Skip to content

Latest commit

 

History

History
326 lines (225 loc) · 8.58 KB

File metadata and controls

326 lines (225 loc) · 8.58 KB

📄 Multi-Model ID classification System: IDentify.AI

An end-to-end multimodal system to classify scanned documents (IDs, receipts, licenses, etc.) using both image and text features. The project combines OCR, deep learning models (LayoutLMv3 and a custom Early Fusion model), FastAPI, and Streamlit into a deployable .exe app.


✨ Demo

OCR in Action LayoutLMv3 in Action Early Fusion in Action

⏰ Table of Contents


🌟 Features

  • OCR-powered extraction using PaddleOCR

  • Multimodal Gov. ID classification using:

    • LayoutLMv3 (text + layout + image)
    • Custom Early Fusion Model (BERT + ResNet + Attention)
  • Fully working Streamlit UI + FastAPI backend

  • Packaged into a single .exe for Windows users

  • Supports classification into 10 categories (Aadhar, Passport, PAN, Voter ID, etc.)


🚀 System Requirements

  • OS: Windows 10 or 11
  • GPU: NVIDIA RTX 3050 (for GPU OCR)
  • CUDA Toolkit: 12.6 or 12.8 (if using GPU PaddleOCR)
  • Python: 3.11.5 (64-bit)
  • RAM: 8GB+
  • Processor: x86_64 / Intel64 / AMD64

✅ Getting Started

📦 Download Dataset

import gdown, zipfile, os
file_id = "1Gu23xr357BPzGoocyPw6IPUhnz5mf52j"
gdown.download(f"https://drive.google.com/uc?id={file_id}", "file.zip", quiet=False)
with zipfile.ZipFile("file.zip", 'r') as zip_ref:
    zip_ref.extractall("Data")
os.remove("file.zip")

Or Download Manually


🧠 OCR Features

This project utilizes PaddleOCR by Alibaba Cloud.

🔍 Key Features in PaddleOCR 3.0:

  • 🖼️ Universal-Scene Text Recognition Model: Handles five text types + complex handwriting. +13% improvement over previous generation.
  • 🧮 General Document-Parsing: Parses multi-layout, multi-scene PDFs with high precision.
  • 📈 Document Understanding: Powered by ERNIE 4.5 Turbo; +15% accuracy boost over its predecessor.

🛠️ Environment & Prerequisites

🧪 Create Separate OCR Environment

python -m venv env/OCRenv

Activate:

env\OCRenv\Scripts\activate.bat

✅ Check Python Compatibility

Supported Python versions:

  • 3.8 / 3.9 / 3.10 / 3.11 / 3.12 / 3.13

Check Python version:

python --version

Check pip version:

python -m pip --version

Check architecture:

python -c "import platform;print(platform.architecture()[0]);print(platform.machine())"

Expected Output:

64bit
x86_64 (or AMD64)

🖥️ Platform Limitations

  • No NCCL/distributed training on Windows.
  • Requires MKL-compatible CPU (all Intel chips support this).

🚀 Install PaddleOCR GPU Version

python -m pip install paddlepaddle-gpu==3.0.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddleocr

🍫 Install Chocolatey & ccache (Optional for caching)

Set-ExecutionPolicy Bypass -Scope Process -Force; \
[System.Net.ServicePointManager]::SecurityProtocol = \
[System.Net.ServicePointManager]::SecurityProtocol -bor 3072; \
iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))

Then:

choco upgrade chocolatey
choco install ccache
where ccache

Add C:\ProgramData\chocolatey\bin to PATH if not found.


🧱 Architecture

  1. OCR (PaddleOCR)
  2. Text/Image Embedding (BERT + ResNet)
  3. Fusion (Early Fusion Attention / LayoutLMv3)
  4. Classification
  5. UI + API Serving (Streamlit + FastAPI)

📁 File Structure

├── api/               # FastAPI endpoints
├── assets/            # icon, demo video
├── pipeline/          # single image end-to-end pipeline
├── requirements/      # env-specific dependencies
├── scripts/           # .bat, .ps1, .exe launcher
├── src/               # development and training code
├── streamlit_ui/      # Streamlit frontend
├── test_results/      # test_pred.csv
├── DocuSort.exe       # Windows executable
├── accuracy_plot.png
├── check_splits.py
├── config.py
├── main_fastapi.py
└── README.md

File Descriptions

  • api/ – Hosts the FastAPI routes that handle classification requests and serve OCR/model predictions.

  • pipeline/ – Integrates OCR and the model to classify a single image end-to-end. Useful for scripting and testing.

  • src/ – Main training code: dataset loaders, model architectures (LayoutLMv3, Early Fusion), and utilities.

  • streamlit_ui/ – User interface to upload and classify documents via the browser. Shows results in real time.

  • scripts/ – Contains .bat, .ps1, and the DocuSort.exe generator for local deployment.

  • requirements/ – Separated .txt files for installing base, training, or OCR-specific Python dependencies.

  • test_results/ – Contains test_pred.csv used for evaluating or submitting to benchmarks.

  • DocuSort.exe – Shortcut of Final packaged application for Windows — opens both backend and UI in one click.

  • config.py – Centralized config: model name, class labels, thresholds, paths.

  • accuracy_plot.png – Snapshot of training performance to visually track overfitting/generalization.

  • check_splits.py – Verifies dataset balance across train, val, and test sets.

  • main_fastapi.py – Starts the FastAPI app and defines how endpoints behave.


🧠 Model Details

✅ LayoutLMv3

  • Combines text + image + layout (bounding boxes)
  • Fine-tuned using Parquet-formatted OCR documents

✅ Early Fusion Model

  • bert-base-uncased for text embeddings
  • resnet-50 for image embeddings
  • Multi-head attention to fuse modalities
  • Weighted CrossEntropyLoss to handle imbalance

👨‍💻 Primary Contributors: Paul Samuel W E, Sanjesh J

❌ ViT Vision-only (FAILED)

  • Overfit on training, poor generalization (Test Acc: 28%)

🔧 How to Run

🔁 During Development

# Start FastAPI server
uvicorn main_fastapi:app --port 8000

# In another terminal
streamlit run streamlit_ui/app.py

🖱️ Using Executable

scripts/DocuSort.exe
  • Launches both servers
  • Opens UI in browser
  • Prompts to shut down (Y/N)

📦 Packaging as EXE

Invoke-ps2exe `
  -inputFile ".\scripts\run_and_stop.ps1" `
  -outputFile ".\scripts\DocuSort.exe" `
  -title "DocuSort" `
  -icon ".\assets\icon.ico" `
  -requireAdmin `
  -noConsole

🚧 Future Scope

  • Sentence-BERT for better textual embeddings
  • Spell/grammar correction on noisy OCR
  • Multilingual support (Hindi, Tamil, etc.)
  • Mobile/web deployment (React Native, Flask, etc.)
  • GPU inference + caching for faster batch processing

🧾 License

MIT License


👥 Contributors

  • Paul Samuel W E (LayoutLMv3 fine-tuning, OCR Pipeline, Architecture, Packaging)
  • Sanjesh J (EDA, Early Fusion Model, Evaluation)
  • Gayathri R
  • Sri Yogesh B A
  • Samritha S

📸 Project Slides

Project Completed: July 21, 2025