Extract, detect, and export text from engineering PDFs at scale.
BAB-AAT is a web application that processes engineering PDFs using optical character recognition to identify and extract equipment tags. It handles the full lifecycle β from document upload through OCR inference to searchable PDF and Excel export.
Key capabilities:
- Upload PDFs or ZIP archives (up to 2.5 GB)
- Run PaddleOCR with configurable models and parameters
- Merge raw detections into meaningful tags via DBSCAN clustering
- Export results as searchable PDFs (invisible text overlay) or Excel spreadsheets
- Process documents asynchronously with Celery workers
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Docker Compose β
β β
β βββββββββββββ βββββββββββββ βββββββββββββββ β
β β Django β β Celery β β PostgreSQL β β
β β Uvicorn βββββΊβ Worker β β 16 β β
β β :8080 β β β β :5432 β β
β βββββββ¬ββββββ βββββββ¬ββββββ ββββββββ²βββββββ β
β β β β β
β β βββββββΌββββββ β β
β βββββββββββΊβ Redis ββββββββββββββ β
β β :6379 β β
β βββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Service | Role | Resource Limit |
|---|---|---|
| web | Django API + static files | 2 GB |
| worker | Celery OCR processing | 8 GB |
| redis | Task broker + result backend | β |
| db | PostgreSQL with named volume | β |
Vessel
βββ Document
βββ Page
β βββ Detection βββΊ OCRConfig
βββ Tag (merged detections)
βββ Truth (ground-truth annotations)
Upload PDF/ZIP Extract pages Run PaddleOCR
ββββββββββββββΊ Pages βββββββββββββββΊ OCR βββββββββββββββΊ Detections
β
DBSCAN merge
β
βΌ
Export PDF βββββββββββββ Tags βββββββββββββββΊ Export Excel
(text overlay) (structured data)
- Docker & Docker Compose
- uv (Python package manager)
- A
.envfile at the project root (see Configuration)
# Clone and start all services
git clone <repo-url> && cd bab-aat
make runThis will install dependencies, build the Docker image, run migrations, and start all four services.
# Install dependencies
make install
# Start Redis and PostgreSQL (via Docker)
docker compose up redis db -d
# Run Django
uv run python manage.py migrate
uv run uvicorn babaatsite.asgi:application --host 0.0.0.0 --port 8080
# In a separate terminal β start the Celery worker
uv run celery -A babaatsite worker --pool prefork --concurrency 2 --loglevel info| Target | Description |
|---|---|
make run |
Install deps + build + docker compose up |
make install |
uv sync from pyproject.toml |
make clean |
Remove __pycache__ and .DS_Store |
make runner |
Pull, install, clean, then run (default) |
Create a .env file in the project root:
# Force all services to run locally (skip Supabase/S3/external Redis)
FORCE_LOCAL_DEV=True
# Django
SECRET_KEY=your-secret-key
DEBUG=True
# PostgreSQL (matches docker-compose defaults)
POSTGRES_USER=postgres
POSTGRES_PASSWORD=mysecretpassword
POSTGRES_DB=babaatsite
# Modal (for serverless compute β optional)
MODAL_TOKEN_ID=...
MODAL_TOKEN_SECRET=...Set FORCE_LOCAL_DEV=True for local development. In production, configure Supabase (database + S3 storage) and an external Redis instance via their respective secret keys.
Create OCR configs through the web UI at /ocr/configs/create/. Key parameters:
| Parameter | Range | Description |
|---|---|---|
| scale | 1.0β8.0 | Image upscaling factor |
| min_confidence | 0.0β1.0 | Minimum detection confidence |
| use_angle_cls | bool | Detect rotated text |
bab-aat/
βββ babaatsite/ # Django project config
β βββ settings.py # Settings (DB, storage, Celery)
β βββ celery.py # Celery app init
β βββ urls.py # Root URL routing
βββ ocr/ # Main application
β βββ models.py # Data models
β βββ views.py # View handlers
β βββ tasks.py # Celery tasks
β βββ forms.py # Upload & config forms
β βββ main/
β β βββ intake/ # Document upload & ingestion
β β βββ inference/ # PaddleOCR pipeline
β β β βββ detections.py # Core OCR logic
β β β βββ postprocessing/ # DBSCAN detection merging
β β βββ export/ # PDF & Excel export
β β β βββ pdf.py # Searchable PDF generation
β β β βββ excel.py # Excel export
β β βββ utils/ # Shared utilities
β βββ templates/ # HTML templates
β βββ static/ # CSS & JS assets
βββ resources/
β βββ fonts/ # Fonts for PDF text overlay
βββ build.Dockerfile
βββ docker-compose.yml
βββ pyproject.toml
βββ Makefile
| Endpoint | Method | Description |
|---|---|---|
/ocr/upload/ |
POST | Upload PDFs or ZIP archives |
/ocr/documents/ |
GET | List all documents |
/ocr/documents/<id>/ |
GET | Document detail & page viewer |
/ocr/documents/detect/by_origin/ |
POST | Run OCR by vessel + department |
/ocr/configs/create/ |
POST | Create an OCR configuration |
| Layer | Technology |
|---|---|
| Web framework | Django 5.2 + Uvicorn (ASGI) |
| Task queue | Celery 5.5 + Redis 7 |
| OCR engine | PaddleOCR 3.0 + PaddlePaddle 3.0 |
| Image processing | OpenCV 4.11 |
| PDF handling | PyMuPDF (export), pypdfium2 (page rotation) |
| Excel export | openpyxl, XlsxWriter |
| Database | PostgreSQL 16 |
| Storage | Local filesystem or S3-compatible (Supabase) |
| Containerization | Docker Compose |
| Package manager | uv |
Out-of-memory errors during OCR
Lower the scale parameter in your OCR config or process smaller batches. The Celery worker has an 8 GB memory limit.
Low detection accuracy
Increase the scale factor for higher-resolution input. Lower min_confidence to capture more detections, or enable use_angle_cls for rotated text.
Text missing from exported PDF
This is typically caused by fontsize overflow in PyMuPDF's insert_textbox. The export code uses a conservative fill ratio (0.55) to prevent this β if you see the issue, check that tag bounding boxes are reasonable.