Skip to content

dbchristenson/bab-aat

Repository files navigation

BAB-AAT

Automated Asset Tagging for Bumi Armada

Extract, detect, and export text from engineering PDFs at scale.

Python 3.12+ Django 5.2 PaddleOCR 3.0 Celery 5.5 Docker License


Overview

BAB-AAT is a web application that processes engineering PDFs using optical character recognition to identify and extract equipment tags. It handles the full lifecycle β€” from document upload through OCR inference to searchable PDF and Excel export.

Key capabilities:

  • Upload PDFs or ZIP archives (up to 2.5 GB)
  • Run PaddleOCR with configurable models and parameters
  • Merge raw detections into meaningful tags via DBSCAN clustering
  • Export results as searchable PDFs (invisible text overlay) or Excel spreadsheets
  • Process documents asynchronously with Celery workers

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Docker Compose                    β”‚
β”‚                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Django   β”‚    β”‚  Celery   β”‚    β”‚ PostgreSQL  β”‚  β”‚
β”‚  β”‚  Uvicorn   │◄──►│  Worker   β”‚    β”‚     16      β”‚  β”‚
β”‚  β”‚  :8080     β”‚    β”‚           β”‚    β”‚   :5432     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚        β”‚                β”‚                  β”‚         β”‚
β”‚        β”‚          β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”            β”‚         β”‚
β”‚        └─────────►│   Redis   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚                   β”‚   :6379   β”‚                      β”‚
β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Service Role Resource Limit
web Django API + static files 2 GB
worker Celery OCR processing 8 GB
redis Task broker + result backend β€”
db PostgreSQL with named volume β€”

Data Model

Vessel
  └── Document
        β”œβ”€β”€ Page
        β”‚     └── Detection  ──►  OCRConfig
        β”œβ”€β”€ Tag  (merged detections)
        └── Truth  (ground-truth annotations)

Pipeline

 Upload PDF/ZIP          Extract pages         Run PaddleOCR
 ─────────────►  Pages  ──────────────►  OCR  ──────────────►  Detections
                                                                    β”‚
                                                              DBSCAN merge
                                                                    β”‚
                                                                    β–Ό
                Export PDF ◄──────────── Tags ──────────────►  Export Excel
              (text overlay)                               (structured data)

Quick Start

Prerequisites

  • Docker & Docker Compose
  • uv (Python package manager)
  • A .env file at the project root (see Configuration)

Run with Docker Compose

# Clone and start all services
git clone <repo-url> && cd bab-aat
make run

This will install dependencies, build the Docker image, run migrations, and start all four services.

Run Locally (development)

# Install dependencies
make install

# Start Redis and PostgreSQL (via Docker)
docker compose up redis db -d

# Run Django
uv run python manage.py migrate
uv run uvicorn babaatsite.asgi:application --host 0.0.0.0 --port 8080

# In a separate terminal β€” start the Celery worker
uv run celery -A babaatsite worker --pool prefork --concurrency 2 --loglevel info

Makefile Targets

Target Description
make run Install deps + build + docker compose up
make install uv sync from pyproject.toml
make clean Remove __pycache__ and .DS_Store
make runner Pull, install, clean, then run (default)

Configuration

Environment Variables

Create a .env file in the project root:

# Force all services to run locally (skip Supabase/S3/external Redis)
FORCE_LOCAL_DEV=True

# Django
SECRET_KEY=your-secret-key
DEBUG=True

# PostgreSQL (matches docker-compose defaults)
POSTGRES_USER=postgres
POSTGRES_PASSWORD=mysecretpassword
POSTGRES_DB=babaatsite

# Modal (for serverless compute β€” optional)
MODAL_TOKEN_ID=...
MODAL_TOKEN_SECRET=...

Set FORCE_LOCAL_DEV=True for local development. In production, configure Supabase (database + S3 storage) and an external Redis instance via their respective secret keys.

OCR Configuration

Create OCR configs through the web UI at /ocr/configs/create/. Key parameters:

Parameter Range Description
scale 1.0–8.0 Image upscaling factor
min_confidence 0.0–1.0 Minimum detection confidence
use_angle_cls bool Detect rotated text

Project Structure

bab-aat/
β”œβ”€β”€ babaatsite/               # Django project config
β”‚   β”œβ”€β”€ settings.py           #   Settings (DB, storage, Celery)
β”‚   β”œβ”€β”€ celery.py             #   Celery app init
β”‚   └── urls.py               #   Root URL routing
β”œβ”€β”€ ocr/                      # Main application
β”‚   β”œβ”€β”€ models.py             #   Data models
β”‚   β”œβ”€β”€ views.py              #   View handlers
β”‚   β”œβ”€β”€ tasks.py              #   Celery tasks
β”‚   β”œβ”€β”€ forms.py              #   Upload & config forms
β”‚   β”œβ”€β”€ main/
β”‚   β”‚   β”œβ”€β”€ intake/           #   Document upload & ingestion
β”‚   β”‚   β”œβ”€β”€ inference/        #   PaddleOCR pipeline
β”‚   β”‚   β”‚   β”œβ”€β”€ detections.py #     Core OCR logic
β”‚   β”‚   β”‚   └── postprocessing/   # DBSCAN detection merging
β”‚   β”‚   β”œβ”€β”€ export/           #   PDF & Excel export
β”‚   β”‚   β”‚   β”œβ”€β”€ pdf.py        #     Searchable PDF generation
β”‚   β”‚   β”‚   └── excel.py      #     Excel export
β”‚   β”‚   └── utils/            #   Shared utilities
β”‚   β”œβ”€β”€ templates/            #   HTML templates
β”‚   └── static/               #   CSS & JS assets
β”œβ”€β”€ resources/
β”‚   └── fonts/                # Fonts for PDF text overlay
β”œβ”€β”€ build.Dockerfile
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ pyproject.toml
└── Makefile

API Endpoints

Endpoint Method Description
/ocr/upload/ POST Upload PDFs or ZIP archives
/ocr/documents/ GET List all documents
/ocr/documents/<id>/ GET Document detail & page viewer
/ocr/documents/detect/by_origin/ POST Run OCR by vessel + department
/ocr/configs/create/ POST Create an OCR configuration

Tech Stack

Layer Technology
Web framework Django 5.2 + Uvicorn (ASGI)
Task queue Celery 5.5 + Redis 7
OCR engine PaddleOCR 3.0 + PaddlePaddle 3.0
Image processing OpenCV 4.11
PDF handling PyMuPDF (export), pypdfium2 (page rotation)
Excel export openpyxl, XlsxWriter
Database PostgreSQL 16
Storage Local filesystem or S3-compatible (Supabase)
Containerization Docker Compose
Package manager uv

Troubleshooting

Out-of-memory errors during OCR Lower the scale parameter in your OCR config or process smaller batches. The Celery worker has an 8 GB memory limit.

Low detection accuracy Increase the scale factor for higher-resolution input. Lower min_confidence to capture more detections, or enable use_angle_cls for rotated text.

Text missing from exported PDF This is typically caused by fontsize overflow in PyMuPDF's insert_textbox. The export code uses a conservative fill ratio (0.55) to prevent this β€” if you see the issue, check that tag bounding boxes are reasonable.


Built for Bumi Armada

About

Automated asset tagging for Bumi Armada

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors