documenta11ist-ocr

Extract text from images and PDF documents using OCR. Fast, accurate, with optional LLM vision fallback for low-confidence results.

REST API v1 Multipart Upload

Authentication

API endpoints require authentication when the server is configured with allowed API keys.

Header X-API-Key

Format Plain string token provided by the service administrator

Scope Applies to /api/* routes when API_KEYS is set. The /health endpoint is public.

Endpoints

POST /api/v1/ocr Auth required

Extract text from an uploaded image or PDF document.

Request

Content-Type multipart/form-data

Body file — the document to process

Formats JPEG, PNG, PDF

Max size 5 MB (configurable)

Response

            json
            200
          

{
  "data": "Le texte extrait du document...",
  "metadata": {
    "processing_time_ms": 342,
    "method": "ocr_tesseract"
  }
}

Extraction methods

Method	Description
`ocr_tesseract`	Text extracted via Tesseract OCR engine
`text_extraction`	Text extracted directly from a text-based PDF
`ocr_tesseract_with_llm_fallback`	Tesseract result refined by LLM vision (low confidence)

POST /api/v1/analyze Auth required

Analyze an uploaded image or PDF document into useful text, retained blocks, removed noise, and PDF image metadata.

Request

Content-Type multipart/form-data

Body file — the document to process

Formats JPEG, PNG, PDF

Max size 5 MB (configurable)

Response

            json
            200
          

{
  "useful_text": "Titre du document\nParagraphe utile...",
  "blocks": [
    {
      "id": "block-1",
      "kind": "title",
      "text": "Titre du document",
      "page": 1,
      "confidence": 1.0,
      "bbox": null,
      "heading_level": 1
    }
  ],
  "images": [
    {
      "id": "image-1-1",
      "page": 1,
      "width": 640,
      "height": 480,
      "bbox": null,
      "mime_type": "image/jpeg",
      "caption": null,
      "caption_confidence": 0.0,
      "nearby_text": "",
      "alt_text": null
    }
  ],
  "removed_blocks": [
    {
      "kind": "page_number",
      "text": "Page 1 sur 12",
      "page": 1,
      "reason": "pagination_pattern"
    }
  ],
  "metadata": {
    "processing_time_ms": 512,
    "method": "text_extraction",
    "pages": 1
  }
}

Phase 1 scope

heading_level is set when the heading level is available from PDF outlines or can be inferred. The service does not generate alt text yet. Bounding boxes, captions, and nearby text remain empty when reliable source data is not available.

GET /health Public

Health check endpoint. Returns service status.

            json
            200
          

{
  "status": "ok"
}

Responses

All responses are JSON. Successful responses return the result directly. Error responses use a consistent envelope.

Error format

{
  "error": "Description of what went wrong"
}

Errors

Status	Reason
400	Unsupported file format, file too large, or invalid PDF
401	Missing or invalid API key
422	Text extraction failed on a valid file
500	Internal server error (OCR engine failure)

Examples

Extract text from an image

curl

$ curl -X POST https://your-domain.com/api/v1/ocr \
  -H "X-API-Key: your-api-key" \
  -F "file=@scan.jpg"

Extract text from a PDF

curl

$ curl -X POST https://your-domain.com/api/v1/ocr \
  -H "X-API-Key: your-api-key" \
  -F "file=@document.pdf"

Analyze a document

curl

$ curl -X POST https://your-domain.com/api/v1/analyze \
  -H "X-API-Key: your-api-key" \
  -F "file=@document.pdf"

Health check

curl

$ curl https://your-domain.com/health