documenta11ist-ocr

Extract text from images and PDF documents using OCR. Fast, accurate, with optional LLM vision fallback for low-confidence results.

REST API v1 Multipart Upload

Authentication

API endpoints require authentication when the server is configured with allowed API keys.

Header X-API-Key
Format Plain string token provided by the service administrator
Scope Applies to /api/* routes when API_KEYS is set. The /health endpoint is public.

Endpoints

POST /api/v1/ocr Auth required

Extract text from an uploaded image or PDF document.

Request

Content-Type multipart/form-data
Body file — the document to process
Formats JPEG, PNG, PDF
Max size 5 MB (configurable)

Response

json 200
{
  "data": "Le texte extrait du document...",
  "metadata": {
    "processing_time_ms": 342,
    "method": "ocr_tesseract"
  }
}

Extraction methods

MethodDescription
ocr_tesseractText extracted via Tesseract OCR engine
text_extractionText extracted directly from a text-based PDF
ocr_tesseract_with_llm_fallbackTesseract result refined by LLM vision (low confidence)
POST /api/v1/analyze Auth required

Analyze an uploaded image or PDF document into useful text, retained blocks, removed noise, and PDF image metadata.

Request

Content-Type multipart/form-data
Body file — the document to process
Formats JPEG, PNG, PDF
Max size 5 MB (configurable)

Response

json 200
{
  "useful_text": "Titre du document\nParagraphe utile...",
  "blocks": [
    {
      "id": "block-1",
      "kind": "title",
      "text": "Titre du document",
      "page": 1,
      "confidence": 1.0,
      "bbox": null,
      "heading_level": 1
    }
  ],
  "images": [
    {
      "id": "image-1-1",
      "page": 1,
      "width": 640,
      "height": 480,
      "bbox": null,
      "mime_type": "image/jpeg",
      "caption": null,
      "caption_confidence": 0.0,
      "nearby_text": "",
      "alt_text": null
    }
  ],
  "removed_blocks": [
    {
      "kind": "page_number",
      "text": "Page 1 sur 12",
      "page": 1,
      "reason": "pagination_pattern"
    }
  ],
  "metadata": {
    "processing_time_ms": 512,
    "method": "text_extraction",
    "pages": 1
  }
}

Phase 1 scope

heading_level is set when the heading level is available from PDF outlines or can be inferred. The service does not generate alt text yet. Bounding boxes, captions, and nearby text remain empty when reliable source data is not available.

GET /health Public

Health check endpoint. Returns service status.

json 200
{
  "status": "ok"
}

Responses

All responses are JSON. Successful responses return the result directly. Error responses use a consistent envelope.

Error format
{
  "error": "Description of what went wrong"
}

Errors

StatusReason
400Unsupported file format, file too large, or invalid PDF
401Missing or invalid API key
422Text extraction failed on a valid file
500Internal server error (OCR engine failure)

Examples

Extract text from an image

curl
$ curl -X POST https://your-domain.com/api/v1/ocr \
  -H "X-API-Key: your-api-key" \
  -F "file=@scan.jpg"

Extract text from a PDF

curl
$ curl -X POST https://your-domain.com/api/v1/ocr \
  -H "X-API-Key: your-api-key" \
  -F "file=@document.pdf"

Analyze a document

curl
$ curl -X POST https://your-domain.com/api/v1/analyze \
  -H "X-API-Key: your-api-key" \
  -F "file=@document.pdf"

Health check

curl
$ curl https://your-domain.com/health