Skip to content

Read document (OCR)

The "Read document (OCR)" service task processes PDFs and images in a single pass. It extracts full text, layout regions (titles, paragraphs, headers and footers), tables, mathematical formulas, and seals. In addition, a RAG-optimized Markdown of the entire document is returned.

This makes it possible to automatically read incoming business documents such as delivery notes, invoices, contracts, or forms, and to turn them into structured data in subsequent steps – for example with the "AI: Extract data from OCR" service task.


Input parameters

Provide the following fields as task input:

{
  "document": {
    "referenceId": "...",
    "filename": "delivery-note.pdf"
  }
}

Explanation:

  • document.referenceId: Reference to the uploaded file. The file must be available as a fileReference, e.g. from a file upload step in a form or from a previous service task.
  • document.filename: The file name including extension. The format is detected from the extension.

Supported formats:

  • PDF (.pdf)
  • Images: PNG, JPG/JPEG, BMP, TIFF, WEBP

The same process can handle both PDFs and images – e.g. a smartphone photo of a delivery note and a scanned PDF.


Output

The task returns a structured JSON object with all detected content:

{
  "metadata": {
    "source_file": "delivery-note.pdf",
    "total_pages": 1,
    "total_text_blocks": 49,
    "total_blocks": 12,
    "total_tables": 1,
    "total_formulas": 0,
    "total_seals": 1,
    "extraction_engine": "PPStructureV3"
  },
  "markdown": "# Delivery Note No. 7208166\n\nDate: 19/03/2025\n\n| Pos | Qty | Article |\n|-----|-----|---------|\n| 1 | 100 | Screws M8 |",
  "full_text": "Delivery Note No. 7208166\nDate: 19/03/2025\nPSL Ltd.",
  "pages": [
    {
      "page_number": 0,
      "markdown": "# Delivery Note No. 7208166\n...",
      "text": "Delivery Note No. 7208166\nDate: 19/03/2025\nPSL Ltd.",
      "blocks": [
        { "label": "doc_title", "content": "Delivery Note No. 7208166", "bbox": [120, 80, 480, 110] },
        { "label": "header", "content": "Pantarey Ltd.", "bbox": [80, 40, 240, 60] },
        { "label": "paragraph_title", "content": "Date: 19/03/2025", "bbox": [120, 130, 320, 150] }
      ],
      "tables": [
        { "index": 0, "html": "<table><tr><td>Pos</td><td>Qty</td></tr><tr><td>1</td><td>100</td></tr></table>" }
      ],
      "text_blocks": [
        { "text": "Delivery Note No. 7208166", "confidence": 0.98 }
      ],
      "formulas": [],
      "seals": [
        { "text": "Pantarey Ltd." }
      ]
    }
  ]
}

Explanation:

  • metadata: Summary of the extraction (page count, number of detected blocks, tables, formulas, seals).
  • markdown: RAG-optimized Markdown of the entire document – ideal as input for AI services. Headers, footers, and page numbers are intentionally stripped here (they live in pages[].blocks).
  • full_text: The complete plain text of all pages.
  • pages: Array with one entry per page. Each page contains:
    • markdown: Markdown of the individual page.
    • text: Plain text of the page.
    • blocks: Layout blocks with semantic labels (e.g. doc_title, header, footer, page_number, paragraph_title, text). This is where titles, headers, and footers live.
    • tables: Detected tables as HTML – the cell structure is preserved even on complex layouts.
    • text_blocks: Individual OCR text fragments with confidence values (confidence, 0–1).
    • formulas: Mathematical formulas as LaTeX.
    • seals: Text recognized inside seals/stamps – useful for identifying the issuer or signing entity.

JSONata examples

// Reference a file from a previous upload step
{
  "document": {
    "referenceId": $.fileUpload.referenceId,
    "filename": $.fileUpload.filename
  }
}
// Pass the Markdown directly to an AI service
{
  "content": $.ocrResult.markdown,
  "questions": [
    {
      "question": "What is the delivery note number?",
      "attribute": "deliveryNoteNumber"
    }
  ]
}
// Check whether any tables were detected (gateway condition)
$.ocrResult.metadata.total_tables > 0

Notes

  • Processing time depends on document size and complexity, typically 30–120 seconds.
  • Multi-page PDFs are processed in full – each page appears as its own element in the pages array.
  • The confidence value of each text fragment is useful for quality assessment – values below 0.7 indicate uncertain recognition.
  • The content of markdown and pages[].blocks overlaps intentionally, because PaddleX strips headers, footers, and page numbers from the Markdown. This separation is helpful for downstream AI steps.
  • Seal recognition works particularly well for clearly defined, circular company stamps.
  • Handwritten content is recognized but is generally less reliable than printed text.

Tip

Combining this task with "AI: Extract data from OCR" is particularly powerful: first the document is read via OCR, then typed fields, line-item tables (e.g. articles with quantity and price), and a key-value index are extracted in a targeted way. Two service tasks are enough to model the full flow from an incoming file to structured data.