Skip to content

AI: Extract data from OCR

The "AI: Extract data from OCR" service task processes the result of the "Read document (OCR)" task and returns structured data. From the combined OCR output (Markdown, layout blocks, tables, seals, and text fragments), it extracts typed fields, line-item tables (e.g. articles with quantity and price), and an automatic key-value index for later search.

The task is generic and works for any business document – e.g. delivery notes, invoices, contracts, purchase orders, certificates, or forms. An optional document context and a typed definition for each field produce significantly more accurate results than generic text extraction.


Input parameters

Provide the following fields as task input:

{
  "ocr": { },
  "documentContext": "Delivery notes from logistics providers",
  "schema": {
    "deliveryNoteNumber": {
      "type": "string",
      "description": "Delivery note number (e.g. DN-2024-001)"
    },
    "deliveryDate": {
      "type": "date",
      "description": "Delivery date in YYYY-MM-DD format"
    },
    "customerName": {
      "type": "string",
      "description": "Recipient / customer name"
    },
    "totalAmount": {
      "type": "currency",
      "description": "Total amount (numeric value in document currency)"
    },
    "paymentMethod": {
      "type": "enum",
      "values": ["cash", "invoice", "card"],
      "description": "Payment method"
    }
  },
  "tables": {
    "items": {
      "description": "Line items of delivered articles",
      "columns": {
        "pos":      { "type": "integer" },
        "article":  { "type": "string"  },
        "quantity": { "type": "number"  },
        "price":    { "type": "currency" }
      }
    }
  },
  "discoverAdditional": true,
  "language": "en",
  "maxContentLength": 50000
}

Explanation:

  • ocr: The full output of the "Read document (OCR)" service task – the object containing metadata, markdown, pages, etc.
  • schema: The fields to be extracted. Each field requires a type and a description. Supported types:
    • string – text
    • number – decimal number
    • integer – whole number
    • date – date (YYYY-MM-DD)
    • currency – currency amount (numeric)
    • iban – IBAN
    • email – email address
    • booleantrue/false
    • enum – one of a fixed list of values; provide values: ["..."] as well
  • tables (optional): Tables to be returned as line-item lists. Each table needs a description and a column schema (name + type). Values are returned as an array of objects.
  • documentContext (optional): A short description of the document type. Helps the AI to disambiguate labels like "No." (e.g. delivery note number vs. invoice number).
  • discoverAdditional (optional, default: false): When true, all additional recognizable key-value pairs are returned as well – useful as a full-text search index.
  • language (optional, default: de): Language used for the labels in discoveredKeyValues.
  • maxContentLength (optional, default: 50000): Maximum number of characters of OCR content passed to the AI.

Output

The task returns a structured JSON object:

{
  "fields": {
    "deliveryNoteNumber": "DN-2024-001",
    "deliveryDate": "2024-03-15",
    "customerName": "Miller Ltd.",
    "totalAmount": 3425.0,
    "paymentMethod": "invoice"
  },
  "positions": {
    "items": [
      { "pos": 1, "article": "Screws M8", "quantity": 100, "price": 0.45 },
      { "pos": 2, "article": "Nuts M8",   "quantity": 100, "price": 0.18 }
    ]
  },
  "discoveredKeyValues": [
    { "key": "Delivery address", "value": "5 Industrial Rd, Stuttgart" },
    { "key": "Order reference",  "value": "PO-7788" }
  ],
  "notFound": []
}

Explanation:

  • fields: The fields defined in schema, with type-correct values. Fields that could not be found are returned as null and additionally listed in notFound.
  • positions: Only present when tables was defined. For each table, an array of objects is returned – the keys match the defined columns.
  • discoveredKeyValues: Only present when discoverAdditional: true was set. A list of all additional recognizable key-value pairs (max. 60 entries). Useful as a full-text search index.
  • notFound: List of schema field names for which no value could be found.

JSONata examples

// Pass the OCR result directly from the previous step
{
  "ocr": $.ocrResult,
  "documentContext": "Delivery notes",
  "schema": {
    "deliveryNoteNumber": { "type": "string", "description": "Delivery note number" },
    "deliveryDate":       { "type": "date",   "description": "Delivery date" },
    "customerName":       { "type": "string", "description": "Recipient / customer" }
  }
}
// Extract a line-item list with column types
{
  "ocr": $.ocrResult,
  "documentContext": "Invoices",
  "schema": {
    "invoiceNumber": { "type": "string",   "description": "Invoice number" },
    "invoiceDate":   { "type": "date",     "description": "Invoice date" },
    "totalAmount":   { "type": "currency", "description": "Total amount" }
  },
  "tables": {
    "items": {
      "description": "Invoice line items",
      "columns": {
        "article":  { "type": "string"   },
        "quantity": { "type": "number"   },
        "price":    { "type": "currency" }
      }
    }
  }
}
// With search index for later lookup in the data lake
{
  "ocr": $.ocrResult,
  "schema": {
    "documentNumber": { "type": "string", "description": "Document number" }
  },
  "discoverAdditional": true,
  "language": "en"
}

Notes

  • The task strictly requires an OCR result from the "Read document (OCR)" service task as input.
  • A precise description per field improves accuracy significantly (e.g. "Date in YYYY-MM-DD format" instead of just "Date").
  • A short documentContext is enough – the AI uses it to distinguish fields between similar document types (e.g. delivery note number vs. invoice number).
  • Values are extracted exactly as they appear in the document – only type conversions (e.g. date normalization) are applied.
  • If a value cannot be found reliably, null is returned – the AI does not guess.
  • For very large OCR outputs, maxContentLength can be increased.
  • The task uses a powerful language model and typically takes 5–20 seconds – depending on the complexity of the schema and any line-item tables.

Tip

Two service tasks – "Read document (OCR)" followed by "AI: Extract data from OCR" – are enough to model the full flow from an incoming file (PDF or image) to structured data in the data lake. In combination with discoverAdditional: true, a search index is automatically generated as a side effect, making it possible to find documents later via fields that were never even defined in the schema.