AI: Extract data from OCR
The "AI: Extract data from OCR" service task processes the result of the "Read document (OCR)" task and returns structured data. From the combined OCR output (Markdown, layout blocks, tables, seals, and text fragments), it extracts typed fields, line-item tables (e.g. articles with quantity and price), and an automatic key-value index for later search.
The task is generic and works for any business document – e.g. delivery notes, invoices, contracts, purchase orders, certificates, or forms. An optional document context and a typed definition for each field produce significantly more accurate results than generic text extraction.
Input parameters
Provide the following fields as task input:
{
"ocr": { },
"documentContext": "Delivery notes from logistics providers",
"schema": {
"deliveryNoteNumber": {
"type": "string",
"description": "Delivery note number (e.g. DN-2024-001)"
},
"deliveryDate": {
"type": "date",
"description": "Delivery date in YYYY-MM-DD format"
},
"customerName": {
"type": "string",
"description": "Recipient / customer name"
},
"totalAmount": {
"type": "currency",
"description": "Total amount (numeric value in document currency)"
},
"paymentMethod": {
"type": "enum",
"values": ["cash", "invoice", "card"],
"description": "Payment method"
}
},
"tables": {
"items": {
"description": "Line items of delivered articles",
"columns": {
"pos": { "type": "integer" },
"article": { "type": "string" },
"quantity": { "type": "number" },
"price": { "type": "currency" }
}
}
},
"discoverAdditional": true,
"language": "en",
"maxContentLength": 50000
}
Explanation:
ocr: The full output of the "Read document (OCR)" service task – the object containingmetadata,markdown,pages, etc.schema: The fields to be extracted. Each field requires a type and a description. Supported types:string– textnumber– decimal numberinteger– whole numberdate– date (YYYY-MM-DD)currency– currency amount (numeric)iban– IBANemail– email addressboolean–true/falseenum– one of a fixed list of values; providevalues: ["..."]as well
tables(optional): Tables to be returned as line-item lists. Each table needs a description and a column schema (name + type). Values are returned as an array of objects.documentContext(optional): A short description of the document type. Helps the AI to disambiguate labels like "No." (e.g. delivery note number vs. invoice number).discoverAdditional(optional, default:false): Whentrue, all additional recognizable key-value pairs are returned as well – useful as a full-text search index.language(optional, default:de): Language used for the labels indiscoveredKeyValues.maxContentLength(optional, default:50000): Maximum number of characters of OCR content passed to the AI.
Output
The task returns a structured JSON object:
{
"fields": {
"deliveryNoteNumber": "DN-2024-001",
"deliveryDate": "2024-03-15",
"customerName": "Miller Ltd.",
"totalAmount": 3425.0,
"paymentMethod": "invoice"
},
"positions": {
"items": [
{ "pos": 1, "article": "Screws M8", "quantity": 100, "price": 0.45 },
{ "pos": 2, "article": "Nuts M8", "quantity": 100, "price": 0.18 }
]
},
"discoveredKeyValues": [
{ "key": "Delivery address", "value": "5 Industrial Rd, Stuttgart" },
{ "key": "Order reference", "value": "PO-7788" }
],
"notFound": []
}
Explanation:
fields: The fields defined inschema, with type-correct values. Fields that could not be found are returned asnulland additionally listed innotFound.positions: Only present whentableswas defined. For each table, an array of objects is returned – the keys match the defined columns.discoveredKeyValues: Only present whendiscoverAdditional: truewas set. A list of all additional recognizable key-value pairs (max. 60 entries). Useful as a full-text search index.notFound: List of schema field names for which no value could be found.
JSONata examples
// Pass the OCR result directly from the previous step
{
"ocr": $.ocrResult,
"documentContext": "Delivery notes",
"schema": {
"deliveryNoteNumber": { "type": "string", "description": "Delivery note number" },
"deliveryDate": { "type": "date", "description": "Delivery date" },
"customerName": { "type": "string", "description": "Recipient / customer" }
}
}
// Extract a line-item list with column types
{
"ocr": $.ocrResult,
"documentContext": "Invoices",
"schema": {
"invoiceNumber": { "type": "string", "description": "Invoice number" },
"invoiceDate": { "type": "date", "description": "Invoice date" },
"totalAmount": { "type": "currency", "description": "Total amount" }
},
"tables": {
"items": {
"description": "Invoice line items",
"columns": {
"article": { "type": "string" },
"quantity": { "type": "number" },
"price": { "type": "currency" }
}
}
}
}
// With search index for later lookup in the data lake
{
"ocr": $.ocrResult,
"schema": {
"documentNumber": { "type": "string", "description": "Document number" }
},
"discoverAdditional": true,
"language": "en"
}
Notes
- The task strictly requires an OCR result from the "Read document (OCR)" service task as input.
- A precise
descriptionper field improves accuracy significantly (e.g. "Date in YYYY-MM-DD format" instead of just "Date"). - A short
documentContextis enough – the AI uses it to distinguish fields between similar document types (e.g. delivery note number vs. invoice number). - Values are extracted exactly as they appear in the document – only type conversions (e.g. date normalization) are applied.
- If a value cannot be found reliably,
nullis returned – the AI does not guess. - For very large OCR outputs,
maxContentLengthcan be increased. - The task uses a powerful language model and typically takes 5–20 seconds – depending on the complexity of the schema and any line-item tables.
Tip
Two service tasks – "Read document (OCR)" followed by "AI: Extract data from OCR" – are enough to model the full flow from an incoming file (PDF or image) to structured data in the data lake. In combination with discoverAdditional: true, a search index is automatically generated as a side effect, making it possible to find documents later via fields that were never even defined in the schema.