AI: Extract structured data
The "AI: Extract structured data" service task reads defined fields from unstructured text and returns them as a JSON object. Extraction is performed via AI (OpenAI). The desired fields are specified through a freely definable schema.
Typical use cases include automatic capture of invoice data, reading customer data from emails, or structuring free-text inputs.
Input parameters
Provide the following fields as task input:
{
"text": "Invoice No. 2024-001\nCompany: Müller GmbH\nDate: 15 March 2025\nAmount: EUR 1,500.00\nVAT ID: DE123456789",
"schema": {
"invoiceNumber": "Invoice number as string",
"company": "Company name",
"date": "Date in YYYY-MM-DD format",
"amount": "Amount as number",
"currency": "Currency code (e.g. EUR)",
"vatId": "VAT identification number"
},
"returnNotFoundFields": false
}
Explanation:
text: The source text from which data should be extracted. It can originate from a document, an email, or a previous process step.schema: An object whose keys are the desired field names and whose values describe what to extract. The descriptions help the AI find the correct value and format.returnNotFoundFields(optional, default:false): Iftrue, the result includes an additional_notFoundfield — an array listing the field names that could not be found in the text.
Output
The task returns a JSON object whose keys match the schema fields:
{
"invoiceNumber": "2024-001",
"company": "Müller GmbH",
"date": "2025-03-15",
"amount": 1500.00,
"currency": "EUR",
"vatId": "DE123456789"
}
If a value cannot be found in the text, the field is set to null.
With returnNotFoundFields: true:
{
"invoiceNumber": "2024-001",
"company": "Müller GmbH",
"date": "2025-03-15",
"amount": 1500.00,
"currency": "EUR",
"vatId": null,
"_notFound": ["vatId"]
}
JSONata examples
// Example: use text from PDF extraction and schema from process data
{
"text": pdfExtract.text,
"schema": {
"invoiceNumber": "Invoice number",
"date": "Invoice date in YYYY-MM-DD format",
"totalAmount": "Total amount as number"
}
}
// Example: check whether all fields were found (gateway condition)
$count(result._notFound) = 0
Notes
- The
schemaobject must contain at least one field. - Field descriptions should be as precise as possible — especially the desired format (e.g. "Date in YYYY-MM-DD format" or "Amount as number").
- Longer and more descriptive texts lead to better results.
- Missing values are automatically set to
null, so subsequent steps can reliably check for them. - The AI automatically selects the most appropriate data type (string, number, or boolean).
Tip
Combining this task with the "Extract text from PDF" service task is particularly effective: first extract the text from a PDF, then automatically break it down into structured fields. Together with the "AI: Classify document" task, a complete document intake workflow can be built.