AI: Extract structured data

The "AI: Extract structured data" service task reads defined fields from unstructured text and returns them as a JSON object. Extraction is performed via AI (OpenAI). The desired fields are specified through a freely definable schema.

Typical use cases include automatic capture of invoice data, reading customer data from emails, or structuring free-text inputs.

Input parameters

Provide the following fields as task input:

{
  "text": "Invoice No. 2024-001\nCompany: Müller GmbH\nDate: 15 March 2025\nAmount: EUR 1,500.00\nVAT ID: DE123456789",
  "schema": {
    "invoiceNumber": "Invoice number as string",
    "company": "Company name",
    "date": "Date in YYYY-MM-DD format",
    "amount": "Amount as number",
    "currency": "Currency code (e.g. EUR)",
    "vatId": "VAT identification number"
  },
  "returnNotFoundFields": false
}

Explanation:

text: The source text from which data should be extracted. It can originate from a document, an email, or a previous process step.
schema: An object whose keys are the desired field names and whose values describe what to extract. The descriptions help the AI find the correct value and format.
returnNotFoundFields (optional, default: false): If true, the result includes an additional _notFound field — an array listing the field names that could not be found in the text.

Output

The task returns a JSON object whose keys match the schema fields:

{
  "invoiceNumber": "2024-001",
  "company": "Müller GmbH",
  "date": "2025-03-15",
  "amount": 1500.00,
  "currency": "EUR",
  "vatId": "DE123456789"
}

If a value cannot be found in the text, the field is set to null.

With returnNotFoundFields: true:

{
  "invoiceNumber": "2024-001",
  "company": "Müller GmbH",
  "date": "2025-03-15",
  "amount": 1500.00,
  "currency": "EUR",
  "vatId": null,
  "_notFound": ["vatId"]
}

JSONata examples

// Example: use text from PDF extraction and schema from process data
{
  "text": pdfExtract.text,
  "schema": {
    "invoiceNumber": "Invoice number",
    "date": "Invoice date in YYYY-MM-DD format",
    "totalAmount": "Total amount as number"
  }
}

// Example: check whether all fields were found (gateway condition)
$count(result._notFound) = 0

Notes

The schema object must contain at least one field.
Field descriptions should be as precise as possible — especially the desired format (e.g. "Date in YYYY-MM-DD format" or "Amount as number").
Longer and more descriptive texts lead to better results.
Missing values are automatically set to null, so subsequent steps can reliably check for them.
The AI automatically selects the most appropriate data type (string, number, or boolean).

Tip

Combining this task with the "Extract text from PDF" service task is particularly effective: first extract the text from a PDF, then automatically break it down into structured fields. Together with the "AI: Classify document" task, a complete document intake workflow can be built.