Skip to content

AI: Extract structured data

The "AI: Extract structured data" service task reads defined fields from unstructured text and returns them as a JSON object. Extraction is performed via AI (OpenAI). The desired fields are specified through a freely definable schema.

Typical use cases include automatic capture of invoice data, reading customer data from emails, or structuring free-text inputs.

Input parameters

Provide the following fields as task input:

{
  "text": "Invoice No. 2024-001\nCompany: Müller GmbH\nDate: 15 March 2025\nAmount: EUR 1,500.00\nVAT ID: DE123456789",
  "schema": {
    "invoiceNumber": "Invoice number as string",
    "company": "Company name",
    "date": "Date in YYYY-MM-DD format",
    "amount": "Amount as number",
    "currency": "Currency code (e.g. EUR)",
    "vatId": "VAT identification number"
  },
  "returnNotFoundFields": false
}

Explanation:

  • text: The source text from which data should be extracted. It can originate from a document, an email, or a previous process step.
  • schema: An object whose keys are the desired field names and whose values describe what to extract. The descriptions help the AI find the correct value and format.
  • returnNotFoundFields (optional, default: false): If true, the result includes an additional _notFound field — an array listing the field names that could not be found in the text.

Output

The task returns a JSON object whose keys match the schema fields:

{
  "invoiceNumber": "2024-001",
  "company": "Müller GmbH",
  "date": "2025-03-15",
  "amount": 1500.00,
  "currency": "EUR",
  "vatId": "DE123456789"
}

If a value cannot be found in the text, the field is set to null.

With returnNotFoundFields: true:

{
  "invoiceNumber": "2024-001",
  "company": "Müller GmbH",
  "date": "2025-03-15",
  "amount": 1500.00,
  "currency": "EUR",
  "vatId": null,
  "_notFound": ["vatId"]
}

JSONata examples

// Example: use text from PDF extraction and schema from process data
{
  "text": pdfExtract.text,
  "schema": {
    "invoiceNumber": "Invoice number",
    "date": "Invoice date in YYYY-MM-DD format",
    "totalAmount": "Total amount as number"
  }
}
// Example: check whether all fields were found (gateway condition)
$count(result._notFound) = 0

Notes

  • The schema object must contain at least one field.
  • Field descriptions should be as precise as possible — especially the desired format (e.g. "Date in YYYY-MM-DD format" or "Amount as number").
  • Longer and more descriptive texts lead to better results.
  • Missing values are automatically set to null, so subsequent steps can reliably check for them.
  • The AI automatically selects the most appropriate data type (string, number, or boolean).

Tip

Combining this task with the "Extract text from PDF" service task is particularly effective: first extract the text from a PDF, then automatically break it down into structured fields. Together with the "AI: Classify document" task, a complete document intake workflow can be built.