AI: Query Markdown content

The "AI: Query Markdown content" service task answers targeted questions about a Markdown text and returns the answers as a structured JSON object. Questions are passed as an array, with each question linked to an attribute name — automatically producing a clean result object.

This task is particularly well suited for analyzing OCR results: a scanned PDF is first converted to Markdown, then specific information such as delivery note numbers, dates or amounts can be extracted through targeted questions.

Input parameters

Provide the following fields as task input:

{
  "content": "# Delivery Note\n\nDelivery Note No: DN-2024-001\nCommission: COM-4711\nDate: 15 March 2024\n\n| Pos | Article | Qty |\n|-----|---------|-----|\n| 1 | Widget A | 100 |\n| 2 | Widget B | 50 |",
  "questions": [
    {
      "question": "What is the delivery note number?",
      "attribute": "deliveryNoteNumber",
      "format": "e.g. DN-2024-001"
    },
    {
      "question": "What is the commission number?",
      "attribute": "commissionNumber"
    },
    {
      "question": "What is the date of the delivery note?",
      "attribute": "date",
      "format": "YYYY-MM-DD",
      "validation": "^\\d{4}-\\d{2}-\\d{2}$"
    },
    {
      "question": "How many line items does the table contain?",
      "attribute": "itemCount",
      "format": "Integer"
    }
  ],
  "returnNotFoundFields": false,
  "maxContentLength": 50000
}

Explanation:

content: The Markdown text to be analyzed. Typically an OCR result or other structured text from a previous process step.
questions: An array of questions. Each question contains:
- question: The question in natural language (English or German).
- attribute: The key name in the result object.
- format (optional): A hint for the AI about the expected answer format (e.g. "YYYY-MM-DD" or "integer").
- validation (optional): A regular expression (regex) to validate the extracted value. If the value does not match, the AI is automatically asked to correct it (up to 3 attempts).
returnNotFoundFields (optional, default: false): If true, the result includes an additional _notFound field listing attribute names for which no answer was found.
maxContentLength (optional, default: 50000): Maximum character count of the content. Can be increased if needed.

Output

The task returns a JSON object whose keys match the attribute names of the questions:

{
  "deliveryNoteNumber": "DN-2024-001",
  "commissionNumber": "COM-4711",
  "date": "2024-03-15",
  "itemCount": 2
}

If an answer cannot be found in the text, the field is set to null.

With returnNotFoundFields: true:

{
  "deliveryNoteNumber": "DN-2024-001",
  "commissionNumber": "COM-4711",
  "date": "2024-03-15",
  "itemCount": 2,
  "_notFound": []
}

JSONata examples

{
  "content": ocrResult.markdown,
  "questions": [
    {
      "question": "What is the delivery note number?",
      "attribute": "deliveryNoteNumber",
      "format": "e.g. DN-2024-001"
    },
    {
      "question": "What is the date?",
      "attribute": "date",
      "format": "YYYY-MM-DD",
      "validation": "^\\d{4}-\\d{2}-\\d{2}$"
    }
  ]
}

{
  "content": previous_step.text,
  "questions": [
    {
      "question": "What is the total amount?",
      "attribute": "totalAmount",
      "format": "Number"
    },
    {
      "question": "What is the invoice number?",
      "attribute": "invoiceNumber"
    }
  ],
  "returnNotFoundFields": true
}

Notes

At least one question must be included in the questions array.
Questions can be written in any language — the AI understands both English and German equally well.
Precise format hints significantly improve result quality (e.g. "Date in YYYY-MM-DD format" rather than just "date").
The optional validation enables automatic verification: if the extracted value does not match the regex pattern, the AI is re-queried with targeted correction feedback.
For very long texts, maxContentLength can be increased. The default of 50,000 characters covers most use cases.

Tip

This task works particularly well in combination with OCR results: first convert a scanned PDF to Markdown via OCR, then use this task to extract the relevant information. Combined with the "AI: Extract key-value pairs" task, a complete document intake workflow can be built — automatic detection of all fields plus targeted follow-up queries for critical values.