Skip to content

Extract text from PDF (invoice)

The “Extract text from PDF (invoice)” service task automatically extracts text and structured invoice content from a single-page PDF document. The task uses the AWS Textract AnalyzeExpense API to return invoice details such as total amount, vendor, invoice date, and other relevant fields in a structured format.

Input parameters

Provide the following field as task input:

{
  "fileReference": "string"
}

Explanation:

  • fileReference: Reference to the PDF file to analyze. This can be a file path or an ID in storage.

Output

The task returns the extracted invoice data along with additional document information.

{
  "status": 200,
  "response": {
    "Expenses": [
      {
        "ExpenseType": "INVOICE",
        "SummaryFields": [
          { "Type": "VENDOR_NAME", "Value": "Pantarey GmbH" },
          { "Type": "INVOICE_DATE", "Value": "2024-12-22" },
          { "Type": "TOTAL_AMOUNT", "Value": "499.99" }
        ]
      }
    ]
  }
}

Explanation:

  • status: Status of the operation (e.g., 200 for success).
  • Expenses: List of detected invoice blocks in the document.
  • SummaryFields: List of extracted fields and their values.
  • Type: Type of the detected field (e.g., VENDOR_NAME, INVOICE_DATE, TOTAL_AMOUNT).
  • Value: Recognized value for the field.

JSONata examples

Example expression for processing the extracted data:

$map(response.Expenses[].SummaryFields[], {
  "type": Type.Text,
  "value": Value.Text
})

Notes

  • Currently this task supports single-page PDF files.
  • Ensure that fileReference points to an existing PDF file.
  • The task is optimized for invoices. For general PDF text extraction, use Extract text from PDF.

Tip

Use the JSONata Playground to test complex JSONata expressions and process the extracted invoice data.