Extract text from PDF (invoice)
The “Extract text from PDF (invoice)” service task automatically extracts text and structured invoice content from a single-page PDF document. The task uses the AWS Textract AnalyzeExpense
API to return invoice details such as total amount, vendor, invoice date, and other relevant fields in a structured format.
Input parameters
Provide the following field as task input:
{
"fileReference": "string"
}
Explanation:
fileReference
: Reference to the PDF file to analyze. This can be a file path or an ID in storage.
Output
The task returns the extracted invoice data along with additional document information.
{
"status": 200,
"response": {
"Expenses": [
{
"ExpenseType": "INVOICE",
"SummaryFields": [
{ "Type": "VENDOR_NAME", "Value": "Pantarey GmbH" },
{ "Type": "INVOICE_DATE", "Value": "2024-12-22" },
{ "Type": "TOTAL_AMOUNT", "Value": "499.99" }
]
}
]
}
}
Explanation:
status
: Status of the operation (e.g.,200
for success).Expenses
: List of detected invoice blocks in the document.SummaryFields
: List of extracted fields and their values.Type
: Type of the detected field (e.g.,VENDOR_NAME
,INVOICE_DATE
,TOTAL_AMOUNT
).Value
: Recognized value for the field.
JSONata examples
Example expression for processing the extracted data:
$map(response.Expenses[].SummaryFields[], {
"type": Type.Text,
"value": Value.Text
})
Notes
- Currently this task supports single-page PDF files.
- Ensure that
fileReference
points to an existing PDF file. - The task is optimized for invoices. For general PDF text extraction, use Extract text from PDF.
Tip
Use the JSONata Playground to test complex JSONata expressions and process the extracted invoice data.