Extract text from PDF
The “Extract text from PDF” service task automatically extracts text from a single-page PDF file. The extracted content is returned in a structured format so it is possible to use it in later process steps. Internally the task relies on the AWS Textract API.
Input parameters
Provide the following field as task input:
{
"fileReference": "string"
}
Explanation:
fileReference: Reference to the PDF file to analyze. This can be a file path or an ID in storage.
Output
The task returns the extracted text along with additional information about the document.
{
"status": 200,
"response": {
"Blocks": [
{
"BlockType": "LINE",
"Text": "Sample text",
"Confidence": 99.5
}
]
}
}
Explanation:
status: Status of the operation (e.g.,200for success).Blocks: List of detected text blocks in the document.BlockType: Type of block (e.g.,LINEfor a line of text).Text: Recognized text.Confidence: Confidence score of the text recognition in percent.
JSONata examples
Example expression for processing the extracted data:
$map(response.Blocks[BlockType="LINE"], $.Text)
Notes
- Currently this task supports single-page PDF files.
- Ensure that
fileReferencepoints to a valid PDF file. - it is possible to further process the results with JSONata expressions.
Tip
Use the JSONata Playground to test complex JSONata expressions.