Extract text from PDF
The “Extract text from PDF” service task automatically extracts text from a single-page PDF file. The extracted content is returned in a structured format so you can use it in later process steps. Internally the task relies on the AWS Textract API.
Input parameters
Provide the following field as task input:
{
"fileReference": "string"
}
Explanation:
fileReference
: Reference to the PDF file to analyze. This can be a file path or an ID in storage.
Output
The task returns the extracted text along with additional information about the document.
{
"status": 200,
"response": {
"Blocks": [
{
"BlockType": "LINE",
"Text": "Sample text",
"Confidence": 99.5
}
]
}
}
Explanation:
status
: Status of the operation (e.g.,200
for success).Blocks
: List of detected text blocks in the document.BlockType
: Type of block (e.g.,LINE
for a line of text).Text
: Recognized text.Confidence
: Confidence score of the text recognition in percent.
JSONata examples
Example expression for processing the extracted data:
$map(response.Blocks[BlockType="LINE"], $.Text)
Notes
- Currently this task supports single-page PDF files.
- Ensure that
fileReference
points to a valid PDF file. - You can further process the results with JSONata expressions.
Tip
Use the JSONata Playground to test complex JSONata expressions.