Extract text from PDF

The “Extract text from PDF” service task automatically extracts text from a single-page PDF file. The extracted content is returned in a structured format so you can use it in later process steps. Internally the task relies on the AWS Textract API.

Input parameters

Provide the following field as task input:

{
  "fileReference": "string"
}

Explanation:

fileReference: Reference to the PDF file to analyze. This can be a file path or an ID in storage.

Output

The task returns the extracted text along with additional information about the document.

{
  "status": 200,
  "response": {
    "Blocks": [
      {
        "BlockType": "LINE",
        "Text": "Sample text",
        "Confidence": 99.5
      }
    ]
  }
}

Explanation:

status: Status of the operation (e.g., 200 for success).
Blocks: List of detected text blocks in the document.
BlockType: Type of block (e.g., LINE for a line of text).
Text: Recognized text.
Confidence: Confidence score of the text recognition in percent.

JSONata examples

Example expression for processing the extracted data:

$map(response.Blocks[BlockType="LINE"], $.Text)

Notes

Currently this task supports single-page PDF files.
Ensure that fileReference points to a valid PDF file.
You can further process the results with JSONata expressions.

Tip

Use the JSONata Playground to test complex JSONata expressions.