Extract text from PDF

The “Extract text from PDF” service task automatically extracts text from a single-page PDF file. The extracted content is returned in a structured format so it is possible to use it in later process steps. Internally the task relies on the AWS Textract API.

Input parameters

Provide the following field as task input:

{
  "fileReference": "string"
}

Explanation:

fileReference: Reference to the PDF file to analyze. This can be a file path or an ID in storage.

Output

The task returns the extracted text along with additional information about the document.

{
  "status": 200,
  "response": {
    "Blocks": [
      {
        "BlockType": "LINE",
        "Text": "Sample text",
        "Confidence": 99.5
      }
    ]
  }
}

Explanation:

status: Status of the operation (e.g., 200 for success).
Blocks: List of detected text blocks in the document.
BlockType: Type of block (e.g., LINE for a line of text).
Text: Recognized text.
Confidence: Confidence score of the text recognition in percent.

JSONata examples

Example expression for processing the extracted data:

$map(response.Blocks[BlockType="LINE"], $.Text)

Notes

Currently this task supports single-page PDF files.
Ensure that fileReference points to a valid PDF file.
it is possible to further process the results with JSONata expressions.

Tip

Use the JSONata Playground to test complex JSONata expressions.