Skip to content

Extract text from PDF

The “Extract text from PDF” service task automatically extracts text from a single-page PDF file. The extracted content is returned in a structured format so you can use it in later process steps. Internally the task relies on the AWS Textract API.

Input parameters

Provide the following field as task input:

{
  "fileReference": "string"
}

Explanation:

  • fileReference: Reference to the PDF file to analyze. This can be a file path or an ID in storage.

Output

The task returns the extracted text along with additional information about the document.

{
  "status": 200,
  "response": {
    "Blocks": [
      {
        "BlockType": "LINE",
        "Text": "Sample text",
        "Confidence": 99.5
      }
    ]
  }
}

Explanation:

  • status: Status of the operation (e.g., 200 for success).
  • Blocks: List of detected text blocks in the document.
  • BlockType: Type of block (e.g., LINE for a line of text).
  • Text: Recognized text.
  • Confidence: Confidence score of the text recognition in percent.

JSONata examples

Example expression for processing the extracted data:

$map(response.Blocks[BlockType="LINE"], $.Text)

Notes

  • Currently this task supports single-page PDF files.
  • Ensure that fileReference points to a valid PDF file.
  • You can further process the results with JSONata expressions.

Tip

Use the JSONata Playground to test complex JSONata expressions.