Introduction
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. This article will help you create your own data pipeline for extracting meaningful information from image or pdf.
Example of Amazon Textract use case:
To demonstrate a use case of AWS Textract, I have utilized the AWS Textract service and shown the result as images (shown below) containing the information extracted out of the input image (shown above).
AWS services required
S3
- Amazon Simple Storage Service (Amazon S3) is an object storage service that offers scalability, data availability, security, and performance.
- A S3 bucket will be required to upload document. Another S3 bucket will be used for storing output file.
Lambda functions
- Lambda is a compute service that lets you run code without provisioning or managing servers.
- We will need two lambda functions, one will be used to trigger AWS Textract service and other will be used for logging once document analysis is completed.
SNS
- Amazon Simple Notification Service (Amazon SNS) is a fully managed messaging service for both application-to-application (A2A) and application-to-person (A2P) communication.
- AWS Textract method which is used for document analysis is asynchronous. Hence, SNS will be used to trigger the second lambda function once analysis is completed.
RDS
- Amazon Relational Database Service (RDS) is a collection of managed services that makes it simple to set up, operate, and scale databases in the cloud.
- AWS RDS is used here for logging the events to tract the status of the pipeline.
Pipeline
Explanation
- Document will be uploaded to a S3 raw bucket.
- Once document is uploaded to S3 raw bucket, it will trigger a lambda function “Textract lambda function”.
- This lambda function will utilize an asynchronous method for AWS Textract to extract text from the document and immediately give response as a “job id”. And it will be logged in database using AWS RDS service.
- Since the Textract method used here is asynchronous, it will continue processing document in background even when “Textract lambda function” has finished executing.
- After the document processing is finished by Textract, it will save output to another S3 bucket and it will also trigger a notification using AWS SNS service.
- The SNS notification will trigger a lambda function “Logging and Metadata update”, which will log the current status of the document i.e. was it a success or failure into AWS RDS database service.
Code Snippet
Below is a simple python implementation for calling AWS Textract service