Overview of the pipeline:
Step1 :The first step is to Scan and convert hard copies into PDFs or images.
Next,
Step 2 :Create Text Extraction Cloud Function (HTTP Trigger):
- Takes PDF input as Base64 stream.
- Converts PDF content to text.
Step 3 :Create Text to Embeddings Cloud Function (HTTP Trigger):
- Second cloud function triggered by GCP Cloud Task.
- Accepts text payload.
- Converts text into embeddings using a pre-trained model.
Step 4 :Create updating index Cloud function (HTTP Trigger):
- Third cloud function triggered by GCP Cloud Task.
- Accepts text payload.
- Generates a unique ID.
- Sends Embeddings and ID to vector search database.
- Sends PDF name, location and ID to an RDBMS for storage.
Step 5 :Lastly create Query vector database for summarization Cloud function (HTTP Trigger):
New function accepts query text as a payload from the user.
- Converts text into embeddings.
- Uses vector search to find the nearest neighbour, obtaining its ID.
- Queries RDBMS using the obtained ID.
- Retrieves PDF name and location.
- Fetches the identified PDF.
- Sends PDF to GenAI for summarization.
- The response is the summary is reverted as the output.
Let’s break down the intricacies of this pipeline to better grasp its transformative power.
Data Source:
The data can come from either hard copies or PDFs. To make this information usable, it needs to be converted into text. I have personally explored various Python libraries which gives good results when converting to pdf to text, such as
- pydf2
- pytesseract
- mymupdf
- pdfminer.six
- easyocr
- pyocr
- GCP’s DocumentAI.
However, it’s worth noting that the usage of GCP’s DocumentAI custom model can be relatively costly. It’s important to recognize that these libraries work best with high-quality images or PDFs for the most effective results.
Text Embeddings:
Let’s first understand what Embeddings are!
In simple terms, embeddings in machine learning are a way to represent things like words, phrases, or documents as numbers. Imagine each word or piece of text is assigned a unique set of numbers, kind of like a secret code. The cool part is that similar words or meanings end up having similar codes.
Imagine you have a database that stores information about various things, like documents or sentences. Now, an embedding in this database is like a special code or signature assigned to each piece of information.
This code is a set of numbers that somehow captures the essence or meaning of the information. Similar pieces of information end up having similar codes. So, when you want to find something similar in the database, you look at these codes rather than the actual content.