Document Parser Data Pipeline using Apache Kafka

Resume Parsing

Document Parsing

Apache Kafka

Data Pipeline

OCR

Document Parser Data Pipeline using Apache Kafka

by: Sangeeta Saha, Ashwanth K

February 10, 2025

Document Parser Data Pipeline using Apache Kafka

In today's world there are many use cases where we need to extract information from files. It could be medical bills submitted to HR, where total amount claimed can be verified against the employee's name. Or a billing system which can calculate billing amounts for each vendor by processing the bills. Also resume processing, where all important information about a candidate can be extracted and matched against job requirements.

Automating the above process greatly improves its efficiency and scalability. It can be done by creating a Document Parser Data Pipeline. Here files can be input and data from these files will be read, verified, and stored in some data storage.

There are many tools and techniques available for building such a solution. The more popular techniques are creating a Modular Document Parser or using a Visual Language Model (VLM). In a Modular Document Parser, each step is done in a single module, making it easier to understand the entire process. In Visual Language Model, the entire process is done in a single step.

Of course, there are few challenges as well, such as, the various types of files available i.e. Text, Word, Pdf or Excel. We could either get computer files or scanned documents, in which case we need to use Optical Character Recognition (OCR). Also, information within documents could be in various formats such as paragraphs, tables, or graphs. The document parser needs to understand different layouts to extract key-value pairs correctly. As a final step, we need to test for data accuracy and integrity.

In one of our projects, we built a Modular Document Parser Data Pipeline using Apache Kafka as a messaging system along with Python and Java microservices. Apache Kafka is an open-source, event driven data pipeline, that can handle real time data feeds. A Kafka pipeline typically consists of Topics, Producers and Consumers. A Kafka Topic is a unique category name, used to store messages. Producers are applications that send/write data to a Kafka topic. Consumers are applications that pull messages from a Kafka topic and read them.

The workflow in our Document Parser Data Pipeline consists of the following modules:

Uploading documents
Document Parsing
Validation and Post Processing
Database Storage

1. Uploading Documents

Users can upload documents using any backend application. Uploaded documents are stored in an AWS S3 bucket. Each file when uploaded will have a traceId which can be used to uniquely identify the document or its contents throughout the various stages in the pipeline.

After the document is successfully uploaded, a database record having a new recordId and this traceId is created. This record will have a status field which is set to "PROGRESS". Later, information extracted from the document will be populated into this record and the status updated.

Finally, a message containing that traceId and uploadDate is sent to the Kafka topic "doc_upload". This message will be read by the next module.

2. Document Parsing

The Document Parser module subscribes to the "doc_upload" topic. When it gets a new message, it can get fetch the document file from the AWS S3 bucket using the traceId and uploadDate sent in the Kafka message.

The file is fetched as raw bytes of data and converted to Ascii string. The same Ascii string is sent to an Image Extractor service as well as an AI Parser service.

2.1 Image Extractor Service

The image extractor service is python based. It accepts as input, the uploaded file in Ascii string format. First it identifies the type of the file, whether it's a Word or a Pdf document. Then, it either uses the docx package to extract images from Word documents or the fitz package for Pdf files. Images are returned in a binary form.

Before returning the images, we can also check if the images are of a certain type. For instance, we can check for logos or profiles and return only required images.

2.2 AI Parser Service

This service uses OpenAI. A GPT prompt is used to parse document details, generating a JSON output. We provide the pattern for JSON output, as input to the AI parser.

For example, for a resume screening application, output would be required in following format.

{
  "education": {
    "graduation": {
      "university": "",
      "collegeName": "",
      "cgpa": ""
    }
  },
  "experience": [
    {
      "name": "",
      "location": "",
      "role": "",
      "duration": "",
      "responsibilities": []
    }
  ],
  "personalInfo": {
    "name": "",
    "address": "",
    "dob": "",
    "emailAddress": "",
    "gender": "",
    "phoneNumber": ""
  }
}

Both the extracted images as well as the parsed JSON object is uploaded to the cloud, in another AWS S3 bucket, having the same traceId and uploadDate from the Kafka message it received earlier.

Now, the Document Parser module will publish a new message to another Kafka topic "doc_parse". This too, will contain the traceId and uploadDate.

3. Validation and Post Processing

The Validation and Post Processing module subscribes to the Kafka topic "doc_parse". When it receives a new message, it fetches the images and the JSON files having matching traceId and uploadDate from the second AWS S3 bucket.

Images do not need much further processing. The recordId corresponding to the traceId, will be obtained from the database. Image files will be renamed with recordId, version and the type of image. It will then be uploaded to the final AWS S3 bucket.

The JSON file containing the data however, needs to be validated and processed. First the JSON file is downloaded using traceId and uploadDate from the Kafka message. The data is stored in a jsonNode java object and validated against an existing JSON schema.

Once validated, it is mapped to another object, using a mapping configuration. During the mapping process each field of the object obtains value from the jsonNode object, based on the mapping configuration. If any required data is missing, transformation stops and "Transformation could not be completed for traceId-recordId" message is sent to the Kafka topic "doc_transform".

If the transformation is successful, the transformed object is again validated against the JSON schema. It is then uploaded to the final AWS S3 bucket using recordId and version.

Finally, a message object is created to send to the Kafka topic "doc_transform". Along with the success message, object containing the data is also sent. It is created by downloading JSON file from the final AWS S3 bucket and populating the data object with it. Presigned URLs for the images corresponding to this recordId is also obtained from the same S3 bucket. Its value is assigned to the image fields within the data object.

At this stage we need to check for duplicates. Very often duplicate documents are uploaded to the application and should not be replicated in database. So, values of few key fields from this data object are checked against the database. If any of the database records have same values, this data object's status will be marked as duplicate. The message "Duplicate data found for traceId-recordId" message is sent to the Kafka topic "doc_transform".

If there are no duplicates, then "Transformation successful for traceId-recordId" message along with the data object is sent to the Kafka topic "doc_transform".

4. Database Storage

The Database Storage module subscribes to the Kafka topic "doc_transform". Once it gets a new message it will check whether message indicates a successful transformation or not.

In case it is not successful, application will find record in database based on the recordId and update the status as "FAILED". If duplicate data is found, status is updated to "DUPLICATE".

Else it will update the database record with the data object values and mark the status as "COMPLETED".

This record is now available to our application for use.

Conclusion

In this article we saw how we can create modules that step-by-step extract and store information from documents into a database. All it needs is a little bit of planning and specifying requirements clearly. Once these are done, we can implement each module independently. And the product is that we have a seamless data pipeline that can take in documents at one end and at the other end, information from the documents is available in a database. This can be processed easily by any application.

This greatly reduces human error, thus improving accuracy. There are a many other tools available for achieving the same results, however Kafka pipeline has proven itself as an industry leader due to its enormous data handling capacity and very low latency.

Thanks for reading this article and hope you have found it useful.