I’m currently working on a project that involves loading and parsing a large PDF file (over 2GB) from a stream without loading the entire file into memory. I have created a ReadStream from Google Cloud Storage (GCS) / S3 and want to efficiently read the PDF in chunks, extracting text and other relevant information for further processing.
What I tried
I tried various PDF-parsing libraries like pdf-lib, pdf-parse, and pdf2json, but they don’t seem to support parsing using streams. They first completely load the PDF into memory and then parse it.
What I expected
I was hoping to find a library or method that allows me to process a large PDF file without exhausting memory resources.
Additionally, I would like to know how this approach can be extended to handle other document types, such as CSV, DOCX, or Excel files.
If anyone has experience with handling large PDFs or other document types in this manner or can share best practices for stream processing, I would greatly appreciate your insights!