Memory optimization tips when processing large PDF files
When dealing with PDF files that are very large in file size (north of 1 GB) or PDF files that have many pages (north of 1,000 to 10,000 depending on documents contents) it is desirable or sometimes necessary to write code that ensures memory usage does not climb too high. We will continue to enhance this page with more and more tips.
Close the file and re-open it
Recently we had a customer report that:
“DAExtractPageText uses memory that is not cleared until the document is closed and reopened. We have a 18,6000 page document and we have to close and reopen the document about every 1000 pages to keep the memory below 900 MB. Is there a way to clear the TPDFStructure without closing and reopening the PDF?”
There are two representations of PDF objects in memory, the first is the textual representation as it appears in the PDF file itself, for example:
<< /Type /Page /Contents 10 0 R >>
Then, during processing the textual representation of the PDF object is parsed and a set of objects (class instances) is created. In the above example we would have:
TPDFDictionary x1
TPDFName x3
TPDFIndRef x1
The way Foxit Quick PDF Library has been designed there isn’t currently a way to “undo” the conversion from textual representation to the hierarchy of objects. The reason for this is that the TPDFStructure cannot know which other parts of the system have stored references to the created objects. So there is no way to safely release the objects.
The only way to safely reclaim the memory is to do as the customer is currently doing: close the file and re-open it.
Our own internal processing functions (for example, ExtractPagesFromFile) do exactly this.
This article refers to a deprecated product. If you are looking for support for Foxit PDF SDK, please click here.
Updated on May 16, 2022