Extract text from PDFs as a text block list
Foxit Quick PDF Library provides an extensive API for programmatically extracting text from PDF files. This includes the options of just plain text output and also returning the text in a formatted CSV string with details about the font, size and style of the text.
The API now includes additional text extraction functions for extracting text as text blocks which can be easier to manage and parse. The text block functions let you retrieve the text block as well as information about the text bounds, font, color and size.
The full range of text extraction functions can be found in our online reference for extraction functions.
Here’s some C# sample code which demonstrates how to use some of these text block functions:
DPL.LoadFromFile(@"C:\Program Files (x86)\Debenu\PDF Library\DLL\GettingStarted.pdf", "");
double[] box = new double[9];
for (int i = 1; i <= DPL.PageCount(); i++)
{
int id = DPL.ExtractPageTextBlocks(4);
for (int f = 1; f <= DPL.GetTextBlockCount(id); f++)
{
double fontSize = DPL.GetTextBlockFontSize(id, f);
string fontName = DPL.GetTextBlockFontName(id, f);
for (int j = 1; j <= 8; j++)
{
box[j] = DPL.GetTextBlockBound(id, f, j);
}
string text = DPL.GetTextBlockText(id, f);
Console.WriteLine("Text Block ID: " + id);
Console.WriteLine(text);
Console.WriteLine("Font Name: " + fontName);
Console.WriteLine("Font Size: " + fontSize);
Console.WriteLine("Text Block Bounds:");
foreach (var item in box)
{
Console.WriteLine(item.ToString());
}
Console.WriteLine(Environment.NewLine);
}
DPL.ReleaseTextBlocks(id);
Console.Read();
}
This article refers to a deprecated product. If you are looking for support for Foxit PDF SDK, please click here.
Updated on May 16, 2022