Foxit PDF SDK for Windows

How to Extract & Search for Text with Foxit PDF SDK (C++)

Text Page

Foxit PDF SDK provides APIs to extract, select, search and retrieve text in PDF documents. PDF text contents are stored in TextPage objects which are related to a specific page. The TextPage class can be used to retrieve information about text in a PDF page, such as single character, single word, or text content within a specified character range or a rectangle and so on. It also can be used to construct objects of other text related classes to perform other operations for text contents or access specified information from text contents:

  • To search for text in the text contents of a PDF page, construct a TextSearch object with a TextPage object.
  • To access text such as hypertext links, construct a PageTextLinks object with TextPage object.

Example:

How to extract text from a PDF page

#include "include/common/fs_common.h"
#include "include/pdf/fs_pdfdoc.h"
#include "include/pdf/fs_search.h"
using namespace std;
using namespace foxit;
using namespace foxit::common;
using foxit::common::Library;
using namespace pdf;
...
// Assuming PDFPage page has been loaded and parsed.
// Get the text page object.
TextPage text_page(page);
int count = text_page.GetCharCount();
if (count > 0) {
 WString text = text_page.GetChars();
 String s_text = text.UTF8Encode();
 fwrite((const char*)s_text, sizeof(char), s_text.GetLength(), file);
}
...

How to select the text of a rectangle area in a PDF

#include "include/common/fs_common.h"
#include "include/pdf/fs_pdfdoc.h"
#include "include/pdf/fs_search.h"
using namespace foxit;
using namespace foxit::common;
using foxit::common::Library;
using namespace pdf;
...
RectF rect; 
rect.left = 90; 
rect.right = 450;
rect.top = 595;
rect.bottom = 580;
TextPage textPage = new TextPage (&page, TextPage::e_ParseTextNormal);
textPage.GetTextInRect(&rect);
...

Text Search

Foxit PDF SDK provides APIs to search text in a PDF document, a XFA document, a text page or in a PDF annotation’s appearance. It offers functions to perform a text search and get the search results:

  • To specify the search pattern and options, use functions TextSearch::SetPattern, TextSearch::SetStartPage (only useful for a text search in a PDF document), TextSearch::SetEndPage (only useful for a text search in a PDF document) and TextSearch::SetSearchFlags.
  • To perform the search, use function TextSearch::FindNext or TextSearch::FindPrev.
  • To get the search results, use function TextSearch::GetMatchXXX().

Example:

How to search a text pattern in a page

#include "include/common/fs_common.h"
#include "include/pdf/fs_pdfdoc.h"
#include "include/pdf/fs_pdfpage.h"
#include "include/pdf/fs_search.h"
using namespace foxit;
using namespace foxit::common;
using foxit::common::Library;
using namespace pdf;
...
// Assuming PDFDoc doc has been loaded.
// Search for all pages of doc.
TextSearch search(doc, NULL);
int start_index = 0, end_index = doc.GetPageCount() - 1;
search.SetStartPage(start_index);
search.SetEndPage(end_index);
WString pattern = L"Foxit";
search.SetPattern(pattern);
foxit::uint32 flags = TextSearch::e_SearchNormal;
search.SetSearchFlags(flags);
...
int match_count = 0;
while (search.FindNext()) {
 RectFArray rect_array = search.GetMatchRects();
 match_count ++;
 }
...

Text Link

In a PDF page, text contents that represent a hypertext link to a website/resource on the internet, or an email address are the same as common text. Prior to text link processing, user should first call PageTextLinks::GetTextLink to get a textlink object.

Example:

How to retrieve hyperlinks in a PDF page

#include "include/common/fs_common.h"
#include "include/pdf/fs_pdfdoc.h"
#include "include/pdf/fs_pdfpage.h"
#include "include/pdf/fs_search.h"

using namespace foxit;
using namespace foxit::common;
using foxit::common::Library;
using namespace pdf;
...

// Assuming PDFPage page has been loaded and parsed.

// Get the text page object.
TextPage text_page(page);
PageTextLinks pageTextLink(text_page);
TextLink textLink = pageTextLink.GetTextLink(index);
String strURL = textLink.GetURI();
...

Updated on April 26, 2019

Was this article helpful?
Thanks for your feedback. If you have a comment on how to improve the article, you can write it here: