Debenu Quick PDF logo

ExtractFilePageText

Extraction, Page properties

Description

Extracts the text of any page in a PDF file.

This function internally uses the direct access functionality. The entire file is not loaded into memory, so this function can be used on arbitrarily large documents.

Two different methods are provided for extracting text from the selected page in a variety of output formats.

The DASetTextExtractionWordGap, DASetTextExtractionOptions and DASetTextExtractionArea functions can be used to adjust the text extraction process.

Syntax

Delphi

function TDebenuPDFLibrary1811.ExtractFilePageText(InputFileName, 
  Password: WideString; Page, Options: Integer): WideString;

ActiveX

Function DebenuPDFLibrary1811.PDFLibrary::ExtractFilePageText(
  InputFileName As String, Password As String, Page As Long,
  Options As Long) As String

DLL

wchar_t * DPLExtractFilePageText(int InstanceID, wchar_t * InputFileName,
  wchar_t * Password, int Page, int Options)

Parameters

InputFileName The path and file name of the file to extract text from.
Password The password to use, if any, when opening the file
Page The number of the page that must be extracts. The first page in the document is page 1.
Options Using the standard text extraction algorithm:
0 = Extract text in human readable format
1 = Deprecated
2 = Return a CSV string including font, color, size and position of each piece of text on the page
Using the more accurate but slower text extraction algorithm:
3 = Return a CSV string for each piece of text on the page with the following format:
Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text
The co-ordinates are the four points bounding the text, measured using the units set with the SetGlobalMeasurementUnits function and the origin set with the SetGlobalOrigin function. Co-ordinate order is anti-clockwise with the bottom left corner first.
4 = Similar to option 3, but individual words are returned, making searching for words easier
5 = Similar to option 3 but character widths are output after each block of text
6 = Similar to option 4 but character widths are output after each line of text
7 = Extract text in human readable format with improved accuracy compared to option 0
8 = Similar output format as option 0 but using the more accurate algorithm. Returns unformatted lines.

Copyright © 2020 Debenu. All rights reserved. AboutContactBlogNewsletterSupport