ExtractFilePageText
Description
Extracts the text of any page in a PDF file.
This function internally uses the direct access functionality. The entire file is not loaded into memory, so this function can be used on arbitrarily large documents.
Two different methods are provided for extracting text from the selected page in a variety of output formats.
The DASetTextExtractionWordGap, DASetTextExtractionOptions and DASetTextExtractionArea functions can be used to adjust the text extraction process.
Syntax
Delphi
function TDebenuPDFLibrary1811.ExtractFilePageText(InputFileName,
Password: WideString; Page, Options: Integer): WideString;
ActiveX
Function DebenuPDFLibrary1811.PDFLibrary::ExtractFilePageText(
InputFileName As String, Password As String, Page As Long,
Options As Long) As String
DLL
wchar_t * DPLExtractFilePageText(int InstanceID, wchar_t * InputFileName,
wchar_t * Password, int Page, int Options)
Parameters
InputFileName | The path and file name of the file to extract text from. |
Password | The password to use, if any, when opening the file |
Page | The number of the page that must be extracts. The first page in the document is page 1. |
Options |
Using the standard text extraction algorithm: 0 = Extract text in human readable format 1 = Deprecated 2 = Return a CSV string including font, color, size and position of each piece of text on the page Using the more accurate but slower text extraction algorithm: 3 = Return a CSV string for each piece of text on the page with the following format: Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text The co-ordinates are the four points bounding the text, measured using the units set with the SetGlobalMeasurementUnits function and the origin set with the SetGlobalOrigin function. Co-ordinate order is anti-clockwise with the bottom left corner first. 4 = Similar to option 3, but individual words are returned, making searching for words easier 5 = Similar to option 3 but character widths are output after each block of text 6 = Similar to option 4 but character widths are output after each line of text 7 = Extract text in human readable format with improved accuracy compared to option 0 8 = Similar output format as option 0 but using the more accurate algorithm. Returns unformatted lines. |