PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual
ABC
Text and Image
Extraction Toolkit (TET)
Version 5.0
Toolkit for extracting Text, Images,
and other items from PDF
Copyright © 2002–2015 PDFlib GmbH. All rights reserved.
Protected by European and U.S. patents.
PDFlib GmbH
Franziska-Bilek-Weg 9, 80339 München, Germany
www.pdflib.com
phone +49 • 89 • 452 33 84-0
fax +49 • 89 • 452 33 84-99
If you have questions check the PDFlib mailing list and archive at
groups.yahoo.com/neo/groups/pdflib/info
Licensing contact: [email protected]
Support for commercial PDFlib licensees: [email protected] (please include your license number)
This publication and the information herein is furnished as is, is subject to change without notice, and
should not be construed as a commitment by PDFlib GmbH. PDFlib GmbH assumes no responsibility or liability for any errors or inaccuracies, makes no warranty of any kind (express, implied or statutory) with respect to this publication, and expressly disclaims any and all warranties of merchantability, fitness for particular purposes and noninfringement of third party rights.
Adobe, Acrobat, PostScript, and XMP are trademarks of Adobe Systems Inc. AIX, IBM, OS/390, WebSphere,
iSeries, and zSeries are trademarks of International Business Machines Corporation. ActiveX, Microsoft,
OpenType, and Windows are trademarks of Microsoft Corporation. Apple, Macintosh and TrueType are
trademarks of Apple Computer, Inc. Unicode and the Unicode logo are trademarks of Unicode, Inc. Unix is a
trademark of The Open Group. Java and Solaris are trademarks of Sun Microsystems, Inc. HKS is a registered trademark of the HKS brand association: Hostmann-Steinberg, K+E Printing Inks, Schmincke. Other
company product and service names may be trademarks or service marks of others.
TET contains modified parts of the following third-party software:
Zlib compression library, Copyright © 1995-2012 Jean-loup Gailly and Mark Adler
TIFFlib image library, Copyright © 1988-1997 Sam Leffler, Copyright © 1991-1997 Silicon Graphics, Inc.
Cryptographic software written by Eric Young, Copyright © 1995-1998 Eric Young ([email protected])
Independent JPEG Group’s JPEG software, Copyright © 1991-1998, Thomas G. Lane
Cryptographic software, Copyright © 1998-2002 The OpenSSL Project (www.openssl.org)
Expat XML parser, Copyright © 1998, 1999, 2000 Thai Open Source Software Center Ltd
ICU International Components for Unicode, Copyright © 1995-2012 International Business Machines Corporation and others
OpenJPEG library, Copyright © 2002-2014, Université catholique de Louvain (UCL), Belgium
TET contains the RSA Security, Inc. MD5 message digest algorithm.
Contents
0 First Steps with TET
7
0.1 Installing the Software 7
0.2 Applying the TET License Key 8
1 Introduction
11
1.1 Overview of TET Features 11
1.2 Many ways to use TET 13
1.3 Roadmap to Documentation and Samples 14
1.4 What’s new in TET 5.0? 15
2 TET Command-Line Tool
17
2.1 Command-Line Options 17
2.2 Constructing TET Command Lines 20
2.3 Command-Line Examples 21
2.3.1 Extracting Text 21
2.3.2 Extracting Images 21
2.3.3 Generating TETML 22
2.3.4 Advanced Options 22
3 TET Library Language Bindings
23
3.1 Exception Handling 23
3.2 C Binding 24
3.3 C++ Binding 26
3.4 COM Binding 28
3.5 Java Binding 29
3.6 .NET Binding 31
3.7 Objective-C Binding 32
3.8 Perl Binding 34
3.9 PHP Binding 35
3.10 Python Binding 37
3.11 REALbasic/Xojo Binding 38
3.12 Ruby Binding 39
3.13 RPG Binding 41
4 TET Connectors
43
4.1 Free TET Plugin for Adobe Acrobat 43
4.2 TET Connector for the Lucene Search Engine 44
4.3 TET Connector for the Solr Search Server 47
Contents
3
4.4 TET Connector for Oracle 48
4.5 TET PDF IFilter for Microsoft Products 51
4.6 TET Connector for the Apache TIKA Toolkit 53
4.7 TET Connector for MediaWiki 55
5 Configuration
57
5.1 Extracting Content from protected PDF 57
5.2 Resource Configuration and File Searching 59
5.3 Recommendations for common Scenarios 63
6 Text Extraction
67
6.1 PDF Document Domains 67
6.2 Page and Text Geometry 72
6.3 Text Color 78
6.4 Chinese, Japanese, and Korean Text 80
6.4.1 CJK Encodings and CMaps 80
6.4.2 Word Boundaries for CJK Text 80
6.4.3 Vertical Writing Mode 80
6.4.4 CJK Decompositions: Narrow, wide, vertical, etc. 81
6.5 Bidirectional Arabic and Hebrew Text 83
6.5.1 General Bidi Topics 83
6.5.2 Postprocessing Arabic Text 83
6.6 Content Analysis 85
6.7 Layout Analysis 89
6.8 Check whether an Area is empty 92
7 Advanced Unicode Handling
93
7.1 Important Unicode Concepts 93
7.2 Unicode Preprocessing (Filtering) 96
7.2.1 Filters for all Granularities 96
7.2.2 Filters for Granularity Word and above 97
7.3 Unicode Postprocessing 98
7.3.1 Unicode Folding 98
7.3.2 Unicode Decomposition 101
7.3.3 Unicode Normalization 105
7.4 Supplementary Characters and Surrogates 107
7.5 Unicode Mapping for Glyphs 108
8 Image Extraction
115
8.1 Image Extraction Basics 115
8.2 Extracting Images 118
8.2.1 Placed Images and Image Resources 118
4
Contents
8.2.2 Page-based and Resource-based Image Retrieval 119
8.2.3 Geometry of Placed Images 120
8.3 Merging Fragmented Images 123
8.4 Small Image Filtering 125
8.5 Image Colors and Masking 126
8.5.1 Color Spaces 126
8.5.2 Image Masks and Soft Masks 127
9 TET Markup Language (TETML)
129
9.1 Creating TETML 129
9.2 TETML Examples 131
9.3 Controlling TETML Details 135
9.4 TETML Elements and the TETML Schema 139
9.5 Transforming TETML with XSLT 147
9.6 XSLT Samples 151
10 TET Library API Reference
155
10.1 Option Lists 155
10.1.1 Option List Syntax 155
10.1.2 Basic Types 157
10.1.3 Geometric Types 160
10.1.4 Encoding Names 160
10.2 General Functions 162
10.2.1 Option Handling 162
10.2.2 Setup 164
10.2.3 PDFlib Virtual Filesystem (PVF) 165
10.2.4 Unicode Conversion Function 167
10.2.5 Exception Handling 169
10.2.6 Logging 170
10.3 Document Functions 172
10.4 Page Functions 181
10.5 Text and Glyph Details Retrieval Functions 190
10.6 Image Retrieval Functions 196
10.7 TET Markup Language (TETML) Functions 200
10.8 pCOS Functions 203
A TET Library Quick Reference
B Revision History
Index
207
209
211
Contents
5
0 First Steps with TET
0.1 Installing the Software
TET is delivered as an MSI or compressed package for Windows systems, and as a compressed archive for all other supported operating systems. All TET packages contain the
TET command-line tool and the TET library/component, plus support files, documentation, and examples. After installing or unpacking TET the following steps are recommended:
> Users of the TET command-line tool can use the executable right away. The available
options are discussed in Section 2.1, »Command-Line Options«, page 17, and are also
displayed when you execute the TET command-line tool without any options.
> Users of the TET library/component should read one of the sections in Chapter 3,
»TET Library Language Bindings«, page 23, corresponding to their preferred development environment, and review the installed examples.
If you obtained a commercial TET license you must enter your TET license key according
to Section 0.2, »Applying the TET License Key«, page 8.
CJK configuration. In order to extract Chinese, Japanese, or Korean (CJK) text which is
encoded with legacy encodings TET requires the corresponding CMap files for mapping
CJK encodings to Unicode. The CMap files are contained in all TET packages, and are installed in the resource/cmap directory within the TET installation directory.
On non-Windows systems you must manually configure the CMap files:
> For the TET command-line tool this can be achieved by supplying the name of the directory holding the CMap files with the --searchpath option.
> For the TET library/component you can set the searchpath at runtime:
tet.set_option("searchpath={/path/to/resource/cmap}");
As an alternative method for configuring access to the CJK CMap files you can set the
TETRESOURCEFILE environment variable to point to a UPR configuration file which contains a suitable searchpath definition.
Restrictions of the evaluation version. The TET command-line tool and library can be
used as fully functional evaluation versions even without a commercial license. Unlicensed versions support all features, but will only process PDF documents with up to 10
pages and 1 MB size. Evaluation versions of TET must not be used for production purposes, but only for evaluating the product. Using TET for production purposes requires
a valid TET license.
0.1 Installing the Software
7
0.2 Applying the TET License Key
Using TET for production purposes requires a valid TET license key. Once you purchased
a TET license you must apply your license key in order to allow processing of arbitrarily
large documents. There are several methods for applying the license key; choose one of
the methods detailed below.
Note TET license keys are platform-dependent, and can only be used on the platform for which they
have been purchased.
Windows installer. If you are working with the Windows installer you can enter the license key when you install the product. The installer will add the license key to the registry (see below).
Working with a license file. PDFlib products read license keys from a license file,
which is a text file according to the format shown below. You can use the template
licensekeys.txt which is contained in all TET distributions. Lines beginning with a ’#’
character contain comments and will be ignored; the second line contains version information for the license file itself:
# Licensing information for PDFlib GmbH products
PDFlib license file 1.0
TET 5.0 ...your license key...
The license file may contain license keys for multiple PDFlib GmbH products on separate lines. It may also contain license keys for multiple platforms so that the same license file can be shared among platforms. License files can be configured in the following ways:
> A file called licensekeys.txt will be searched in all default locations (see »Default file
search paths«, page 9).
> You can specify the licensefile option with the set_option( ) API function:
tet.set_option("licensefile={/path/to/licensekeys.txt}");
The licensefile option must be set immediately after instantiating the TET object, i.e.,
after calling TET_new( ) (in C) or creating a TET object.
> Supply the --tetopt option of the TET command-line tool and supply the licensefile
option with the name of a license file:
tet --tetopt "licensefile=/path/to/your/licensekeys.txt" ...
If the path name contains space characters you must enclose the path with braces:
tet --tetopt "licensefile={/path/to/your license file.txt}" ...
> You can set an environment (shell) variable which points to a license file. On Windows use the system control panel and choose System, Advanced, Environment
Variables; on Unix apply a command similar to the following:
export PDFLIBLICENSEFILE="/path/to/licensekeys.txt"
On i5/iSeries the license file can be specified as follows (this command can be specified in the startup program QSTRUP and will work for all PDFlib GmbH products):
ADDENVVAR ENVVAR(PDFLIBLICENSEFILE) VALUE(<... path ...>) LEVEL(*SYS)
8
Chapter 0: First Steps with TET
License keys in the registry. On Windows you can also enter the name of the license
file in the following registry value:
HKLM\SOFTWARE\PDFlib\PDFLIBLICENSEFILE
As another alternative you can enter the license key directly in one of the following registry values:
HKLM\SOFTWARE\PDFlib\TET5\license
HKLM\SOFTWARE\PDFlib\TET5\5.0\license
The MSI installer will write the license key provided at install time in the last of these
entries.
Note Be careful when manually accessing the registry on 64-bit Windows systems: as usual, 64-bit
binaries work with the 64-bit view of the Windows registry, while 32-bit binaries running on a
64-bit system work with the 32-bit view of the registry. If you must add registry keys for a 32-bit
product manually, make sure to use the 32-bit version of the regedit tool. It can be invoked as
follows from the Start, Run... dialog:
%systemroot%\syswow64\regedit
Default file search paths. On Unix, Linux, OS X and i5/iSeries systems some directories
will be searched for files by default even without specifying any path and directory
names. Before searching and reading the UPR file (which may contain additional search
paths), the following directories will be searched:
<rootpath>/PDFlib/TET/5.0/resource/cmap
<rootpath>/PDFlib/TET/5.0/resource/codelist
<rootpath>/PDFlib/TET/5.0/resource/glyphlst
<rootpath>/PDFlib/TET/5.0
<rootpath>/PDFlib/TET
<rootpath>/PDFlib
On Unix, Linux, and OS X <roothpath> will first be replaced with /usr/local and then with
the HOME directory. On i5/iSeries <roothpath> is empty.
Default file names for license and resource files. By default, the following file names
will be searched for in the default search path directories:
licensekeys.txt
tet.upr
(license file)
(resource file)
This feature can be used to work with a license file without setting any environment
variable or runtime option.
Setting the license key in an option for the TET command-line tool. If you use the TET
command-line tool you can supply an option which contains the name of a license file
or the license key itself:
tet --tetopt "license ...your license key..." ...more options...
0.2 Applying the TET License Key
9
Setting the license key with a TET API call. If you use the TET API you can add an API
call to your script or program which sets the license key at runtime:
> In COM/VBScript:
oTET.set_option "license=...your license key..."
> In C:
TET_set_option(tet, "license=...your license key...");
> In C++, .NET/C#, Java, and Ruby:
tet.set_option("license=...your license key...");
> In Perl, Python and PHP:
tet->set_option("license=...your license key...");
> In RPG:
d licensekey
d licenseval
c
c
s
s
eval
callp
20
50
licenseopt='license=... your license key ...'+x'00'
TET_set_option(TET:licenseopt:0)
The license option must be set immediately after instantiating the TET object, i.e., after
calling TET_new( ) (in C) or creating a TET object.
Licensing options. Different licensing options are available for TET use on one or more
computers, and for redistributing TET with your own products. We also offer support
and source code contracts. Licensing details and the purchase order form can be found
in the TET distribution. Please contact us if you are interested in obtaining a commercial license, or have any questions:
PDFlib GmbH, Licensing Department
Franziska-Bilek-Weg 9, 80339 München, Germany
www.pdflib.com
phone +49 • 89 • 452 33 84-0
fax +49 • 89 • 452 33 84-99
Licensing contact: [email protected]
Support for PDFlib licensees: [email protected]
10
Chapter 0: First Steps with TET
1 Introduction
The PDFlib Text and Image Extraction Toolkit (TET) is targeted at extracting text and images from PDF documents, but can also be used to retrieve other information from PDF.
TET can be used as a base component for realizing the following tasks:
> search the text contents of PDF
> create a list of all words contained in a PDF (concordance)
> implement a search engine for processing large numbers of PDF files
> extract text from PDF to store, translate, or otherwise repurpose it
> convert the text contents of PDF to other formats
> process or enhance PDFs based on their contents
> compare the text contents of multiple PDF documents
> extract the raster images from PDF
> extract metadata and other information from PDF
TET has been designed for stand-alone use, and does not require any third-party software. It is robust and suitable for multi-threaded server use.
1.1 Overview of TET Features
Supported PDF input. TET has been tested against millions of PDF test files from various sources. It accepts PDF 1.0 up to PDF 1.7 extension level 8 and PDF 2.0, corresponding
to Acrobat 1-DC including encrypted documents. TET attempts to repair various kinds of
malformed and damaged PDF documents.
Note TET does not support XFA forms. XFA is a separate format which is not part of the PDF standard
ISO 32000-1. Since XFA is packaged inside a small PDF wrapper XFA forms are often confused
with PDF documents although actually XFA is a completely different file format which requires
dedicated software.
Unicode support. TET includes a considerable number of algorithms and data to
achieve reliable Unicode mappings for all text. Since text in PDF documents is not usually encoded in Unicode, TET normalizes the text from a PDF document to Unicode:
> TET converts all text contents to Unicode. In C the text is returned in UTF-8 or UTF-16
format; in other language bindings as native Unicode strings.
> Ligatures and other multi-character glyphs are decomposed into a sequence of their
constituent Unicode characters.
> Vendor-specific Unicode values (Corporate Use Subarea, CUS) are identified and
mapped to characters with precisely defined meanings if possible.
> Glyphs which are lacking Unicode mapping information are identified and mapped
to a configurable replacement character.
> UTF-16 surrogate pairs for characters outside the Basic Multilingual Plane (BMP) are
interpreted and maintained. Surrogate pairs and UTF-32 values can be retrieved in all
language bindings.
Some PDF documents do not contain enough information for reliable Unicode mapping. In order to successfully extract the text nevertheless TET offers various configuration options which can be used to supply auxiliary information for proper Unicode
mappings. In order to facilitate writing the required mapping tables we make available
1.1 Overview of TET Features
11
PDFlib FontReporter, a free plugin for Adobe Acrobat. This plugin can be used for analyzing fonts, encodings, and glyphs in PDF.
CJK support. TET includes full support for extracting Chinese, Japanese, and Korean
text:
> All predefined CJK CMaps (encodings) are recognized; CJK text is converted to Unicode. The CMap files for CJK encoding conversion are included in the TET distribution.
> Special character forms (e.g. wide, narrow, prerotated glyphs for vertical text) can optionally be converted (folded) to the corresponding regular forms
> Horizontal and vertical writing modes are supported.
> CJK font names are normalized to Unicode.
Support for Bidirectional Hebrew and Arabic Text. TET includes the following features
for dealing with Bidi text:
> Re-order right-to-left and Bidi text to logical ordering
> Determine dominant text direction of the page
> Normalize Arabic presentation forms and decompose ligatures
> Remove Arabic Tatweel character used for stretching words
Unicode postprocessing. TET’s Unicode postprocessing features include the following:
> Folding: preserve, replace, or remove one or more characters; affected characters can
conveniently be specified as Unicode sets;
> Decomposition: optionally apply canonical or compatibility decompositions as defined in the Unicode standard. This may make the text better usable in some environments. For example, you can keep or split accented characters, fractions, or symbols like the trademark symbol.
> Normalization: convert the output to Unicode normalization formats NFC, NFD,
NFKC, or NFKD as defined in the Unicode standard. This way TET can produce the exact format required as input in some environments, e.g. databases or search engines.
Image extraction. TET extracts raster images from PDF. Adjacent parts of a segmented
image are combined to facilitate postprocessing and re-use (e.g. multi-strip images created by some applications). Small images can be filtered in order to exclude tiny image
fragments from cluttering the output. If a mask is attached to an image, the mask can
be extracted as well.
Images are extracted in TIFF, JPEG, JPEG 2000, or JBIG2 format.
Geometry. TET provides precise metrics for the text, such as the position on the page,
glyph widths, and text direction. Specific areas on the page can be excluded or included
in the text extraction process, e.g. to ignore headers and footers or margins.
For images the pixel size, physical size, and color space are available as well as position and angle.
Text color. TET provides information about the color of glyphs. The color spaces for
filling and stroking and the corresponding color values can be retrieved. A convenient
shortcut is available for easily comparing the colors of multiple glyphs without having
to deal with the complexities of PDF color spaces.
12
Chapter 1: Introduction
Word detection and content analysis. TET can be used to retrieve low-level glyph information, but also includes advanced algorithms for high-level content and layout
analysis:
> Detect word boundaries to retrieve words instead of characters.
> Recombine the parts of hyphenated words (dehyphenation).
> Remove duplicate instances of text, e.g. shadow and fake bold text.
> Recombine paragraphs into reading order.
> Reorder text which is scattered over the page.
> Reconstruct lines of text.
> Recognize tabular structures on the page.
> Recognize superscript, subscript and drop caps (large initial characters at the start of
a paragraph).
TET Markup Language (TETML). The information retrieved from a PDF document can
be presented in an XML format called TET Markup Language (TETML) for processing
with standard XML tools. TETML contains text, image, and metadata information and
can optionally also contain font- and geometry-related details. TETML also contains color and colorspace information as well as interactive elements such as form fields, annotations, bookmarks, etc.
pCOS interface for simple access to PDF objects. TET includes pCOS (PDFlib Comprehensive Object System) for retrieving arbitrary PDF objects. With pCOS you can retrieve
PDF metadata, interactive elements (e.g. bookmark text, contents of form fields), or any
other information from a PDF document with a simple query interface. The syntax of
pCOS query path is described separately in the pCOS Path Reference.
What is text? While TET deals with a large class of PDF documents, in some cases visible text cannot be extracted. The text must be encoded using PDF’s text and encoding
facilities (i.e., it must be based on a font). Although the following flavors of text may be
visible on the page they cannot be extracted with TET:
> Rasterized (pixel image) text, e.g. scanned pages;
> Text which is represented by vector elements without any font.
Note that metadata and text in hypertext elements (such as bookmarks, form fields,
notes, or annotations) can be retrieved with TETML or the pCOS interface; see Section
6.1, »PDF Document Domains«, page 67, for details. On the other hand, TET may extract
some text which is not visible on the page. This may happen in the following situations:
> Text using PDF’s invisible attribute (however, there is an option to exclude this kind
of text from the text retrieval process)
> Text which is obscured by some other element on the page, e.g. an image.
1.2 Many ways to use TET
TET is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer similar features,
but are suitable for different deployment tasks. Both the TET library and command-line
tool can create TETML, TET’s XML-based output format.
> The TET programming library can be used for integration into your desktop or server
application. Many different programming languages are supported. Examples for
1.2 Many ways to use TET
13
>
>
>
>
using the TET library with all supported language bindings are included in the TET
package.
The TET command-line tool is suited for batch processing PDF documents. It doesn’t
require any programming, but offers command-line options which can be used to
integrate it into complex workflows.
TETML output is suited for XML-based workflows and developers who are familiar
with the wide range of XML processing tools and languages, e.g. XSLT.
TET connectors are suited for integrating TET in various common software packages,
e.g. databases and search engines.
The TET Plugin is a free extension for Adobe Acrobat which makes TET available for
interactive use (see Section 4.1, »Free TET Plugin for Adobe Acrobat«, page 43, for
more information).
1.3 Roadmap to Documentation and Samples
Mini samples for the TET library. The TET distribution contains programming examples for all supported language bindings. These mini samples can serve as a starting
point for your own applications, or to test your TET installation. They comprise source
code for the following applications:
> The extractor sample demonstrates the basic loop for extracting text from a PDF document.
> The images_per_page sample extracts the images on each page and reports about
their geometry and other properties.
> The image_resources sample demonstrates the basic loop for extracting images from
a PDF document in a resource-oriented way (no geometric information available).
> The dumper sample shows the use of the integrated pCOS interface for querying general information about a PDF document.
> The fontfilter sample shows how to process font-related information, such as font
name and font size.
> The glyphinfo sample demonstrates how to retrieve detailed information about
glyphs (font, size, position, etc.) as well as text attributes such as dropcap, shadow,
hyphenation, etc. It also shows how to access text color information.
> The tetml sample contains code for generating TETML (TET’s XML language for expressing PDF contents) from a PDF document.
> The get_attachments sample (not available for all language bindings) demonstrates
how to process PDF file attachments, i.e. PDF documents which are embedded in another PDF document.
XSLT samples. The TET distribution contains several XSLT stylesheets. They demonstrate how to process TETML to achieve various goals:
> concordance.xsl: create list of unique words in a document sorted by descending frequency.
> fontfilter.xsl: List all words in a document which use a particular font in a size larger
than a specified value.
> fontfinder.xsl: For all fonts in a document, list all occurrences along with page number
and position information.
> fontstat.xsl: generate font and glyph statistics.
> index.xsl: create an alphabetically sorted »back-of-the-book« index.
14
Chapter 1: Introduction
> metadata.xsl: extract selected properties from document-level XMP metadata included in TETML.
> solr.xsl: generate input for the Solr enterprise search server.
> table.xsl: Extract a table to a CSV file (comma-separated values).
> tetml2html.xsl: convert TETML to simple HTML.
> textonly.xsl: extract the raw text from TETML input.
TET Cookbook. The TET Cookbook is a collection of source code examples for solving
specific application problems with the TET library. The Cookbook examples are written
in the Java language, but can easily be adjusted to other programming languages since
the TET API is almost identical for all supported language bindings. Some Cookbook
samples are written in the XSLT language.The TET Cookbook is organized in the following groups:
> Text: samples related to text extraction
> Font: samples related to text with a focus on font properties
> Image: samples related to image extraction
> TET & PDFlib+PDI: samples which extract information from a PDF with TET and construct a new PDF based on the original PDF and the extracted information. These
samples require the PDFlib+PDI product in addition to TET.
> TETML: XSLT samples for processing TETML
> Special: other samples
The TET Cookbook is available at the following URL:
www.pdflib.com/tet-cookbook.
pCOS Cookbook. The pCOS Cookbook is a collection of code fragments for the pCOS interface which is integrated in TET. It is available at the following URL:
www.pdflib.com/pcos-cookbook.
Details of the pCOS interface are documented in the pCOS Path Reference which is
included in the TET package.
1.4 What’s new in TET 5.0?
The features below are new or considerably improved in TET 5.0.
Text retrieval:
> retrieve fill and stroke color of text
> honor vector graphics to improve page and table layout recognition
> support vertical font metrics for CJK text
Image retrieval:
> significantly enhanced merging of fragmented images, e.g. for rotated images
> improved image handling for many special cases and rare PDF image flavors
> extract image masks and soft masks
> merge and convert JPEG 2000-compressed images
> preserve spot color in extracted TIFF images
> restrict image extraction to user-selected area
> collect XMP image metadata stored in non-standard locations by Adobe InDesign
Page processing:
> honor clipping paths to avoid extraction of invisible content
1.4 What’s new in TET 5.0?
15
> honor layers (optional content) to avoid extraction of invisible content
> optionally ignore artifacts (irrelevant content) in Tagged PDF
> check whether an area on the page is empty or contains any text, image, or vector
graphics
TETML:
> TETML includes fill and stroke color of glyphs
> TETML includes information about interactive elements including annotations,
form fields, bookmarks, actions, JavaScript, signatures, etc.
> TETML includes color space and ICC profile details
> TETML includes information about layers and page labels
pCOS PDF information retrieval:
> pCOS pseudo objects for ICC profile details and image masking properties
> pCOS pseudo objects for form fields
Other areas:
> additional checks and heuristics for damaged and non-conforming PDF input
> updated TET language bindings, programming samples and TET connectors
> new options for improved PDF processing control
> many improvements in existing functionality
16
Chapter 1: Introduction
2 TET Command-Line Tool
2.1 Command-Line Options
The TET command-line tool allows you to extract text and images from one or more PDF
documents without the need for any programming. Output can be generated in plain
text (Unicode) format or in TETML, TET’s XML-based output format. The TET program
can be controlled via a number of command-line options. The program will insert space
characters (U+0020) after each word, U+000A after each line, and U+000C after each
page. It is called as follows for one or more input PDF files:
tet [<options>] <filename>...
The TET command-line tool is built on top of the TET library. You can supply library options using the --docopt, --tetopt, --imageopt, and --pageopt options according to the option list tables in Chapter 10, »TET Library API Reference«, page 155. Table 2.1 lists all TET
command-line options (this list will also be displayed if you run the TET program without any options).
Note In order to extract CJK text you must configure access to the CMap files which are shipped with
TET according to Section 0.1, »Installing the Software«, page 7.
Table 2.1 TET command-line options
option
parameters
--
function
End the list of options; this is useful if file names start with a - character.
1
@filename
Specify a response file with options; for a syntax description see »Response files«,
page 20. Response files are only recognized before the -- option and before the
first filename. Response files can not be used to replace the parameter for another
option, but must contain complete option/parameter combinations.
--docopt
<option list>
Additional option list for TET_open_document( ) (see Table 10.8, page 173). The
filename suboption of the tetml option can not be used here.
--firstpage
-f
<integer> | last (Ignored for --imageloop resource) The number of the page where content extraction will start. The keyword last specifies the last page, last-1 the page before the last page, etc. Default: 1
--format
utf8 | utf16
Specifies the format for text output (default: utf8):
utf8
UTF-8 with BOM (byte order mark)
utf16
UTF-16 in native byte ordering with BOM
This option does not affect TETML output which is always created in UTF-8.
--help, -?
(or no option)
Display help with a summary of available options.
--image2
-i
Extract images from the whole document (with --imageloop resource) or the
selected pages (with --imageloop page). The file name pattern for extracted images depends on the --imageloop option.
2.1 Command-Line Options
17
Table 2.1 TET command-line options
option
parameters
function
--imageloop
page | resource Specifies the kind of enumeration for extracting images with the --image option
(default: page, but forced to resource if --tetml is specified):
page
Extract all images on the selected pages. Image resources which are
placed multiply are extracted multiply. Extracted images are named
according to the following pattern:
<filename>_p<pagenumber>_<imagenumber>.[tif|jpg|jp2|jpf|j2k|jbig2]
Images which are used as soft or hard mask for another image are
named according to the following pattern:
<filename>_p<pagenumber>_<imagenumber>_mask.[tif|jpg|jp2|jpf|j2k|jbig2]
where imagenumber is the number of the masked image on the page.
Calculated resolution values based on the size of the placed image are
embedded in generated TIFF images.
resource
Extract all plain and merged image resources in the document
including masks and soft masks. Each image resource is extracted
once, regardless of the number of occurrences in the document.
Extracted images (including images which are used as soft or hard
masks for another image) are named according to the following
pattern:
<filename>_I<imageid>.[tif|jpg|jp2|jpf|j2k|jbig2]
The same image file names are emitted in the TETML attribute Image/
@filename.
Since no size information is available for image resources, a dummy
value of 72dpi is embedded in generated TIFF images.
--imageopt
<option list>
--lastpage
-l
<integer> | last (Ignored for --imageloop resource) The number of the page where content extraction will finish. The keyword last specifies the last page, last-1 the page before the last page, etc. Default: last
--outfile
-o
<filename>
--pagecount
Additional option list for TET_write_image_file( ) (see Table 10.20, page 198)
(Not allowed if multiple input file names are supplied) File name for text or TETML
output. The file name »-« can be used to designate standard output provided only
a single input file has been supplied. Default: name of the input file, with .pdf or
.PDF replaced with .txt (for text output) or .tetml (for TETML output).
Print the number of pages in the document, i.e. the value of the pCOS path
length:pages, to stdout or the file provided with --outfile.
--pageopt
<option list>
Additional option list which will be supplied to TET_open_page( ) if text output is
generated, or to TET_process_page( ) if TETML output is generated (see Table 10.10,
page 181, and Table 10.21, page 200). For text output the option granularity is always set to page.
--password,
-p
<password>
User, master or attachment password for encrypted documents. In some situations the shrug feature can be used to index protected documents without supplying a password (see Section 5.1, »Extracting Content from protected PDF«, page
57).
--samedir
Create output files in the same directory as the input file(s).
1
--searchpath
-s
<path>...
Name of one or more directories where files (e.g. CMaps) will be searched. Default:
installation-specific
--targetdir
-t
<dirname>
Output directory for generated text, TETML, and image files. The directory must
exist. This option is ignored if --samedir is specified. Default: . (i.e. the current
working directory)
18
Chapter 2: TET Command-Line Tool
Table 2.1 TET command-line options
option
parameters
function
--tetml
-m
glyph |
image |
word |
wordplus |
line |
page
(Cannot be combined with --text) Create TETML output with information about
text, images, and interactive elements. TETML is created in UTF-8 format. The supplied parameter selects one of several variants (see Section 9.3, »Controlling TETML Details«, page 135):
glyph
Glyph-based TETML with glyph geometry and font details
image
TETML with image information, but without text and interactive
elements
line
Line-based TETML
page
Page-based TETML
word
Word-based TETML with word boxes
wordplus Word-based TETML with word boxes plus all glyph details
--tetopt
<option list>
--text2
--verbose
-v
--version, -V
Additional option list for TET_set_option( ) (see Table 10.2, page 162). The option
outputformat will be ignored (use --format instead).
(Can not be combined with --tetml) Extract text from the document (enabled by
default)
0 | 1 | 2 | 3
verbosity level (default: 1):
0
no output at all
1
emit only errors
2
emit errors and file names
3
detailed reporting
Print the TET version number.
1. This option can be supplied more than once.
2. The option --image disables text extraction by default, but it can be combined with --text and --tetml.
2.1 Command-Line Options
19
2.2 Constructing TET Command Lines
The following rules must be observed for constructing TET command lines:
> Input files will be searched in all directories specified as searchpath.
> Short forms are available for some options, and can be mixed with long options.
> Long options can be abbreviated provided the abbreviation is unique.
> Depending on the encryption status of the input file, a user or master password may
be required for successfully extracting text. It must be supplied with the --password
option. TET will check whether this password is sufficient for content extraction, and
will generate an error if it isn’t.
TET checks the full command line before processing any file. If an error is encountered
in the options anywhere on the command line, no files will be processed at all.
File names. File names which contain blank characters require some special handling
when used with command-line tools like TET. In order to process a file name with blank
characters you should enclose the complete file name with double quote " characters.
Wildcards can be used according to standard practice. For example, *.pdf denotes all files
in a given directory which have a .pdf file name suffix. Note that on some systems case
is significant, while on others it isn’t (i.e., *.pdf may be different from *.PDF). Also note
that on Windows systems wildcards do not work for file names containing blank characters. Wildcards will be evaluated in the current directory, not any searchpath directory.
On Windows all file name options accept Unicode strings, e.g. as a result of dragging
files from the Explorer to a command prompt window.
Response files. In addition to options supplied directly on the command-line, options
can also be supplied in a response file. The contents of a response file will be inserted in
the command-line at the location where the @filename option was found.
A response file is a simple text file with options and parameters. It must adhere to
the following syntax rules:
> Option values must be separated with whitespace, i.e. space, linefeed, return, or tab.
> Values which contain whitespace must be enclosed with double quotation marks: "
> Double quotation marks at the beginning and end of a value will be omitted.
> A double quotation mark must be masked with a backslash to use it literally: \"
> A backslash character must be masked with another backslash to use it literally: \\
Response files can be nested, i.e. the @filename syntax can itself be used in a response
file.
Response files may contain Unicode strings for file name arguments. Response files
can be encoded in UTF-8, EBCDIC-UTF-8, or UTF-16 format and must start with the corresponding BOM. If no BOM is found, the contents of the response file will be interpreted
in EBCDIC on zSeries, and in ISO 8859-1 (Latin-1) on all other systems.
Exit codes. The TET command-line tool returns with an exit code which can be used to
check whether or not the requested operations could be successfully carried out:
> Exit code 0: all command-line options could be successfully and fully processed.
> Exit code 1: one or more file processing errors occurred, but processing continued.
> Exit code 2: some error was found in the command-line options. Processing stopped
at the particular bad option, and no input file has been processed.
20
Chapter 2: TET Command-Line Tool
2.3 Command-Line Examples
The following examples demonstrate some useful combinations of TET command-line
options.
2.3.1 Extracting Text
Extract the text from a PDF document file.pdf in UTF-8 format and store it in file.txt:
tet file.pdf
Exclude the first and last page from text extraction:
tet --firstpage 2 --lastpage last-1 file.pdfSupply a directory where the CJK CMaps are
located (required for CJK text extraction):
tet --searchpath /usr/local/cmaps file.pdf
Extract the text from a PDF in UTF-16 format and store it in file.utf16:
tet --format utf16 --outfile file.utf16 file.pdf
Extract the text from all PDF files in a directory and store the generated *.txt files in another directory (which must already exist):
tet --targetdir out in/*.pdf
Extract the text from all PDF files from two directories and store the generated *.txt files
in the same directory as the corresponding input document:
tet --samedir dir1/*.pdf dir2/*.pdf
Restrict text extraction to a particular area on the page:
tet --pageopt "includebox={{0 0 200 200}}" file.pdf
Use a response file which contains various command-line options and process all PDF
documents in the current directory (the file options contains command-line options):
tet @options *.pdf
2.3.2 Extracting Images
Extract images from file.pdf in a page-oriented manner and store them in the directory
out:
tet --targetdir out --image file.pdf
Extract images from file.pdf in a resource-oriented manner and store them in the directory out:
tet --targetdir out --image --imageloop resource file.pdf
Extract images from file.pdf without image merging; this can be achieved by supplying a
list of page options for image processing:
tet --targetdir out --image --pageopt "imageanalysis={merge={disable}}" file.pdf
2.3 Command-Line Examples
21
2.3.3 Generating TETML
Generate TETML output in word mode for PDF document file.pdf and store it in file.tetml:
tet --tetml word file.pdf
Generate TETML output without any Options elements; this can be achieved by supplying a suitable list of document options:
tet --docopt "tetml={elements={options=false}}" --tetml word file.pdf
Generate TETML output in word mode with all glyph details and store it in file.tetml:
tet --tetml word --pageopt "tetml={glyphdetails={all}}" file.pdf
Extract images and generate TETML with text and image information:
tet --image --tetml word file.pdf
Extract images and generate TETML image information, but no text:
tet --tetml image --image file.pdf
Generate TETML output with topdown coordinates:
tet --tetml word --pageopt "topdown={output}" file.pdf
Generate TETML output with improved table detection:
tet --tetml word --pageopt "vectoranalysis={structures=tables}" file.pdf
2.3.4 Advanced Options
Supply the document option checkglyphlists to improve Unicode mapping for certain
kinds of TeX-generated PDF documents:
tet --docopt checkglyphlists file.pdf
Apply Unicode foldings, e.g. space folding: map all variants of Unicode space characters
to U+0020:
tet --docopt "fold={{[:blank:] U+0020}}" file.pdf
Disable punctuation as word boundary:
tet --pageopt "contentanalysis={punctuationbreaks=false}" file.pdf
22
Chapter 2: TET Command-Line Tool
3 TET Library Language Bindings
This chapter discusses specifics for the language bindings which are supplied for the
TET library. The TET distribution contains full sample code for several small TET applications in all supported language bindings.
3.1 Exception Handling
Errors of a certain kind are called exceptions in many languages for good reasons – they
are mere exceptions, and are not expected to occur very often during the lifetime of a
program. The general strategy is to use conventional error reporting mechanisms (read:
error return codes) for function calls which may go wrong often times, and use a special
exception mechanism for those rare occasions which don’t justify cluttering the code
with conditionals. This is exactly the path that TET goes: Some operations can be expected to go wrong rather frequently, for example:
> Trying to open a PDF document for which one doesn’t have the proper password (but
see also the shrug feature described in Section 5.1, »Extracting Content from protected PDF«, page 57);
> Trying to open a PDF document with a wrong file name;
> Trying to open a PDF document which is damaged beyond repair.
TET signals such errors by returning a value of –1 as documented in the API reference.
Other events may be considered harmful, but will occur rather infrequently, e.g.
> running out of virtual memory;
> supplying wrong function parameters (e.g. an invalid document handle);
> supplying malformed option lists;
> a required resource (e.g. a CMap file for CJK text extract) cannot be found.
When TET detects such a situation, an exception will be thrown instead of passing a special error return value to the caller. In languages which support native exceptions
throwing the exception will be done using the standard means supplied by the language or environment. For the C language binding TET supplies a custom exception
handling mechanism which must be used by clients (see Section 3.2, »C Binding«, page
24).
It is important to understand that processing a document must be stopped when an
exception occurred. The only methods which can safely be called after an exception are
delete( ), get_apiname( ), get_errnum( ), and get_errmsg( ). Calling any other method after
an exception may lead to unexpected results. The exception will contain the following
information:
> A unique error number;
> The name of the API function which caused the exception;
> A descriptive text containing details of the problem;
Querying the reason of a failed function call. Some TET function calls, e.g. open_
document( ) or open_page( ), can fail without throwing an exception (they will return -1
in case of an error). In this situation the functions get_errnum( ), get_errmsg( ), and get_
apiname( ) can be called immediately after a failed function call in order to retrieve details about the nature of the problem.
3.1 Exception Handling
23
3.2 C Binding
TET is written in C with some C++ modules. In order to use the C binding you can use a
static or shared library (DLL on Windows and MVS), and you need the central TET include file tetlib.h for inclusion in your client source modules.
Note Applications which use the TET binding for C must be linked with a C++ compiler since the library includes some parts which are implemented in C++. Using a C linker may result in unresolved externals unless the application is linked against the required C++ support libraries.
Exception handling. The TET API provides a mechanism for acting upon exceptions
thrown by the library in order to compensate for the lack of native exception handling
in the C language. Using the TET_TRY( ) and TET_CATCH( ) macros client code can be set up
such that a dedicated piece of code is invoked for error handling and cleanup when an
exception occurs. These macros set up two code sections: the try clause with code which
may throw an exception, and the catch clause with code which acts upon an exception.
If any of the API functions called in the try block throws an exception, program execution will continue at the first statement of the catch block immediately. The following
rules must be obeyed in TET client code:
> TET_TRY( ) and TET_CATCH( ) must always be paired.
> TET_new( ) will never throw an exception; since a try block can only be started with a
valid TET object handle, TET_new( ) must be called outside of any try block.
> TET_delete( ) will never throw an exception, and therefore can safely be called outside
of any try block. It can also be called in a catch clause.
> Special care must be taken about variables that are used in both the try and catch
blocks. Since the compiler doesn’t know about the transfer of control from one block
to the other, it might produce inappropriate code (e.g., register variable optimizations) in this situation.
Fortunately, there is a simple rule to avoid this kind of problem: Variables used in
both the try and catch blocks must be declared volatile. Using the volatile keyword signals to the compiler that it must not apply dangerous optimizations to the variable.
> If a try block is left (e.g., with a return statement, thus bypassing the invocation of the
corresponding TET_CATCH( )), the TET_EXIT_TRY( ) macro must be called before the return statement to inform the exception machinery.
> Document processing must stop when an exception was thrown.
The following code fragment demonstrates these rules with the typical idiom for dealing with TET exceptions in client code (a full sample can be found in the TET package):
volatile int pageno;
...
if ((tet = TET_new()) == (TET *) 0)
{
printf("out of memory\n");
return(2);
}
TET_TRY(tet)
{
for (pageno = 1; pageno <= n_pages; ++pageno)
{
/* process page */
if (/* error happened */)
{
24
Chapter 3: TET Library Language Bindings
TET_EXIT_TRY(tet);
return -1;
}
}
/* statements that directly or indirectly call API functions */
}
TET_CATCH(tet)
{
printf("Error %d in %s() on page %d: %s\n",
TET_get_errnum(tet), TET_get_apiname(tet), pageno, TET_get_errmsg(tet));
}
TET_delete(tet);
Unicode handling for name strings. The C programming language supports genuine
Unicode strings only in version C11. Since this version is not yet generally supported,
TET offers Unicode support based on the traditional char data type. Some string parameters for API functions may be declared as name strings. These are handled depending
on the length parameter and the existence of a BOM at the beginning of the string. In C,
if the length parameter is different from 0 the string will be interpreted as UTF-16. If the
length parameter is 0 the string will be interpreted as UTF-8 if it starts with a UTF-8
BOM, or as EBCDIC UTF-8 if it starts with an EBCDIC UTF-8 BOM, or as auto encoding if
no BOM is found (or ebcdic on EBCDIC-based platforms).
Unicode handling for option lists. Strings within option lists require special attention
since they cannot be expressed as Unicode strings in UTF-16 format, but only as byte arrays. For this reason UTF-8 is used for Unicode options. By looking for a BOM at the beginning of an option TET decides how to interpret it. The BOM will be used to determine
the format of the string. More precisely, interpreting a string option works as follows:
> If the option starts with a UTF-8 BOM (\xEF\xBB\xBF) it is interpreted as UTF-8.
> If the option starts with an EBCDIC UTF-8 BOM (\x57\x8B\xAB) it is interpreted as
EBCDIC UTF-8.
> If no BOM is found, the string is treated as winansi (or ebcdic on EBCDIC-based platforms).
Note The TET_convert_to_unicode( ) utility function can be used to create UTF-8 strings from UTF-16
strings, which is useful for creating option lists with Unicode values.
Using TET as a DLL loaded at runtime. While most clients will use TET as a statically
bound library or a dynamic library which is bound at link time, you can also load the
DLL at runtime and dynamically fetch pointers to all API functions. This is especially
useful to load the DLL only on demand, and on MVS where the library is customarily
loaded as a DLL at runtime without explicitly linking against TET. TET supports a special
mechanism to facilitate this dynamic usage. It works according to the following rules:
> Include tetlibdl.h instead of tetlib.h.
> Use TET_new_dl( ) and TET_delete_dl( ) instead of TET_new( ) and TET_delete( ).
> Use TET_TRY_DL( ) and TET_CATCH_DL( ) instead of TET_TRY( ) and TET_CATCH( ).
> Use function pointers for all other TET calls.
> Compile the auxiliary module tetlibdl.c and link your application against the resulting object file.
The dynamic loading mechanism is demonstrated in the extractordl.c sample.
3.2 C Binding
25
3.3 C++ Binding
Note For .NET applications written in C++ we recommend to access the TET .NET DLL directly instead
of via the C++ binding (except for cross-platform applications which should use the C++ binding). The TET distribution contains C++ sample code for use with .NET CLI which demonstrates
this combination.
In addition to the tetlib.h C header file, an object-oriented wrapper for C++ is supplied
for TET clients. It requires the tet.hpp header file, which in turn includes tetlib.h. Since
tet.hpp contains a template-based implementation no corresponding tet.cpp module is
required. Using the C++ object wrapper replaces the functional approach (with API functions and TET_ prefixes in all TET function names) with a more object-oriented approach.
String handling in C++. TET’s template-based string handling approach supports the
following usage patterns with respect to string handling:
> Strings of the C++ standard library type std::wstring are used as basic string type.
They can hold Unicode characters encoded as UTF-16 or UTF-32. This is the default behavior and the recommended approach for new applications unless custom data
types (see next item) offer a significant advantage over wstrings.
> Custom (user-defined) data types for string handling can be used as long as the custom data type is an instantiation of the basic_string class template and can be converted to and from Unicode via user-supplied converter methods. This technique is
demonstrated in the glyphinfo.cpp sample in the TET distribution.
The default interface assumes that all strings passed to and received from TET methods
are native wstrings. Depending on the size of the wchar_t data type, wstrings are assumed
to contain Unicode strings encoded as UTF-16 (2-byte characters) or UTF-32 (4-byte characters). Literal strings in the source code must be prefixed with L to designate wide
strings. Unicode characters in literals can be created with the \u and \U syntax. Although this syntax is part of standard ISO C++, some compilers don’t support it. In this
case literal Unicode characters must be created with hex digits.
Note On EBCDIC-based systems the formatting of option list strings for the wstring-based interface
requires additional conversions to avoid a mixture of EBCDIC and UTF-16 wstrings in option
lists. Convenience code for this conversion and instructions are available in the auxiliary module utf16num_ebcdic.hpp.
Error handling in C++. TET API functions will throw a C++ exception in case of an error.
These exceptions must be caught in the client code by using C++ try/catch clauses. In order to provide extended error information the TET class provides a public TET::Exception
class which exposes methods for retrieving the detailed error message, the exception
number, and the name of the TET API function which threw the exception.
Native C++ exceptions thrown by TET routines will behave as expected. The following code fragment will catch exceptions thrown by TET:
try {
...some TET instructions...
} catch (TET::Exception &ex) {
wcerr << L"Error " << ex.get_errnum()
<< L" in " << ex.get_apiname()
26
Chapter 3: TET Library Language Bindings
<< L"(): " << ex.get_errmsg() << endl;
}
Using TET as a DLL loaded at runtime. Similar to the C language binding the C++ binding allows you to dynamically attach TET to your application at runtime (see »Using TET
as a DLL loaded at runtime«, page 25). Dynamic loading can be enabled as follows when
compiling the application module which includes tet.hpp:
#define TETCPP_DL
1
In addition you must compile the auxiliary module tetlibdl.c and link your application
against the resulting object file. Since the details of dynamic loading are hidden in the
TET object it does not affect the C++ API: all method calls look the same regardless of
whether or not dynamic loading is enabled. The dynamic loading mechanism is demonstrated in the extractordl sample in the shipped Makefile.
3.3 C++ Binding
27
3.4 COM Binding
Installing the TET COM edition. TET can be deployed in all environments that support
COM components. Installing TET is an easy and straight-forward process. Please note
the following:
> If you install on an NTFS partition all TET users must have read permission for the
installation directory, and execute permission for
...\TET 5.0 32-bit\bind\COM\bin\tet_com.dll.
> The installer must have write permission for the system registry. Administrator or
Power Users group privileges will usually be sufficient.
Exception Handling. Exception handling for the TET COM component is done according to COM conventions: when a TET exception occurs, a COM exception will be raised
and furnished with a clear-text description of the error. In addition the memory allocated by the TET object is released. The COM exception can be caught and handled in the
TET client in whichever way the client environment supports for handling COM errors.
Using the TET COM Edition with .NET. As an alternative to the TET.NET edition (see
Section 3.6, ».NET Binding«, page 31) the COM edition of TET can also be used with .NET.
First, you must create a .NET assembly from the TET COM edition using the tlbimp.exe
utility:
tlbimp tet_com.dll /namespace:tet_com /out:Interop.tet_com.dll
You can use this assembly within your .NET application. If you add a reference to tet_
com.dll from within Visual Studio .NET an assembly will be created automatically. The
following code fragment shows how to use the TET COM edition with C#:
using TET_com;
...
static TET_com.ITET tet;
...
tet = New TET();
...
All other code works as with the .NET edition of TET.
28
Chapter 3: TET Library Language Bindings
3.5 Java Binding
Installing the TET Java edition. TET is organized as a Java package with the name
com.pdflib.TET. This package relies on a native JNI library; both pieces must be configured
appropriately.
In order to make the JNI library available the following platform-dependent steps
must be performed:
> On Unix systems the library libtet_java.so (on OS X: libtet_java.jnilib) must be placed
in one of the default locations for shared libraries, or in an appropriately configured
directory.
> On Windows the library tet_java.dll must be placed in the Windows system directory,
or a directory which is listed in the PATH environment variable.
The TET Java package is contained in the TET.jar file. In order to supply this package to
your application, you must add TET.jar to your CLASSPATH environment variable, add the
option -classpath TET.jar in your calls to the Java compiler, or perform equivalent steps in
your Java IDE. In the JDK you can configure the Java VM to search for native libraries in a
given directory by setting the java.library.path property to the name of the directory, e.g.
java -Djava.library.path=. extractor
You can check the value of this property as follows:
System.out.println(System.getProperty("java.library.path"));
Using TET in J2EE application servers and Servlet containers. TET is perfectly suited for
server-side Java applications. The TET distribution contains sample code and configuration for using TET in J2EE environments. The following configuration issues must be observed:
> The directory where the server looks for native libraries varies among vendors. Common candidate locations are system directories, directories specific to the underlying Java VM, and local server directories. Please check the documentation supplied
by the server vendor.
> Application servers and Servlet containers often use a special class loader which may
be restricted or uses a dedicated classpath. For some servers it is required to define a
special classpath to make sure that the TET package will be found.
More detailed notes on using TET with specific Servlet engines and application servers
can be found in additional documentation in the J2EE directory of the TET distribution.
Unicode and legacy encoding conversion. For the convenience of TET users we list
some useful string conversion methods here. Please refer to the Java documentation for
more details.
The following constructor creates a Unicode string from a byte array, using the platform’s default encoding:
String(byte[] bytes)
The following constructor creates a Unicode string from a byte array, using the encoding supplied in the enc parameter (e.g. SJIS, UTF8, UTF-16):
String(byte[] bytes, String enc)
3.5 Java Binding
29
The following method of the String class converts a Unicode string to a string according
to the encoding specified in the enc parameter:
byte[] getBytes(String enc)
Javadoc documentation for TET. The TET package contains Javadoc documentation for
TET. The Javadoc contains only abbreviated descriptions of all TET API methods; please
refer to Section 10, »TET Library API Reference«, page 155, for more details.
In order to configure Javadoc for TET in Eclipse proceed as follows:
> In the Package Explorer right-click on the Java project and select Javadoc Location.
> Click on Browse... and select the path where the Javadoc (which is part of the TET
package) is located.
After these steps you can browse the Javadoc for TET, e.g. with the Java Browsing perspective or via the Help menu.
Exception handling. The TET language binding for Java will throw native Java exceptions of the class TETException. TET client code must use standard Java exception syntax:
TET tet = null;
try {
...TET method invocations...
} catch (TETException e) {
System.err.print("TET exception occurred:\n");
System.err.print("[" + e.get_errnum() + "] " + e.get_apiname() + ": " +
e.get_errmsg() + "\n");
} catch (Exception e) {
System.err.println(e.getMessage());
} finally {
if (tet != null) {
tet.delete();
}
}
/* delete the TET object */
Since TET declares appropriate throws clauses, client code must either catch all possible
exceptions or declare those itself.
30
Chapter 3: TET Library Language Bindings
3.6 .NET Binding
Note Detailed information about the various flavors and options for using TET with the .NET Framework can be found in the PDFlib-in-.NET-HowTo.pdf document which is contained in the distribution packages and also available on the PDFlib Web site.
The .NET edition of TET supports all relevant .NET concepts. In technical terms, the
TET.NET edition is a C++ class (with a managed wrapper for the unmanaged TET core library) which runs under control of the .NET framework. It is packaged as a static assembly with a strong name. The TET assembly (TET_dotnet.dll) contains the actual library
plus meta information.
Installing the TET Edition for .NET. The TET.NET MSI installer installs the TET assembly
plus auxiliary data files, documentation and samples on the machine interactively. The
installer also registers TET so that it can easily be referenced on the .NET tab in the Add
Reference dialog box of Visual Studio .NET.
Error handling. TET.NET supports .NET exceptions, and will throw an exception with a
detailed error message when a runtime problem occurs. The client is responsible for
catching such an exception and properly reacting on it. Otherwise the .NET framework
will catch the exception and usually terminate the application.
In order to convey exception-related information TET defines its own exception
class TET_dotnet.TETException with the members get_errnum, get_errmsg, and get_apiname.
Using TET with C++ and CLI. .NET applications written in C++ (based on the Common
Language Infrastructure, CLI) can directly access the TET.NET DLL without using the TET
C++ binding. The source code must reference TET as follows:
using namespace TET_dotnet;
3.6 .NET Binding
31
3.7 Objective-C Binding
Although the C and C++ language bindings can be used with Objective-C, a genuine language binding for Objective-C is also available. The TET framework is available in the
following flavors:
> TET for use on OS X
> TET_ios for use on iOS
Both frameworks contain language bindings for C, C++, and Objective-C.
Installing the TET Edition for Objective-C on OS X. In order to use TET in your application you must copy TET.framework or TET_ios.framework to the directory /Library/
Frameworks. Installing the TET framework in a different location is possible, but requires
use of Apple’s install_name_tool which is not described here. The TET_objc.h header file
with TET method declarations must be imported in the application source code:
#import "TET/TET_objc.h"
or
#import "TET_ios/TET_objc.h"
Parameter naming conventions. For TET method calls you must supply parameters according to the following conventions:
> The value of the first parameter is provided directly after the method name, separated by a colon character.
> For each subsequent parameter the parameter’s name and its value (again separated
from each other by a colon character) must be provided. The parameter names can
be found in Chapter 10, »TET Library API Reference«, page 155, and in TET_objc.h.
For example, the following line in the API description:
int open_page(int doc, int pagenumber, String optlist)
corresponds to the following Objective-C method:
- (NSInteger) open_page: (NSInteger) doc pagenumber: (NSInteger) pagenumber optlist:
(NSString *) optlist;
This means your application must make a call similar to the following:
page = [tet open_page:doc pagenumber:pageno optlist:pageoptlist];
Xcode Code Sense for code completion can be used with the TET framework.
Error handling in Objective-C. The Objective-C binding translates TET exceptions to
native Objective-C exceptions. In case of a runtime problem TET throws a native Objective-C exception of the class TETException. These exceptions can be handled with the
usual try/catch mechanism:
@try {
...some TET instructions...
}
@catch (TETException *ex) {
NSString * errorMessage =
32
Chapter 3: TET Library Language Bindings
[NSString stringWithFormat:@"TET error %d in '%@': %@",
[ex get_errnum], [ex get_apiname], [ex get_errmsg]];
NSAlert *alert = [[NSAlert alloc] init];
[alert setMessageText: errorMessage];
[alert runModal];
[alert release];
}
@catch (NSException *ex) {
NSAlert *alert = [[NSAlert alloc] init];
[alert setMessageText: [ex reason]];
[alert runModal];
[alert release];
}
@finally {
[tet release];
}
In addition to the get_errmsg method you can also use the reason field of the exception
object to retrieve the error message.
3.7 Objective-C Binding
33
3.8 Perl Binding
The TET wrapper for Perl consists of a C wrapper and two Perl package modules, one for
providing a Perl equivalent for each TET API function and another one for the TET object. The C module is used to build a shared library which the Perl interpreter loads at
runtime, with some help from the package file. Perl scripts refer to the shared library
module via a use statement.
Installing the TET Edition for Perl. The Perl extension mechanism loads shared libraries at runtime through the DynaLoader module. The Perl executable must have been
compiled with support for shared libraries (this is true for the majority of Perl configurations).
For the TET binding to work, the Perl interpreter must access the TET Perl wrapper
and the modules tetlib_pl.pm and PDFlib/TET.pm. In addition to the platform-specific
methods described below you can add a directory to Perl’s @INC module search path using the -I command line option:
perl -I/path/to/tet extractor.pl
Unix. Perl will search tetlib_pl.so (on OS X: tetlib_pl.bundle), tetlib_pl.pm and PDFlib/
TET.pm in the current directory, or the directory printed by the following Perl command:
perl -e 'use Config; print $Config{sitearchexp};'
Perl will also search the subdirectory auto/tetlib_pl. Typical output of the above command looks like
/usr/lib/perl5/site_perl/5.16/i686-linux
Windows. The DLL tetlib_pl.dll and the modules tetlib_pl.pm and PDFlib/TET.pm will be
searched in the current directory, or the directory printed by the following Perl command:
perl -e "use Config; print $Config{sitearchexp};"
Typical output of the above command looks like
C:\Program Files\Perl5.16\site\lib
Exception Handling in Perl. When a TET exception occurs, a Perl exception is thrown. It
can be caught and acted upon using an eval sequence:
eval {
...some TET instructions...
};
die "Exception caught: [email protected]" if [email protected];
34
Chapter 3: TET Library Language Bindings
3.9 PHP Binding
Note Detailed information about the various flavors and options for using TET with PHP, can be
found in the PDFlib-in-PHP-HowTo document which is included in the distribution packages
and available on the PDFlib Web site. Although it is mainly targeted at using PDFlib with PHP
the discussion applies equally to using TET with PHP.
Installing the TET Edition for PHP. TET is implemented as a C library which can dynamically be attached to PHP. TET supports several versions of PHP. Depending on the version of PHP you use you must choose the appropriate TET library from the unpacked
TET archive.
You must configure PHP so that it knows about the external TET library. You have
two choices:
> Add one of the following lines in php.ini:
extension=php_tet.dll
extension=php_tet.so
extension=php_tet.sl
; for Windows
; for Unix and OS X
; for HP-UX
PHP will search the library in the directory specified in the extension_dir variable in
php.ini on Unix, and additionally in the standard system directories on Windows.
You can test which version of the PHP TET binding you have installed with the following one-line PHP script:
<?phpinfo()?>
This will display a long info page about your current PHP configuration. On this page
check the section titled tet. If this section contains the phrase
PDFlib TET Support
enabled
(plus the TET version number) you have successfully installed TET for PHP.
> Alternatively, you can load TET at runtime with one of the following lines at the start
of your script:
dl("php_tet.dll");
dl("php_tet.so");
dl("php_tet.sl");
# for Windows
# for Unix and OS X
# for HP-UX
File name handling in PHP. Unqualified file names (without any path component) and
relative file names are handled differently in Unix and Windows versions of PHP:
> PHP on Unix systems will find files without any path component in the directory
where the script is located.
> PHP on Windows will find files without any path component only in the directory
where the PHP DLL is located.
Exception handling. Since PHP supports structured exception handling, TET exceptions will be propagated as PHP exceptions. You can use the standard try/catch technique to deal with TET exceptions:
try {
...some TET instructions...
3.9 PHP Binding
35
} catch (TETException $e) {
print "TET exception occurred:\n";
print "[" . $e->get_errnum() . "] " . $e->get_apiname() . ": "
$e->get_errmsg() . "\n";
}
catch (Exception $e) {
print $e;
}
Developing with Eclipse and Zend Studio. The PHP Development Tools (PDT) support
PHP development with Eclipse and Zend Studio. PDT can be configured to support context-sensitive help with the steps outlined below.
Add TET to the Eclipse preferences so that it will be known to all PHP projects:
> Select Window, Preferences, PHP, PHP Libraries, New... to launch a wizard.
> In User library name enter TET, click Add External folder... and select the folder
bind\php\Eclipse PDT.
In an existing or new PHP project you can add a reference to the TET library as follows:
> In the PHP Explorer right-click on the PHP project and select Include Path, Configure
Include Path...
> Go to the Libraries tab, click Add Library..., and select User Library, TET.
After these steps you can explore the list of TET methods under the PHP Include Path/TET/
TET node in the PHP Explorer view. When writing new PHP code Eclipse will assist with
code completion and context-sensitive help for all TET methods.
36
Chapter 3: TET Library Language Bindings
3.10 Python Binding
Installing the TET edition for Python. The Python extension mechanism works by
loading shared libraries at runtime. For the TET binding to work, the Python interpreter
must have access to the TET Python wrapper which will be searched in the directories
listed in the PYTHONPATH environment variable. The name of Python wrapper depends on the platform:
> Unix and OS X: tetlib_py.so
> Windows: tetlib_py.pyd
Error Handling in Python. The Python binding translates TET exceptions to native
Python exceptions. The Python exceptions can be dealt with by the usual try/except
technique:
try:
...some TET instructions...
except TETException:
print("TET exception occurred:\n[%d] %s: %s" %
((tet.get_errnum()), tet.get_apiname(), tet.get_errmsg()))
3.10 Python Binding
37
3.11 REALbasic/Xojo Binding
Installing the TET edition for REALbasic/Xojo. TET for REALbasic/Xojo (TET.rbx) must be
copied to a folder called Plugins in the same folder where the REALbasic/Xojo application
lives. TET for REALbasic/Xojo contains variants for OS X, Windows and Linux.This
means that you can use any version of REALbasic/Xojo to build applications for all supported platforms. When a stand-alone application is generated, REALbasic/Xojo selects
the appropriate parts of TET and includes only the platform-specific portion(s) in the
generated application.
Additional REALbasic/Xojo classes. TET adds two new classes to the object hierarchy:
> The TET class contains all TET API methods.
> The TETException class, which is derived from RuntimeException, can be used to deal
with exceptions thrown by TET (see below).
TET can be used to create GUI applications as well as console applications. Since TET is
not a control it does not install a new icon in the control palette. However, when TET is
available, REALbasic/Xojo will be aware of the TET class and its associated methods. For
example, statement completion and parameter checking fully work for TET API methods.
Error handling in REALbasic/Xojo. In case of an exception TET will throw a native
REALbasic/Xojo exception of the class TETException. TET Exceptions can be handled with
a standard try/catch block.
3.12 Ruby Binding
Installing the TET Ruby edition. The Ruby extension mechanism works by loading a
shared library at runtime. For the TET binding to work, the Ruby interpreter must have
access to the TET extension library for Ruby. This library (on Windows and Unix: TET.so;
on OS X: TET.bundle) will usually be installed in the site_ruby branch of the local ruby installation directory, i.e. in a directory with a name similar to the following:
/usr/local/lib/ruby/site_ruby/<version>/
However, Ruby will search other directories for extensions as well. In order to retrieve a
list of these directories you can use the following ruby call:
ruby -e "puts $:"
This list will usually include the current directory, so for testing purposes you can simply place the TET extension library and the scripts in the same directory.
Error Handling in Ruby. The Ruby binding installs an error handler which translates
TET exceptions to native Ruby exceptions. The Ruby exceptions can be dealt with by the
usual rescue technique:
begin
...some TET instructions...
rescue TETException => pe
print pe.backtrace.join("\n") + "\n"
print "Error [" + pe.get_errnum.to_s + "] " + pe.get_apiname + ": " + pe.get_errmsg
print " on page pageno" if (pageno != 0)
print "\n"
rescue Exception => e
print e.backtrace.join("\n") + "\n" + e.to_s + "\n"
ensure
tet.delete() if tet
end
Ruby on Rails. Ruby on Rails is an open-source framework which facilitates Web development with Ruby. The TET extension for Ruby can be used with Ruby on Rails. Follow
these steps to run the TET examples for Ruby on Rails:
> Install Ruby and Ruby on Rails.
> Set up a new controller from the command line:
$
$
$
$
$
$
rails new tetdemo
cd tetdemo
cp <TET dir>/bind/ruby/<version>/TET.so vendor/
cp <TET dir>/bind/data/TET-datasheet.pdf .
rails generate controller home demo
rm public/index.html
# use .so/.dll/.bundle
> Edit config/routes.rb:
...
# remember to delete public/index.html
root :to => "home#demo"
3.12 Ruby Binding
39
> Edit app/controllers/home_controller.rb as follows and insert TET code for extracting
PDF contents. As a starting point you can use the code in the extractor-rails.rb sample:
class HomeController < ApplicationController
def demo
require "TET"
begin
p = TET.new
doc = tet.open_document(infilename, docoptlist)
...TET application code, see extractor-rails.rb...
...
# and finally show the retrieved text
send_data text, :type => "text/plain", :disposition => "inline"
rescue TETException => pe
# error handling
end
end
end
> In order to test your installation start the WEBrick server with the command
$ rails server
and point your browser to http://0.0.0.0:3000. The text extracted from the PDF document will be displayed in the browser.
Local TET installation. If you want to use TET only with Ruby on Rails, but cannot install it globally for general use with Ruby, you can install TET locally in the vendors directory within the Rails tree. This is particularly useful if you do not have permission to install Ruby extensions for general use, but want to work with TET in Rails nevertheless.
40
Chapter 3: TET Library Language Bindings
3.13 RPG Binding
TET provides a /copy module that defines all prototypes and some useful constants
needed to compile ILE-RPG programs with embedded TET functions.
Unicode string handling. Since all TET functions use Unicode strings with variable
length as parameters, you have to use the %ucs2 builtin function to convert a singlebyte string to a Unicode string. All strings returned by TET functions are Unicode strings
with variable length. Use the %char builtin function to convert these Unicode strings to
single-byte strings.
Note The %CHAR and %UCS2 functions use the current job’s CCSID to convert strings from and to
Unicode. The examples provided with TET are based on CCSID 37 (US EBCDIC). Some special
characters in option lists (e.g. { [ ] } ) may not be translated correctly if you run the examples under other codepages.
Since all strings are passed as variable length strings you must not pass the length parameters in those functions which expect explicit string lengths (the length of a variable length string is stored in the first two bytes of the string).
Compiling and binding RPG programs for TET. Using TET functions from RPG requires
the compiled TET service program. To include the TET definitions at compile time you
have to specify the name in the D specs of your ILE-RPG program:
d/copy QRPGLESRC,TETLIB
If the TET source file library is not on top of your library list you have to specify the library as well:
d/copy tetsrclib/QRPGLESRC,TETLIB
Before you start compiling your ILE-RPG program you have to create a binding directory
that includes the TETLIB service program shipped with TET. The following example assumes that you want to create a binding directory called TETLIB in the library TETLIB:
CRTBNDDIR BNDDIR(TETLIB/TETLIB) TEXT('TETlib Binding Directory')
After creating the binding directory you need to add the TETLIB service program to your
binding directory. The following example assumes that you want to add the service program TETLIB in the library TETLIB to the binding directory created earlier.
ADDBNDDIRE BNDDIR(TETLIB/TETLIB) OBJ((TETLIB/TETLIB *SRVPGM))
Now you can compile your program using the CRTBNDRPG command (or option 14 in
PDM):
CRTBNDRPG PGM(TETLIB/EXTRACTOR) SRCFILE(TETLIB/QRPGLESRC) SRCMBR(*PGM) DFTACTGRP(*NO)
BNDDIR(TETLIB/TETLIB)
Error Handling in RPG. TET clients written in ILE-RPG can use the monitor/on-error/
endmon error handling mechanism that ILE-RPG provides. Another way to monitor for
exceptions is to use the *PSSR global error handling subroutine in ILE-RPG. If an excep-
3.13 RPG Binding
41
tion occurs, the job log shows the error number, the function that failed and the reason
for the exception. TET sends an escape message to the calling program.
c
*
c
*
c
c
:
:
*
c
*
*
c
c
42
eval
p=TET_new
monitor
callp
eval
TET_set_option(tet:globaloptlist)
doc=TET_open_document(tet:%ucs2(%trim(parm1)):docoptlist)
Error Handling
on-error
Do something with this error
don’t forget to free the TET object
callp
TET_delete(tet)
endmon
Chapter 3: TET Library Language Bindings
4 TET Connectors
TET connectors provide the necessary glue code for interfacing TET with other software.
TET connectors are based on the TET library or the TET command-line tool.
4.1 Free TET Plugin for Adobe Acrobat
This section discusses the TET Plugin, a freely available packaging of TET which can be
used for testing in Adobe Acrobat and interactive use of TET with any PDF document.
The TET Plugin works with Acrobat X-DC Standard, Pro, and Pro Extended (but not the
free Adobe Reader). It can be downloaded for free from the following location:
www.pdflib.com/products/tet-plugin.
What is the TET Plugin? The TET Plugin provides simple interactive access to TET. Although the TET Plugin runs as an Acrobat plugin, the underlying content extraction features do not use Acrobat functions, but are completely based on TET. The TET Plugin is
provided as a free tool which demonstrates the power of PDFlib TET. Since TET is more
powerful than Acrobat’s built-in text and image extraction tools and offers a number of
convenient user interface features, it is useful as a replacement for Acrobat’s built-in
copy and find features. PDFlib TET can successfully process many documents for which
Acrobat provides only garbage when trying to extract the text. The TET Plugin provides
the following functions:
> Copy the text from a PDF document to the system clipboard or a disk file.
> Convert a PDF to TETML and place it on the clipboard or a disk file.
> Copy XMP document metadata to the clipboard.
> Find words in the document.
> Highlight all instances of a search term on the page simultaneously.
> Extract images from the document as TIFF, JPEG, JPEG 2000, or JBIG2 files.
> Display color space and position information for images.
> Detailed configuration settings are available to adjust text and image extraction to
your requirements. Configuration sets can be saved and reloaded.
Advantages over Acrobat’s copy function. The TET Plugin offers several advantages
over Acrobat’s built-in copy facility:
> The output can be customized to match different application requirements.
> TET is able to correctly interpret the text in many cases where Acrobat copies only
garbage to the clipboard.
> Unknown glyphs (for which proper Unicode mapping cannot be established) are
highlighted in red color, and can be replaced with a user-selected character (e.g. question mark).
> TET processes documents much faster than Acrobat.
> Images can be selected interactively for export, or all images on the page or in the
document can be extracted.
> Tiny image fragments are merged to usable images.
4.1 Free TET Plugin for Adobe Acrobat
43
4.2 TET Connector for the Lucene Search Engine
Lucene is an open-source search engine. Lucene is primarily a Java project, but a version
for .NET is also available. For more information on Lucene see lucene.apache.org.
Note Protected documents can be indexed with the shrug option under certain conditions (see
Chapter 5.1, »Extracting Content from protected PDF«, page 57, for details). This is prepared in
the Connector files, but you must manually enable this option.
Requirements and installation. The TET distribution contains a TET connector which
can be used to enable PDF indexing in Lucene Java. We describe this connector for Lucene Java in more detail below, assuming the following requirements are met:
> JDK 1.7 or later for Lucene 5.2.x.
> A working installation of the Ant build tool
> The Lucene distribution with the Lucene core JAR file. The Ant build file distributed
with TET expects the files lucene-core-x.x.x.jar, lucene-analyzers-common-x.x.x.jar and
lucene-queryparser-x.x.x.jar, which are part of the Lucene distribution.
> An installed TET distribution package for Unix, Linux, OS X, or Windows
In order to implement the TET connector for Lucene perform the following steps with a
command prompt:
> Change to the directory <TET install dir>/connectors/lucene.
> Copy the files lucene-core-x.x.x.jar, lucene-analyzers-common-x.x.x.jar and lucenequeryparser-x.x.x.jar to this directory.
> Optionally customize the settings by adding global, document-, and page-related
TET options in TetReader.java. For example, the global option list can be used to supply a suitable search path for resources (e.g. if the CJK CMaps are installed in a directory different from the default installation).
The PdfDocument.java module demonstrates how to process PDF documents which
are stored either on a disk file or in a memory buffer (e.g. supplied by a Web crawler).
> Run the command ant index. This will compile the source code and run the indexer
on the PDF files contained in the directory <TET install dir>/bind/data.
> Run the command ant search to start the command-line search client where you can
enter queries in the Lucene query language.
Testing TET and Lucene with the command-line search client. The following sample
session demonstrates the commands and output for indexing with TET and Lucene, and
testing the generated index with the Lucene command-line query tool. The process is
started by running the command ant index:
amira (1)$ ant index
Buildfile: build.xml
...
index:
[echo] Indexing PDF files in directory "../../bind/data"
[java] adding ../../bind/data/Whitepaper-Technical-Introduction-to-PDFA.pdf
[java] adding ../../bind/data/Whitepaper-XMP-metadata-in-PDFlib-products.pdf
[java] adding ../../bind/data/PDFlib-datasheet.pdf
[java] adding ../../bind/data/TET-datasheet.pdf
[java] 662 total milliseconds
BUILD SUCCESSFUL
Total time: 1 second
44
Chapter 4: TET Connectors
amira (1)$ ant search
Buildfile: build.xml
compile:
search:
[java]
PDFlib
[java]
[java]
[java]
[java]
[java]
[java]
...
[java]
[java]
[java]
[java]
...
[java]
[java]
[java]
[java]
...
[java]
[java]
[java]
documents
[java]
[java]
...
[java]
q
[java]
title:XMP
[java]
[java]
[java]
[java]
[java]
[java]
...
Enter query:
Searching for: pdflib
4 total matching documents
1. ../../bind/data/PDFlib-datasheet.pdf
Title: PDFlib, PDFlib+PDI, Personalization Server data sheet
Font : PDFlibLogo-Regular
Font : TheSans-Plain
2. ../../bind/data/Whitepaper-XMP-metadata-in-PDFlib-products.pdf
Title: Whitepaper: XMP Metadata support in PDFlib products
Font : PDFlibLogo-Regular
Font : TheSansLight-Italic
3. ../../bind/data/Whitepaper-Technical-Introduction-to-PDFA.pdf
Title: Whitepaper: A Technical Introduction to PDF/A
Font : PDFlibLogo-Regular
Font : TheSansLight-Italic
4. ../../bind/data/TET-datasheet.pdf
Title: PDFlib TET datasheet
Subject: PDFlib TET reliably extracts text, images, and metadata from PDF
Font : TheSans-Plain
Font : PDFlibLogo-Regular
Press (q)uit or enter number to jump to a page.
Enter query:
Searching for: title:xmp
1 total matching documents
1. ../../bind/data/Whitepaper-XMP-metadata-in-PDFlib-products.pdf
Title: Whitepaper: XMP Metadata support in PDFlib products
Font : PDFlibLogo-Regular
Font : TheSansLight-Italic
Two queries have been performed: one for the word PDFlib in the text, and another one
for the word XMP in the title field. Note that q must be entered to leave the result paging
mode before the next query can be started.
All paths and filenames in the Ant build.xml file are defined via properties so that the
file can be used with different environments, either by providing the properties on the
command line or by entering the properties to override in a file build.properties, or even
platform-specific into the files windows.properties or unix.properties. For example, to run
the sample with a Lucene JAR file which is installed under /tmp you can invoke Ant as
follows:
4.2 TET Connector for the Lucene Search Engine
45
ant -Dlucene-core.jar=/tmp/lucene-core-x.x.x.jar -Dlucene-analyzers-common.jar=/tmp/
lucene-analyzers-common-x.x.x.jar -Dlucene-queryparser.jar=/tmp/lucene-queryparserx.x.x.jar index
Indexing metadata fields. The TET connector for Lucene indexes the following metadata fields:
> path (StringField): the pathname of the document
> modified (DateLongField): the date of the last modification (taken from the PDF file’s
time-stamp, not the PDF metadata)
> contents (ReaderTextField): the full text contents of the document
> All predefined and custom PDF document info entries, e.g. Title, Subject, Author, etc.
Document info entries can be queried with the pCOS interface which is integrated in
TET (see the pCOS Path Reference for more details on pCOS), e.g.
String objType = tet.pcos_get_string(tetHandle, "type:/Info/Subject");
if (!objType.equals("null")) {
doc.add(new TextField("summary",
tet.pcos_get_string(tetHandle, "/Info/Subject"),
Field.Store.YES));
}
> font: the names of all fonts in the PDF document
You can customize metadata fields by modifying the set of indexed document info entries or by adding more information based on pCOS paths in PdfDocument.java.
PDF file attachments. The Lucene connector for TET recursively processes all PDF file
attachments in a document, and feeds the text and metadata of each attachment to the
Lucene search engine for indexing. This way search hits will be generated even if the
searched text is not present in the main document but some attachment. Recursive attachment traversal is especially important for PDF packages and portfolios.
46
Chapter 4: TET Connectors
4.3 TET Connector for the Solr Search Server
Solr is a high performance open-source enterprise search server based on the Lucene
search library, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted
search, caching, replication, and a web admin interface. It runs in a Java servlet container (see lucene.apache.org/solr).
Solr acts as an additional layer around the Lucene core engine. It expects the indexed
data in a simple XML format. Solr input can most easily be generated based on TETML,
the XML flavor produced by TET. The TET connector for Solr consists of an XSLT
stylesheet which converts TETML to the XML format expected by Solr. The TETML input
for this stylesheet can be generated with the TET library or the TET command-line tool
(see Section 9.1, »Creating TETML«, page 129).
Note Protected documents can be indexed with the shrug option under certain conditions (see
Chapter 5.1, »Extracting Content from protected PDF«, page 57, for details). In order to index
protected documents you must enable this option in the TET library or the TET command-line
tool when generating the TETML input for Solr.
Indexing metadata fields. The TET connector for Solr indexes all standard document
info fields. The key of each field will be used as the field name.
PDF file attachments. The TET connector for Solr recursively processes all PDF file attachments in a document, and feeds the text and metadata of each attachment to the
search engine for indexing. This way search hits will be generated even if the searched
text is not present in the main document but some attachment. Recursive attachment
traversal is especially important for PDF packages and portfolios.
XSLT stylesheet for converting TETML. The solr.xsl stylesheet expects TETML input in
any mode except glyph. It generates the XML required to supply input data to the search
server. Document info entries are supplied as fields which carry the name of the info
entry (plus the _s suffix to indicate a string value), and the main text is supplied in a
number of text fields. PDF attachments (including PDF packages and portfolios) in the
document will be processed recursively:
<?xml version="1.0" encoding="UTF-8"?><add>
<doc>
<field name="id">TET-datasheet.pdf</field>
<field name="Author_s">PDFlib GmbH</field>
<field name="CreationDate_s">2015-08-04T23:45:46+02:00</field>
<field name="Creator_s">Adobe InDesign CS6 (Windows)</field>
<field name="ModDate_s">2015-08-04T23:45:46+02:00</field>
<field name="Producer_s">Adobe PDF Library 10.0.1</field>
<field name="Subject_s">PDFlib TET: Text and Image Extraction Toolkit (TET)</field>
<field name="Title_s">PDFlib TET datasheet</field>
<field name="text">PDFlib</field>
<field name="text">datasheet</field>
<field name="text">PDFlib</field>
<field name="text">TET</field>
<field name="text">5</field>
...
4.3 TET Connector for the Solr Search Server
47
4.4 TET Connector for Oracle
The TET connector for Oracle attaches TET to an Oracle database so that PDF documents
can be indexed and queried with Oracle Text. The PDF documents can be referenced via
their path name in the database, or directly stored in the database as BLOBs.
Note Protected documents can be indexed with the shrug option under certain conditions (see
Chapter 5.1, »Extracting Content from protected PDF«, page 57, for details). This is prepared in
the Connector files, but you must manually enable this option.
Requirements and installation. The TET connector has been tested with Oracle 10i and
Oracle 11g. In order use the TET connector you must specify the AL32UTF8 database character set when creating the database. This is always the case for the Universal edition of
Oracle Express (but not for the Western European edition). AL32UTF8 is the database
character set recommended by Oracle, and also works best with TET for indexing PDF
documents. However, it is also possible to connect TET to Oracle Text with other character sets according to one of the following methods:
> Starting with Oracle Text 11.1.0.7 the database can perform the required character set
conversion. Please refer to the section »Using USER_FILTER with Charset and Format
Columns« in the Oracle Text 11.1.0.7 documentation, available at
docs.oracle.com/cd/B28359_01/text.111/b28304/cdatadic.htm#sthref497.
> With Oracle Text 11.1.0.6 or earlier the UTF-8 text generated by the TET filter script
must be converted to the database character set. This can be achieved by adding a
character set conversion command to tetfilter.sh:
Unix: call iconv (open-source software) or uconv (part of the free ICU Unicode library)
Windows: call a suitable code page converter in tetfilter.bat.
In order to take advantage of the TET Connector for Oracle you must make the TET filter
script available to Oracle as follows:
> Copy the TET filter script to a directory where Oracle can find it:
Unix: copy connectors/Oracle/tetfilter.sh to $ORACLE_HOME/ctx/bin
Windows: copy connectors/Oracle/tetfilter.bat to %ORACLE_HOME%\bin
> Make sure that the TETDIR variable in the TET filter script (tetfilter.sh or tetfilter.bat, respectively) points to the TET installation directory.
> If required you can supply more TET options for the global, document, or page level
in the TETOPT, DOCOPT, and PAGEOPT variables (see Chapter 10, »TET Library API Reference«, page 155, for option list details). This is especially useful for supplying the TET
license key, e.g.:
TETOPT="license=aaaaaaa-bbbbbb-cccccc-dddddd-eeeeee"
See Section 0.2, »Applying the TET License Key«, page 8, for more options for supplying the TET license key.
Granting privileges to the Oracle user. The examples below assume an Oracle user
with appropriate privileges to create and query an index. The following commands
grant appropriate privileges to the user HR (these commands must be issued as system
and must be adjusted as appropriate):
SQL> GRANT CTXAPP TO HR;
SQL> GRANT EXECUTE ON CTX_CLS TO HR;
SQL> GRANT EXECUTE ON CTX_DDL TO HR;
48
Chapter 4: TET Connectors
SQL>
SQL>
SQL>
SQL>
SQL>
GRANT
GRANT
GRANT
GRANT
GRANT
EXECUTE
EXECUTE
EXECUTE
EXECUTE
EXECUTE
ON
ON
ON
ON
ON
CTX_DOC TO HR;
CTX_OUTPUT TO HR;
CTX_QUERY TO HR;
CTX_REPORT TO HR;
CTX_THES TO HR;
Example A: Store path names of PDF documents in the database. This example stores
file name references to the indexed PDF documents in the database. Proceed as follows:
> Change to the following directory in a command prompt:
<TET installation directory>/connectors/Oracle
> Adjust the tetpath variable in the tetsetup_a.sql script so that it points to the directory
where TET is installed.
> Prepare the database: using Oracle’s sqlplus program create the table pdftable_a, fill
this table with path names of PDF documents, and create the index tetindex_a (note
that the contents of the tetsetup_a.sql script are slightly platform-dependent because
of different path syntax):
SQL> @tetsetup_a.sql
> Query the database using the index:
SQL> select * from pdftable_a where CONTAINS(pdffile, 'Whitepaper', 1) > 0;
> Update the index (required after adding more documents):
SQL> execute ctx_ddl.sync_index('tetindex_a')
> Optionally clean up the database (remove the index and table):
SQL> @tetcleanup_a.sql
Example B: Store PDF documents as BLOBs in the database and add metadata. This
examples stores the actual PDF documents as BLOBs in the database. In addition to the
PDF data some metadata is extracted with the pCOS interface and stored in dedicated
database columns. The tet_pdf_loader Java program stores the PDF documents as BLOBs
in the database. In order to demonstrate metadata handling the program uses the pCOS
interface to extract the document title (via the pCOS path /Info/Title) and the number of
pages in the document (via the pCOS path length:pages). The document title and the
page count will be stored in separate columns in the database. Proceed as follows to run
this example:
> Change to the following directory in a command prompt:
<TET installation directory>/connectors/Oracle
> Prepare the database: using Oracle’s sqlplus program create the table pdftable_b and
the corresponding index tetindex_b:
SQL> @tetsetup_b.sql
> Populate the database: fill the table with PDF documents and metadata via JDBC
(note that this is not possible with stored procedures). The ant build file supplied
with the TET package expects the ojdbc14.jar file for the Oracle JDBC driver in the
same directory as the tet_pdf_loader.java source code. Specify a suitable JDBC connection string with the ant command. The build file contains a description of all properties that can be used to specify options for the Ant build. You can supply values for
4.4 TET Connector for Oracle
49
these options on the command line. In the following example we use localhost as
host name, port number 1521, xe as database name, and HR as user name and password (adjust as appropriate for your database configuration):
ant -Dtet.jdbc.connection=jdbc:oracle:thin:@localhost:1521:xe
-Dtet.jdbc.user=HR -Dtet.jdbc.password=HR
> Update the index (required initially and after adding more documents):
SQL> execute ctx_ddl.sync_index('tetindex_b')
> Query the database using the index:
SQL> select * from pdftable_b where CONTAINS(pdffile, 'Whitepaper', 1) > 0;
> Optionally clean up the database (remove the index and table):
SQL> @tetcleanup_b.sql
50
Chapter 4: TET Connectors
4.5 TET PDF IFilter for Microsoft Products
This section discusses TET PDF IFilter, which is a separate product built on top of
PDFlib TET. More information and distribution packages for TET PDF IFilter are available
at www.pdflib.com/products/tet-pdf-ifilter.
TET PDF IFilter is freely available for non-commercial desktop use; commercial use
on desktop systems and deployment on servers requires a commercial license.
What is PDFlib TET PDF IFilter? TET PDF IFilter extracts text and metadata from PDF
documents and makes it available to search and retrieval software on Windows. This allows PDF documents to be searched on the local desktop, a corporate server, or the Web.
TET PDF IFilter is based on the patented PDFlib Text Extraction Toolkit (TET), which is an
established developer product for reliably extracting text from PDF documents.
TET PDF IFilter is a robust implementation of Microsoft’s IFilter indexing interface. It
works with all search and retrieval products which support the IFilter interface, e.g.
SharePoint and SQL Server. Such products use format-specific filter programs – called
IFilters – for particular file formats, e.g. HTML. TET PDF IFilter is such a program, aimed
at PDF documents. The user interface for searching the documents may be the Windows
Explorer, a Web or database frontend, a query script, or a custom application. As an alternative to interactive searches, queries can also be submitted programmatically without any user interface.
Unique Advantages. TET PDF IFilter offers the following advantages:
> Supports Western text, Chinese, Japanese, and Korean (CJK) text and right-to-left languages such as Arabic and Hebrew;
> Indexes protected documents and extracts text even from PDFs where Acrobat fails;
> Supports Unicode folding, decomposition, and normalization;
> Deployment: thread-safe, fast and robust, 32- and 64-bit versions;
> Automatic script and language detection for improved search.
Enterprise PDF Search. TET PDF IFilter is available in fully thread-safe native 32- and
64-bit versions. You can implement enterprise PDF search solutions with TET PDF IFilter
and the following products:
> Microsoft SharePoint Server and FAST server for SharePoint
> Microsoft Search Server
> Microsoft SQL Server
> Microsoft Exchange Server
> Microsoft Site Server
TET PDF IFilter can be used with all other Microsoft and third-party products which support the IFilter interface.
Desktop PDF Search. TET PDF IFilter can also be used to implement desktop PDF
search, e.g. with Windows Search which is integrated in Windows Vista/7/8/10.
TET PDF IFilter is free for non-commercial use on desktop operating systems, which provides a convenient basis for test and evaluation.
Accepted PDF Input. TET PDF IFilter supports all relevant flavors of PDF input:
> All PDF versions up to Acrobat DC, including ISO 32000-1 and ISO 32000-2
> Protected PDFs which do not require a password for opening the document
4.5 TET PDF IFilter for Microsoft Products
51
> Damaged PDF documents will be repaired
Unicode Postprocessing. TET PDF IFilter supports various Unicode postprocessing
steps which can be used to improve the search results:
> Foldings preserve, remove or replace characters, e.g. remove punctuation or characters from irrelevant scripts.
> Decompositions replace a character with an equivalent sequence of one or more other characters, e.g. replace a Chinese character with its canonically equivalent Unicode character.
> Text can be converted to all four Unicode normalization forms, e.g. emit NFC form to
match the requirements of a database.
Internationalization. In addition to Western text TET PDF IFilter fully supports Chinese, Japanese, and Korean (CJK) text. All CJK encodings are recognized; horizontal and
vertical writing modes are supported. Automatic detection of the locale ID (language
and region identifier) of the text improves the results of Microsoft’s word breaking and
stemming algorithms, which is especially important for East Asian text.
Right-to-left languages such as Hebrew and Arabic are also supported. Contextual
character forms are normalized and the text is delivered in logical order.
PDF is more than just a Bunch of Pages. TET PDF IFilter treats PDF documents as containers which may contain much more information than only plain pages. TET PDF IFilter indexes all relevant items in PDF documents:
> Page contents
> Text in bookmarks
> Metadata (see below)
> Embedded PDFs and PDF packages/portfolios are processed recursively so that the
text in all embedded PDF documents can be searched.
XMP Metadata and document info. The advanced metadata implementation in TET
PDF IFilter supports the Windows property system for metadata. It indexes XMP metadata as well as standard or custom document info entries. Metadata indexing can be
configured on several levels:
> Document info entries, Dublin Core fields and other common XMP properties are
mapped to equivalent Windows properties, e.g. Title, Subject, Author.
> TET PDF IFilter adds useful PDF-specific pseudo properties, e.g. page size, PDF/A conformance level, font names.
> All relevant predefined XMP properties can be searched.
> User-defined XMP properties can be searched, e.g. company-specific classification
properties, PDF/A extension schemas.
TET PDF IFilter optionally integrates metadata in the full text index. As a result, even
full text search engines without metadata support (e.g. SQL Server) can search for metadata.
52
Chapter 4: TET Connectors
4.6 TET Connector for the Apache TIKA Toolkit
TIKA is an open-source »toolkit for detecting and extracting metadata and structured
text content from various documents using existing parser libraries«. For more information about TIKA see tika.apache.org. The TET connector for Tika replaces the default
PDF parser configured in Tika and hooks up TET as parser for the PDF format. The TET
connector supplies the following items to Tika:
> unformatted text contents of all pages
> predefined and custom document info fields
> number of pages in the document
Note Protected documents can be indexed with the shrug option under certain conditions (see Chapter 5.1, »Extracting Content from protected PDF«, page 57, for details). This is prepared in the
Connector files, but you must manually enable this option. TETPDFParser.java additionally provides a method for supplying a password in case the shrug option is not sufficient.
Requirements and installation. The TET distribution contains a TET connector for the
Tika toolkit. In the description below <tet-dir> stands for the directory where the TET
package was unpacked. The following requirements must be met:
> JDK 1.5 or later
> A working installation of the Ant build tool
> An installed TET distribution package for Unix, Linux, OS X, or Windows.
> A pre-built JAR file for Tika called tika-app-1.x.jar. Download information for this file
can be found at the following location:
tika.apache.org/download.html
In general Tika 1.8 or above can be used. However, Tika 1.9 has a bug which prevents
overriding the built-in PDF parser. The TET connector can therefore only be used
with Tika 1.9 if some tweaks are applied to the Tika source code, or by using a mechanism like the Tika XML configuration file.
Building and testing the TET connector for Tika. Proceed as follows to build and test
the TET connector for Tika:
> Copy tika-app-1.x.jar to the directory <tet-dir>/connectors/Tika.
> Change to <tet-dir>/connectors/Tika and build the TET connector for Tika:
ant
If your Tika jar file has a name different from tika-app-1.x.jar you must supply the
name of the jar file on the command line:
ant -Dtika-app.jar=tika-app-1.x.jar
> The build file includes a target for running a test with the TET connector for Tika:
ant test
This command should produce the contents of the test document as XHTML on the
standard output. To test with a PDF file of your choice provide the Ant property
test.inputfile on the command line as follows:
ant -Dtest.inputfile=/path/to/your/file.pdf test
The ability to supply a password for protected documents can be tested as follows:
4.6 TET Connector for the Apache TIKA Toolkit
53
ant -Dtest.inputfile=<protected file.pdf> -Dtest.outputfile=<output file name>
-Dtest.password=<password> api-test
> To verify that the TET connector for Tika is actually used for the MIME type application/pdf, execute the following command in the directory <tet-dir>/connectors/Tika on
Unix and OS X systems:
java -Djava.library.path=<tet-dir>/bind/java -classpath
<tet-dir>/bind/java/TET.jar:tika-app-1.x.jar:tet-tika.jar
org.apache.tika.cli.TikaCLI --list-parser-details
On Windows:
java -Djava.library.path=<tet-dir>/bind/java -classpath
<tet-dir>/bind/java/TET.jar;tika-app-1.x.jar;tet-tika.jar
org.apache.tika.cli.TikaCLI --list-parser-details
The following fragment should appear in the generated output:
com.pdflib.tet.tika.TETPDFParser
application/pdf
> For running the Tika GUI application with the TET connector, execute the following
command in the directory <tet-dir>/connectors/Tika:
On Unix and OS X systems:
java -Djava.library.path=<tet-dir>/bind/java -classpath
<tet-dir>/bind/java/TET.jar:tika-app-1.x.jar:tet-tika.jar
org.apache.tika.cli.TikaCLI
On Windows:
java -Djava.library.path=<tet-dir>\bind\java -classpath
<tet-dir>\bind\java\TET.jar;tika-app-1.x.jar;tet-tika.jar
org.apache.tika.cli.TikaCLI
Customizing the TET connector for Tika. You can customize the Tika connector as follows in the TETPDFParser.java source module:
> Add document options to the DOC_OPT_LIST variable, e.g. the shrug option for processing protected documents;
> Add page options to the PAGE_OPT_LIST variable;
> Customize the searchpath for resources such as CJK CMaps in the SEARCHPATH variable. Alternatively, the tet.searchpath property can be supplied when processing PDF
documents.
54
Chapter 4: TET Connectors
4.7 TET Connector for MediaWiki
MediaWiki is the free Wiki software which is used to run Wikipedia and many other
community Web sites. More details on MediaWiki can be found at
www.mediawiki.org/wiki/MediaWiki.
Note Protected documents can be indexed with the shrug option under certain conditions (see
Chapter 5.1, »Extracting Content from protected PDF«, page 57, for details). This is prepared in
the Connector files, but you must manually enable this option.
Requirements and installation. The TET distribution contains a TET connector which
can be used to index PDF documents that are uploaded to a MediaWiki site. MediaWiki
does not support PDF documents natively, but allows you to upload PDFs as »images«.
The TET connector for MediaWiki indexes all PDF documents as they are uploaded. PDF
documents which already exist in MediaWiki are not indexed. The following requirements must be met:
> MediaWiki 22 or above
> A TET distribution package with the TET binding for PHP on Unix, Linux, OS X, or
Windows.
In order to implement the TET connector for MediaWiki perform the following steps:
> Install the TET binding for PHP as described in Section 3.9, »PHP Binding«, page 35.
> Copy <TET install dir>/connectors/MediaWiki/PDFIndexer.php to
<MediaWiki install dir>/extensions/PDFIndexer/PDFIndexer.php.
> If you need support for CJK text, copy the CMap files in <TET install dir>/resource/cmap
to <MediaWiki install dir>/extensions/PDFIndexer/resource/cmap.
> Add the following lines to the MediaWiki configuration file LocalSettings.php:
# Index uploaded PDFs to make them searchable
include("extensions/PDFIndexer/PDFIndexer.php");
> In order to avoid warnings when uploading PDF documents it is recommended to
add the following lines to <MediaWiki install dir>/includes/DefaultSettings.php in order
to make .pdf a well-known file type extension:
/**
* This is the list of preferred extensions for uploading files. Uploading files
* with extensions not in this list will trigger a warning.
*/
$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg', 'pdf' );
How the TET connector for MediaWiki works. The TET connector for MediaWiki consists of the PHP module PDFIndexer.php. Using one of MediaWiki’s predefined hooks it is
hooked up so that it will be called whenever a new PDF document is uploaded. It extracts text and metadata from the PDF document and appends it to the optional usersupplied comment which accompanies the uploaded document. The text is hidden in
an HTML comment so that it will not be visible to users when they view the document
comment. Since MediaWiki indexes the full contents of the comment (including the
hidden full text) the text contents of the PDF are also indexed. The text for the index is
constructed as follows:
> The TET connector feeds the value of all document info fields to the index.
> The text contents of all pages are extracted and concatenated.
4.7 TET Connector for MediaWiki
55
Fig. 4.1 Searching PDF documents in MediaWiki
> If the size of the extracted text is below a limit, it is completely fed to the index. The
advantage of this method is that search results display the search term in context.
> If the size of the extracted text exceeds a limit, the text is reduced to unique words
(i.e. multiple instances of the same word are reduced to a single instance of the
word).
> If the size of the reduced text is below a limit, it is fed to the index. Otherwise it is
truncated, i.e. some text towards the end of the document are not indexed.
The predefined limit is 512 KB, but this can be changed in PDFIndexer.php. If one of the
size tests described above hits the limit, a warning message is written to MediaWiki’s
DebugLogFile if MediaWiki logging is activated.
Searching for PDF documents. Since PDF documents are treated as images by MediaWiki you must search them in the Image namespace. This can be achieved by activating
the Image checkbox in the list of namespaces in the Advanced search dialog (see Figure
4.1). The Image namespace will not be searched by default. However, this setting can be
enabled in the LocalSettings.php preferences file as follows:
$wgNamespacesToBeSearchedDefault = array(
NS_MAIN
=> true,
NS_IMAGE
=> true,
}
The search results will display a list of documents which contain the search term. If the
full text has been indexed (as opposed to the abbreviated word list for long documents)
some additional terms will be displayed before and after the search term to provide
context. Since the PDF text contents are fed to the MediaWiki index in HTML form, line
numbers will be displayed in front of the text. These line numbers are not relevant for
PDF documents, and you can safely ignore them.
Indexing metadata fields. The TET connector for MediaWiki indexes all standard document info fields. The value of each field will be fed to the index so that it can be used in
searches. Since MediaWiki does not support metadata-based searches you cannot directly search for document info entries, but only for info entries as part of the full text.
56
Chapter 4: TET Connectors
5 Configuration
5.1 Extracting Content from protected PDF
PDF security features. PDF documents can be protected with password security which
offers the following protection features:
> The user password (also referred to as open password) is required to open the file for
viewing.
> The master password (also referred to as owner or permissions password) is required
to change any security settings, i.e. permissions, user or master password. Files with
user and master passwords can be opened for viewing by supplying either password.
> Permission settings restrict certain actions for the PDF document, such as printing
or extracting text.
> An attachment password can be specified to encrypt only file attachments, but not
the actual contents of the document itself.
If a PDF document uses any of these protection features it will be encrypted. In order to
display or modify a document’s security settings with Acrobat, click File, Properties...,
Security, Show Details... or Change Settings..., respectively.
TET honors PDF permission settings. The password and permission status can be
queried with the pCOS paths encrypt/master, encrypt/user, encrypt/nocopy, etc. as demonstrated in the dumper sample. pCOS also offers the pcosmode pseudo object which can
be used to determine which operations are allowed for a particular document.
Content extraction status. By default, text and image extraction is possible with TET if
the document can successfully be opened (this is no longer true if the requiredmode option of TET_open_document( ) was supplied). Depending on the nocopy permission setting, content extraction may or may not be allowed in restricted pCOS mode (content
extraction is always allowed in full pCOS mode). The following condition can be used to
check whether content extraction is allowed:
if ((int) tet.pcos_get_number(doc, "encrypt/nocopy") == 0)
{
/* content extraction allowed */
}
The need for processing protected documents. PDF permission settings help document authors to enforce their rights as creators of content, and users of PDF documents
must respect the rights of the document author when extracting text or image contents. By default, TET will operate in restricted mode and refuse to extract any contents
from such protected documents. However, content extraction does not in all cases automatically constitute a violation of the author’s rights. Situations where content extraction may be acceptable include the following:
> Small amounts of content are extracted for quoting (»fair use«).
> Organizations may want to check incoming or outgoing documents for certain keywords (document screening) without any further content repurposing.
> The document author himself may have lost the master password.
5.1 Extracting Content from protected PDF
57
> Search engines index protected documents without making the document contents
available to the user directly (only indirectly by providing a link to the original PDF).
The last example is particularly important: even if users are not allowed to extract the
contents of a protected PDF, they should be able to locate the document in an enterprise
or Web-based search. It may be acceptable to extract the contents if the extracted text is
not directly made available to the user, but only used to feed the search engine’s index
so that the document can be found. Since the user only gets access to the original protected PDF (after the search engine indexed the contents and the hit list contained a link
to the PDF), the document’s internal permission settings will protect the document as
usual when accessed by the user.
The »shrug« feature for protected documents. TET offers a feature which can be used
to extract text and images from protected documents, assuming the TET user accepts
responsibility for respecting the document author’s rights. This feature is called shrug,
and works as follows: by supplying the shrug option to TET_open_document( ) the user asserts that he or she will not violate any document authors’ rights. PDFlib GmbH’s terms
and conditions require that TET customers respect PDF permission settings.
If all of the following conditions are true, the shrug feature will be enabled:
> The shrug option has been supplied to TET_open_document( ).
> The document requires a master password but it has not been supplied to TET_open_
document( ).
> If the document requires a user (open) password, it must have been supplied to TET_
open_document( ).
> Text extraction is not allowed in the document’s permission settings, i.e.
nocopy=true.
The shrug feature will have the following effects:
> Extracting content from the document is allowed despite nocopy=true. The user is responsible for respecting the document author’s rights.
> The pCOS pseudo object shrug will be set to true/1.
> pCOS runs in full mode (instead of restricted mode), i.e. the pcosmode pseudo object
will be set to 2.
The shrug pseudo object can be used according to the following idiom to determine
whether or not the contents can directly be made available to the user, or should only
be used for indexing and similar indirect purposes:
int doc = tet.open_document(filename, "shrug");
...
if ((int) tet.pcos_get_number(doc, "shrug") == 1)
{
/* only indexing allowed */
}
else
{
/* content may be delivered to the user */
}
58
Chapter 5: Configuration
5.2 Resource Configuration and File Searching
UPR files and resource categories. In some situations TET needs access to resources
such as encoding definitions or glyph name mapping tables. In order to make resource
handling platform-independent and customizable, a configuration file can be supplied
for describing the available resources along with the names of their corresponding disk
files. In addition to a static configuration file, dynamic configuration can be accomplished at runtime by adding resources with TET_set_option( ). For the configuration file
a simple text format called Unix PostScript Resource (UPR) is used. The UPR file format as
used by TET will be described below. TET supports the resource categories listed in Table
5.1.
Table 5.1 Resource categories (all file names must be specified in UTF-8)
category
format1
explanation
cmap
key=value
Resource name and file name of a CMap
codelist
key=value
Resource name and file name of a code list
encoding
key=value
Resource name and file name of an encoding
glyphlist
key=value
Resource name and file name of a glyph list
glyphmapping
option list
An option list describing a glyph mapping method according to Table 10.9,
page 178. This resource will be evaluated in TET_open_document( ), and
the result will be appended after the mappings specified in the option
glyphmapping of TET_open_document( ).
hostfont
key=value
Name of a host font resource (key is the PDF font name; value is the UTF8 encoded host font name) to be used for an unembedded font
fontoutline
key=value
Font and file name of a TrueType or OpenType font to be used for an unembedded font
searchpath
value
Relative or absolute path name of directories containing data files
1. While the UPR syntax requires an equal character ’=’ between the name and value, this character is neither required nor allowed
when specifying resources with TET_set_option( ).
The UPR file format. UPR files are text files with a very simple structure that can easily
be written in a text editor or generated automatically. To start with, let’s take a look at
some syntactical issues:
> Lines can have a maximum of 255 characters.
> A backslash ’\’ escapes newline characters. This may be used to extend lines.
> An isolated period character ’ . ’ serves as a section terminator.
> Comment lines may be introduced with a percent ’%’ character, and terminated by
the end of the line.
> Whitespace is ignored everywhere except in resource names and file names.
UPR files consist of the following components:
> A magic line for identifying the file. It has the following form:
PS-Resources-1.0
> A section listing all resource categories described in the file. Each line describes one
resource category. The list is terminated by a line with a single period character.
5.2 Resource Configuration and File Searching
59
> A section for each of the resource categories listed at the beginning of the file. Each
section starts with a line showing the resource category, followed by an arbitrary
number of lines describing available resources. The list is terminated by a line with a
single period character. Each resource data line contains the name of the resource
(equal signs have to be quoted). If the resource requires a file name, this name has to
be added after an equal sign. The searchpath (see below) will be applied when TET
searches for files listed in resource entries.
Sample UPR file.
The following listing gives an example of a UPR configuration file:
PS-Resources-1.0
searchpath
glyphlist
codelist
encoding
.
searchpath
/usr/local/lib/cmaps
/users/kurt/myfonts
.
glyphlist
myglyphlist=/usr/lib/sample.gl
.
codelist
mycodelist=/usr/lib/sample.cl
.
encoding
myencoding=sample.enc
.
File search and the searchpath resource category. In addition to relative or absolute
path names you can supply file names without any path specification to TET. The
searchpath resource category can be used to specify a list of path names for directories
containing the required data files. When TET must open a file it will first use the file
name exactly as supplied, and try to open the file. If this attempt fails, TET will try to
open the file in the directories specified in the searchpath resource category one after
another until it succeeds. Multiple searchpath entries can be accumulated, and will be
searched in reverse order (paths set at a later point in time will searched before earlier
ones). In order to disable the search you can use a fully specified path name in the TET
functions.
On Windows TET initializes the searchpath resource category with a value read from
the following registry keys:
HKLM\SOFTWARE\PDFlib\TET5\5.0\SearchPath
HKLM\SOFTWARE\PDFlib\TET5\SearchPath
HKLM\SOFTWARE\PDFlib\SearchPath
These registry entries may contain a list of path names separated by a semicolon ’;’
character. The Windows installer initializes the SearchPath registry entry with the name
of the resource directory in the TET installation directory.
Note Be careful when manually accessing the registry on 64-bit Windows systems: as usual, 64-bit
binaries work with the 64-bit view of the Windows registry, while 32-bit binaries running on a
64-bit system work with the 32-bit view of the registry. If you must add registry keys for a 32-bit
60
Chapter 5: Configuration
product manually, make sure to use the 32-bit version of the regedit tool. It can be invoked as
follows from the Start, Run... dialog:
%systemroot%\syswow64\regedit
Default file search paths. On Unix, Linux, OS X and i5/iSeries systems some directories will be searched for files by default even without specifying any path and directory
names. Before searching and reading the UPR file (which may contain additional search
paths), the following directories will be searched:
<rootpath>/PDFlib/TET/5.0/resource/cmap
<rootpath>/PDFlib/TET/5.0/resource/codelist
<rootpath>/PDFlib/TET/5.0/resource/glyphlst
<rootpath>/PDFlib/TET/5.0/resource/fonts
<rootpath>/PDFlib/TET/5.0/resource/icc
<rootpath>/PDFlib/TET/5.0
<rootpath>/PDFlib/TET
<rootpath>/PDFlib
On Unix, Linux, and OS X <roothpath> will first be replaced with /usr/local and then with
the HOME directory. On i5/iSeries <roothpath> is empty.
Default file names for license and resource files. By default, the following file names
will be searched for in the default search path directories:
licensekeys.txt
pdflib.upr
(license file)
(resource file)
This feature can be used to work with a license file without setting any environment
variable or runtime option.
Searching for the UPR resource file. If resource files are to be used you can specify
them via calls to TET_set_option( ) (see below) or in a UPR resource file. TET reads this file
automatically when the first resource is requested. The detailed process is as follows:
> If the environment variable TETRESOURCEFILE is defined TET takes its value as the
name of the UPR file to be read. If this file cannot be read an exception will be
thrown.
> If the environment variable TETRESOURCEFILE is not defined, TET tries to open a file
with the following name:
upr (on MVS; a dataset is expected)
tet.upr (Windows, Unix, and all other systems)
If this file cannot be read no exception will be thrown.
> On Windows TET will additionally try to read the following registry entry:
HKLM\SOFTWARE\PDFlib\TET5\5.0\resourcefile
The value of this key (which will be created with the value <installdir>/tet.upr by the
TET installer, but can also be set manually) serves as the name of the resource file to
be used. If this file cannot be read an exception will be thrown.
> The client can force TET to read a resource file at runtime by explicitly setting the
resourcefile option:
set_option("resourcefile=/path/to/tet.upr");
5.2 Resource Configuration and File Searching
61
This call can be repeated arbitrarily often; the resource entries will be accumulated.
Configuring resources at runtime. In addition to using a UPR file for the configuration,
it is also possible to directly configure individual resources at runtime via TET_set_
option( ). This function takes a resource category name and pairs of corresponding resource names and values as it would appear in the respective section of this category in
a UPR resource file, for example:
set_option("glyphlist={myglyphnames=/usr/local/glyphnames.gl}");
Multiple resource names can be configured in a single option list for a resource category
option (but the same resource category option cannot be repeated in a single call to TET_
set_option( )). Alternatively, multiple calls can be used to accumulate resource settings.
Escape sequences for text files. Escape sequences are supported in all text files except
UPR files and CMap files. Special character sequences can be used to include unprintable
characters in text files. All sequences start with a backslash ’\’ character:
> \x introduces a sequence of two hexadecimal digits (0-9, A-F, a-f), e.g. \x0D
> \nnn denotes a sequence of three octal digits (0-7), e.g. \015. The sequence \000 will be
ignored.
> The sequence \\ denotes a single backslash.
> A backslash at the end of a line will cancel the end-of-line character.
62
Chapter 5: Configuration
5.3 Recommendations for common Scenarios
TET offers a variety of options which you can use to control various aspects of operation. In this section we provide some recommendations for typical TET application scenarios. Please refer to Chapter 10, »TET Library API Reference«, page 155, for details on
the functions and options mentioned below.
Optimizing performance. In some situations, particularly when indexing PDF for
search engines, text extraction speed is crucial and may play a more important role
than optimal output. The default settings of TET have been selected to achieve the best
possible output, but can be adjusted to speed up processing. Some tips for choosing options in TET_open_page( ) and TET_open_document( ) to maximize text extraction
throughput:
> docstyle=searchengine
This page option sets up several internal parameters to speed up operation by reducing the output quality in a way which does not affect the indexing process for search
engines.
> engines={image=false textcolor=false}
If image extraction and text color detection are not required, internal processing
steps can be disabled with this document option to speed up operation.
> contentanalysis={merge=0}
This page option disables the expensive strip and zone merging step, and reduces
processing times for typical files to ca. 60% compared to default settings. However,
documents where the contents are scattered across the pages in arbitrary order may
result in some text which is not extracted in logical order.
> contentanalysis={shadowdetect=false}
This page option disables detection of redundant shadow and fake bold text, which
can also reduce processing times.
> When creating TETML the following document option can be used to disable creation
of TETML elements for various interactive PDF features:
tetml={elements={annotations=false bookmarks=false destinations=false fields=false
javascripts=false}}
Words vs. line layout vs. reflowable text. Different applications prefer different kinds
of output (hyphenated words are always dehyphenated with these settings):
> Individual words (ignore layout): a search engine may not be interested in any layout-related aspects, but only the words comprising the text. In this situation use
granularity=word in TET_open_page( ) to retrieve one word per call to TET_get_text( ).
> Keep line layout: use granularity=page in TET_open_page( ) for extracting the full text
contents of a page in a single call to TET_get_text( ). Text lines are separated with a
linefeed character U+00A0 to retain the existing line structure.
> Reflowable text: in order to avoid line breaks and facilitate reflowing of the extracted
text use the document option lineseparator=U+0020 and the page option granularity=page. The full page contents can be fetched with a single call to TET_get_text( ).
By default, paragraphs are separated by U+000A. If you want to apply a different
paragraph separator use the document option paraseparator=U+2029 (or another
suitable Unicode value).
5.3 Recommendations for common Scenarios
63
Writing a search engine or indexer. Indexers are usually not interested in the position
of text on the page (unless they provide search term highlighting). In many cases they
will tolerate errors which occur in Unicode mapping, and process whatever text contents they can get. Recommendations:
> Use granularity=word in TET_open_page( ).
> If the application knows how to process punctuation characters you can keep them
with the adjacent text by setting the following page option:
contentanalysis={punctuationbreaks=false}
Geometry. The geometry features may be useful for some applications:
> The TET_get_char_info( ) interface is only required if you need the position of text on
the page, the respective font name, text color or other details. If you are not interested in text coordinates calling TET_get_text( ) will be sufficient.
> If you have advance information about the layout of pages you can use the includebox and/or excludebox options in TET_open_page( ) to get rid of headers, footers, or
similar items which are not part of the main text.
Complex layouts. Some classes of documents use very elaborate page layouts. For example, with magazines and periodicals TET may not be able to properly determine the
relationship of columns on the page. In such situations it is possible to enhance the extracted text at the expense of processing time. Suitable options for this purpose are
summarized in Section 6.7, »Layout Analysis«, page 89. See Table 10.12, page 187, for
more details on relevant options.
Legal documents. When dealing with legal documents there is usually zero tolerance
for wrong Unicode mappings since they might alter the content or interpretation of a
document. In many cases the text position is not required, and the text must be extracted word by word. Recommendations:
> Use the granularity=word option in TET_open_page( ).
> Use the password option with the appropriate document password in TET_open_
document( ) if you must process documents which require a password for opening, or
the shrug option if content extraction is not allowed in the permission settings and
you are in a legal position to extract text from the document (see »The »shrug« feature for protected documents«, page 58).
> For absolute text fidelity: stop processing as soon as the unknown field in the character info structure returned by TET_get_char_info( ) is 1, or if the Unicode replacement
character U+FFFD is part of the string returned by TET_get_text( ). In TETML with one
of the text modes glyph or wordplus you can identify this situation by the following
attribute in the Glyph element:
unknown="true"
Do not set the unknownchar option to any common character since you may be unable to distinguish it from correctly mapped characters without checking the
unknown field.
> Also to ensure text fidelity you may want to disable text extraction for text which is
not visible on the page:
ignoreinvisibletext=true
64
Chapter 5: Configuration
Processing documents with PDFlib+PDI. When using PDFlib+PDI to process PDF documents on a per-page basis you can integrate TET for controlling the splitting or merging
process. For example, you could split a PDF document based on the contents of a page. If
you have control over the creation process you can insert separator pages with suitable
processing instructions in the text. The TET Cookbook contains examples for analyzing
documents with TET and then processing them with PDFlib+PDI.
Legacy PDF documents with missing Unicode values. In some situations PDF documents created by legacy applications must be processed where the PDF may not contain
enough information for proper Unicode mapping. Using the default settings TET may
be unable to extract some or all of the text contents. Recommendations:
> Start by extracting the text with default settings, and analyze the results. Identify
the fonts which do not provide enough information for proper Unicode mapping.
> Write custom encoding tables and glyph name lists to fix problematic fonts. Use the
PDFlib FontReporter plugin for analyzing the fonts and preparing Unicode mapping
tables.
> Configure the custom mapping tables and extract the text again, using a larger number of documents. If there are still unmappable glyphs adjust the mapping tables as
appropriate.
> If you have a large number of documents with unmappable glyphs PDFlib GmbH
may be able to assist you in creating the required mapping tables.
Convert PDF documents to another format. If you want to import the page contents of
PDF documents into your application, while retaining as much information as possible
you’ll need precise character metrics. Recommendations:
> Use TET_get_char_info( ) to retrieve precise character metrics and font names. Even if
you use the uv field to retrieve the Unicode values of individual characters, you must
also call TET_get_text( ) since it fills the char_info structure.
> Use granularity=glyph or word in TET_open_page( ), depending on what is better suited
for your application. Working with granularity=glyph may result in conflicts between
the visual layout of text and the processed logical text created by TET (e.g. the two
characters created by a ligature glyph may not fit into the same space as the ligature).
Corporate fonts with custom-encoded logos. In many cases corporate fonts containing custom logos include missing or wrong Unicode mapping information for the logos.
If you have a large number of PDF documents containing such fonts it is recommended
to create a custom mapping table with proper Unicode values.
Start by creating a font report (see »Analyzing PDF documents with the PDFlib FontReporter Plugin«, page 109) for a PDF containing the font, and locate mismapped glyphs
in the font report. Depending on the font type you can use any of the available configuration tables to provide the missing Unicode mappings. See »Code list resources for all
font types«, page 110, for a detailed example of a code list for a logotype font.
TeX documents. PDF documents produced with the TeX documents often contain numerical glyph names, Type 3 fonts and other problematic properties which prevent other products from successfully extracting the text. TET contains many heuristics and
workarounds for dealing with such documents. However, a particular flavor of TeX doc-
5.3 Recommendations for common Scenarios
65
uments can only be processed with a workaround that requires more processing time,
and is disabled by default. You can enable more CPU-intensive font processing for these
documents with the following document option:
checkglyphlists=true
66
Chapter 5: Configuration
6 Text Extraction
6.1 PDF Document Domains
PDF documents may contain text in many other places than only the page contents.
While most applications deal with the page contents only, in many situations other
document domains may be relevant as well.
While the page contents can be retrieved with the workhorse functions TET_get_
text( ) and TET_get_image( ), the integrated pCOS interface plays a crucial role for retrieving text from other document domains.
In the remaining section we provide information on domain searching with the TET
library and TETML. In addition, we summarize how to search these document domains
with Acrobat X/XI/DC. This is important to locate search hits in Acrobat.
Text on the page. Page contents are the main source of text in PDF. Text on a page is
rendered with fonts and encoded using one of the many encoding techniques available
in PDF.
> How to display with Acrobat: page contents are always visible
> How to search a single PDF with Acrobat X/XI/DC: Edit, Find or Edit, Advanced Search.
TET may be able to process the text in documents where Acrobat does not correctly
map glyphs to Unicode values. In this situation you can use the TET Plugin which is
based on TET (see Section 4.1, »Free TET Plugin for Adobe Acrobat«, page 43). The TET
Plugin offers its own search dialog via Plug-Ins, PDFlib TET Plugin..., TET Find. However,
it is not intended as a full-blown search facility.
> How to search multiple PDFs with Acrobat X/XI/DC: Edit, Advanced Search and in
Show More Options under Look In: select All PDF Documents in, and browse to a folder
with PDF documents.
> Sample code for the TET library: extractor mini sample
> TETML element: /TET/Document/Pages/Page/Content
Predefined document info entries. Traditional document info entries are key/value
pairs.
> How to display with Acrobat X/XI/DC: File, Properties...
> How to search a single PDF with Acrobat X/XI/DC: not available
> How to search multiple PDFs with Acrobat X/XI/DC: click Edit, Advanced Search andc
Show More Options near the bottom of the dialog. In the Look In: pull-down select a
folder of PDF documents and in the pull-down menu Use these additional criteria select one of Date Created, Date Modified, Author, Title, Subject, Keywords.
> Sample code for the TET library: dumper mini sample
> TETML element: /TET/Document/DocInfo
Custom document info entries. Custom document info entries can be defined in addition to the standard entries.
> How to display with Acrobat X/XI/DC: File, Properties..., Custom (not available in the
free Adobe Reader)
> How to search with Acrobat X/XI/DC: not available
6.1 PDF Document Domains
67
Fig. 6.1
Acrobat’s advanced
search dialog
> Sample code for the TET library: dumper mini sample
> TETML element: /TET/Document/DocInfo/Custom
XMP metadata on document level. XMP metadata consists of an XML stream containing extended metadata.
> How to display with Acrobat X/XI/DC: File, Properties..., Description, Additional Metadata.. (not available in the free Adobe Reader)
> How to search a single PDF with Acrobat X/XI/DC: not available
> How to search multiple PDFs with Acrobat X/XI/DC: click Edit, Advanced Search and
Show More Options. In the Look In: pull-down select a folder of PDF documents and in
the pull-down menu Use these additional criteria select XMP Metadata (not available
in the free Adobe Reader).
> Sample code for the TET library: dumper mini sample
> TETML element: /TET/Document/Metadata
XMP metadata on image level. XMP metadata can be attached to document components, such as images, pages, fonts, etc. However, XMP is commonly only found on the
image level (in addition to document level).
68
Chapter 6: Text Extraction
> How to display with Acrobat X: Tools, Content, Edit Object, select image, right-click,
Show Metadata... (not available in the free Adobe Reader)
> How to display with Acrobat XI/DC: View, Show/Hide, Navigation Panes, Content. Locate the image in the tree structure, right-click on it and select Show Metadata... .(not
available in the free Acrobat Reader)
> How to search with Acrobat X/XI/DC: not available
> Sample code for the TET library: pCOS Cookbook topic image_metadata
> TETML element: /TET/Document/Pages/Page/Resources/Images/Image/Metadata
Text in form fields. Form fields are displayed on top of the page. However, technically
they are not part of the page contents, but represented by separate data structures.
> How to display with Acrobat X/XI: Tools, Forms, Edit (not available in the free Adobe
Reader)
> How to display with Acrobat DC: Tools, Prepare Form (not available in the free Acrobat
Reader)
> How to search with Acrobat X/XI/DC: not available
> Sample code for the TET library: pCOS Cookbook topic fields
> TETML element: /TET/Document/Pages/Page/Fields/Field/Value
Text in comments (annotations). Similar to form fields, annotations (notes, comments, etc.) are layered on top of the page, but are represented by separate data structures. The interesting text contents of an annotation depend on its type. For example,
for Web links the interesting part may be the URL, while for other annotation types the
visible text contents may be relevant.
> How to display with Acrobat X/XI: Comment, Comments List
> How to display with Acrobat DC: Tools, Comment, Comments List
> How to search a single PDF with Acrobat X/XI/DC: Edit, Search and check the box
Include Comments, or use the Search Comments button on the Comments List toolbar
> How to search multiple PDFs with Acrobat X/XI/DC: click Edit, [Advanced] Search and
Show More Options. In the Look In: pull-down select a folder of PDF documents and in
the pull-down menu Use these additional criteria: select Comments.
> Sample code for the TET library: pCOS Cookbook topic annotations
> TETML element: /TET/Document/Pages/Page/Annotations/Annotation
Text in bookmarks. Bookmarks are not directly page-related, although they may contain an action which jumps to a particular page. Bookmarks can be nested to form a hierarchical structure.
> How to display with Acrobat X/XI/DC: View, Show/Hide, Navigation Panes, Bookmarks
> How to search a single PDF with Acrobat X/XI/DC: Edit, Advanced Search and check the
box Include Bookmarks
> How to search multiple PDFs with Acrobat X/XI/DC: click Edit, Advanced Search and
Show More Options. In the Look In: pull-down select a folder of PDF documents and in
the pull-down menu Use these additional criteria select Bookmarks (not available in the
free Adobe Reader)
> Sample code for the TET library: pCOS Cookbook topic bookmarks
> TETML element: /TET/Document/Bookmarks/Bookmark/Title
File attachments. PDF documents may contain file attachments (on document or page
level) which may themselves be PDF documents.
6.1 PDF Document Domains
69
> How to display with Acrobat X/XI/DC: View, Show/Hide, Navigation Panes, Attachments
> How to search with Acrobat X/XI/DC: Use Edit, AdvancedSearch and check the box
Include Attachments (not available in the free Adobe Reader). Nested attachments are
not searched recursively.
> Sample code for the TET library: get_attachments mini sample
> TETML element: /TET/Document/Attachments/Attachment/Document
PDF packages and portfolios. PDF packages and PDF portfolios are file attachments
with additional properties.
> How to display with Acrobat X/XI/DC: Acrobat presents the cover sheet of the package/portfolio and the constituent PDF documents with dedicated user interface elements for PDF packages.
> How to search a single PDF package with Acrobat X/XI/DC: Edit, Search Entire Portfolio
> How to search multiple PDF packages with Acrobat X/XI/DC: not available
> Sample code for the TET library: get_attachments mini sample
> TETML element: /TET/Document/Attachments/Attachment/Document
PDF standards and other PDF properties. This domain does not explicitly contain text,
but is used as a container which collects various intrinsic properties of a PDF document,
e.g. PDF/X and PDF/A status, Tagged PDF status, etc.
> Acrobat X/XI/DC: View, Show/Hide, Navigation Panes, Standards (only present for standard-conforming PDFs)
> How to search with Acrobat X/XI/DC: not available
> Sample code for the TET library: dumper mini sample
> TETML elements and attributes: /TET/Document/@pdfa, /TET/Document/@pdfe,
/TET/Document/@pdfua, /TET/Document/@pdfvt, /TET/Document/@pdfx
Tagged PDF. TET reconstructs the layout structure and hierarchy directly from the
page contents without using the structure tree which is present in Tagged PDF documents. Page contents which are not required to understand the document but rather
are generated for layout purposes or as decoration may be marked as Artifacts in
Tagged PDF. The most common use of Artifacts is for running headers and footers including page numbers and chapter titles. Depending on the use case it may or may not
be desirable to process page contents which are marked as Artifacts:
> How to display with Acrobat XI/DC: View, Show/Hide, Navigation Panes, Tags; in the
Tags menu click Find... and select Artifacts. Text, images and vector graphics which are
marked as Artifact are highlighted.
Alternatively, you can activate Tools, Accessibility, Touch Up Reading Order. This tool
highlights the tagged contents on the page with shaded rectangles. Contents which
are not highlighted represents Artifacts.
> How to ignore Artifacts when searching with Acrobat X/XI/DC: not available
> How to ignore Artifacts with TET: provide the page option ignoreartifacts.
> TETML: Artifacts are not identified in TETML, but can be excluded with the page option ignoreartifacts.
Layers. Using layers (technically known as optional content) the page contents can be
made visible or invisible. Depending on the use case it may or may not be desirable to
process page contents on invisible layers.
70
Chapter 6: Text Extraction
> How to display with Acrobat XI/DC: View, Show/Hide, Navigation Panes, Layers: layers
which are currently visible have an eye symbol in front of the name. Clicking on this
symbol controls the visibility of a layer.
> How to search with Acrobat X/XI/DC: Acrobat searches the contents of all layers. If a
search result is found on an invisible layer, Acrobat offers to make the layer visible.
> How to process layers with TET: the page option layers can be used to restrict content
extraction to either visible or invisible layers. Alternatively, the contents of all layers
can be processed which only makes sense if the layers don’t overlap.
> TETML: layer contents are processed according to the page option layers. Layer names
as well as their visibility state and other properties are listed in the TETML element
/TET/Document/Pages/Graphics/Layers/Layer.
6.1 PDF Document Domains
71
6.2 Page and Text Geometry
Default coordinate system. By default TET represents all page and text metrics in the
standard coordinate system of PDF. However, the origin of the coordinate system
(which could be located outside the page) is adjusted to the lower left corner of the visible page. More precisely, the origin is located in the lower left corner of the CropBox if it
is present, or the MediaBox otherwise. Page rotation is applied if the page has a Rotate
key. The coordinate system uses the DTP point as unit:
1 pt = 1 inch / 72 = 25.4 mm / 72 = 0.3528 mm
The first coordinate increases to the right, the second coordinate increases upwards.
By default, all coordinates expected or returned by TET are interpreted in this coordinate system, regardless of their representation in the underlying PDF document. See
the pCOS Path Reference to learn how to determine the size of a PDF page.
Top-down coordinate system. Unlike PDF’s bottom-up coordinate system some graphics environments use top-down coordinates which may be preferred by some developers. In order to facilitate the use of top-down coordinates TET supports an alternative
coordinate system in which all relevant coordinates are interpreted relative to the upper left corner of the page instead of the lower left corner, with y coordinates increasing
downwards. This topdown feature has been designed to make it quite natural for TET users to work in a top-down coordinate system. As an additional advantage, top-down coordinates are identical to the coordinate values displayed in Acrobat (see below). The
top-down coordinate system for a page can be activated with the page option topdown=
{output}.
Visualizing coordinates in Acrobat. You can visualize page coordinates in Acrobat as
follows (see Figure 6.2):
> To display cursor coordinates in Acrobat X/XI/DC use View, Show/Hide, Cursor
Coordinates.
> The coordinates are displayed in the unit which is currently selected in Acrobat. To
change the display units to points (as used in TET) in Acrobat X/XI/DC proceed as follows: go to Edit, Preferences, Units & Guides, Units and select Points.
Note that the coordinates displayed refer to an origin in the top left corner of the page,
and not the default coordinate system of PDF and TET with an origin in the lower left
corner. See the previous section for details on selecting a top-down coordinate system
which aligns with Acrobat’s coordinate display.
Area of text extraction. By default, TET extracts all text from the visible page area. Using the clippingarea option of TET_open_page( ) (see Table 10.10, page 181) you can change
this to any of the PDF page box entries (e.g. TrimBox). With the keyword unlimited all text
regardless of any page boxes can be extracted. The default value cropbox instructs TET to
extract text within the area which is visible in Acrobat.
The area of text extraction can be specified in more detail by providing an arbitrary
number of rectangular areas in the includebox and excludebox options of TET_open_
page( ). This is useful for extracting partial page content (e.g. selected columns), or for
72
Chapter 6: Text Extraction
Fig. 6.2
Configuring coordinate display in Acrobat; use View, Show/Hide, Cursor Coordinates to display cursor coordinates.
excluding irrelevant parts (e.g. margins, headers and footers). The final clipping area is
constructed by determining the union of all rectangles specified in the includebox option, and subtracting the union of all rectangles specified in the excludebox option. A
glyph is considered inside the clipping area if its reference point is inside the clipping
area. This means that a character could be considered inside the clipping area even if
parts of it extend beyond the clipping area, or vice versa.
Glyph metrics. Using TET_get_char_info( ) you can retrieve font and metrics information for the characters which are returned for a particular glyph. The following values
are available for each character in the output (see Figure 6.3 and Table 10.16):
> The uv value contains the UTF-32 Unicode value of the current character, i.e. the character for which details are retrieved. This field always contains UTF-32, even in language bindings that can deal only with UTF-16 strings in their native Unicode
strings. Accessing the uv field allows applications to deal with characters outside the
BMP without having to interpret surrogate pairs. Since surrogate pairs are reported
as two separate characters, the uv field of the leading surrogate value contains the actual Unicode value (larger than U+FFFF). The uv field of the trailing surrogate value is
treated as an artificial character, and has a uv value of 0.
> The type field specifies how the character was created. There are two groups: real and
artificial characters. The group of real characters comprises normal characters (i.e.
the complete result of a single glyph) and characters which start a multi-character
6.2 Page and Text Geometry
73
width
(x, y)
beta
e
tsiz
fon
e
elin
bas
(x, y)
height
th
wid
alpha
Fig. 6.3
Glyph metrics for horizontal and vertical writing mode
in TET’s default coordinate system (topdown=false)
>
>
>
>
>
74
sequence that corresponds to a single glyph (e.g. the first character of a ligature). The
group of artificial characters comprises the continuation of a multi-character sequence (e.g. the second character of a ligature) and inserted separator characters. For
artificial characters the position (x, y) specifies the endpoint of the most recent real
character, width and height are 0, and all other fields except uv are those of the most
recent real character. The endpoint is the point (x, y) plus the width added in direction alpha (in horizontal writing mode) or plus the height in direction -90˚ (in vertical
writing mode).
The unknown field is usually false (in C and C++: 0), but has a value of true (in C and
C++: 1) if the original glyph could not be mapped to Unicode and has therefore been
replaced with the character specified in the unknownchar option. Using this field you
can distinguish real document content from replaced characters if you specified a
common character as unknownchar, such as a question mark or space.
The attributes field contains information about the subscript, superscript, dropcap,
or shadow status of the glyph as determined by TET’s content analysis algorithms.
The (x, y) fields specify the position of the glyph’s reference point, which is the lower
left corner of the glyph rectangle in horizontal writing mode, and the top center in
vertical writing mode (see Section 6.4, »Chinese, Japanese, and Korean Text«, page 80
for details on vertical writing mode). For artificial characters, which do not correspond to any glyph on the page, the point (x, y) specifies the end point of the most recent real character. The value of y is subject to the topdown page option.
The width field specifies the width of a glyph according to the corresponding font
metrics and text output parameters, such as character spacing and horizontal scaling. Since these parameters control the position of the next glyph, the distance between the reference points of two adjacent glyphs may be different from width. The
width may be zero for non-spacing characters. On the other hand, the outline may
actually be wider than the glyph’s width value, e.g. for slanted text.
The width is 0 for artificial characters.
The height field in vertical writing mode specifies the height of the corresponding
glyph according to the font metrics and text parameters (e.g. character spacing). The
Chapter 6: Text Extraction
>
>
>
>
>
>
height is positive in the default coordinate system, but negative for topdown coordinates. In monospaced vertical fonts all glyphs have fontsize as height unless additional character spacing has been applied. Artificial characters (e.g. separators) have
a height of 0.
For horizontal writing mode an approximation of the glyph height is supplied. This
approximate value is derived from font properties and therefore identical for all
glyphs in a font. There is no guarantee that the visible glyph has the exact height value supplied here.
The angle alpha provides the direction of inline text progression, specified as the deviation from the standard direction. The standard direction is 0˚ for horizontal writing mode, and -90˚ for vertical writing mode (see below for more details on vertical
writing mode). Therefore, the angle alpha is 0˚ for standard horizontal text as well as
for standard vertical text. The values of alpha and beta are subject to the topdown
page option.
The angle beta specifies any skewing which has been applied to the text, e.g. for
slanted (italicized) text. The angle is measured against the perpendicular of alpha. It
is 0˚ for standard upright text (for both horizontal and vertical writing mode). If the
absolute value of beta is greater than 90˚ the text is mirrored at the baseline.
The fontid field contains the pCOS ID of the font used for the glyph. It can be used to
retrieve detailed font information, such as the font name, embedding status, writing
mode (horizontal/vertical), etc. The pCOS Path Reference contains sample code for
retrieving font details.
The fontsize field specifies the size of the text in points. It is normalized and therefore
always positive, even for topdown={output}.
The colorid field contains an index for the text color. It represents the unique combination of fill color, stroke color, and text rendering. All occurrences of the same combination in a document are represented by the same color id. Different combinations are represented by different ids, which means that colors of multiple glyphs
can be checked for equality by comparing their color ids. For example, by comparing
the colorid values of successive glyphs you can identify changes in text color. The exact color space and color components for filling and/or stroking text can be retrieved
with TET_get_color_info( ) (see Section 6.3, »Text Color«, page 78).
The textrendering field specifies the kind of rendering for a glyph, e.g. stroked, filled,
or invisible, and possible use of the text as clipping path. This field contains the numerical text rendering mode as defined in PDF (see Table 10.16, page 192). Invisible
text (i.e. textrendering=3) is extracted by default, but this can be changed with the
ignoreinvisibletext option of TET_open_page( ).
Text in Type 3 fonts: textrendering=3 and 7 result in invisible text; all other values of
textrendering are irrelevant and are ignored.
Font-specific metrics. TET uses the glyph and font metrics system used by PostScript
and PDF which shall be briefly discussed here.
The font size is usually chosen as the minimum distance between adjacent text lines
which is required to avoid overlapping character parts. The font size is generally larger
than individual characters in a font, since it spans ascender and descender, plus possibly additional space between lines.
The capheight is the height of capital letters such as T or H in most Latin fonts. The
xheight is the height of lowercase letters such as x in most Latin fonts. The ascender is the
6.2 Page and Text Geometry
75
ascender
capheight
font size
baseline
descender
Fig. 6.4 Font-specific metrics
height of lowercase letters such as t or d in most Latin fonts. The descender is the distance from the baseline to the bottom of lowercase letters such as j or p in most Latin
fonts. The descender is usually negative. The values of xheight, capheight, ascender, and
descender are measured in thousands of the font size.
These values vary among fonts, and can be retrieved with the pCOS interface. For example, the following code retrieves the ascender and descender values:
/* Query ascender and descender values */
path = "fonts[" + i + "]/ascender";
System.out.println("Ascender=" + p.pcos_get_number(doc, path));
path = "fonts[" + i + "]/descender";
System.out.println("Descender=" + p.pcos_get_number(doc, path));
Note that ascender and other font metrics values should only be queried after calling
TET_get_char_info( ) for a glyph with this font. In order words, using font ids returned by
TET_get_char_info( ) is safe, while enumerating all fonts in the fonts[] array does not necessarily provide metrics values from embedded font data, but the possibly inaccurate
values from the PDF FontDescriptor dictionary. For more information refer to the pCOS
Path Reference.
End points of glyphs and words. In order to do proper highlighting you need the end
position of the last character in a word. Using x, y, width, and alpha returned by TET_get_
char_info( ) you can determine the end point of a glyph in horizontal writing mode, i.e.
the end point of the glyph’s advance vector (the lower right corner of the glyph box):
xend = lrx = x + width * cos(alpha)
yend = lry = y + width * sin(alpha)
In the common case of horizontally oriented text (i.e. alpha=0) this reduces to
xend = lrx = x + width
yend = lry = y
More generally, you can calculate the size of the glyph box by determining the coordinates of the upper right corner (for beta=0, i.e. this formula does not take into account
glyph skewing):
76
Chapter 6: Text Extraction
urx = x + width * cos(alpha) - dir * height * sin(alpha)
ury = y + width * sin(alpha) + dir * height * cos(alpha)
with dir=1 in the default case topdown={output=false} and dir=-1 if topdown={output=
true} (see »Top-down coordinate system«, page 72). The value of height depends on the
fontsize and the font geometry. The following results in useful values for most common fonts (see »Font-specific metrics«, page 75, for retrieving the ascender value):
height = fontsize * ascender / 1000
In many graphical development environments the glyph transformations can be expressed as follows:
translate(x, y);
rotate(alpha);
skew(0, -beta);
if (abs(beta) > 90)
scale(1, -1);
After applying these transformations the upper right corner of the glyph box can be expressed as follows:
urx = x + width
ury = y + dir * height
Glyph calculations for vertical writing mode.
end point calculation works as follows:
For text with vertical writing mode the
xend = x
yend = y - height
The upper left and lower right corners of the glyph box can be calculated as follows (for
beta=0):
ulx = x - width/2 * cos(alpha)
uly = y - width/2 * sin(alpha)
lrx = ulx + width * cos(alpha) + dir * height * sin(alpha)
lry = uly + width * sin(alpha) - dir * height * cos(alpha)
with dir=1 in the default case topdown={output=false} and dir=-1 if topdown={output=
true} (see »Top-down coordinate system«, page 72).
6.2 Page and Text Geometry
77
6.3 Text Color
The text color id returned by TET_get_char_info( ) can be used to retrieve the fill and/or
stroke color of the glyph corresponding to an output character. This can be achieved
with TET_get_color_info( ) which returns the following values for a color id. These values
can be retrieved separately for the fill and stroke color of a glyph:
> The colorspaceid field contains the index of the color space in the colorspaces[ ] pseudo
object (see the pCOS Path Reference), or -1 if no color is applied to the glyph.
> The patternid field contains the index of the pattern in the patterns[ ] pseudo object
(see the pCOS Path Reference), or -1 if no pattern is applied to the glyph.
> The components array contains the color values which must be interpreted in the color space reported with colorspaceid.
> The n field (only available in the C and C++ language bindings) contains the number
of relevant entries in the components field.
The glyphinfo mini sample demonstrates how to interpret the color values provided by
TET_get_color_info( ) and how to augment this information with general color space attributes retrieved with pCOS. The colorspaces and page_colors topics in the pCOS Cookbook demonstrate how to retrieve even more color space details, such as WhitePoint for
calibrated color spaces or the alternate color space of a Separation or DeviceN color space.
Text stroking, i.e. painting the outline of glyphs (as opposed to filling the interior) is
rarely used in PDF documents. Most applications may ignore the stroke color information. Also, patterns are rarely used for text.
Text color retrieval can be disabled with the following document option:
engines={notextcolor}
If the text color engine is disabled, the colorid field of TET_char_info must not be used
since it doesn’t contain any meaningful value.
Table 6.1 provides an overview of PDF color spaces. Unless noted otherwise, color values are in the range 0..1.
78
Chapter 6: Text Extraction
Table 6.1 Color spaces in PDF
color space
number of color
components
notes
Device color spaces
DeviceGray
DeviceRGB
DeviceCMYK
1
3
4
The device color spaces are widely known, but are device-dependent and therefore
don’t represent reliable color information.
CIE-based (device-independent) color spaces
ICCBased
1, 3 or 4
ICCBased color spaces are defined by an ICC profile for grayscale, RGB or CMYK color.
Lab
3
Lab color spaces are defined by an CIE 1976 L*a*b* space. They require a lightness
value in the range 0...100 and two color values which are often in the range
-128...127.
CalGray
CalRGB
1
3
Calibrated color spaces define a WhitePoint and optional BlackPoint. They are
rarely used today since ICCBased color spaces are more flexible.
Special color spaces
Pattern
0 (PaintType=1)
N (PaintType=2)
0 (PatternType=2)
Pattern color spaces are used to apply some graphical pattern instead of a solid
color. Tiling patterns (PatternType=1) colorize by repeatedly placing some graphical shape, where the shape may be colored with intrinsic colors (PaintType=1),
or may be uncolored like a stencil mask and require external color (PaintType=2).
Shading patterns (PatternType=2) apply a color gradient instead of solid color.
Separation
1
A Separation color space describes a named spot color and requires an alternate
color space which is needed if the named spot color is not directly available for
output.
DeviceN
N
DeviceN is a generalization of Separation color space for more than one named
spot color. It is also used to apply a subset of CMYK process colors.
Indexed
1, but N in the base
color space
Indexed color spaces allow for efficient storage of a small number of different color values (up to 256) and require an underlying base color space.
6.3 Text Color
79
6.4 Chinese, Japanese, and Korean Text
6.4.1 CJK Encodings and CMaps
TET supports Chinese, Japanese, and Korean (CJK) text, and converts horizontal and vertical CJK text in arbitrary legacy encodings (CMaps) to Unicode. TET supports all of Adobe’s CJK character collections:
> Simplified Chinese: Adobe-GB1-5
> Traditional Chinese: Adobe-CNS1-6
> Japanese: Adobe-Japan1-6
> Korean: Adobe-Korea1-2
The PDF CMaps in turn cover all of the CJK character encodings which are in use today,
such as Shift-JIS, EUC, Big-5, KSC, and many others. CJK font names encoded with localespecific encodings (e.g. Japanese font names encoded in Shift-JIS) are normalized to Unicode.
Note In order to extract CJK text which is encoded with legacy encodings you must configure access
to the CMap files which are shipped with TET according to Section 0.1, »Installing the Software«, page 7.
6.4.2 Word Boundaries for CJK Text
Ideographic characters don’t constitute a word boundary, but punctuation and the
transition between ideographic and non-ideographic characters still constitute word
boundaries. For granularity=word ideographic comma U+3001 and ideographic full stop
U+3002 also constitute word boundaries. For granularity=page no line separator is inserted at the end of a line.
Note This default behavior changed with TET 5. TET 4 treated ideographic characters as word boundaries by default.
6.4.3 Vertical Writing Mode
TET supports both horizontal and vertical writing modes, and performs all metrics calculations as appropriate for the respective writing mode. Keep the following in mind
when dealing with text in vertical writing mode:
> The glyph reference point in vertical writing mode is at the top center of the glyph
box. The text position advances downwards as determined by the glyph height, regardless of the glyph width (see Figure 6.3).
> The angle alpha is 0˚ for standard vertical text. In other words, fonts with vertical
writing mode and alpha=0° progress downwards, i.e. in direction -90˚.
> Because of the differences noted above, client code must take the writing mode into
account by using the following pCOS code (note that not all text which appears vertically actually uses a font with vertical writing mode):
count = p.pcos_get_number(doc, "length:fonts");
for (i=0; i < count; i++)
{
if (p.pcos_get_number(doc, "fonts[" + id + "]/vertical"))
{
/* font uses vertical writing mode */
80
Chapter 6: Text Extraction
vertical = true;
}
}
> Prerotated glyphs for vertical text and punctuation are mapped to the corresponding unrotated Unicode characters. Use the following document option to preserve
prerotated characters:
decompose={vertical=_none}
6.4.4 CJK Decompositions: Narrow, wide, vertical, etc.
Unicode and many legacy encodings support the notion of fullwidth and halfwidth
characters (sometimes also called double-byte and single-byte characters). By default,
TET applies the Unicode decompositions wide and narrow which replace fullwidth and
halfwidth characters with the corresponding standard-width counterparts.
In order to preserve the original fullwidth and halfwidth characters you can use the
decompose document option and disable the respective decompositions:
decompose={wide=_none narrow=_none}
Similarly, the small, square, and vertical decompositions also affect CJK characters. Since
all these decompositions (including wide and narrow) are enabled by default, the characters are converted to their normal counterparts. Disable the respective decompositions in order to preserve the original characters. The following document option disables all decompositions:
decompose={none}
Table 6.2 demonstrates the CJK decompositions along with examples. See Section 7.3.2,
»Unicode Decomposition«, page 101, for more information on decompositions.
6.4 Chinese, Japanese, and Korean Text
81
Table 6.2 CJK compatibility decomposition examples (suboptions for the decompose option)
decomposition
name
description
narrow
small
square
vertical
wide
82
affected Unicode characters
Narrow (hankaku)
compatibility forms
U+FF61-U+FFDC,
U+FFE8-U+FFEE
Small forms for CNS
11643 compatibility
U+FE50-U+FE6B
CJK squared font
variants
U+3250,
U+32CC-U+32CF,
U+3300-U+3357,
U+3371-U+33DF,
U+337B-U+337F,
U+33FF,
U+1F131-U+1F14E,
U+1F190,
U+1F200,
U+1F210-U+1F231
Vertical layout presentation forms
U+309F,
U+30FF,
U+FE10-U+FE19
U+FE30-U+FE48
ዏ
U+30F2
U+002C
Wide (zenkaku) com- U+3000,
patibility forms
U+FF01-U+FF60,
U+FFE0-U+FFE6
Chapter 6: Text Extraction
decompositions
enabled (default)
ኊ ዊ
U+30AD U+30ED
ᤣ
U+FE37
£
U+00A3
decompositions
disabled
ᩀ
U+FF66
ᤸ
U+FE50
ᑐ
U+3314
^
U+007B
£
U+FFE1
6.5 Bidirectional Arabic and Hebrew Text
TET applies additional processing to correctly extract text from documents with rightto-left scripts such as Arabic and Hebrew. Since these scripts often contain left-to-right
text inserts (e.g. numbers), such documents are called bidirectional. Extracting bidirectional text involves one or more of the processing steps mentioned below.
6.5.1 General Bidi Topics
Reorder right-to-left and bidirectional text. Right-to-left sequences and left-to-right
sequences must be reordered to form the correct sequence of logical text. In granularity
word or higher TET delivers text in logical order with the following page option (which
is the default setting):
contentanalysis={bidi=logical}
Bidi processing can explicitly be disabled with the following page option:
contentanalysis={bidi=visual}
Determine the dominant text direction of the page. Not only the characters within a
word and words within a line are affected by Bidi reordering, but also other aspects of
page layout recognition. In some cases mixed Bidi lines cannot safely be reordered
without taking into account the fact that the page is an overall right-to-left or left-toright page. In order to make this decision automatically TET checks the dominant text
direction of the page and adjusts its algorithms depending on whether the page must
be considered mostly left-to-right or mostly right-to-left.
This decision can be overridden with the bidilevel option. For example, the following
option list forces right-to-left handling even on pages where the majority of text runs
left-to-right:
contentanalysis={bidilevel=rtl}
Glyph ordering. The glyph information returned by TET_get_char_info( ) and the Glyph
elements in TETML are always ordered according to visual order, i.e. from left to right
for horizontal baselines. This left-to-right glyph ordering ensures that client applications receive glyph coordinates in deterministic ordering without having to check the
Bidi status of the text. This behavior reflects the fact that the glyphs in Arabic and Hebrew fonts generally have the reference point at the left edge and advance to the right,
despite the fact that the actual text direction is right-to-left.
6.5.2 Postprocessing Arabic Text
Normalize Arabic presentation forms and decompose ligatures. Arabic characters exist in up to four different forms for isolated use, at the beginning, in the middle, or at
the end of a word. These forms can have different Unicode values although semantically they represent the same character. By default, TET converts all presentation forms to
the corresponding canonical forms. As shown in Table 6.3 the decompose option can be
used to preserve presentation forms (see Section 7.3.2, »Unicode Decomposition«, page
101).
6.5 Bidirectional Arabic and Hebrew Text
83
Since the PDF document may map presentation forms either to the isolated Unicode
character or one of the presentation forms (e.g. in the document’s ToUnicode CMap),
TET cannot guarantee that the output contains presentation forms even when decompositions are disabled.
Table 6.3 Processing Arabic presentation forms with the decompose option
description and option list
Decompose final, initial, isolated, and medial presentation forms:
no decompose option (default) or
decompose=
{final=_all medial=_all initial=_all isolated=_all}
before
decomposition
ᦒ
ҷ
U+FEB2
U+0633
ᦓ
ҷ
U+FEB3
U+0633
Note that ligatures are only decomposed if they are actually repre- ᡤ
U+FD0E
sented by a ligature glyph. If multiple separate glyphs are used
these are retained in the output.
ᦔ
ҷ
ҷ
U+0633
᧜
U+FEFC
decompose=
{final=_none medial=_none initial=_none isolated=_none}
or
decompose=none
Ӄ
ҫ
Ӄ
ҫ
U+0644 U+0627
ҫ
U+0644 U+0627
Preserve final, initial, isolated, and medial presentation forms:
ҵ
U+0633 U+0631
U+FEB4
Ӄ
after decomposition
(in logical order)
ᦒ
U+FEB2
ᦓ
U+FEB3
ᡤ
U+FD0E
ᦔ
U+FEB4
᧜
U+FEFC
U+0644 U+0627
ᦒ
U+FEB2
ᦓ
U+FEB3
ᡤ
U+FD0E
ᦔ
U+FEB4
᧜
U+FEFC
Remove Arabic Tatweel character. The Tatweel character U+0640 (also called kashida)
is often used in Arabic text to stretch words so that they completely fill the line. Since
the Tatweel doesn’t carry any text information itself it is usually not required in the extracted text. By default, TET removes Tatweel characters from the extracted text. As
shown in Table 6.4 the fold option can be used to preserve Tatweel characters (see Section 7.3.1, »Unicode Folding«, page 98).
Table 6.4 Processing the Tatweel character U+0640 with the fold option
description and option list
Remove Arabic Tatweel characters: no fold option (default) or
fold={{[U+0640] remove}} or fold={default}
Preserve Arabic Tatweel characters (which are removed by default):
fold={{[U+0640] preserve}}
84
Chapter 6: Text Extraction
before folding
ҿ
U+0640
ҿ
U+0640
after folding
n/a
ҿ
U+0640
6.6 Content Analysis
PDF documents provide the semantics (Unicode mapping) of individual text characters
as well as their position on the page. However, they usually do not convey information
about words, lines, columns or other high-level text units. The fragments comprising
text on a page may contain individual characters, syllables, words, lines, or an arbitrary
mixture thereof, without any explicit marks designating the start or end of a word, line,
or column.
To make matters worse, the ordering of text fragments on the page may be different
from the logical (reading) order. There are no rules for the order in which portions of
text are placed on the page. For example, a page containing two columns of text could
be produced by creating the first line in the left column, followed by the first line of the
right column, the second line of the left column, the second line of the right column etc.
However, logical order requires all text in the left column to be processed before the
text in the right column is processed. Extracting text from such documents by simply
replaying the instructions on the PDF page generally provides undesirable results since
the logical structure of the text is lost.
TET’s content analysis engine analyzes the contents, position, and relationship of
text fragments in order to achieve the following goals:
> create words from characters, and insert separator characters between words if desired;
> remove redundant text, such as duplicates which are only present to create a shadow
effect;
> recombine the parts of hyphenated words which span more than one line;
> identify text columns (zones);
> sort text fragments within a zone, as well as zones within a page.
These operations are discussed in more detail below, as well as options which provide
some control over content processing.
Text granularity. The granularity option of TET_open_page( ) specifies the amount of
text which is returned by a single call to TET_get_text( ):
> With granularity=glyph each fragment contains the result of mapping one glyph,
which may be more than one character (e.g. for ligatures). In this mode content analysis is disabled. TET returns the original text fragments on the page in their original
order. Although this is the fastest mode, it is only useful if the TET client intends to
do sophisticated postprocessing (or is only interested in the text position, but not in
its logical structure) since the text may be scattered all over the page.
> With granularity=word the Wordfinder algorithm groups characters into logical
words. Each fragment contains a word. Isolated punctuation characters (comma, colon, question mark, quotes, etc.) are returned as separate fragments by default, while
multiple sequential punctuation characters are grouped as a single word (e.g. a series
of period characters which simulates a dotted line). However, punctuation treatment can be changed (see »Word boundary detection for Western text«, page 86).
> With granularity=line the words identified by the Wordfinder are grouped into lines.
If dehyphenation is enabled (which is the default) the parts of hyphenated words at
the end of a line are combined, and the full dehyphenated word is part of the line.
> With granularity=page all words on the page are returned in a single fragment.
6.6 Content Analysis
85
Separator characters are inserted between multiple words, lines, or paragraphs if the
chosen granularity is larger than the respective unit. For example, with
granularity=word there’s no need to insert word separators since each call to TET_get_
text( ) returns exactly one word.
The separator characters can be specified with the wordseparator,
lineseparatoroptions of TET_open_document( ) (use U+0000 to disable a separator), for example:
lineseparator=U+000A
By default, all content processing operations are disabled for granularity=glyph, and enabled for all other granularity settings. However, more fine-grain control is possible via
separate options (see below).
Word boundary detection for Western text. The Wordfinder, which is enabled for all
granularity modes except glyph, creates logical words from multiple glyphs which may
be scattered all over the page in no particular order. Word boundaries for Western text
are identified by two criteria:
> A sophisticated algorithm analyzes the geometric relationship among glyphs to find
character groups which together form a word. The algorithm takes into account a variety of properties and special cases in order to accurately identify words even in
complicated layouts and for arbitrary text ordering on the page.
The suboption usemetrics of the contentanalysis page option can be used to disable
this algorithm for special situations.
> Some characters, such as space and punctuation characters (e.g. colon, comma, full
stop, parentheses) are considered a word boundary, regardless of their width and position. The suboption useclasses of the contentanalysis page option can be used to disable this algorithm for special situations.
Ignoring punctuation characters for word boundary detection can, for example, be useful for maintaining Web URLs where period and slash characters are usually considered
part of a word (see Figure 6.5). If the punctuationbreaks page option is set to false the
Wordfinder no longer treats punctuation characters as word boundaries:
contentanalysis={punctuationbreaks=false}
Note Word boundary detection for text with ideographic characters works differently; see Section
6.4.2, »Word Boundaries for CJK Text«, page 80, for more information.
Fig. 6.5
The default setting punctuationbreaks=true
separates the parts of URLs (top), while
punctuationbreaks=false keeps the parts together (bottom).
86
Chapter 6: Text Extraction
Dehyphenation. Hyphenated
words at the end of a line are usually not desired for applications
which process the extracted text
on a logical level. TET therefore
dehyphenates or recombines the
parts of a hyphenated word. More
precisely, if a word at the end of a
line ends with a hyphen character
and the first word on the next line
starts with a lowercase character,
the hyphen is removed and the
first part of the word is combined
with the part on the next line, provided there is at least one more line in the same zone.
Dash characters (as opposed to hyphens) are left unmodified. The parts of a hyphenated
word are not modified, only the hyphen is removed. Dehyphenation can be disabled
with the following option list for TET_open_page( ):
contentanalysis={dehyphenate=false}
Shadow and fake bold text removal. PDF documents sometimes include redundant
text which does not contribute to the semantics of a page, but creates certain visual effects only. Shadow text effects are usually achieved by placing two or more copies of the
actual text on top of each other, where a small displacement is applied. Applying
opaque coloring to each layer of text provides a visual appearance where the majority
of the text in lower layers is obscured, while the visible portions create a shadow effect.
Similarly,
word processing
applications
sometimes support a feature for
creating artificial bold text. In order to create bold text appearance even if a bold font is
not available, the text is placed repeatedly on the page in the same color. Using a very
small displacement the appearance of bold text is simulated.
Shadow simulation, artificial bold text, and similar visual artifacts create severe
problems when reusing the extracted text since redundant text contents which contribute only to the visual appearance is processed although the text does not contribute to
the page contents.
If the Wordfinder is enabled, TET identifies and removes such redundant visual artifacts by default. Shadow removal can be disabled with the following option list for TET_
open_page( ):
contentanalysis={shadowdetect=false}
Accented characters. In many languages accents and other diacritical marks are
placed close to other characters to form combined characters. Some typesetting programs, most notably TeX, emit two characters (base character and accent) separately to
create a combined character. For example, to create the character ä first the letter a is
placed on the page, and then the dieresis character ¨ is placed on top of it. TET detects
6.6 Content Analysis
87
this situation and recombines both characters to form the appropriate combined character.
88
Chapter 6: Text Extraction
6.7 Layout Analysis
TET analyses the layout of text on the page in order to determine the best possible order
of text extraction. This automatic process can be assisted by several options. If you have
advance knowledge of the nature of the documents you can improve the text extraction
results by supplying suitable options.
Document styles. Several internal parameters are available for processing documents
of different layout and style. For example, newspaper pages tend to contain lots of text
in multiple columns, while business reports often contain comments in the margins,
etc. TET contains predefined settings for several types of document. These settings can
be activated with an option for TET_open_page( ):
docstyle=papers
If the type of input documents is known it is strongly recommended to supply suitable
values of the docstyle page option and (if applicable) also the layouthint page option.
Supplying the docstyle option activates an advanced layout recognition algorithm.
However, supplying an unsuitable value for this option may actually create worse results.
The following types are available for the docstyle option (Table 6.5 contains typical
examples for some document styles):
> Book: typical book layouts with regular pages
> Business: business documents
> Cad: technical or architectural drawings which are typically heavily fragmented
> Fancy: fancy pages with complex and sometimes irregular layout
> Forms: structured forms
> Generic: the most general document class without any further qualification
> Magazines: magazine articles, usually with three or more columns and interspersed
images and graphics
> Papers: newspapers with many columns, large pages and small type
> Science: scientific articles, usually with two or more columns and interspersed images, formulae, tables, etc.
> Search engine: this class does not refer to a specific type of input document, but rather optimizes TET for the typical requirements of indexers for search engines. Some
layout detection features are disabled to deliver only the raw text and speed up processing. For example, table and page structure recognition are disabled.
> Space grid: this class is targeted at list-oriented reports which are often generated on
mainframe systems. The characteristic of this document class is that the visual layout is generated with space characters instead of explicit positioning of text. When
processing this kind of document text extraction can be accelerated since some processing steps (e.g. shadow detection) can be skipped.
Choosing the most appropriate document style can speed up processing and enhance
text extraction results.
Complex layouts. Some classes of documents often use very elaborate page layouts.
For example, with magazines and periodicals TET may not be able to properly determine the relationship of columns on the page. In such situations it is possible to en-
6.7 Layout Analysis
89
Table 6.5 Document styles
docstyle=book
docstyle=business
docstyle=fancy
docstyle=magazines
docstyle=papers
docstyle=science
docstyle=spacegrid
90
Chapter 6: Text Extraction
docstyle=cad
hance the extracted text at the expense of processing time. This can be controlled with
the structureanalysis and layoutanalysis page options, e.g.
structureanalysis={list=true bullets={{fontname=ZapfDingbats}}}
layoutanalysis = {layoutrowhint={full separation=preservecolumns}}
layoutdetect=2
layouteffort=high
Table detection. TET detects tabular structures on the page and structures the table
contents in rows, columns and cells. Information about tables detected on the page is
not provided directly by the API, but is only available in TETML output as in the
following example:
<Table llx="302.14" lly="639.72" urx="525.50" ury="731.50">
<Row>
<Cell colSpan="3" llx="306.14" lly="641.52" urx="516.67" ury="650.52">
<Para>
<Box llx="306.14" lly="641.52" urx="516.67" ury="650.52">
<Word>
<Text>TET</Text>
<Box llx="306.14" lly="641.52" urx="319.70" ury="650.52"/>
</Word>
<Word>
<Text>processes</Text>
<Box llx="321.67" lly="641.52" urx="356.89" ury="650.52"/>
</Word>
<Word>
<Text>all</Text>
<Box llx="358.85" lly="641.52" urx="368.15" ury="650.52"/>
</Word>
...
</Box>
</Para>
</Cell>
</Row>
</Table>
TET can optionally analyze the horizontal and vertical lines or colored boxes which are
often use to enhance the table layout. This vector graphics analysis is disabled by default. It can improve the results of table and layout detection if such graphical elements
are present. Vector graphics analysis can be enabled with the page option vectoranalysis,
for example
vectoranalysis={structures=tables}
6.7 Layout Analysis
91
6.8 Check whether an Area is empty
TET can also be used to check whether a particular area on the page is empty, i.e. contains any text, image, or vector graphics objects which may be useful for postprocessing
applications. For example, consider that you need to place a stamp, page number, barcode or other item somewhere on a page. If the page contents are variable it may be difficult to specify a location on the page where the stamp or barcode can be placed without obscuring some existing contents. TET can check whether the target area is actually
empty. This feature works as follows with the TET API:
> The emptycheck page option activates the feature and disables any page content retrieval.
> The coordinates of the rectangular area which is checked are supplied in the includebox page option. Double braces are required since this option usually accepts multiple boxes (but only a single box makes sense for emptycheck):
includebox={{100 20 500 100}}
If the includebox option is not supplied the whole clipping area is checked. This can
be used to identify empty pages.
> Instead of retrieving any page contents, TET_get_text( ) returns one of the strings
empty or notempty as result of the check.
The emptycheck feature can be used in the TET command-line tool as follows:
tet
box
box
box
92
--pageopt "emptycheck includebox={{300 760 450 820}}" input.pdf
on page 1: empty
on page 2: empty
on page 3: notempty
Chapter 6: Text Extraction
7 Advanced Unicode Handling
7.1 Important Unicode Concepts
This section provides basic information about Unicode since text handling in TET heavily relies on the Unicode standard. The Unicode Web site provides a wealth of additional
information:
www.unicode.org
Characters and glyphs. When dealing with text it is important to clearly distinguish
the following concepts:
> Characters are the smallest units which convey information in a language. Common
examples are the letters in the Latin alphabet, Chinese ideographs, and Japanese syllables. Characters have a meaning: they are semantic entities.
> Glyphs are graphical shapes which represent one or more particular characters.
Glyphs have an appearance: they are representational entities.
There is no one-to-one relationship between characters and glyphs. For example, a ligature is a single glyph which represents two or more separate characters. On the other
hand, a specific glyph may be used to represent different characters depending on the
context (some characters look identical, see Figure 7.1).
Unicode postprocessing in TET can change the relationship of glyphs and resulting
characters even more. For example, decompositions may convert a single character into
multiple characters, and foldings may remove characters. For these reasons you must
not assume any specific relationship of characters and glyphs.
BMP and PUA. The following terms occur frequently in Unicode-based environments:
> The Basic Multilingual Plane (BMP) comprises the code points in the Unicode range
U+0000...U+FFFF. The Unicode standard contains many more code points in the supplementary planes, i.e. in the range U+10000...U+10FFFF.
Characters
Glyphs
U+0067 LATIN SMALL LETTER G
U+0066 LATIN SMALL LETTER F +
U+0069 LATIN SMALL LETTER I
U+2126 OHM SIGN or
U+03A9 GREEK CAPITAL LETTER OMEGA
U+2167 ROMAN NUMERAL EIGHT or
U+0056 V U+0049 I U+0049 I U+0049 I
Fig. 7.1
Relationship of glyphs
and characters
7.1 Important Unicode Concepts
93
> A Private Use Area (PUA) is one of several ranges which are reserved for private use.
PUA code points cannot be used for general interchange since the Unicode standard
does not specify any characters in this range. The Basic Multilingual Plane includes a
PUA in the range U+E000...U+F8FF. Plane fifteen (U+F0000... U+FFFFD) and plane sixteen (U+100000...U+10FFFD) are completely reserved for private use.
Unicode encoding forms (UTF formats). The Unicode standard assigns a number (code
point) to each character. In order to use these numbers in computing, they must be represented in some way. In the Unicode standard this is called an encoding form (formerly: transformation format); this term should not be confused with font encodings. Unicode defines the following encoding forms:
> UTF-8: This is a variable-width format where code points are represented by 1-4 bytes.
ASCII characters in the range U+0000...U+007F are represented by a single byte in
the range 00...7F. Latin-1 characters in the range U+00A0...U+00FF are represented by
two bytes, where the first byte is always 0xC2 or 0xC3 (these values represent  and Ã
in Latin-1).
> UTF-16: Code points in the Basic Multilingual Plane (BMP) are represented by a single
16-bit value. Code points in the supplementary planes, i.e. in the range U+10000...
U+10FFFF, are represented by a pair of 16-bit values. Such pairs are called surrogate
pairs. A surrogate pair consists of a high-surrogate value in the range D800...DBFF
and a low-surrogate value in the range DC00...DFFF. High- and low-surrogate values
can only appear as parts of surrogate pairs, but not in any other context.
> UTF-32: Each code point is represented by a single 32-bit value.
Unicode encoding schemes and the Byte Order Mark (BOM). Computer architectures
differ in the ordering of bytes, i.e. whether the bytes constituting a larger value (16- or
32-bit) are stored with the most significant byte first (big-endian) or the least significant
byte first (little-endian). A common example for big-endian architectures is PowerPC,
while the x86 architecture is little-endian. Since UTF-8 and UTF-16 are based on values
which are larger than a single byte, the byte-ordering issue comes into play here. An encoding scheme (note the difference to encoding form above) specifies the encoding
form plus the byte ordering. For example, UTF-16BE stands for UTF-16 with big-endian
byte ordering. If the byte ordering is not known in advance it can be specified by means
of the code point U+FEFF, which is called Byte Order Mark (BOM). Although a BOM is not
required in UTF-8, it may be present as well, and can be used to identify a stream of
bytes as UTF-8. Table 7.1 lists the representation of the BOM for various encoding forms.
Table 7.1 Byte order marks for various Unicode encoding forms
Encoding form
Byte order mark (hex)
graphical representation in WinAnsi1
UTF-8
EF BB BF

UTF-16 big-endian
FE FF
þÿ
UTF-16 little-endian
FF FE
ÿþ
UTF-32 big-endian
00 00 FE FF
■ ■ þÿ
UTF-32 little-endian
FF FE 00 00
ÿþ ■ ■
1. The black square ■ denotes a null byte.
94
Chapter 7: Advanced Unicode Handling
Composite characters and sequences. Some glyphs map to a sequence of multiple
characters. For example, ligatures will be mapped to multiple characters according to
their constituent characters. However, composite characters (such as the Roman numeral in Figure 7.1) may or may not be split, subject to information in the font and PDF
as well as the decompose document option (see Section 7.3, »Unicode Postprocessing«,
page 98).
If appropriate, TET will split composite characters into a sequence of constituent
characters. The corresponding sequence will be part of the text returned by TET_get_
text( ). For each character, details of the underlying glyph(s) can be obtained via TET_get_
char_info( ), including the information whether the character is the start or continuation of a sequence. Position information will only be returned for the first character of a
sequence. Subsequent characters of a sequence will not have any associated position or
width information, but must be processed in combination with the first character.
Characters without any corresponding glyph. Although every glyph on the page will
be mapped to one or more corresponding Unicode characters, not all characters delivered by TET actually correspond to a glyph. Characters which correspond to a glyph are
called real characters, others are called artificial characters. There are several classes of
artificial characters which will be delivered although a directly corresponding glyph is
not available:
> A composite character (see above) will map to a sequence of multiple Unicode characters. While the first character in the sequence corresponds to the actual glyph, the
remaining characters do not correspond to any glyph.
> Separator characters inserted via the lineseparator, wordseparator, and paraseparator
options are artifacts without any corresponding glyph.
7.1 Important Unicode Concepts
95
7.2 Unicode Preprocessing (Filtering)
TET applies several filters to remove text which is unlikely to be useful. These filters
modify the text before applying any Unicode postprocessing steps. While some filters
are always active, others require the Wordfinder and are therefore active only for
granularity=word or above.
7.2.1 Filters for all Granularities
The following filters can be used with all granularities.
Text in unwieldy font sizes. Very small or very large text can optionally be ignored, e.g.
large characters in the background of the page. The limits can be controlled with the
fontsizerange page option. By default, text in all font sizes will be extracted.
The following page option limits the range of font sizes for extracted text from 10 to
50 points; text in other font sizes will be ignored:
fontsizerange={10 50}
Invisible text. Invisible text (i.e. text with textrendering=3) is extracted by default. Note
that text in PDF may be invisible for various other reasons than the textrendering property, e.g. the text color is identical to the background color, the text may be obscured by
other objects on the page, etc. The behavior described here relates only to text with
textrendering=3. This PDF technique is commonly used for the results of OCR where the
text sits invisibly »on top of« the scanned raster image.
Invisible text can be identified with the textrendering member of the TET_char_info
structure returned by TET_get_char_info( ) (see Table 10.16, page 192), or with the Glyph/
@textrendering attribute in TETML.
Use the following page option if you want to ignore invisible text:
ignoreinvisibletext=true
Completely ignore text with certain font names or font types. In some situations it
may be useful to completely ignore text in one ore more fonts specified by name, e.g. a
symbolic font which does not contribute any meaningful text. As an alternative, the
problematic fonts can also be specified by font type. This is mainly useful for Type 3
fonts which are sometimes used for ornaments. This filter can be controlled via the
remove suboption of the glyphmapping document option.
E.g. ignore all text in Type 3 fonts:
glyphmapping={{fonttype={Type3} remove}}
Ignore all text in the Webdings, Wingdings, Wingdings 2, and Wingdings 3 fonts:
glyphmapping={{fontname=Webdings remove} {fontname=Wingdings* remove}}
The conditions for font name and font type can also be combined, e.g. ignore text in all
Type 3 fonts starting with the letter A:
glyphmapping={{fonttype={Type3} fontname=A* remove}}
96
Chapter 7: Advanced Unicode Handling
7.2.2 Filters for Granularity Word and above
The following filters can be used only for granularity=word, line, and page.
Dehyphenation. Dehyphenation removes hyphen characters and combines the parts
of a hyphenated word.
Hyphens used for splitting words across lines can be identified with the attributes
member of the TET_char_info structure (see Table 10.16, page 192), or with the Glyph/
@hyphenation attribute in TETML.
Dehyphenation can be disabled with the following page option:
contentanalysis={dehyphenate=false}
Hyphen reporting. If dehyphenation is enabled you can decide whether or not the hyphen characters between the parts of hyphenated words will be reported in the generated glyph lists or not, i.e. the list of glyphs returned by TET_get_char_info( ) and the Glyph
elements in TETML. By default, hyphens will be removed.
However, some applications may need to know the exact location of the hyphen on
the page. For example, the highlight_search_terms and search_and_replace_text topics in
the TET Cookbook take the hyphen glyph into account when placing an annotation or
replacement text on top of the original word. In this situation you can instruct TET to
include all hyphens which have been detected by the dehyphenation process with the
following page option:
contentanalysis={keephyphenglyphs=true}
The hyphens can be identified by the TET_ATTR_DEHYPHENATION_ARTIFACT flag of the attributes member in the TET_char_info structure returned by TET_get_char_info( ) (see Table 10.16, page 192), or in TETML with the Glyph/@dehyphenation attribute with value
artifact.
Shadow removal. Redundant text which creates only visual artifacts such as shadow
effects or artificial bold text will be removed.
Shadow and artificial bold text can be identified with the attributes member of the
TET_char_info structure (see Table 10.16, page 192), or with the Glyph/@shadow attribute
in TETML.
Shadow removal can be disabled with the following page option:
contentanalysis={shadowdetect=false}
7.2 Unicode Preprocessing (Filtering)
97
7.3 Unicode Postprocessing
TET offers various controls for fine-tuning the Unicode characters comprising the extracted text. The postprocessing steps discussed in this chapter are defined in the Unicode standard. They are available in TET and are processed in the following order:
> Foldings are controlled by the fold document option and preserve, remove, or replace
certain characters. Examples: remove hyphens which are used to split words, remove
Arabic Tatweel characters.
> Decomposition is controlled by the decompose document option and replaces a character with one ore more equivalent characters. Examples: split ligatures, map fullwidth ASCII and symbol variants to the corresponding non-fullwidth characters.
> Normalization is controlled by the normalize document option and converts the text
to one of the normalized Unicode forms. Examples: combine base character and diacritical character to a common character; map Ohm sign to Greek Omega.
Unicode postprocessing is completely disabled for granularity=glyph.
7.3.1 Unicode Folding
Foldings process one or more Unicode characters and apply a certain action on each of
the characters. The following actions are available:
> preserve the character;
> remove the character;
> replace it with a another (fixed) character.
Foldings are not chained: the output of a folding will not be processed again by the
available foldings. Foldings affect only the Unicode text output, but not the set of
glyphs reported in the TET_char_info structure or the <Glyph> elements in TETML. For example, if a folding removes certain Unicode characters, the corresponding glyphs which
created the initial characters will still be reported.
In order to improve readability the examples in the tables below list isolated suboptions of the fold option list. Keep in mind that these suboptions must be combined to a
single large fold option list if you want to apply multiple foldings; do not supply the
fold option more than once. For example, the following is wrong:
fold={ {[:blank:] U+0020} } fold={ {_dehyphenation remove} }
WRONG!
The following option list shows the correct syntax for multiple foldings:
fold={ {[:blank:] U+0020 } {_dehyphenation remove} }
Folding examples. Table 7.2 lists examples for the fold option which demonstrate various folding applications. The options must be supplied in the option list for TET_open_
document( ). TET can apply foldings to a selected subset of all Unicode characters. These
are called Unicode sets; their syntax is discussed in »Unicode sets«, page 158.
98
Chapter 7: Advanced Unicode Handling
Table 7.2 Examples for the fold option
description and option list
before folding
after folding
Remove all characters in a Unicode set
Keep only characters in ISO 8859-1 (Latin-1) in the output, i.e. remove all characters
outside the Basic Latin Block:fold={{[^U+0020-U+00FF] remove} default}
Remove all non-alphabetic characters (e.g. punctuation, numbers):
fold={{[:Alphabetic=No:] remove} default}
Æ
n/a
n/a
U+0104
U+0037
Remove all characters except numbers:
fold={{[^[:General_Category=Decimal_Number:]] remove} default}
U+0037
U+0037
$
n/a
n/a
U+0041
Remove all dashes and punctuation characters:
fold={{[:General_Category=Dash_Punctuation:] remove} default}
U+002D
Remove all Bidi control characters:
fold={{[:Bidi_Control:] remove} default}
n/a
U+200E
Remove all variation selectors for Standard or Ideographic Variation Sequences
(IVS):
fold={{[[\uFE00-\uFE0F][\U000E0100-\U000E01EF]] remove} default}
฻

U+2268 U+FE00
฻
U+2268
Replace all characters in a Unicode set with another character
Space folding: map all variants of Unicode space characters to U+0020:
b
fold={{[:blank:] U+0020} default}
U+00A0
Dashes folding: map all variants of Unicode dash characters to U+002D:
ತ
fold={{[:Dash:] U+002D} default}
U+2011
Replace all unassigned characters (i.e. Unicode code points to which no character
is assigned) with U+FFFD: fold={{[:Unassigned:] U+FFFD} default}
b
U+0020
U+002D
᪽
U+03A2
U+FFFD
Preserve all hyphen characters at line breaks while keeping the remaining default
foldings. Since these characters are identified internally in TET (as opposed to havU+002D
ing a fixed Unicode property) the keyword _dehyphenation is used to identify the
folding’s domain: fold={{_dehyphenation preserve} default}
U+002D
Special handling for individual characters
Preserve Arabic Tatweel characters (which are removed by default):
ҿ
fold={{[U+0640] preserve} default}
U+0640
Replace various punctuation characters with their ASCII counterparts:
ಯ
fold={ {[U+2018] U+0027} {[U+2019] U+0027} {[U+201C] U+0022}
{[U+201D] U+0022} default}
ҿ
U+0640
U+201C
U+002D
U+0022
Default behavior: replace PUA characters with the Unicode replacement character
U+FFFD: fold={{[:Private_Use:] U+FFFD} default}
t
U+FFFD
Preserve PUA characters:
fold={{[:Private_Use:] preserve} default}
t
t
n/a
t
t t
Handle font-specific PUA characters, e.g. Japanese EUDC or logo font
Remove PUA characters:
fold={{[:Private_Use:] remove} default}
Remove TET PUA values for unmappable glyphs, but preserve PUA characters from
fonts:
fold={{_tet_pua remove} {[:Private_Use:] preserve} default}
᪽
7.3 Unicode Postprocessing
99
Default foldings. Except for granularity=glyph TET applies the following default foldings which are explained in Table 7.3:
{[:blank:] U+0020}
{tetpua unknownchar}
{[:Private_Use:] U+FFFD}
{_dehyphenation remove}
{[[\u0640][:Control:][:Unassigned:]] remove}
In order to combine custom foldings with the default foldings, the keyword default
must be supplied after the custom folding options (this is shown in all examples in Table 7.2).For example, the following fold option list preserves hyphens in dehyphenated
words and then applies the default foldings:
fold={ {_dehyphenation preserve} default }
Adding the keyword default to the fold option list is recommended in most cases unless
you want to explicitly disable all default foldings.
Table 7.3 Default foldings
folding and description
Space folding: map all variants of Unicode space characters to U+0020:
{[:blank:] U+0020}
sample input output
b
U+00A0
U+0020
Map TET PUA values for unmappable glyphs to the character specified in the
unknownchar option (or apply the specified action preserve/remove):
᪽
U+FFFD
{_tetpua unknownchar}
Map PUA characters to the Unicode replacement character U+FFFD:
{[:Private_Use:] U+FFFD}
Remove hyphens in dehyphenated words:
{_dehyphenation remove}
Remove Arabic Tatweel characters, control characters and characters which are
not assigned in Unicode (these foldings are always performed after all other foldings when creating TETML output):
{[\u0640][:Control:][:Unassigned:] remove}
100
Chapter 7: Advanced Unicode Handling
t
᪽
U+FFFD
n/a
ҿ
n/a
U+002D
U+0640
b
U+000C U+03A2
7.3.2 Unicode Decomposition
Decompositions replace a character with an equivalent sequence of one or more other
characters. A Unicode character is called (either compatibility or canonical) equivalent
to another character or a sequence of characters if they actually mean the same, but for
historical reasons (mostly related to round tripping with legacy encodings) are encoded
separately in Unicode. Decompositions destroy information. This is useful if you are
not interested in the difference between the original character and its equivalent. If you
are interested in the difference, however, the respective decomposition should not be
applied. For a full discussion of Unicode decomposition see
www.unicode.org/versions/Unicode8.0.0/ch02.pdf (section 2.12) and
www.unicode.org/versions/Unicode8.0.0/ch03.pdf (section 3.7)
Note The term »decomposition« is used here as defined in the Unicode standard, although many decompositions do not actually split a character into multiple parts, but convert a single character to another character.
Canonical decomposition. Characters or character sequences which are canonically
equivalent represent the same abstract character and should therefore always have the
same appearance and behavior. Common examples include precomposed characters
(e.g.
† ) vs. combining sequences (e.g. $
U+00C4
ɰ ): both representations are canoni-
U+0041 U+0308
cally equivalent. Switching from one representation to the other does not remove information. Canonical decompositions replace one representation with another which is
considered the canonical representation.
In the Unicode code charts1 (but not the character tables) canonical mappings are
marked with the symbol IDENTICAL TO
†
ิ . The decomposition name <canonical> is
U+00C4
U+2261
implicitly assumed. Table 7.4 contains several examples.
Compatibility decomposition. Characters which are compatibility equivalent represent the same abstract character, but may differ in appearance or behavior. Examples
include isolated forms of Arabic characters (e.g.
(e.g.
ᦒ
,
ҷ ) vs. context-specific shaped forms
U+0633
ᦔ , ᦓ ). Compatibility equivalent characters differ in formatting. Re-
U+FEB2 U+FEB4 U+FEB3
moving this formatting information implies loss of information, but may simplify processing for certain types of applications (e.g. searching).
In the Unicode code charts compatibility mappings are marked with the symbol
ALMOST EQUAL TO
†
ป , followed by the decomposition name (or »tag«) in angle
U+00C4
U+2248
brackets, e.g. <noBreak>. If no tag name is provided, <compat> is assumed. The tag names
are identical to the option names in Table 7.5. As can be seen in some of the examples,
the result of a decomposition may convert a single character to a sequence of multiple
characters.
1. See www.unicode.org/Public/8.0.0/charts/
7.3 Unicode Postprocessing
101
Table 7.4 Canonical decomposition: suboption for the decompose option (canonically equivalent characters are
marked with the symbol IDENTICAL TO
†
ิ
U+00C4
U+2261
in the Unicode code charts)
decomposition
name
description
canonical1
Canonical decomposition
before
decomposition
‚
U+00C0
ᙔ
U+F9F4
വ
U+2126
ቓ
U+3070
ᚫ
U+FB2F
after decomposition
$
ɨ
U+0041 U+0300
㨦
U+6797
˖
U+03A9
വ ቸ
വ
ቒ
ቒ
U+2126
U+306F U+2126
U+306F
U+3099
҂
Ѷ
U+05D0 U+05B8
1. By default this decomposition is not applied to all characters in order to preserve certain characters; see »Default decompositions«, page 104, for details.
Note Keep in mind that PDF documents may already map glyphs to the decomposed sequence instead of to the non-decomposed Unicode value. In this situation the decompose option doesn’t
affect the output.
Decomposition examples. Decompositions in TET can be controlled with the document option decompose. A decomposition can be restricted to operate only on some, but
not all Unicode characters. The subset on which a decomposition operates is called its
domain. Table 7.5 lists the suboptions for all Unicode decompositions along with examples.
The following examples for the decompose option must be supplied in the option list
for TET_open_document( ). The decomposition names in the decompose option list are
taken from Table 7.5.
Disable all decompositions:
decompose={none}
Preserve wide (double-byte or zenkaku) and narrow (hankaku) characters:
decompose={wide=_none narrow=_none}
Map all canonical equivalents to their counterparts:
decompose={canonical=_all}
The following option list enables the circle decomposition, but disables all other decompositions:
decompose={none circle=_all}
102
Chapter 7: Advanced Unicode Handling
Table 7.5 Compatibility decomposition: suboptions for the decompose option (canonically equivalent characters
are marked with the symbol ALMOST EQUAL TO
†
ป
U+00C4
U+2248
in the Unicode code charts)
decomposition
name
description
circle
compat
final
before
decomposition
after decomposition
(in logical order)
U+3251
U+0032 U+0031
Encircled characters
Other compatibility decompositions, e.g. common
ligatures
Final presentation forms, especially Arabic
ᚏ
U+FB01
ᦒ
U+FEB2
font
fraction
Font variants, e.g. mathematical set letters, Hebrew
ligatures
Vulgar fraction forms
഑
U+2102
~
U+00BC
initial
Initial presentation forms, especially Arabic
ᦓ
U+FEB3
isolated
Isolated presentation forms, especially Arabic
ᡤ
U+FD0E
medial
Medial presentation forms, especially Arabic
ᦔ
U+FEB4
narrow
Narrow (hankaku) compatibility characters
ᩀ
U+FF66
nobreak
Non-breaking characters
b
U+00A0
none
Disable all decompositions which are not explicitly
specified in the decompose option list.
small
Small forms for CNS 11643 compatibility
CJK squared font variants
ᤸ
ᑐ
U+3314
sub
Subscript forms
ೢ
U+2081
super
Superscript forms
l
U+00AA
റ
U+2122
vertical
Vertical layout presentation forms
ᤣ
U+FE37
wide
Wide (zenkaku) compatibility forms
I
L
U+0066 U+0069
ҷ
U+0633
&
U+0043
೑
U+0031 U+2044 U+0034
ҷ
U+0633
ҷ
ҵ
U+0633 U+0631
ҷ
U+0633
ዏ
U+30F2
U+0020
(leaves all characters unmodified)
U+FE50
square
£
U+FFE1
U+002C
ኊ ዊ
U+30AD U+30ED
U+0031
D
U+0061
7
0
U+0054 U+004D
^
U+007B
£
U+00A3
7.3 Unicode Postprocessing
103
In contrast, the following option list enables all decompositions (since omitting the other options activates the default):
decompose={circle=_all}
Default decompositions. By default, all decompositions except fraction are enabled.
While most default decompositions operate on the _all domain (i.e. they will be applied
to all characters), some operate on smaller default domains according to Table 7.6. A
straightforward way of dealing with decompositions is via normalization (see Section
7.3.3, »Unicode Normalization«, page 105). Since Unicode postprocessing is disabled for
granularity=glyph no decompositions are active in this case.
Table 7.6 Default domains for Unicode decompositions (suboptions for the decompose option)
decomposition default in TET
canonical
canonical={[U+0374 U+037E U+0387 U+1FBE U+1FEF U+1FFD U+2000 U+2001 U+2126 U+212A
U+212B U+2329-U+232A]}
The default domain includes canonical duplicates (singletons), but not other canonically equivalent characters. The default is not _all in order to preserve characters like
compat
†
U+00C4
.
compat={[U+FB00-U+FB17]}
The default domain includes Latin and Armenian ligatures, but not other compatibility charac-
ô
ters. The default is not _all in order to preserve characters like
fraction
U+0132
.
fraction=_none
Fractions are not decomposed by default because this would lead to undesired sequences of the
digits for integer and fractional parts, e.g. client applications would wrongly interpret the sequence

U+0039 U+00BD
(representing the numerical value 9.5) as
೑
U+0039 U+0031 U+2044 U+0032
which
represents the numerical value (91)/2=45.5.
sub
super
sub={[U+208A-U+208E]}
super={[U+207A-U+207E]}
The default domain includes only mathematical signs. Superscript and subscript digits are not decomposed by default to avoid problems with the numerical interpretation similar to those mentioned above for fraction. Characters such as the trademark sign
to
all others
7
0
U+0054 U+004D
circle=_all
by default.
final=_all
...
vertical=_all wide=_all
All other decompositions are enabled for all characters by default.
104
Chapter 7: Advanced Unicode Handling
റ
U+2122
will not be decomposed
7.3.3 Unicode Normalization
The Unicode standard defines four normalization forms which are based on the notions
of canonical equivalence and compatibility equivalence (these are discussed in Section
7.3.2, »Unicode Decomposition«, page 101). All normalization forms put combining
marks in a specific order and apply decomposition and composition in different ways:
> Normalization Form C (NFC) applies canonical decomposition followed by canonical
composition.
> Normalization Form D (NFD) applies canonical decomposition.
> Normalization Form KC (NFKC) applies compatibility decomposition followed by canonical composition.
> Normalization Form KD (NFKD) applies compatibility decomposition.
The normalization forms are specified in Unicode Standard Annex #15 »Unicode Normalization Forms« (see www.unicode.org/versions/Unicode8.0.0/ch03.pdf#G21796 and
www.unicode.org/reports/tr15/).
TET supports all four Unicode normalization forms. Unicode normalization can be
controlled via the normalize document option, e.g.
normalize=nfc
TET does not apply normalization by default. Because of the possible interaction between the decompose and normalize options, setting the normalize option to a value different from none disables the default decompositions.
The choice of normalization form depends on the application’s requirements. For
example, some databases expect text in NFC which is also the preferred format for Unicode text on the Web. Table 7.7 demonstrates the effect of Normalization on various
characters.
Table 7.7 Examples for Unicode normalization forms
before
normalization
†
U+00C4
$
ɰ
ɰ
$
U+0041 U+0308
U+0308 U+0041
ᚏ
ೖ
റ
U+2122
$
U+0308 U+0041
‡
റ
U+2122
ɰ
$
ɰ
ɰ
$
U+0308 U+0041
ᚏ
U+FB01
ೖ
U+0033 U+2075
U+00C5
$
U+0041 U+0308
U+FB01
U+0033 U+2075
ഺ
†
U+00C4
ɰ
NFKC
U+0041 U+0308
ᚏ
U+FB01
U+212B
NFD
†
U+00C4
NFC
ೖ
$
ɲ
U+0041 U+030A
റ
†
U+00C4
†
U+00C4
ɰ
$
ɰ
ɰ
$
I
L
$
ɲ
7
0
U+0041 U+0308
ɰ
$
I
L
U+0308 U+0041
U+0033 U+0035
‡
U+00C5
7
$
U+0041 U+0308
U+0066 U+0069
U+0033 U+2075
U+2122
NFKD
U+0308 U+0041
U+0066 U+0069
U+0033 U+0035
U+0041 U+030A
0
U+0054 U+004D
U+0054 U+004D
7.3 Unicode Postprocessing
105
Table 7.7 Examples for Unicode normalization forms
before
normalization
൘
U+2163
ᚿ
U+FB48
閵
U+AC00
ᚿ
ቅ
U+FB48
U+3062
ᐊ
U+32C9
106
NFC
NFD
൘
൘
U+2163
Қ
NFKC
U+2163
ѹ
U+05E8 U+05BC
閵
U+AC00
ᚿ
ቅ
U+FB48
U+3062
ᐊ
U+32C9
Chapter 7: Advanced Unicode Handling
Қ
ѹ
৆ ਢ
U+1100 U+1161
ቄ ቸ
U+3061 U+3099
ᐊ
,
9
Қ
ѹ
U+0049 U+0056
U+05E8 U+05BC
U+32C9
NFKD
U+05E8 U+05BC
9
Қ
ѹ
U+05E8 U+05BC
৆ ਢ
閵
U+1100 U+1161
U+AC00
ᚿ
ቅ
ቄ ቸ
U+FB48
U+3062
,
U+0049 U+0056
U+3061 U+3099
㦗
U+0031 U+0030 U+6708
㦗
U+0031 U+0030 U+6708
7.4 Supplementary Characters and Surrogates
Supplementary characters outside Unicode’s Basic Multilingual Plane (BMP), i.e. those
with Unicode values above U+FFFF, cannot be expressed as a single UTF-16 value, but require a pair of UTF-16 values called a surrogate pair. Examples of supplementary characters include various mathematical and musical symbols at U+1DXXX as well as thousands of CJK extension characters starting at U+20000. TET also uses the Supplementary
Private Use Area to assign Unicode values to glyphs for which no Unicode mapping was
found in the PDF document. By default, these characters are replaced with the Unicode
replacement character U+FFFD. However, with the option unknownchar=preserve they
can occur in the output as Unicode values outside the BMP, i.e. values above U+FFFF (see
»Unmappable glyphs and the TET PUA«, page 108).
TET interprets and maintains supplementary characters and provides access to the
corresponding UTF-32 value even in language bindings where native Unicode strings
support only UTF-16. The uv field returned by TET_get_char_info( ) for the leading surrogate value contains the corresponding UTF-32 value. This allows direct access to the
UTF-32 value of a supplementary character even if you are working in a UTF-16 environment without any support for UTF-32.
Leading (high) surrogates and trailing (low) surrogates are maintained. The string returned by TET_get_text( ) contains two UTF-16 values.
7.4 Supplementary Characters and Surrogates
107
7.5 Unicode Mapping for Glyphs
While text in PDF can be represented with a variety of font and encoding schemes, TET
abstracts from glyphs and normalizes all text to Unicode characters, regardless of the
original text representation in the PDF. Converting the information found in the PDF to
the corresponding Unicode values is called Unicode mapping, and is crucial for understanding the semantics of the text (as opposed to rendering a visual representation of
the text on screen or paper). In order to provide proper Unicode mapping TET consults
various data structures which are found in the PDF document, embedded or external
font files, as well as builtin and user-supplied tables. In addition, it applies several
methods to determine the Unicode mapping for non-standard glyph names.
Despite all efforts there are still a few PDF documents where some text cannot be
mapped to Unicode. In order to deal with these cases TET offers a number of configuration features which can be used to control Unicode mapping for problematic PDF files.
Unmappable glyphs and the TET PUA. There are several reasons why text in a PDF cannot reliably be mapped to Unicode. For example, Type 1 fonts may contain unknown
glyph names, and TrueType, OpenType, or CID fonts may be addressed with glyph ids
without any Unicode values in the font or PDF. If TET cannot determine a Unicode value
after examining the information in the PDF document, embedded and external fonts,
configured tables and internal tables the glyph is considered as unmappable.TET_
Unmapped glyphs can be identified with the unknown member of the TET_char_info
structure (see Table 10.16, page 192) or the Glyph/@unknown attribute in TETML.
TET assigns decreasing values in the TET Private Use Area (TET PUA) to all unmappable glyphs. The TET PUA is located in the Supplementary Private Use Area, i.e. outside the
BMP, to avoid conflicts with PUA values assigned in fonts. The TET PUA can be addressed
with the keyword _tetpua as source in the fold option.
By default, TET PUA values for unmappable glyphs are replaced with the Unicode replacement character U+FFFD. This behavior can be modified with the unknownchar document option which can be set to an arbitrary Unicode character, or to specify that TET
PUA values for unmappable glyphs are preserved or removed. Table 7.2 explains various
combinations of the fold and unknownchar options for different use cases.
Table 7.8 Specifying treatment of TET PUA values for unmappable glyphs with the unknownchar document option
description and option list
Default behavior: replace TET PUA values for unmappable glyphs with the Unicode replacement character U+FFFD: unknownchar=U+FFFD
Replace TET PUA values for unmappable glyphs with a question mark (or any other suitable Unicode character); this may be useful for visually identifying problematic glyphs in the text: unknownchar=?
raw input
result
᪽
U+FFFD
"
U+003F
Remove TET PUA values for unmappable glyphs: unknownchar=remove
n/a
Preserve TET PUA values for unmappable glyphs; this may be useful for debugging
and analysis: unknownchar=preserve
108
Chapter 7: Advanced Unicode Handling
(TET PUA
value)
Characters in the Private Use Area (PUA). A font or PDF document may map a glyph to
a Unicode character in the Private Use Area. This is commonly used for symbols without
any global standardized meaning, such as fonts for Japanese end-user defined characters (EUDC) or logo fonts. Since PUA characters cannot meaningfully be used in generic
Unicode workflows they are replaced with the Unicode replacement character U+FFFD
by default. See Table 7.2 for preserving PUA values in situations where the applications
can handle PUA values.
Summary of Unicode mapping controls. While TET implements many workarounds in
order to process PDF documents which actually don’t contain Unicode values so that it
can successfully extract the text nevertheless. However, there are still documents where
the text cannot be extracted since not enough information is available in the PDF and
relevant font data structures. TET contains various configuration features which can be
used to supply additional Unicode mapping information. These features are detailed in
this section.
Using the glyphmapping option of TET_open_document( ) (see Section 10.3, »Document
Functions«, page 172) you can control Unicode mapping for glyphs in several ways. The
following list gives an overview of available methods (which can be combined). These
controls can be applied on a per-font basis or globally for all fonts in a document:
> The suboption forceencoding can be used to completely override all occurrences of
the predefined PDF encodings WinAnsiEncoding or MacRomanEncoding.
> The suboptions codelist and tounicodecmap can be used to supply Unicode values in a
simple text format (a codelist resource).
> The suboption glyphlist can be used to supply Unicode values for non-standard glyph
names.
> The suboption glyphrule can be used to define a rule which will be used to derive Unicode values from numerical glyph names in an algorithmic way. Several rules are already built into TET. The option encodinghint can be used to control the internal
rules.
> In addition to dozens of predefined encodings, custom encodings can be defined for
use with the encodinghint option or the encoding suboption of the glyphrule option.
> External fonts can be configured to provide Unicode mapping information if the
PDF does not provide enough information and the font is not embedded in the PDF.
Analyzing PDF documents with the PDFlib FontReporter Plugin1. In order to obtain
the information required to create appropriate Unicode mapping tables you must analyze the problematic PDF documents.
PDFlib GmbH provides a free companion product to TET which assists in this situation: PDFlib FontReporter is an Adobe Acrobat plugin for easily collecting font, encoding, and glyph information. The plugin creates detailed font reports containing the actual glyphs along with the following information:
> The corresponding code: the first hex digit is given in the left-most column, the second hex digit is given in the top row. For CID fonts the offset printed in the header
must be added to obtain the code corresponding to the glyph.
> The glyph name if present.
> The Unicode value(s) corresponding to the glyph (if Acrobat can determine them).
1. The PDFlib FontReporter plugin is available for free download at www.pdflib.com/products/fontreporter
7.5 Unicode Mapping for Glyphs
109
Fig. 7.2
Sample font reports created with the PDFlib FontReporter plugin for Adobe Acrobat
These pieces of information play an important role for TET’s glyph mapping controls.
Figure 7.2 shows two pages from a sample font report. Font reports created with the
FontReporter plugin can be used to analyze PDF fonts and create mapping tables for
successfully extracting the text with TET. It is highly recommended to take a look at the
corresponding font report if you want to write Unicode mapping tables or glyph name
heuristics to control text extraction with TET.
Precedence rules. TET will apply the glyph mapping controls in the following order:
> Codelist and ToUnicode CMap resources are consulted first.
> If the font has an internal ToUnicode CMap it is considered next.
> For glyph names TET applies an external or internal glyph name mapping rule if one
is available which matches the font and glyph name.
> Lastly, a user-supplied glyph list is applied.
Code list resources for all font types. Code lists are similar to glyph lists except that
they specify Unicode values for individual codes instead of glyph names. Although
multiple fonts from the same foundry may use identical code assignments, codes (also
called glyph ids) are generally font-specific. As a consequence, separate code lists will be
required for individual fonts. A code list is a text file where each line describes a Unicode
mapping for a single code according to the following rules:
> Text after a percent sign ’%’ will be ignored; this can be used for comments.
110
Chapter 7: Advanced Unicode Handling
Fig. 7.3
The font report for a logotype font shows that the font contains wrong Unicode mappings.
A custom code list can correct such mappings.
> The first column contains the glyph code in decimal or hexadecimal notation. This
must be a value in the range 0-255 for simple fonts, and in the range 0-65535 for CID
fonts.
> The remainder of the line contains up to 7 Unicode code points for the code. The values can be supplied in decimal notation or (with the prefix x or 0x) in hexadecimal
notation. UTF-32 is supported, i.e. surrogate pairs can be used.
By convention, code lists use the file name suffix .cl. Code lists can be configured with
the codelist resource category. If no code list resource has been specified explicitly, TET
will search for a file named <mycodelist>.cl (where <mycodelist> is the resource name) in
the searchpath hierarchy (see Section 5.2, »Resource Configuration and File Searching«,
page 59 for details). In other words: if the resource name and the file name (without the
.cl suffix) are identical you don’t have to configure the resource since TET will implicitly
do the equivalent of the following call (where name is an arbitrary resource name):
set_option("codelist {name name.cl}");
The following sample demonstrates the use of code lists. Consider the mismapped logotype glyphs in Figure 7.3 where a single glyph of the font actually represents multiple
characters, and all characters together create the company logotype. However, the
glyphs are wrongly mapped to the characters a, b, c, d, and e. In order to fix this you
could create the following code list:
% Unicode mappings for codes in the GlobeLogosOne font
x61
x62
x63
x64
x65
x0054
x0042
x0073
x0047
x0062
x0068 x0065 x0020
x006F
x0074 x006F x006E x0020
x006C x006F
x0065
%
%
%
%
%
The
Bo
ston
Glo
be
Then supply the codelist with the following option to TET_open_document( ) (assuming
the code list is available in a file called GlobeLogosOne.cl and can be found via the search
path):
glyphmapping {{fontname=GlobeLogosOne codelist=GlobeLogosOne}}
ToUnicode CMap resources for all font types. PDF supports a data structure called
ToUnicode CMap which can be used to provide Unicode values for the glyphs of a font.
If this data structure is present in a PDF file TET will use it. Alternatively, a ToUnicode
CMap can be supplied in an external file. This is useful when a ToUnicode CMap in the
PDF is incomplete, contains wrong entries, or is missing. A ToUnicode CMap will take
precedence over a code list. However, code lists use an easier format the ToUnicode
CMaps so they are the preferred format.
7.5 Unicode Mapping for Glyphs
111
By convention, CMaps don’t use any file name suffix. ToUnicode CMaps can be configured with the cmap resource category (see Section 5.2, »Resource Configuration and
File Searching«, page 59). The contents of a cmap resource must adhere to the standard
CMap syntax.1 In order to apply a ToUnicode CMap to all fonts in the Warnock family use
the following option to TET_open_document( ):
glyphmapping {{fontname=Warnock* tounicodecmap=warnock}}
Glyph list resources for simple fonts. Glyph lists (short for: glyph name lists) can be
used to provide custom Unicode values for non-standard glyph names, or override the
existing values for standard glyph names. A glyph list is a text file where each line describes a Unicode mapping for a single glyph name according to the following rules:
> Text after a percent sign ’%’ will be ignored; this can be used for comments.
> The first column contains the glyph name. Any glyph name used in a font can be
used (i.e. even the Unicode values of standard glyph names can be overridden). In order to use the percent sign as part of a glyph name the sequence \% must be used
(since the percent sign serves as the comment introducer).
> At most one mapping for a particular glyph name is allowed; multiple mappings for
the same glyph name will be treated as an error.
> The remainder of the line contains up to 7 Unicode code points for the glyph name.
The values can be supplied in decimal notation or (with the prefix x or 0x) in hexadecimal notation. UTF-32 is supported, i.e. surrogate pairs can be used.
> Unprintable characters in glyph names can be inserted by using escape sequences
for text files (see Section 5.2, »Resource Configuration and File Searching«, page 59).
By convention, glyph lists use the file name suffix .gl. Glyph lists can be configured with
the glyphlist resource. If no glyph list resource has been specified explicitly, TET will
search for a file named <myglyphlist>.gl (where <myglyphlist> is the resource name) in the
searchpath hierarchy (see Section 5.2, »Resource Configuration and File Searching«, page
59, for details). In other words: if the resource name and the file name (without the .gl
suffix) are identical you don’t have to configure the resource since TET will implicitly do
the equivalent of the following call (where name is an arbitrary resource name):
set_option("glyphlist {name name.gl}");
Due to the precedence rules for glyph mapping, glyph lists will not be consulted if the
font contains a ToUnicode CMap. The following sample demonstrates the use of glyph
lists:
% Unicode values for glyph names used in TeX documents
precedesequal
similarequal
negationslash
union
prime
0x227C
0x2243
0x2044
0x222A
0x2032
In order to apply a glyph list called tarski.gl to all font names starting with CMSY use the
following option for TET_open_document( ):
glyphmapping {{fontname=CMSY* glyphlist=tarski}}
1. See partners.adobe.com/public/developer/en/acrobat/5411.ToUnicode.pdf
112
Chapter 7: Advanced Unicode Handling
Rules for interpreting numerical glyph names in simple fonts. Sometimes PDF documents contain glyphs with names which are not taken from some predefined list, but
are generated algorithmically. This can be a »feature« of the application generating the
PDF, or may be caused by a printer driver which converts fonts to another format: sometimes the original glyph names get lost in the process, and are replaced with schematic
names such as G00, G01, G02, etc. TET contains builtin glyph name rules for processing
numerical glyph names created by various common applications and drivers. Since the
same glyph names may be created for different encodings you can provide the
encodinghint option to TET_open_document( ) in order to specify the target encoding for
schematic glyph names encountered in the document. For example, if you know that
the document contains Russian text, but the text cannot successfully be extracted for
lack of information in the PDF, you can supply the option encodinghint= cp1250 to specify a Cyrillic codepage.
In addition to the builtin rules for interpreting numerical glyph names you can define custom rules with the fontname and glyphrule suboptions of the glyphmapping option of TET_open_document( ). You must supply the following pieces of information:
> The full or abbreviated name of the font to which the rule will be applied (fontname
option)
> A prefix for the glyph names, i.e. the characters before the numerical part (prefix suboption)
> The base (decimal or hexadecimal) in which the numbers will be interpreted (base
suboption)
> The encoding in which to interpret the resulting numerical codes (encoding suboption)
For example, if you determined (e.g. using PDFlib FontReporter) that the glyphs in the
fonts T1, T2, T3, etc. are named c00, c01, c02, ..., cFF where each glyph name corresponds to
the WinAnsi character at the respective hexadecimal position (00, ..., FF) use the following option for TET_open_document( ):
glyphmapping {{fontname=T* glyphrule={prefix=c base=hex encoding=winansi} }}
External font files and system fonts. If a PDF does not contain sufficient information
for Unicode mapping and the font is not embedded, you can configure additional font
data which TET will use to derive Unicode mappings. Font data may come from a TrueType or OpenType font file on disk, which can be configure with the fontoutline resource
category. As an alternative on OS X and Windows systems, TET can access fonts which
are installed on the host operating system. Access to these host fonts can be disabled
with the usehostfonts option in TET_open_document( ).
In order to configure a disk file for the WarnockPro font use the following call:
set_option("fontoutline {WarnockPro WarnockPro.otf}");
See Section 5.2, »Resource Configuration and File Searching«, page 59, for more details
on configuring external font files.
7.5 Unicode Mapping for Glyphs
113
114
Chapter 7: Advanced Unicode Handling
8 Image Extraction
8.1 Image Extraction Basics
Image formats. TET extracts raster images from PDF pages and stores the extracted
images in one of the following formats:
> TIFF (.tif) images are created in most cases. The majority of TIFF images created by
TET is compatible with all TIFF viewers and consumers. However, some advanced
TIFF features are not supported by all image viewers, especially additional spot color
channels (see »Spot colors«, page 126). We regard Adobe Photoshop as benchmark for
the validity of TIFF images. Note that the Windows XP image viewer does not support the common Flate compression method in TIFF. In order to work around this
viewer restriction you can enable LZW compression with the option preferredtiffcompression=lzw in TET_write_image_file( ) or TET_get_image_data( ).
> JPEG (.jpg) is created for images which are compressed with the JPEG algorithm
(DCTDecode filter) in PDF. JPEG-compressed image data in the PDF document is validated unless the validation has been disabled with the option validatejpeg=false in
TET_write_image_file( ) or TET_get_image_data( ), which may slightly speed up processing. In some cases DCT-compressed images are extracted as TIFF since not all PDF
color spaces can be expressed in JPEG (e.g. spot colors).
> JPEG 2000 is created for images which are compressed with the JPEG 2000 algorithm
(JPXDecode filter) in PDF. JPEG 2000 images come in different flavors. The main flavor with MIME type image/jp2 and file name suffix .jp2 is encoded according to ISO
15444-1 (Annex I). The extended flavor with MIME type image/jpx is encoded according to ISO 15444-2 (Annex M). It supports additional features such as CMYK and Lab
color and uses .jpf as file name suffix (note the difference between MIME type and
recommended suffix). Finally, raw JPEG 2000 code streams contain only the bare
pixel data without any additional properties such as colorspace information. They
are extracted with file name suffix .j2k.
Applications which cannot handle JPEG 2000 output can avoid this extraction format with the document option allowjpeg2000=false. In this case 8-bit or 16-bit TIFF
images are created instead of JPEG 2000, which may result in larger output. TIFF images for JPX-compressed data are also created if spot color information must be preserved or if image merging is involved. If a JPX-compressed image is extracted as
TIFF, implicit internal ICC profiles in the JPX stream are ignored. For example, sRGB
JPEG 2000 images are extracted as plain RGB TIFF.
> JBIG2 (.jbig2) is created for images which are compressed with the JBIG2 algorithm
(JBIG2Decode filter) in PDF. JBIG2 files are created with »sequential organization« according to ISO 14492.
Extracting images to disk or memory. The TET API can deliver the images extracted
from PDF documents in two different ways:
> The TET_write_image_file( ) API function creates an image file on disk. The base file
name of this image file must be specified in the filename option. TET will automatically add a suitable suffix depending on the image format.
8.1 Image Extraction Basics
115
> The TET_get_image_data( ) API function delivers the image data in memory. This is
convenient if you want to pass on the image data to another processing component
without having to deal with disk files.
Details depend on your image extraction requirements (see Section 8.2.2, »Page-based
and Resource-based Image Retrieval«, page 119). In both cases you can determine the
type of the extracted image (see next section).
Determine the file format and name of extracted images. The image file type is reported in the Image/@extractedAs attribute in TETML. At the API level you can use the
following code to determine the type of an extracted image:
int imageType = tet.write_image_file(doc, tet.imageid, "typeonly");
/* Map the numerical image type to a format suffix */
String imageSuffix;
switch (imageType) {
case 10:
imageSuffix = ".tif";
break;
case 20:
imageSuffix = ".jpg";
break;
case 31:
imageSuffix = ".jp2";
break;
case 32:
imageSuffix = ".jpf";
break;
case 33:
imageSuffix = ".j2k";
break;
case 50:
imageSuffix = ".jbig2";
break;
default:
System.err.println("write_image_file() returned unknown value "
+ imageType + ", skipping image, error: "
+ tet.get_errmsg());
}
The image file name is reported in the Image/@filename attribute in TETML. At the API
level you can supply the image file name to TET_write_image_file( ).
The structure of the image file names produced by the TET command-line tool is
documented in Section 2.1, »Command-Line Options«, page 17.
XMP metadata for images. PDF uses the XMP format to attach metadata to the whole
document or parts of it. You can find more information about XMP and its use in PDF at
the following location: www.pdflib.com/knowledge-base/xmp-metadata/
116
Chapter 8: Image Extraction
An image object may have XMP metadata associated with it in the PDF document.
You can check the presence of image XMP in Acrobat XI/DC as follows:
> Click View, Show/Hide, Navigation Panes, Content.
> Locate the image in the tree structure, right-click on it and select Show Metadata... .
> The image is highlighted and the XMP panel pops up which displays XMP metadata
for the selected image.
If XMP metadata is present, TET by default embeds it in the extracted image for the output formats JPEG and TIFF. This behavior can be controlled with the keepxmp option of
TET_write_image_file( ) and TET_get_image_data( ). If this option has been set to false, TET
ignores image metadata when generating the image output file.
If image metadata is available, TET attaches a Metadata element to the image in the
TETML output. This behavior can be controlled with the tetml={elements={metadata}}
image option.
The image_metadata topic in the pCOS Cookbook demonstrates how to extract image
metadata with the pCOS interface directly, without generating any image file.
TET implements a special heuristic for XMP image metadata which bypasses the
usual PDF method for attaching XMP to an image object, but uses an alternate method
based on marked content properties. This construct is typically generated by Adobe InDesign. Note that this kind of image XMP is not available via pCOS, but only in TETML
and the extracted image files.
Restrictions. In some cases the shape of extracted images may appear different from
the PDF page:
> Images may appear mirrored horizontally (upside down) or vertically. This is caused
by the fact that TET extracts the original pixel data of the image, without respect to
any transformation which may have been applied to the image on the PDF page.
> Masking effects achieved by applying a soft mask to another image are not visible in
the extracted image. However, you can extract the mask as a separate image.
> Inline images can not be extracted, i.e. TET_write_image_file( ) returns -1. Inline imagesare a rare flavor of PDF images;they are sometimes used for small raster images or
glyphs in Type 3 fonts.
8.1 Image Extraction Basics
117
8.2 Extracting Images
8.2.1 Placed Images and Image Resources
TET distinguishes between placed images and image resources:
> A placed image corresponds to an image on a page. A placed image has geometric
properties: it is placed at a certain location and has a size (measured in points, millimeters, or some other absolute unit). In most cases the image is visible on the page,
but in some cases it may be invisible because it is obscured by other objects on the
page, is placed outside the visible page area, is fully or partially clipped, etc. Placed
images are represented by the PlacedImage element in TETML. Processing of placed
images is subject to the clippingarea, excludebox, and includebox options.
> An image resource is a resource which represents the actual pixel data, color space
and number of components, number of bits per component, etc. Unlike placed images, image resources don’t have any intrinsic geometry. However, they do have width
and height properties (measured in pixels). Each image resource has a unique ID
which can be used to extract its pixel data. Image resources are represented by the
Image element in TETML. Processing of image resources is not subject to the
clippingarea, excludebox, and includebox options.
An image resource may be used as the basis for an arbitrary number of placed images in
the document. Commonly each image resource will be placed exactly once, but it could
also be placed repeatedly on the same page or on multiple pages. For example, consider
an image for a company logo which is used repeatedly on the header of each page in the
document. Each logo on a page constitutes a placed image, but all those placed images
may be created by the same image resource in an optimized PDF. On the other hand, in
a non-optimized PDF each placed logo could be based on its own copy of the same image resource. This would result in the same visual appearance, but a larger PDF document. Non-optimized PDF documents may even contain image resources which are not
even referenced on any page (i.e. unused resources).
Table 8.1 compares various aspects of placed images and image resources.
How many images are in a document? Surprisingly, there is no simple answer to this
question. The answer depends on the following decisions:
> Do you want to count image resources or placed images?
> Do you want to count images which are only used as parts of merged images, but are
never placed isolated?
> Do you want to count images which are only used as mask?
Using TET and pCOS pseudo objects you can determine all variants of the image count
answer. The image_count topic in the TET Cookbook demonstrates various possibilities
of image counting. It generates output like the following:
No of raw image resources before merging: 82
No of placed images: 12
No of images after merging (all types): 83
normal images: 1
artificial (merged) images: 1
consumed images: 81
No of relevant (normal or artificial) image resources: 2
118
Chapter 8: Image Extraction
Table 8.1 Comparison of placed images and image resources
property
placed images
image resources
TETML element
PlacedImage
Image
affected by image merging
yes
yes
associated with a page
yes
–
width and height in pixels
yes
yes
width and height in points
yes
–
image resolution can be
determined
yes
–
position on the page
yes
–
number of appearances
1
0, 1, or more
unique ID
no: the imageid member returned by TET_get_image_info( ) and the PlacedImage/@image attribute
in TETML identify the underlying image resource
yes: imageid member returned by
TET_get_image_info( ) Image/@id
attribute in TETML
file naming convention in
the TET command-line tool
<filename>_p<pagenumber>_
<imagenumber>.[tif|jpg|jp2|jpf|j2k|jbig2]
<filename>_I<imageid>.
[tif|jpg|jp2|jpf|j2k|jbig2]
handling of image masks in
the TET command-line tool
masks are extracted as
<filename>_p<pagenumber>_<imagenumber>_
mask.[tif|jpg|jp2|jpf|j2k|jbig2]
masks are extracted according to
their own image id without additional labels in the file name
8.2.2 Page-based and Resource-based Image Retrieval
The distinction between placed images and image resources gives rise to two fundamentally different approaches to image extraction: page-based and resource-based image extraction loops. Both methods can be used to extract images to a disk file or to
memory.
Page-based image extraction loop. In this case the application is interested in the exact page layout and placed images, but doesn’t care about duplicated image data. Extracting images with a page-based loop creates an image file for each placed image, and
may result in the same image data for more than one extracted placed image. The application could avoid image duplication by checking for duplicate image IDs. However,
unique image resources can more easily be extracted with the resource-based image extraction loop (see below).
The page-based image extraction loop can be activated in the TET command-line
tool with the option --imageloop page. Code for page-based image extraction at the API
level is demonstrated in the images_per_page Cookbook topic and mini sample. These
samples also show how to retrieve the image geometry.
Details of the page-based image extraction loop (please refer to the sample code
mentioned above): TET_get_image_info( ) retrieves geometric information about a
placed image as well as the pCOS image ID (in the imageid field) of the underlying image
data. This ID can be used to retrieve more image details with TET_pcos_get_number( ),
such as the color space, width and height in pixels, etc., as well as the actual pixel data
with TET_write_image_file( ) or TET_get_image_data( ). TET_get_image_info( ) does not
touch the actual pixel data of the image. If the same image is referenced multiply on
one or more pages, the corresponding IDs will be the same.
8.2 Extracting Images
119
Resource-based image extraction loop. In this case the application is interested in the
image resources of the document, but doesn’t care which image is used on which page.
Image resources which are placed more than once (on one or more pages) are extracted
only once. On the other hand, image resources which are not placed at all on any page
will also be extracted.
The resource-based image extraction loop can be activated in the TET command-line
tool with the option --imageloop resource. Code for resource-based image extraction at
the API level is demonstrated in the image_resources mini sample and Cookbook topic.
Details of the resource-based image extraction loop (please refer to the sample code
mentioned above): all pages are opened before extracting image resources to make sure
that image merging is activated; if image merging is not relevant this step can be
skipped. In order to extract an image, the corresponding image ID is required. The code
enumerates all values from 0 to the highest image ID, which is queried as follows:
n_images = (int) tet.pcos_get_number(doc, "length:images");
In order to skip the consumed parts of merged images (e.g. the strips of a multi-strip image), the type of each image resource is examined with the mergetype pCOS pseudo object. This allows us to skip images which have been consumed by the image merging
process (since we are only interested in the resulting merged image). Once an image ID
has been determined, one of the functions TET_write_image_file( ) or TET_get_image_
data( ) can be called to write the image data to a disk file or pass the pixel data in memory, respectively.
8.2.3 Geometry of Placed Images
Using TET_get_image_info( ) you can retrieve geometric information for a placed image.
The following values are available for each image in the image_info structure (see Figure
8.1):
> The x and y fields are the coordinates of the image reference point. The reference
point is usually the lower left corner of the image. However, coordinate system
transformations on the page may result in a different reference point. For example,
the image may be mirrored horizontally with the result that the reference point becomes the upper left corner of the image. The value of y is subject to the topdown
page option.
> The width and height fields correspond to the physical dimensions of the placed image on the page. They are provided in points (i.e. 1/72 inch).
> The angle alpha describes the direction of the pixel rows. This angle will be in the
range -180˚ < alpha ³ +180˚. The angle alpha rotates the image at its reference point.
For upright images alpha will be 0˚. The values of alpha and beta are subject to the
topdown page option.
> The angle beta describes the direction of the pixel columns, relative to the perpendicular of alpha. This angle will be in the range -180˚ < beta ³ +180˚, but different
from ±90˚. The angle beta skews the image, and beta=180˚ mirrors the image at the x
axis. For upright images beta will be in the range -90˚ < beta < +90˚. If abs(beta) > 90˚
the image is mirrored at the baseline.
> The imageid field contains the pCOS ID of the image. It can be used to retrieve detailed image information with pCOS functions and the image pixel data with TET_
write_image_file( ) or TET_get_image_data( ).
120
Chapter 8: Image Extraction
As a result of image transformations, the orientation of the extracted images may appear wrong since the extracted image data is based on the image resource in the PDF.
Any rotation or mirror transformations applied to the placed image on the PDF page are
not applied to the extracted pixel data.Instead, the original pixel data is extracted.
Calculating the coordinates of all image corners. The x and y fields retrieved with TET_
get_image_info( ) provide the coordinates of the image’s reference point which is often
located in the lower left corner of the image. Using the image’s x/y, width/height and
alpha/beta values you can calculate the coordinates of all image corners as follows:
llx = x
lly = y
lrx = x + width * cos(alpha)
lry = y + width * sin(alpha)
ulx = x + dir * height * (tan(beta)*cos(alpha) - sin(alpha))
uly = y + dir * height * (tan(beta)*sin(alpha) + cos(alpha))
urx = x + width * cos(alpha) + dir * height * (tan(beta)*cos(alpha) - sin(alpha))
ury = y + width * sin(alpha) + dir * height * (tan(beta)*sin(alpha) + cos(alpha))
with dir=1 in the default case topdown={output=false}. In topdown coordinates, i.e. if
topdown={output= true} (see »Top-down coordinate system«, page 72), you must set
dir=-1 and the corners are swapped, i.e. ll must be swapped with ul, and lr with ur.
Image resolution. In order to calculate the image resolution in dpi (dots per inch) you
must divide the image width in pixels by the image width in points and multiply by 72:
while (tet.get_image_info(page) == 1) {
String imagePath = "images[" + tet.imageid + "]";
int width = (int) tet.pcos_get_number(doc, imagePath + "/Width");
int height = (int) tet.pcos_get_number(doc, imagePath + "/Height");
double xDpi = 72 * width / tet.width;
double yDpi = 72 * height / tet.height;
...
}
beta
Fig. 8.1
Image geometry
ght
hei
(x, y)
th
wid
alpha
8.2 Extracting Images
121
Note that dpi values for rotated or skewed images may be meaningless. Full code for image dpi calculations can be found in the determine_image_resolution topic in the TET
Cookbook.
TET by default records a dummy resolution value of 72 dpi in generated TIFF images
to satisfy the TIFF specification. The dpi option of TET_write_image_file( ) can be used to
embed calculated resolution values instead. TET cannot embed calculated resolution
values automatically since a particular image may have been placed more than once,
each time with different size and therefore different resolution. The value dpi=0 can be
used to suppress the dummy resolution values.
The TET command-line tool embeds calculated resolution values when operating in
the page-based image loop.
122
Chapter 8: Image Extraction
8.3 Merging Fragmented Images
Sometimes it is not desirable to extract images exactly as they are represented in the
PDF document: in many situations what appears to be a single image is actually a collection of several smaller images which are placed adjacent to each other. There are some
common reasons for such image fragmentation:
> Some applications and drivers convert multi-strip TIFF images to fragmented PDF
images. The number of strips can range from dozens to hundreds.
> Some scanning software divides scanned pages in smaller fragments (strips or tiles).
The number of fragments is usually not more than a few dozen.
> Some applications break images into small pieces when generating print or PDF output. In extreme cases, especially documents created with Microsoft Office applications, a page may contain thousands of small image fragments.
> Some page layout programs, e.g. Adobe InDesign, cut images into smaller and sometimes irregular fragments when creating PDF output (see Figure 8.2).
TET’s image merging engine detects this situation and recombines the image parts to
form a larger and more useful image. If the merging candidates can be combined to a
larger image, they will be merged.
Fig. 8.2
Although this image is
segmented into smaller
parts (top), TET extracts it
as a single reusable image
(bottom).
8.3 Merging Fragmented Images
123
In order to disable image merging use the following page option:
imageanalysis={merge={disable}}
Merged images in pCOS. Merged images can be identified by the pCOS pseudo object
images[ ]/mergetype: it has the value 1 (artificial) for merged images and 2 (consumed) for
images which have been consumed by the merging process. Consumed images should
generally be ignored by the receiving application.
In order to compensate for inaccuracies in the image locations some amount of gap
or overlap is allowed between adjacent images. By default, images are merged if the gap
or overlap is smaller than one point. This value can be modified with the following page
option:
imageanalysis={merge={gap=2}}
Larger gap/overlap values are often required when extracting images from newspapers
or magazines.
When are images merged? Analyzing and merging images on a page are triggered by
the corresponding call to TET_open_page( ). This leads to the following consequences:
> The number of entries in the pCOS images[ ] array, i.e. the value of the length:images
pseudo object, may increase: as more pages are processed, artificial images which result from image merging are added to the array. In order to extract all merged images you must therefore open all pages in the document before querying length:images
and extracting image data. Artificial (merged) images are marked with the corresponding flag artificial (numerical value 1) in the images[ ]/mergetype pseudo object.
> On the other hand, some elements in the images[ ] array may only be consumed as
parts of merged images, but are not used as images in their own right. Such entries
are never removed from the images[ ] array, but the consumed entries are marked
with the corresponding flag consumed (numerical value 2) in the images[ ]/mergetype
pseudo object.
124
Chapter 8: Image Extraction
8.4 Small Image Filtering
TET ignores very small images since these are often irrelevant or useless. Since the image merging process often combines small image fragments to a larger image, small image removal is performed after image merging. Only images which can not be merged
to form a larger image are candidates for small image removal. In addition, they must
satisfy the conditions for the height, width, and area which can be specified in the
maxheight/maxwidth/maxarea suboptions of the smallimages suboption of the imageanalysis page option. In order to completely disable small image removal use the following page option:
imageanalysis={smallimages={disable}}
Small images in pCOS. Images which have been classified as small according to the
smallimages option are ignored by TET_write_image_file( ) and TET_get_image_data( ), but
are still present in the pCOS images[ ] array. They can be identified with the pCOS pseudo object images[ ]/small.
8.4 Small Image Filtering
125
8.5 Image Colors and Masking
8.5.1 Color Spaces
Image color fidelity. Table 6.1 provides an overview of PDF color spaces. All color spaces are supported for images. TET does not degrade image quality when extracting images:
> Raster images are never downsampled.
> The color space of an image is retained in the output. TET never applies any CMYKto-RGB or similar color conversion.
ICC profiles. An image in PDF may have an ICC profile assigned which allows precise
color reproduction. By default, TET processes attached ICC profiles and embeds them in
the generated TIFF or JPEG image files. You can disable ICC profile embedding with the
option keepiccprofile=false in TET_write_image_file( ) and TET_get_image_data( ). This reduces the size of the image files at the expense of color fidelity. Disabling ICC profile
embedding is not recommended for workflows which need precise color representation.
Spot colors. Images in PDF may be colorized with a named color. Usually named colors
are used to specify custom spot colors, but the same mechanism can also be used to
apply a subset of CMYK process colors to an image (e.g. only the Cyan and Magenta
channels). The Separation color space in PDF holds a single named color, while the
DeviceN color space can be used to assign multiple named colors. Separation colors are
accompanied by a so-called alternate color which makes it possible to represent the color even if the spot color is not available (e.g. on a monitor). For example, if a Separation
color is called Company Red it is useful to have an alternate representation in a wellknown color space such as RGB or CMYK to display the spot color on devices where
Company Red is not available as named color.
TET extracts images with Separation or DeviceN colors as follows: CMYK process color
names are identified: if a named color is called Black it is treated as process color and the
image is extracted as grayscale image. The color names Cyan, Magenta and Yellow are
also identified and the image is extracted as CMYK image. Custom spot color names, i.e.
names different from Cyan, Magenta, Yellow and Black can be handled in different ways
subject to the document option spotcolor:
> With spotcolor=convert (which is default) spot colors are converted to the corresponding alternate color space if possible. If such a conversion is not possible this method
behaves like spotcolor=ignore (for a single custom spot color) or spotcolor= preserve
(for two or more custom spot colors).
> The option spotcolor=ignore is similar to spotcolor=convert except that images with
exactly one custom spot color are extracted as grayscale image and the spot color
name is lost.
> With spotcolor=preserve spot color names are preserved, and the image is extracted as
grayscale or CMYK image with one or more extra spot color channels. This requires
TIFF output; the generated TIFF flavor can be viewed with Adobe Photoshop and
compatible programs (see Figure 8.3). Simple TIFF viewers often ignore the extra spot
color channels.
126
Chapter 8: Image Extraction
Fig. 8.3
Adobe Photoshop displays spot color channels of TIFF images which
have been extracted with spotcolor=preserve in the Channels
window (left). Double-clicking one of the icons reveals the alternate
color (top).
Table 8.2 summarizes the output formats for different combinations of spot color
names and settings of the document option spotcolor.
Table 8.2 Output formats for images with Separation and DeviceN colors
Separation or DeviceN color names
spotcolor=ignore
only Black
spotcolor=convert
spotcolor=preserve
grayscale
one or more of Cyan, Magenta, Yellow, Black
CMYK (unused channels are empty)
exactly one custom spot color (i.e. different
from Cyan, Magenta, Yellow, Black)
grayscale
alternate color
space if possible1
two or more color names and all are
different from Cyan, Magenta, Yellow
alternate color space if possible1
grayscale channel plus one or
more named extra channels
two or more color names including one or
more of Cyan, Magenta, Yellow
alternate color space if possible1
CMYK plus one or more
named extra channels
empty grayscale channel plus
a named extra channel
1. Behaves like spotcolor=ignore (for a single custom spot color) or spotcolor=preserve (for two or more custom spot colors) if conversion to the alternate color space is not possible.
8.5.2 Image Masks and Soft Masks
Masking information and the actual image data used for masking another image can be
retrieved with TET. PDF supports the following types of image masking:
> A stencil mask is a 1-bit image with the PDF key ImageMask. The image is used as a
stencil which is partly opaque and partly transparent: by default, color is applied
where the image has pixel value 0, and the background shines through unchanged
where the image has pixel value 1.
> A mask is a 1-bit grayscale image which is applied to another image (PDF key Mask). It
specifies which image areas shall be painted and which shall be masked out (left unchanged).
> A soft mask is a grayscale image of arbitrary bit depth which is applied to another
image (PDF key SMask). It provides a smooth transition between the masked image
and its background, creating a real transparency effect.
Since hard and soft masks differ only in bit depth, they are treated uniformly in TET.
8.5 Image Colors and Masking
127
Image masks in TETML. Image masking is handled as follows in TETML:
> Stencil masks: the TETML attribute Image/@stencilmask signals that a 1-bit image itself is used as a stencil mask.
> Masks: the TETML attribute Image/@maskid references an image mask (Mask or
SMask) which may be attached to an image. Details of the mask image can be retrieved in the mask image’s entry in the images[ ] array.
Image masks in the TET command-line tool. Image masks are handled as follows in the
TET command-line tool (information about stencil masking is not available):
> Extracting images with --imageloop page extracts all plain images as usual. Images
used as mask for one of the extracted plain images are also extracted using the suffix
_mask in the image file name.
> Extracting images with --imageloop resource extracts all plain images and mask images. The generated file names include the image/@id TETML attribute of the mask image (which is identical to the image/@maskid attribute of the masked image) so that
applications can locate the corresponding files for images referenced in TETML.
Image masks in pCOS. Image masking is handled in the pCOS pseudo objectimages[ ]
and TET_pcos_get_number( ) as follows:
> Images which are used as stencil mask can be identified by the images[ ]/stencilmask
pseudo object.
> If an image has a soft mask assigned the corresponding images[ ]/maskid pseudo object has a value different from -1. The value designates the image ID of the mask and
can be used to query further details of the mask using the corresponding entry in the
images[ ] array.
Image masks in the API. Image masking is handled as follows in the TET API:
> TET_get_image_info( ) enumerates only plain images which are placed on the page,
and skips masks. The imageid field in the image_info structure can be used to obtain
the image’s pCOS id, which in turn can be used to query mask and stencil mask information via pCOS as described above.
TET_write_image_file( ) and TET_get_image_data( ) can be used to retrieve the pixel
data of the mask, using the image id retrieved with the maskid pCOS object of the
masked image. This is demonstrated in the images_per_page sample. Alternatively, you
can iterate over all entries in the pCOS images[ ] array to create image files for all plain
images and mask images. This is demonstrated in the image_resources sample.
128
Chapter 8: Image Extraction
9 TET Markup Language (TETML)
9.1 Creating TETML
As an alternative to supplying the contents of a PDF document via a programming interface, TET can create XML output. We refer to the XML output created by TET as TET
Markup Language (TETML). TETML contains the text contents of the PDF pages plus
optional information such as text position, font, font size, etc. If TET detects table-like
structures on the page the tables will be expressed in TETML as a hierarchy of table, row,
and cell elements. Note that table information is not available via the TET programming
interface, but only through TETML. TETML also contains information about images and
color spaces as well as annotations, form fields, bookmarks and other interactive elements.
You can convert PDF documents to TETML with the TET command-line tool or the
TET library. In both cases there are various options available for controlling details of
TETML generation.
Creating TETML with the TET command-line tool. Using the TET command-line tool
you can generate TETML output with the --tetml option. The following command creates a TETML output document file.tetml:
tet --tetml word file.pdf
You can use various options to convert only some pages of the document, supply processing options, etc. Refer to Section 2.1, »Command-Line Options«, page 17, for more details.
Creating TETML with the TET library. Using a simple sequence of API calls you can generate TETML output with the TET library. The tetml mini sample demonstrates the canonical code sequence for generating TETML. This sample program is available in all
supported language bindings.
TETML is created page by page, which means that the client may choose to process
only a subset of pages. The TETML trailer must be created after processing the last page:
final int n_pages = (int) tet.pcos_get_number(doc, "length:pages");
/* Loop over all pages in the document */
for (int pageno = 1; pageno <= n_pages; ++pageno)
{
tet.process_page(doc, pageno, pageoptlist);
}
/* This could be combined with the last page-related call */
tet.process_page(doc, 0, "tetml={trailer}");
If the filename option has been supplied to TET_open_document( ) the TETML output is
written to the specified disk file. Otherwise TETML is accumulated in memory and can
be fetched with TET_get_tetml( ). This can be done for the full TETML stream in a single
call (only recommended for small documents), or with multiple calls where each call retrieves a smaller chunk of the full TETML stream.
9.1 Creating TETML
129
The generated TETML stream can be parsed into a XML tree using the XML support
provided by most modern programming languages. Processing the TETML tree is also
demonstrated in the tetml sample programs for language bindings with integrated XML
support.
What’s included in TETML? TETML output is encoded in UTF-8 (on zSeries with USS or
MVS: EBCDIC-UTF-8), and includes the following information (some of these items are
optional):
> general document information, encryption status, PDF standards, Tagged PDF etc.
> document info fields and XMP metadata
> text contents of each page (words or paragraphs; optionally lines)
> font, geometry and color of the glyphs
> layout attributes for the glyph (sub/superscript, dropcap, shadow)
> hyphenation attributes
> structure information, e.g. tables
> information about placed images on the page
> resource information, i.e. fonts, color spaces, images, ICC profiles
> interactive elements: bookmarks, named destinations, annotations, form fields, actions, and JavaScript
> anchors are provided in the text stream for easy reference of links, form fields, and
bookmark targets
> digital signatures
> error messages if an exception occurred during PDF processing
Various elements and attributes in TETML are optional. See Section 9.3, »Controlling
TETML Details«, page 135, for details.
130
Chapter 9: TET Markup Language (TETML)
9.2 TETML Examples
The TETML samples below demonstrate some important features. The full list of TETML
elements along with descriptions can be found in Section 9.4, »TETML Elements and the
TETML Schema«, page 139.
Document header and text output.
ant parts of a TETML document:
The following fragment shows the most import-
<?xml version="1.0" encoding="UTF-8"?>
<!-- Created by the PDFlib Text and Image Extraction Toolkit TET (www.pdflib.com) -->
<TET xmlns="http://www.pdflib.com/XML/TET5/TET-5.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.pdflib.com/XML/TET5/TET-5.0
http://www.pdflib.com/XML/TET5/TET-5.0.xsd"
version="5.0">
<Creation platform="Win64" tetVersion="5.0" date="2015-08-05T18:26:02+02:00" />
<Document filename="TET-datasheet.pdf" pageCount="6" filesize="508093" linearized="true"
pdfVersion="1.7">
<DocInfo>
<Author>PDFlib GmbH</Author>
<CreationDate>2015-08-05T17:43:14+02:00</CreationDate>
<Creator>Adobe InDesign CS6 (Windows)</Creator>
<ModDate>2015-08-05T17:43:15+02:00</ModDate>
<Producer>Adobe PDF Library 10.0.1</Producer>
<Subject>PDFlib TET: Text and Image Extraction Toolkit (TET)</Subject>
<Title>PDFlib TET datasheet</Title>
</DocInfo>
<Metadata>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.3-c011 66.145661, 2012/02/
06-14:56:27
">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
...XMP metadata...
</rdf:RDF>
</x:xmpmeta>
</Metadata>
<Options> tetml={filename={TET-datasheet.word.tetml}}</Options>
<Pages>
<Page number="1" width="595.28" height="841.89">
<Options> granularity=word tetml={}</Options>
<Content granularity="word" dehyphenation="false" dropcap="false" font="false"
geometry="false" shadow="false" sub="false" sup="false">
<Para>
<Box llx="235.80" lly="796.02" urx="397.67" ury="816.72">
<Word>
<Text>PDFlib</Text>
<Box llx="235.80" lly="796.02" urx="291.91" ury="814.02"/>
</Word>
<Word>
<Text>datasheet</Text>
<Box llx="306.14" lly="796.22" urx="397.67" ury="816.72"/>
</Word>
</Box>
...more page contents...
</Content>
9.2 TETML Examples
131
</Page>
...more pages...
<Resources>
<Fonts>
<Font id="F0" name="TheSans-Plain" fullname="FXLUMY+TheSans-Plain" type="Type 1 CFF"
embedded="true" ascender="1170" capheight="675" italicangle="0" descender="-433"
weight="400" xheight="497"/>
<Font id="F1" name="PDFlibLogo2-Regular" fullname="DUMIKC+PDFlibLogo2-Regular"
type="Type 1 CFF" embedded="true" ascender="800" capheight="700" italicangle="0"
descender="-9" weight="400" xheight="537"/>
...more fonts...
</Fonts>
<Images>
<Image id="I0" filename="TET-datasheet_I0.tif" extractedAs=".tif" width="885"
height="565" colorspace="CS3" bitsPerComponent="8"/>
<Image id="I1" filename="TET-datasheet_I1.tif" extractedAs=".tif" width="1253"
height="379" colorspace="CS4" bitsPerComponent="8"/>
...more images...
</Images>
<ColorSpaces>
<ColorSpace id="CS0" name="DeviceCMYK" components="4"/>
<ColorSpace id="CS1" name="DeviceGray" components="1"/>
...more colorspaces...
</ColorSpaces>
</Resources>
...
</Pages>
</Document>
</TET>
Glyph coordinates and color. Depending on the selected TETML mode more glyph details can be expressed in TETML. TETML modes are discussed in »Selecting the TETML
mode«, page 135. Here is a variation of the sample above with more glyph details. The
Glyph element contains font, position and color information (see below for more details
on color representation):
<Word>
<Text>datasheet</Text>
<Box llx="306.14" lly="796.22" urx="397.67"
<Glyph font="F0" size="20.5000" x="306.14"
<Glyph font="F0" size="20.5000" x="317.87"
<Glyph font="F0" size="20.5000" x="328.61"
<Glyph font="F0" size="20.5000" x="336.52"
<Glyph font="F0" size="20.5000" x="347.51"
<Glyph font="F0" size="20.5000" x="356.53"
<Glyph font="F0" size="20.5000" x="368.63"
<Glyph font="F0" size="20.5000" x="379.35"
<Glyph font="F0" size="20.5000" x="390.07"
</Box>
</Word>
ury="816.72">
y="796.22" width="11.42" fill="C0">d</Glyph>
y="796.22" width="10.68" fill="C0">a</Glyph>
y="796.22" width="7.61" fill="C0">t</Glyph>
y="796.22" width="10.68" fill="C0">a</Glyph>
y="796.22" width="8.71" fill="C0">s</Glyph>
y="796.22" width="11.79" fill="C0">h</Glyph>
y="796.22" width="10.41" fill="C0">e</Glyph>
y="796.22" width="10.41" fill="C0">e</Glyph>
y="796.22" width="7.61" fill="C0">t</Glyph>
Color values and color spaces. Colors are represented by a color space (e.g. DeviceRGB)
and a color value. Color values for text are available for the fill and stroke colors. Since
stroked glyphs are quite rare. in PDF you will see the fill attribute more often. The color
values for images come from the actual pixel data. Color spaces for text, vector graphics
and images are listed in the ColorSpaces element in the Resources section. Each ColorSpace
132
Chapter 9: TET Markup Language (TETML)
element contains details depending on the type of color space. Some color spaces refer
to others, e.g. an Indexed colorspace requires an underlying base color space, and Separation and DeviceN require an alternate color space:
<Resources>
<ColorSpaces>
<ColorSpace id="CS0" name="DeviceCMYK" components="4"/>
<ColorSpace id="CS1" name="DeviceGray" components="1"/>
<ColorSpace id="CS2" name="Indexed" components="1" base="CS0" hival="255">
<Lookup>000000000705000349340029745300416F50003E775600...</Lookup>
</ColorSpace>
<ColorSpace id="CS0" name="Separation" components="1" alternate="CS0">
<Colorant name="PANTONE 294 CVC"/>
<Function type="interpolate">
...
<C1>
<Value>0.93</Value>
<Value>0.62</Value>
<Value>0.00</Value>
<Value>0.00</Value>
</C1>
<Exponent>1</Exponent>
</Function>
</ColorSpace>
...
</ColorSpaces>
</Resources>
Tables in TETML. Tables identified by TET are expressed with table, row and cell structure in TETML. Cells which span multiple columns are labeled with a colSpan attribute:
<Table>
<Box llx="302.14" lly="639.72" urx="525.50" ury="731.50">
<Row>
<Box llx="311.64" lly="721.10" urx="521.50" ury="730.70"/>
<Cell>
<Box llx="311.64" lly="721.10" urx="375.22" ury="730.70"/>
<Para>
<Word>
<Text>Device-dependent</Text>
<Box llx="311.64" lly="721.90" urx="375.22" ury="729.90"/>
</Word>
</Para>
</Cell>
<Cell>
<Box llx="397.91" lly="721.10" urx="431.99" ury="730.70"/>
<Para>
<Word>
<Text>CIE-based</Text>
<Box llx="397.91" lly="721.90" urx="431.99" ury="729.90"/>
</Word>
</Para>
</Cell>
...
<Row>
<Box llx="306.14" lly="641.52" urx="516.67" ury="650.52"/>
<Cell colSpan="3">
<Box llx="306.14" lly="706.42" urx="516.67" ury="716.02"/>
9.2 TETML Examples
133
<Para>
<Word>
<Text>TET</Text>
<Box llx="306.14" lly="641.52" urx="319.70" ury="650.52"/>
</Word>
<Word>
<Text>.</Text>
<Box llx="514.83" lly="641.52" urx="516.67" ury="650.52"/>
</Word>
</Para>
</Cell>
</Row>
</Box>
</Table>
Interactive elements. Links, bookmarks, form fields etc. are also available in TETML as
shown in the following example:
<Page number="6" width="595.27600" height="841.89000">
<Annotations>
<Annotation id="ANN0" type="Link" anchor="A0">
<Box llx="327.14" lly="64.89" urx="395.08" ury="79.18"/>
<Action type="URI" trigger="activate" URI="mailto:sales%40pdflib.com"/>
</Annotation>
<Annotation id="ANN1" type="Link" anchor="A1">
<Box llx="327.14" lly="52.89" urx="391.05" ury="67.18"/>
<Action type="URI" trigger="activate" URI="http://www.pdflib.com"/>
</Annotation>
</Annotations>
The text inside a link is wrapped with A (anchor) elements which provide the relationship between the geometrically defined PDF annotation and the corresponding page
contents, i.e. the text which activates the link. Keep in mind that the active content does
not need to correspond to complete semantic entities. For example, a link may span
some fraction of a word or paragraph. Since anchors don’t necessarily span complete
TETML elements separate start/stop anchor elements are required instead of enclosing
the link contents with a single A element:
<A id="A1" type="start"/>
<Word>
<Text>www.pdflib.com</Text>
<Box llx="327.14" lly="56.71" urx="391.05" ury="65.71"/>
</Word>
<A id="A1" type="stop"/>
134
Chapter 9: TET Markup Language (TETML)
9.3 Controlling TETML Details
TETML modes. TETML can be generated in various modes which include different
amounts of font and geometry information, and differ regarding the grouping of text
into larger units (granularity). The TETML mode can be specified individually for each
page. Usually TETML files contain the data for all pages in the same mode. The following
TETML modes include text and image information as well as interactive elements:
> Glyph mode is a low-level flavor which includes the text, font, coordinates, and color
for each glyph, without any word grouping or structure information. It is intended
for debugging and analysis purposes since it represents the original text information on the page.
> Word mode groups text into words and adds Box elements with the coordinates of
each word. No font information is available. This mode is suitable for applications
which operate on word basis. Punctuation characters will by default be treated as individual words, but this behavior can be changed with a page option (see »Word
boundary detection for Western text«, page 86). Lines of text can optionally be identified with the Line element; this is controlled via the tetml page option.
> Wordplus mode is similar to word mode, but adds font and coordinate details plus
color information for all glyphs in a word. The coordinates are expressed relative to
the lower left or upper left corner subject to the topdown page option. Wordplus mode
makes it possible to analyze font usage and track changes of font, font size, etc. within a word. Since wordplus is the only TETML mode which contains all relevant TETML
elements it is suited for all kinds of processing tasks. On the other hand, it creates
the largest amount of output due to the wealth of information contained in TETML.
> Line mode includes all text which comprises a line in a separate Line element. In addition, multiple lines may be grouped in a Para element. Line mode is recommended
only in situations where the receiving application can only deal with line-based text
input.
> Page mode includes structure information starting at the paragraph level, but does
not include any font or coordinate details. Note that the layout detection results in
page mode may be slightly different from word mode since anchors for images and
destinations are treated differently.
If you are only interested in image information you can also skip other types of output
in TETML:
> Image mode includes information about placed images and image resources, but not
any text- or font-related elements nor information about interactive elements.
Table 9.1 lists the TETML elements which are present in the TETML modes.
Selecting the TETML mode. With the TET command-line tool (see Section 2.1, »Command-Line Options«, page 17) you can specify the desired mode as a parameter for the -tetml option. The following command generates TETML output in wordplus mode:
tet --tetml wordplus file.pdf
With the TET library the TETML mode cannot be specified directly, but as a combination
of options:
> You can specify the amount of text in the smallest element with the granularity option of TET_process_page( ).
9.3 Controlling TETML Details
135
Table 9.1 Text-related elements in various TETML modes; PlacedImage and Image are always present.
TETML mode
structure
tables
text position
glyph details
glyph
–
–
–
Glyph, Color
word
Para, Word
Table, Row, Cell
Box inside Word
optionally: Box inside Para
–
Table, Row, Cell
Box inside Word
optionally: Box inside Para
Glyph, Color
optionally: Line
wordplus
Para, Word
optionally: Line
line
Para, Line
–
optionally: Box inside Para
–
page
Para
Table, Row, Cell
optionally: Box inside Para
–
image
–
–
–
–
> For granularity=glyph or word you can additionally specify the amount of glyph details. With the glyphdetails suboption of the tetml option you can omit some parts of
the glyph information if you don’t need it.
> In order to suppress all text output (i.e. image mode) you can disable the text engine
with the following document option:
engines={notext}
The following page option list generates TETML output in wordplus mode with all glyph
details:
granularity=word tetml={ glyphdetails={all} }
Table 9.2 summarizes the options for creating TETML modes.
Table 9.2 Creating TETML modes with the TET library
TETML mode
document option
options of TET_process_page( )
glyph
–
granularity=glyph tetml={glyphdetails={all}}
word
–
granularity=word
wordplus
–
granularity=word tetml={glyphdetails={all}}
word with Line
elements
–
granularity=word tetml={elements={line}}
wordplus with Line
elements
–
granularity=word
tetml={glyphdetails={all} elements={line}}
line
–
granularity=line
page
–
granularity=page
image
skipengines=
{text vector}
tetml={elements={annotations=false bookmarks=false
destinations=false docinfo=true fields=false
javascripts=false metadata=true options=true}}
Document options for controlling TETML output. In this section we will summarize the
effect of various options which directly control the generated TETML output. All other
136
Chapter 9: TET Markup Language (TETML)
document options can be used to control processing details. The complete description
of document options can be found in Table 10.8.
Document-related options must be supplied to the --docopt command-line option or
to the TET_open_document( ) function.
The tetml option1 controls general aspects of TETML. The elements suboption can be
used to suppress certain TETML elements if they are not required. The following document option list will suppress document-level XMP metadata in the generated TETML
output:
tetml={ elements={nometadata} }
The engines option can be used to disable some of the TET kernel’s processing engines.
The following option list instructs TET to process text contents, but disable text color retrieval and image processing:
engines={notextcolor noimage}
The following document option makes sense only for granularity=page. It changes the
default line separator character from linefeed to space:
lineseparator=U+0020
All document options which have been supplied when creating TETML are recorded in
the /TET/Document/Options element unless disabled with the following document option:
tetml={ elements={nooptions} }
Document options for controlling TETML output for interactive elements. TETML can
also include information about interactive elements in the PDF document. The document option tetml with the suboption elements can be used to enable or disable TETML
output for various aspects, e.g.
elements={annotations=true bookmarks=true destinations=true fields=true javascripts=true}
Page options for controlling TETML output. The complete description of page options
can be found in Table 10.10. Page-related options must be supplied to the --pageopt command-line option or to TET_process_page( ).
The tetml page option enables or disables coordinate- and font-related information
in the Glyph element. The following page option list enables font details in the Glyph element, but suppresses other glyph attributes:
tetml={ glyphdetails={font} }
The following page option list adds Line elements to the TETML output:
tetml={ glyphdetails={font} elements={line} }
The following page option adds sub and sup attributes to the Glyph element to designate
subscripts and superscripts:
tetml={ glyphdetails={sub sup} }
1. Keep in mind that there are two different tetml options: one on document level and one on page level.
9.3 Controlling TETML Details
137
The following page option uses all to generate all possible attributes to the Glyph element:
tetml={ glyphdetails={all} }
The following page option requests topdown coordinates instead of the default bottomup coordinates:
topdown={output}
The following page option list instructs TET to combine punctuation characters with
the adjacent words, i.e. punctuation characters are no longer treated as individual
words:
contentanalysis={nopunctuationbreaks}
All page options which have been supplied when creating TETML are recorded in the /
TET/Document/Pages/Page/Options elements (individually for each page) unless disabled
with the following document option:
tetml={ elements={nooptions} }
Exception handling. If an error happens during PDF parsing TET generally tries to repair or ignore the problem if possible, or throws an exception otherwise. However,
when generating TETML output with TET PDF parsing problems are usually reported as
an Exception element in TETML:
<Exception errnum="4506">Object ’objects[49]/Subtype’ does not exist</Exception>
Applications should be prepared to deal with Exception elements instead of the expected
elements when processing TETML.
Problems which prevent the generation of the TETML output file (e.g. invalid options, no write permission for the output file) still trigger a runtime exception and no
valid TETML output will be created.
138
Chapter 9: TET Markup Language (TETML)
9.4 TETML Elements and the TETML Schema
A formal XML schema description (XSD) for all TETML elements and attributes as well as
their relationships is contained in the TET distribution. The TETML namespace is the following:
http://www.pdflib.com/XML/TET5/TET-5.0
The schema can be downloaded from the following URL on the Web:
http://www.pdflib.com/XML/TET5/TET-5.0.xsd
Both TETML namespace and schema location are present in the root element of each
TETML document.
Table 9.3 describes the role of all TETML elements. Elements and attributes which
have been introduced with TET 4.0 or above are marked. Figure 9.1 and Figure 9.2 visualize the XML hierarchy of TETML elements.
Table 9.3 TETML elements and attributes
TETML element
description and attributes
A
(TET 5.0; only for granularity=glyph and word) Anchor for an annotation, destination or field
within the page content
Attributes: id, type (the types start and stop enclose text, type rect abbreviates anchors without any content)
Action
(TET 5.0) Describes a PDF action.
Annotation
(TET 5.0) Describes a PDF annotation (excluding form fields). If the annotation has a corresponding popup annotation, the popup is expressed as a nested Annotation element.
Attributes: filename, name, javascript, URI, trigger, type
Attributes: alignment, anchor, color, creationdate, destination, hidden, icon, id, intent,
interiorcolor, invisible, moddate, name, onscreen, opacity, open, print, readonly, rotate,
subject, symbol, type
Child elements: Action, Box, Annotation, Contents, Title
Annotations
(TET 5.0) Container of Annotation elements
Attribute: xml:space
Attachment
For PDF attachments describes the contents in a nested Document element. For non-PDF attachments only the name is listed, but no contents.
Attributes: name, level, pagenumber
Attachments
Container of Attachment elements
BitPerSample
(TET 5.0) Number of bits per sample for sampled functions, i.e. Function/@type="sampled"
BlackPoint
(TET 5.0) Tristimulus value of the black point for CalGray, CalRGB and Lab color spaces
Attributes: x, y, z
Bookmark
(TET 5.0) Contains Bookmark and Title elements to describe text, properties and nested bookmarks of a PDF bookmark (also called outline entry)
Attributes: color, destination, fontstyle, open
Child elements: Action, Bookmark, Title
Bookmarks
(TET 5.0) Container of Bookmark elements
Attribute: xml:space
9.4 TETML Elements and the TETML Schema
139
Table 9.3 TETML elements and attributes
TETML element
Bounds
description and attributes
(TET 5.0) Intervals for stitched functions, i.e. Function/@type="stitching"
Child element: Value
Box
Describes the coordinates of a word, paragraph, annotation or form field. The attributes llx and
lly describe the lower left corner, urx and ury describe the upper right corner of the Box. If the
Box represents a rectangle with edges parallel to the page edges, the four values llx,lly,
urx,ury describe the lower left and upper right corners; otherwise the coordinates of all four corners are present. A word or paragraph may contain multiple Box elements, e.g. a hyphenated
word which spans several lines of text or a word which starts with a large dropcap character.
Attributes: llx, lly1, urx, ury1, ulx, uly1, lrx, lry1
Child elements: A (TET 5.0), Glyph, Line (TET 5.0), Para (TET 5.0), PlacedImage (TET 5.0), Table
(TET 5.0), Text (TET 5.0), Word (TET 5.0)
Parent elements: Para, Word
C0
(TET 5.0) Initial color value for interpolation functions, i.e. Function/@type="interpolate"
C1
(TET 5.0) Terminal color value for interpolation functions, i.e. Function/@type="interpolate".
As a convenience feature, this element is also created for sampled functions, i.e. Function/
@type="sampled", although it is not present in PDF for such functions. This element describes the
alternate color of a spot color.
Child element: Value
Child element: Value
Calculator
(TET 5.0) Operators for PostScript functions, i.e. Function/@type="PostScript"
Cell
Describes the contents of a single table cell.
Attributes: colSpan,llx, lly1, urx, ury1, ulx, uly1, lrx, lry1
Color
(TET 5.0) Describes a PDF color.
Attributes: colorspace, id, svgname, pattern
Colorant
(TET 5.0) Colorant of a Separation or DeviceN color space
Attributes: name, colorspace
Colors
(TET 5.0) Container of Color elements
ColorSpace
Describes a PDF colorspace.
Attributes: alternate, base, components, hival (TET 5.0), iccprofile (TET 5.0), id, name, pattern
(TET 5.0), subtype (TET 5.0)
Child elements (TET 5.0): BlackPoint, Colorant, Exception, Function, Gamma, Lookup, Matrix,
Process, Range, WhitePoint
ColorSpaces
Container of ColorSpace elements
Content
Describes the page contents as a hierarchical structure.
Attributes: granularity, dehyphenation (TET 4.0), dropcap (TET 4.0), font, geometry, shadow
(TET 4.0), sub (TET 4.0), sup (TET 4.0)
Contents
(TET 5.0) As child of Annotation: contents of an annotation
(TET 5.0) As child of Field: contents of a form field
Creation
Describes the date and operating system platform for the TET execution, plus the version number
of TET.
Attributes: date, platform, tetVersion
Decode
(TET 5.0) Mapping of sample values for sampled functions, i.e. Function/@type="sampled"
Child element: Value
140
Chapter 9: TET Markup Language (TETML)
Table 9.3 TETML elements and attributes
TETML element
description and attributes
Destination
(TET 5.0) Describes a PDF destination in the document.
Attributes: anchor, bottom1, id, left, name, page, right, top1, type, zoom
Destinations
(TET 5.0) Container of Destination elements
DefaultValue
(TET 5.0) Default value of a form field
DocInfo
Predefined and custom document info entries
Child elements: Author, CreationDate, Creator, GTS_PDFXConformance, GTS_PDFXVersion, GTS_
PPMLVDXConformance, GTS_PPMLVDXVersion, ISO_PDFEVersion, Keywords, ModDate, Producer,
Subject, Title, Trapped, Custom (attribute: key), CustomBinary (attribute: key)
Document
Describes general document information including PDF file name and size, PDF version number.
Attributes: filename, destination (TET 5.0), pageCount, filesize, linearized, pdfVersion,
pdfa (TET 4.0: new values for PDF/A-2; TET 4.1: new values for PDF/A-3), pdfe (TET 4.0; TET 4.1:
new values for PDF/E-2), pdfua (TET 4.1), pdfvt (TET 4.1), pdfx (TET 4.1: enumerated values), revisions (TET 5.0), tagged, usagerights (TET 5.0)
Child elements: Action (TET 5.0), Attachments, Bookmarks (TET 5.0), Destinations (TET 5.0),
DocInfo, Encryption, Exception, JavaScripts (TET 5.0), Metadata, Options, OutputIntents,
Pages, SignatureFields (TET 5.0), XFA (TET 5.0)
Domain
(TET 5.0) Input value interval(s) for functions
Child element: Value
Encode
(TET 5.0) Mapping of input values for stitched functions, i.e. Function/@type="stitching"
Child element: Value
Encryption
Describes various security settings.
Attributes: keylength, algorithm (TET 4.1: new values 8-11), attachment (TET 4.1), description
(TET 4.1: new values for algorithms 8-11), masterpassword, userpassword, noprint, nomodify,
nocopy, noannots, noassemble, noforms, noaccessible, nohiresprint, plainmetadata
Exception
Error message and number associated with an exception which was thrown by TET and translated to TETML. The Exception element may replace other elements if not enough information can
be extracted from the input because of malformed PDF data structures.
The following elements may have an Exception element as child:
Annotation, Annotations, Attachment, Attachments, Bookmark, Bookmarks, Color, ColorSpace,
ColorSpaces, Document, Field, Fields, Font, Fonts, ICCProfile, Image, Images, Metadata, Page,
Pattern, Patterns, SignatureField, SignatureFields
Attribute: errnum
Exponent
(TET 5.0) Interpolation exponent for interpolated functions, i.e. Function/@type="interpolate"
Field
(TET 5.0) Describes a PDF form field.
Attributes: alignment, anchor, backgroundcolor, bordercolor, caption, captiondown, captionrollover, destination, export, exportvalue (only for type=radiobutton and checkbox),
hidden, id, mappingname, name, onscreen, print, readonly, required, rotate, sort, state, type,
visible
Child elements: Action, Box, Contents, Field (for the buttons comprising a field with
type=radiogroup), Defaultvalue, Optionalvalue, Tooltip, Value
Fields
(TET 5.0) Container of Field elements
Attribute: xml:space
Font
Describes a font resource. The required name attribute contains the canonical font name, while
the optional fullname attribute contains the font name including subset prefix.
Attributes: ascender (TET 4.1), capheight (TET 4.1), descender (TET 4.1), embedded, fullname (TET
4.0), id, italicangle (TET 4.1), type, name, vertical, weight (TET 4.1), xheight (TET 4.1)
9.4 TETML Elements and the TETML Schema
141
Table 9.3 TETML elements and attributes
TETML element
description and attributes
Fonts
Container of Font elements
Function
(TET 5.0) Tint transform function for a Separation or DeviceN color space
Attribute: type
Child elements: BitsPerSample, Bounds, Calculator, C0, C1, Decode, Domain, Encode, Functions,
Exponent, Order, Range, Samples, Size
Functions
(TET 5.0) Container of sub-functions for stitched functions, i.e. Function/@type="stitching"
Child element: Function
Gamma
(TET 5.0) Gamma values for CalGray or CalRGB color space
Child element: Value
Glyph
Describes font and geometry details for a single glyph. The element content holds the Unicode
character(s) produced by the glyph. A single glyph may produce more than one character, e.g. for
ligatures. The Glyph elements for a word are grouped within one or more Box elements.
Attributes: x, y1, width, height (TET 5.0; only for vertical writing mode and if the glyph height is
different from the font size), alpha1, beta1, shadow (TET 4.0), dropcap (TET 4.0), fill (TET 5.0),
font, size, stroke (TET 5.0), sub (TET 4.0), sup (TET 4.0), textrendering, unknown, dehyphenation (TET 4.0)
Graphics
(TET 5.0) Container of the Colors, ICCProfiles, and Layers elements
ICCProfiles
(TET 5.0) Container of ICCProfile elements
ICCProfile
(TET 5.0) Describes an ICC color profile
Attributes: checksum, iccversion, id, deviceclass, embedded, fromCIE, profilecs, profilename, toCIE
Image
Describes an image resource, i.e. the actual pixel array comprising the image.
Attributes: bitsPerComponent, colorspace, extractedAs (TET 4.0, additional value introduced
with TET 4.2), filename (TET 5.0), height, id, maskid (TET 5.0), mergetype, stencilmask (TET 5.0),
width
Images
Container of Image elements
JavaScript
(TET 5.0) Describes a sequence of JavaScript code
Attributes: id, name
JavaScripts
(TET 5.0) Container of JavaScript elements
Layer
(TET 5.0) Describes an optional content group (OCG), commonly called layer
Attributes: name, visible, label, locked
Child element: Layer
Layers
(TET 5.0) Container of Layer elements
Line
Text for a single line. TET 4.0: Line may also contain Word elements.
Lookup
(TET 5.0) Lookup table for Indexed color spaces, i.e. ColorSpace/@name="Indexed". It contains a
hexadecimal sequence of values which must be interpreted in the Indexed color space’s base color
space.
Matrix
(TET 5.0) Transformation matrix of a CalRGB color space
Metadata
XMP metadata which can be associated with the document, a font, or an image
OptionalValue
(TET 5.0) Optional value of a form field
Options
Document or page options used for generating the TETML
Order
(TET 5.0) Order of sample interpolation for sampled functions, i.e. Function/@type="sampled"
Child element: Value
142
Chapter 9: TET Markup Language (TETML)
Table 9.3 TETML elements and attributes
TETML element
description and attributes
OutputIntent
(TET 5.0) Describes the output intent of a document or page
Attributes: iccprofile, subtype
Child elements: OutputCondition, OutputConditionIdentifier, RegistryName, Info
OutputIntents
(TET 5.0) Container OutputIntent elements
Page
Contents of a single page.
Attributes: hasdefaultcmyk, hasdefaultgray, hasdefaultrgb, height, label (TET 5.0), number,
topdown (TET 4.0), width
Child elements: Action (TET 5.0), Annotations (TET 5.0), Content, Exception, Fields (TET 5.0),
Options, OutputIntents
Pages
Container of Page elements
Para
Text comprising a single paragraph
Child elements: A, Box, Para
Pattern
(TET 5.0) Describes a PDF pattern
Attributes: id, patterntype, painttype, tilingtype
Patterns
(TET 5.0) Container of Pattern elements
PlacedImage
Describes an instance of an image placed on the page.
Attributes: alpha1, beta1, height, image, width, x, y1
Process
(TET 5.0) Process color space description of a DeviceN color space with subtype NChannel
Attribute: colorspace
Child element: Component
Range
(TET 5.0) As child of ColorSpace: Range of an Lab color space
As child of Function: Range of output values for functions
Child element: Value
Resources
Container of ColorSpaces, Fonts, Images, and Patternx resource containers
Row
Container of one or more table cells
Child element: Cell
Samples
(TET 5.0) Hexadecimal sequence of samples for sampled functions, i.e. Function/
@type="sampled"
SignatureField
(TET 5.0) Describes a signed or unsigned signature field
Attributes: cades, field, fillablefields, permissions, preventchanges, sigtype, visible
SignatureFields
(TET 5.0) Container of SignatureField elements
Size
(TET 5.0) Number of samples in each input dimension for sampled functions, i.e. Function/
@type="sampled"
Child element: Value
Table
Container of one or more table rows
Attributes: llx, lly1, urx, ury1, ulx, uly1, lrx, lry1
Child element: Row
TET
TETML root element. Attribute: version
Text
Text contents of a word or other element
Title
(TET 5.0) As child of Annotation: Title of an annotation
(TET 5.0) As child of Bookmark: Title of a bookmark
As child of DocInfo: document info entry Title
9.4 TETML Elements and the TETML Schema
143
Table 9.3 TETML elements and attributes
TETML element
description and attributes
Tooltip
(TET 5.0) Tooltip of a form field
Value
(TET 5.0) Value of a form field
WhitePoint
(TET 5.0) Tristimulus value of the white point for CalGray, CalRGB and Lab color spaces
Attributes: x, y, z
Word
Single word
XFA
(TET 5.0) The document contains XFA form information
Attribute: type (always static since TET refuses to process dynamic XFA forms)
1. All vertical coordinates and angles are expressed relative to the lower left or upper left corner subject to the topdown page option.
144
Chapter 9: TET Markup Language (TETML)
9.4 TETML Elements and the TETML Schema
145
Fig. 9.1
Main TETML element hierarchy.
9.5 Transforming TETML with XSLT
Very short overview of XSLT. XSLT (which stands for eXtensible Stylesheet Language
Transformations) is a language for transforming XML documents to other documents.
While the input is always an XML document (TETML in our case), the output does not
necessarily have to be XML. XSLT can also perform arbitrary calculations and produce
plain text or HTML output. We will use XSLT stylesheets to process TETML input in order
to generate a new dataset (in text, XML, CSV, or HTML format) based on the input which
in turn reflects the contents of a PDF document. The TETML document has been created
with the TET command-line tool or the TET library as explained in Section 9.1, »Creating
TETML«, page 129.
While XSLT is very powerful, it is considerably different from conventional programming languages. We do not attempt to provide an introduction to XSLT programming
in this section; please refer to the wide variety of printed and Web resources on this
topic.
However, we do want to assist you in getting XSLT processing of TETML documents
up and running quickly. This section describes the most important environments for
running XSLT stylesheets, and lists common software for this purpose. In order to apply
XSLT stylesheets to XML documents you need an XSLT processor. There are various free
Fig. 9.2
TETML element hierarchyfor page contents
9.5 Transforming TETML with XSLT
147
and commercial XSLT processors available which can be used either in a stand-alone
manner or in your own programs with the help of a programming language.
XSLT stylesheets can make use of parameters which are passed from the environment to the stylesheet in order to control processing details. Since some of our XSLT
samples make use of stylesheet parameters we also supply information about passing
parameters to stylesheets in various environments.
Common XSLT processors which can be used in various packagings include the following:
> Microsoft’s XML implementation called MSXML
> Microsoft’s .NET Framework 2.0 XSLT implementation
> Saxon, which is available in free and commercial versions
> Xalan, an open-source project (available in C++ and Java implementations) hosted by
the Apache foundation
> The open-source libxslt library of the GNOME project
> Sablotron, an open-source XSLT toolkit
XSLT on the command line. Applying XSLT stylesheets from the command-line provides a convenient development and testing environment. The examples below show
how to apply XSLT stylesheets on the command-line. All samples below process the input file TET-datasheet.tetml with the stylesheet tetml2html.xsl while setting the XSLT parameter toc-generate (which is used in the stylesheet) to the value 0, and write the generated output to TET-datasheet.html:
> The Java-based Saxon processor (see www.saxonica.com) can be used as follows:
java -jar saxon9.jar -o TET-datasheet.html TET-datasheet.tetml tetml2html.xsl
> You can apply XSLT scripts with the ant build tool. A minimal build file for applying
XSLT looks as follows:
<project name="tetml2html" default="tetml2html">
<target name="tetml2html">
<xslt in="TET-datasheet.tetml" style="tetml2html.xsl" out="TET-datasheet.html"/>
</target>
</project>
The build.xml file in the TET distribution contains XSLT tasks for all samples. The
command ant applies all XSLT samples and converts the input document TETdatasheet.pdf to TETML. The following command processes another PDF input document:
ant -Dinput.pdf=myfile.pdf
> The xsltproc tool is included in most Linux distributions, see xmlsoft.org/XSLT. Use the
following command to apply a stylesheet to a TETML document:
xsltproc --output TET-datasheet.html --param toc-generate 0 tetml2html.xsl
TET-datasheet.tetml
The runxslt.sh shell script in the TET distribution can be used to run all XSLT samples
with xsltproc (run ant once to create the TETML input files).
> Xalan C++ provides a command-line tool which can be invoked as follows:
Xalan -o TET-datasheet.html -p toc-generate 0 TET-datasheet.tetml tetml2html.xsl
148
Chapter 9: TET Markup Language (TETML)
> On Windows systems with the MSXML parser you can use the free msxsl.exe program
provided by Microsoft. The program (including source code) is available at the following location:
www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=21714
Run the program as follows:
msxsl.exe TET-datasheet.tetml tetml2html.xsl -o TET-datasheet.html toc-generate=0
The runxslt.ps1 and runxslt.vbs scripts in the TET distribution can be used to run all
XSLT samples with msxml (run ant once to create the TETML input files).
XSLT within your own application. If you want to integrate XSLT processing in your
application, the choice of XSLT processor obviously depends on your programming language and environment. The TET distribution contains sample code for various important environments. The runxslt samples demonstrate how to load a TETML document,
apply an XSLT stylesheet with parameters, and write the generated output to a file. If
the programs are executed without any arguments they exercise all XSLT samples supplied with the TET distribution. Alternatively, you can supply parameters for the TETML
input file name, XSLT stylesheet name, output file name and additional parameter/value pairs. You can use the runxslt samples as a starting point for integrating XSLT processing into your application:
> Java developers can use the methods in the javax.xml.transform package. This is
demonstrated in the runxslt.java sample.
> .NET developers can use the methods in the System.Xml.Xsl.XslTransform namespace.
This is demonstrated in the runxslt.ps1 PowerShell script. Similar code can be used
with C# and other .NET languages.
> All Windows-based programming languages which support COM automation can
use the methods of the MSXML2.DOMDocument automation class supplied by the
MSXML parser. This is demonstrated in the runxslt.vbs sample. Similar code can be
used with other COM-enabled languages.
XSLT extensions are available for many other modern programming languages as well,
e.g. Perl.
XSLT on the Web server. Since XML-to-HTML conversion is a common XSLT use case,
XSLT stylesheets are often run on a Web server. Some important scenarios:
> Windows-based Web servers with ASP or ASP.NET can make use of the COM or .NET
interfaces mentioned above.
> Java-based Web servers can make use of the javax.xml.transform package.
> PHP-based Web servers can make use of the Sablotron processor, see www.php.net/
manual/en/intro.xsl.php.
XSLT in the Web browser. XSLT transformations are also supported by most modern
browsers. In order to instruct the browser to apply an XSLT stylesheet to a TETML document add a line with a suitable processing instruction after the first line of the TETML
document containing the xml processing instruction and before the root element. You
can then load it in the browser which will apply the stylesheet and display the resulting
output (note that Internet Explorer requires the file name suffix .xml when processing
files from the local disk):
9.5 Transforming TETML with XSLT
149
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="tetml2html.xsl" version="1.0"?>
<TET xmlns="http://www.pdflib.com/XML/TET5/TET-5.0"
...
The browser will apply the XSLT stylesheet to the TETML document and then display the
resulting text, HTML, or XML output. As an alternative, XSLT processing in the browser
can also be initiated from JavaScript code.
With Firefox you can supply parameters to the XSLT stylesheet with the xslt-param
processing instruction:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="tetml2html.xsl" version="1.0"?>
<?xslt-param name="toc-generate" value="0"?>
<TET xmlns="http://www.pdflib.com/XML/TET5/TET-5.0"
...
150
Chapter 9: TET Markup Language (TETML)
9.6 XSLT Samples
The TET distribution includes several XSLT stylesheets which demonstrate the power of
XSLT applied to TETML, and can be used as a starting point for TETML applications. This
section provides an overview of the XSLT samples and presents sample output. Section
9.5, »Transforming TETML with XSLT«, page 147 discusses many options for deploying
the XSLT stylesheets. More details regarding the functionality and inner workings of the
stylesheets can be found in comments in the XSLT code. Some general aspects of the
stylesheet samples:
> Most XSLT samples support parameters which can used to control various processing details. These parameters can be set within the XSLT code or overridden from the
environment (e.g. ant).
> Most XSLT samples require TETML input in a certain TETML mode (e.g. word mode,
see »TETML modes«, page 135, for details). In order to protect themselves from wrong
input, they check whether the supplied TETML input conforms to the requirement,
and report an error otherwise.
> Some XSLT samples recursively process PDF attachments in the document (this is
mentioned in the descriptions below). Most samples ignore PDF attachments,
though. They are written in a way so that they can easily be expanded to process attachments as well. It is sufficient to select the interesting elements within the
Attachments element; the relevant xsl:template elements themselves don’t have to be
modified.
Create a concordance. The concordance.xsl stylesheet expects TETML input in word or
wordplus mode. It creates a concordance, i.e. a list of unique words in a document sorted
by descending frequency. This may be useful to create a concordance for linguistic analysis, cross-references for translators, consistency checks, etc.
List of words in the document along with the number of occurrences:
the 138
and 91
TET 87
to 63
of 59
for 57
PDF 53
text 51
in 50
a 44
is 37
be 36
as 34
are 34
PDFlib 32
...
Font filtering. The fontfilter.xsl stylesheet expects TETML input in glyph or wordplus
mode. It lists all words in a document which use a particular font in a size larger than a
specified value. This may be useful to detect certain font/size combinations or for quality control. The same concept can be used to create a table of contents based on text portions which use a large font size.
9.6 XSLT Samples
151
Text containing font 'TheSansBold-Plain' with size greater than 10:
[ThesisAntiqua-Bold/32.0000]
[ThesisAntiqua-Bold/32.0000]
[ThesisAntiqua-Bold/32.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
[ThesisAntiqua-Bold/14.0000]
...
PDFlib
TET
5
What
is
PDFlib
TET
?
PDFlib
TET
Features
Challenges
with
PDF
Text
Extraction
Challenges
Searching for font usage. The fontfinder.xsl stylesheet expects TETML input in glyph or
wordplus mode. For all fonts in a document, it lists all occurrences of text using this particular font along with page number and the position on the page. This may be useful
for detecting unwanted fonts and checking consistency, locating use of a particular bad
font size, etc.
TheSans-Plain used on:
page 1:
(306, 796)
ThesisAntiqua-Bold used on:
page 1:
(306, 757), (412, 757), (474, 757), (28, 514), (67, 514), (81, 514), (128, 514), (152,
514),
...
Font statistics. The fontstat.xsl stylesheet expects TETML input in glyph or wordplus
mode. It generates font and glyph statistics. This may be useful for quality control and
even accessibility testing since unmapped glyphs (i.e. glyphs which cannot be mapped
to any Unicode character) will also be reported for each font.
17048 total glyphs in the document; breakdown by font:
85.21% TheSansLight-Plain: 14527 glyphs
5.19% TheSansLight-Italic: 885 glyphs
4.83% ThesisAntiqua-Bold: 823 glyphs, 3 uses of ligatures: fi
2.87% TheSansMonoCondensed-Plain: 489 glyphs
0.33% TheSansSemiLight-Caps: 57 glyphs
0.33% TheSansLight-Plain: 56 glyphs
0.25% TheSansLight-Italic: 42 glyphs
0.17% TheSansExtraLight-Italic: 29 glyphs
0.16% TheSansLight-Plain: 28 glyphs
0.16% TheSansLight-Plain: 28 glyphs
0.16% TheSansLight-Italic: 28 glyphs
0.16% TheSansLight-Plain: 28 glyphs
0.06% TheSansBold-Plain: 10 glyphs
152
Chapter 9: TET Markup Language (TETML)
0.05%
0.04%
0.01%
0.01%
TheSans-Plain: 9 glyphs
WarnockPro-It: 7 glyphs, 7 uses of ligatures: fi fl ffi Th sp ct st
PDFlibLogo2-Regular: 1 glyphs, 1 uses of ligatures: PDFlib
WarnockPro-Regular: 1 glyphs
Create an index. The index.xsl stylesheet expects TETML input in word or wordplus
mode. It generates a back-of-the-book index, i.e. an alphabetically sorted list of words in
the document and the corresponding page numbers. Numbers and punctuation characters will be ignored.
Alphabetical list of words in the document along with their page number:
A
able 5
about 2
About 6
accent 3
Accented 3
accents 3
accept 5
Accepted 1
access 6
accessible 6
achieved 3
Acrobat 1 2 4 6
actual 2
actually 5
added 5
adding 6
addition 1 2 5
additional 2 4 5
Adobe 2 5 6
advanced 1
algorithm 3 4
...
Extract XMP metadata. The metadata.xsl stylesheet expects TETML input in any mode.
It targets XMP metadata on the document level, and extracts some metadata properties
from the XMP. PDF attachments (including PDF packages and portfolios) in the document are processed recursively:
dc:creator = PDFlib GmbH
xmp:CreatorTool = Adobe InDesign CS6 (Windows)
Extract table contents in CSV format. The table.xsl stylesheet expects TETML input in
word, wordplus, or page mode. It extracts the contents of a selected table and creates a
CSV file (comma-separated values) which contains the table contents. CSV files can be
opened with all spreadsheet applications. This may be useful to repurpose the contents
of tables in PDF documents.
Convert TETML to HTML. The tetml2html.xsl stylesheet expects TETML input in
wordplus mode. It converts TETML to HTML which can be displayed in a browser. The
converter does not attempt to generate an identical visual representation of the PDF
document, but demonstrates the following aspects:
9.6 XSLT Samples
153
> Create a linked table of contents at the beginning of the HTML page, where the entries are based on PDF bookmarks or headings in the document.
> Create heading elements (H1, H2, etc.) based on configurable font sizes and font
names.
> Convert link annotations with type URI to HTML links.
> Map table elements in TETML to the corresponding HTML table constructs to visualize tables in the browser.
> Create a list of images for each page where the images are linked to the corresponding image file.
> Create links from PDF annotations.
Extract raw text from TETML. The textonly.xsl stylesheet expects TETML input in any
mode. It extracts the raw text contents by fetching all Text elements while ignoring all
other elements. PDF attachments (including PDF packages and portfolios) in the document are processed recursively.
154
Chapter 9: TET Markup Language (TETML)
10 TET Library API Reference
10.1 Option Lists
Option lists are a powerful yet easy method for controlling API function calls. Instead of
requiring a multitude of function parameters, many API methods support option lists,
or optlists for short. These are strings which can contain an arbitrary number of options.
Option lists support various data types and composite data like lists. In most language
bindings optlists can easily be constructed by concatenating the required keywords and
values.
Bindings C language binding: you may want to use the sprintf( ) function for constructing optlists.
Bindings .NET language binding: C# programmers should keep in mind that the AppendFormat( )
StringBuilder method uses the { and } braces to represent format items which will be replaced by the string representation of arguments. On the other hand, the Append( )
method does not impose any special meaning on the brace characters. Since the option
list syntax makes use of the brace characters, care must be taken in selecting the
AppendFormat( ) or Append( ) method appropriately.
10.1.1 Option List Syntax
Formal option list syntax definition. Option lists must be constructed according to following rules:
> All elements (keys and values) in an option list must be separated by one or more of
the following separator characters: space, tab, carriage return, newline, equal sign ’=’.
> An outermost pair of enclosing braces is not part of the element. The sequence { }
designates an empty element.
> Separators within the outermost pair of braces no longer split elements, but are part
of the element. Therefore, an element which contains separators must be enclosed
with braces.
> An element which contains braces at the beginning or end must be enclosed with
braces.
> If an element contains unbalanced braces, these must be protected with a preceding
backslash character. A backslash in front of the closing brace of an element must also
be preceded by a backslash character.
> Option lists must not contain binary zero values.
An option may have a list value according to its documentation in this reference. List
values contain one or more elements (which may themselves be lists). They are separated according to the rules above, with the only difference that the equal sign is no longer
treated as a separator.
Simple option lists. In many cases option lists will contain one or more key/value
pairs. Keys and values, as well as multiple key/value pairs must be separated by one or
10.1 Option Lists
155
more whitespace characters (space, tab, carriage return, newline). Alternatively, keys
can be separated from values by an equal sign ’=’:
key=value
key = value
key value
key1 = value1
key2 = value2
To increase readability we recommend to use equal signs between key and value and
whitespace between adjacent key/value pairs.
Since option lists will be evaluated from left to right an option can be supplied multiply within the same list. In this case the last occurrence will overwrite earlier ones. In
the following example the first option assignment will be overridden by the second,
and key will have the value value2 after processing the option list:
key=value1 key=value2
List values. Lists contain one or more separated values, which may be simple values or
list values in turn. Lists are bracketed with { and } braces, and the values in the list must
be separated by whitespace characters. Examples:
searchpath={/usr/lib/tet d:\tet}
(list containing two directory names)
A list may also contain nested lists. In this case the lists must also be separated from
each other by whitespace. While a separator must be inserted between adjacent } and {
characters, it can be omitted between braces of the same kind:
fold={ {[:Private_Use:] remove} {[U+FFFD] remove} }
(list containing two lists)
If the list contains exactly one list the braces for the nested list must not be omitted:
fold={ {[:Private_Use:] remove} }
(list containing one nested list)
Nested option lists and list values. Some options accept the type option list or list of
option lists. Options of type option list contain one or more subordinate options. Options
of type list of option lists contain one or more nested option lists. When dealing with
nested option lists it is important to specify the proper number of enclosing braces.
Several examples are listed below.
The value of the option contentanalysis is an option list which itself contains the single option punctuationbreaks:
contentanalysis={punctuationbreaks=false}
The value of the option glyphmapping in the following example is a list of option lists
containing a single option list:
glyphmapping={ {fontname=GlobeLogosOne codelist=GlobeLogosOne} }
The value of the option glyphmapping in the following example is a list of option lists
containing two option lists:
glyphmapping { {fontname=CMSY* glyphlist=tarski} {fontname=ZEH* glyphlist=zeh}}
156
Chapter 10: TET Library API Reference
List containing one option list with a fontname value that includes spaces and therefore
requires an additional pair of braces:
glyphmapping={ {fontname={Globe Logos One} codelist=GlobeLogosOne} }
List containing two keywords:
fonttype={Type1 TrueType}
List containing different types – the inner lists contain a Unicode set and a keyword, the
outer list contains two option lists and the keyword default:
fold={ {[:Private_Use:] remove} {[U+FFFD] remove} default }
List containing one rectangle:
includebox={{10 20 30 40}}
Common traps and pitfalls. This paragraph lists some common errors regarding option list syntax.
Braces are not separators; the following is wrong:
key1 {value1}key2 {value2}
WRONG!
This will trigger the error message Unknown option 'value2'. Similarly, the following are
wrong since the separators are missing:
key{value}
key={{value1}{value2}}
WRONG!
WRONG!
Braces must be balanced; the following is wrong:
key={open brace {}
WRONG!
This will trigger the error message Braces aren't balanced in option list 'key={open brace {}'.
A single brace as part of a string must be preceded by an additional backslash character:
key={closing brace \} and open brace \{}
CORRECT!
A backslash at the end of a string value must be preceded by another backslash if it is
followed by a closing brace character:
key={\value\}
key={\value\\}
WRONG!
CORRECT!
10.1.2 Basic Types
String. Strings are plain ASCII strings (or EBCDIC strings on EBCDIC platforms) which
are generally used for non-localizable keywords. Strings containing whitespace or ’=’
characters must be bracketed with { and }:
password={ secret string }
contents={length=3mm}
(string value contains three blanks)
(string value containing one equal sign)
The characters { and } must be preceded by an additional \ character if they are supposed to be part of the string:
10.1 Option Lists
157
password={weird\}string}
(string value contains a right brace)
A backslash in front of the closing brace of an element must be preceded by a backslash
character:
filename={C:\path\name\\}
(string ends with a single backslash)
An empty string can be constructed with a pair or braces:
{}
Non-Unicode capable language bindings: if an option list starts with a [EBCDIC-]UTF8 BOM, each content, hypertext or name string of the option list will be interpreted as a
[EBCDIC-]UTF-8 string.
Unichar. A Unichar is a single Unicode value where several syntax variants are supported: decimal values ¹ 10 (e.g. 173), hexadecimal values prefixed with x, X, 0x, 0X, or U+
(xAD, 0xAD, U+00AD), numerical references, character references, and glyph name references but without the ’&’ and ’;’ decoration (shy, #xAD, #173). Examples:
unknownchar=?
unknownchar=63
unknownchar=x3F
unknownchar=0x3F
unknownchar=U+003F
lineseparator={CRLF}
(literal)
(decimal)
(hexadecimal)
(hexadecimal)
(Unicode notation)
(standard glyph name reference)
Single characters which happen to be a digit are treated literally, not as decimal Unicode
values:
replacementchar=3
(U+0033 THREE, not U+0003!)
Unichars must be in the hexadecimal range 0-0x10FFFF (decimal 0-1114111).
Unicode sets. Unicode sets and can be constructed with the following building blocks:
> Patterns are a series of characters bounded by square brackets that contain lists of
Unicode characters and Unicode property sets.
> Lists are a sequence of Unicode characters that may have ranges indicated by a '-' between two characters, as in U+FB00-U+FB17. The sequence specifies the range of all
characters from the left to the right, in Unicode order. Multiple Unicode characters
must not be separated by whitespace, but must directly follow each other, e.g.
U+0048U+006C.
> Unicode characters in lists can be specified as follows:
ASCII characters can be specified as literals
Exactly 4 hex digits: \uhhhh or U+hhhh
Exactly 5 hex digits: U+hhhhh
1-6 hex digits: \x{hhhhhh}
Exactly 8 hex digits: \Uhhhhhhhh
escaped backslash: \\
> Unicode property sets are specified by a Unicode property. The syntax for specifying
the property names is an extension of POSIX and Perl syntax, where type represents
the name of a Unicode property (see www.unicode.org/Public/UNIDATA/
158
Chapter 10: TET Library API Reference
PropertyAliases.txt) and value the corresponding value (see www.unicode.org/Public/
UNIDATA/PropertyValueAliases.txt):
POSIX-style syntax: [:type=value:]
POSIX-style syntax with negation: [:^type=value:]
Perl-style syntax: \p{type=value}
Perl-style syntax with negation: \P{type=value}
The type= can be omitted for the Category and Script properties, but is required for
other properties.
> Set operations can be applied to patterns:
To build the union of two sets, simply concatenate them: [[:letter:] [:number:]]
To intersect two sets, use the '&' operator: [[:letter:] & [U+0061-U+007A]]
To take the set difference of two sets, use the '-' operator: [[:letter:]-[U+0061-U+007A]]
To invert a set, place a '^' immediately after the opening '[':
[^U+0061-U+007A]
In any other location, the '^' does not have a special meaning.
Table 10.1 Unicode set examples
specification of Unicode set
characters in the Unicode set
[U+0061-U+007A]
lower case letters a through z
[U+0640]
single character Arabic Tatweel
[\x{0640}]
single character Arabic Tatweel
[U+FB00-U+FB17]
Latin and Armenian ligatures
[^U+0061-U+007A]
all characters except a through z
[:Lu:]
[:UppercaseLetter:]
all uppercase letters (short and long forms of the Unicode
set)
[:L:]
[:Letter:]
all Unicode categories starting with L (short and long
forms of the Unicode set)
[:General_Category=Dash_Punctuation:]
all characters in the general category Dash_Punctuation
[:Alphabetic=No:]
all non-alphabetic characters
[:Private_Use:]
all characters in the Private Use Area (PUA)
See Table 10.1 for examples of Unicode sets. You can use the following Web site for interactively testing Unicode set expressions:
unicode.org/cldr/utility/list-unicodeset.jsp
Boolean. Booleans have the values true or false; if the value of a Boolean option is
omitted, the value true is assumed. As a shorthand notation noname can be used instead
of name=false:
usehostfonts
nousehostfonts
(equivalent to usehostfonts=true)
(equivalent to usehostfonts=false)
Keyword. An option of type keyword can hold one of a predefined list of fixed keywords. Example:
10.1 Option Lists
159
clippingarea=cropbox
For some options the value hold either a number or a keyword.
Number. Option lists support several numerical types.
Integer types can hold decimal and hexadecimal integers. Positive integers starting
with x, X, 0x, or 0X specify hexadecimal values:
-12345
0
0xFF
Floats can hold decimal floating point or integer numbers; period and comma can be
used as decimal separators for floating point values. Exponential notation is also supported. The following values are all equivalent:
size
size
size
size
=
=
=
=
-123.45
-123,45
-1.2345E2
-1.2345e+2
10.1.3 Geometric Types
Rectangle. A rectangle is a list of four float values specifying the x and y coordinates of
the lower left and upper right corners of a rectangle. The coordinate system for interpreting the coordinates (default or user coordinate system) varies depending on the option, and is documented separately. Example:
includebox = {{0 0 500 100} {0 500 500 600}}
10.1.4 Encoding Names
Various options and parameters support the names of encodings, e.g. the filenamehandling option of TET_set_option( ), the forceencoding option of TET_open_document( ),
and the inputformat parameter of TET_convert_to_unicode( ). The following keywords can
be supplied as encoding names:
> The keyword auto specifies the most natural encoding for certain environments:
> On Windows: the current system code page
> On Unix and OS X: iso8859-1
> On i5/iSeries: the current job’s encoding (IBMCCSID000000000000)
> On zSeries: ebcdic
> winansi (=cp1252)
> iso8859-1 - iso8859-10, iso8859-13 - iso8859-14
> cp1250 - cp1258
> macroman, macroman_euro (replaces currency with Euro), macroman_apple, (replaces
currency with Euro and includes additional mathematical/greek symbols)
> adobesymbol designates the Adobe Symbol encoding
> U+XXXX (256 characters starting at the specified value)
> ebcdic (=code page 1047), ebcdic_37 (=code page 037)
> CJK encodings cp932, cp936, cp949, cp950
> on the following systems all encodings available on the host system can be used:
> Windows: cpXXXX
160
Chapter 10: TET Library API Reference
> Linux: all codesets known to the iconv facility
> i5/iSeries: any Coded Character Set Identifier without the CCSID prefix
> zSeries: any Coded Character Set Identifier (CCSID)
> custom encodings can be defined as resources and referenced by their resource
name
10.1 Option Lists
161
10.2 General Functions
10.2.1 Option Handling
C++ Java C# void set_option(String optlist)
Perl PHP set_option(string optlist)
C void TET_set_option(TET *tet, const char *optlist)
Set one or more global options for TET.
optlist An option list specifying global options according to Table 10.2. If an option is
provided more than once the last instance will override all previous ones. In order to
supply multiple values for a single option (e.g. searchpath) supply all values in a list argument to this option.
The following options can be used: asciifile, cmap, codelist, encoding, filenamehandling,
fontoutline, glyphlist, license, licensefile, logging, userlog, outputformat, resourcefile,
searchpath
Details Multiple calls to this function can be used to accumulate values for those options
marked in Table 10.2. For unmarked options the new value will override the old one.
Table 10.2 Global options for TET_set_option( )
option
description
asciifile
(Boolean; Only supported on i5/iSeries and zSeries). Expect text files (e.g. UPR configuration files, glyph
lists, code lists) in ASCII encoding. Default: true on i5/iSeries; false on zSeries
cmap1, 2
(List of name strings) A list of string pairs, where each pair contains the name and value of a CMap resource (see Section 5.2, »Resource Configuration and File Searching«, page 59).
codelist1, 2
(List of name strings) A list of string pairs, where each pair contains the name and value of a codelist resource (see Section 5.2, »Resource Configuration and File Searching«, page 59).
encoding1, 2
(List of name strings) A list of string pairs, where each pair contains the name and value of an encoding
resource (see Section 5.2, »Resource Configuration and File Searching«, page 59).
filenamehandling
(Keyword) Indicates the encoding of file names. File names supplied as function parameters without
UTF-8 BOM in non-Unicode aware language bindings are interpreted according to this option to guard
against characters which would be illegal in the file system and to create a Unicode version of the file
name. An error occurs if the file name contains characters outside the specified encoding. Default:
unicode on Windows and OS X, auto on i5/iSeries, otherwise honorlang:
ascii
7-bit ASCII
basicebcdic Basic EBCDIC according to code page 1047, but only Unicode values <= U+007E
basicebcdic_37
Basic EBCDIC according to code page 0037, but only Unicode values <= U+007E
honorlang (Not supported on i5/iSeries) The environment variables LC_ALL, LC_CTYPE and LANG will be
interpreted. The codeset specified in LANG is applied to file names if it is available.
legacy
Use auto encoding (i.e. the current system encoding) to interpret the file name and interpret
the LANG variable if the honorlang parameter is set.
unicode
Unicode encoding in (EBCDIC-) UTF-8 format
all names of 8-bit and CJK encodings
Encoding name according to Section 10.1.4, »Encoding Names«, page 160
fontoutline1, 2
162
(List of name strings) A list of string pairs, where each pair contains the name and value of a FontOutline
resource (see Section 5.2, »Resource Configuration and File Searching«, page 59).
Chapter 10: TET Library API Reference
Table 10.2 Global options for TET_set_option( )
option
glyphlist
description
1, 2
(List of name strings) A list of string pairs, where each pair contains the name and value of a glyphlist resource (see Section 5.2, »Resource Configuration and File Searching«, page 59).
hostfont1, 2
(List of name strings) A list of string pairs, where each pair contains a PDF font name and the UTF-8 encoded name of a host font to be used for an unembedded font.
license
(String) Set the license key. It must be set before the first call to TET_open_document*( ).
licensefile
(String) Set the name of a file containing the license key(s). The license file can be set only once before the
first call to TET_open_document*( ). Alternatively, the name of the license file can be supplied in an
environment variable called PDFLIBLICENSEFILE or (on Windows) via the registry.
logging1
(Option list; unsupported) An option list specifying logging output according to Table 10.7. Alternatively,
logging options can be supplied in an environment variable called TETLOGGING or on Windows via the
registry. An empty option list will enable logging with the options set in previous calls. If the environment
variable is set logging will start immediately after the first call to TET_new( ).
userlog
(Name string; unsupported) Arbitrary string which will be written to the log file if logging is enabled.
outputformat
(Keyword; only for the C, Ruby, Perl, Python, and PHP language bindings) Specifies the format of the text
returned by TET_get_text( ):
utf8
Strings are returned in (in C: null-terminated) UTF-8 format .
utf16
Strings are returned in UTF-16 format in the machine’s native byte ordering.
utf32
Strings are returned in UTF-32 format in the machine’s native byte ordering.
ebcdicutf8 (Only available on EBCDIC-based systems) Strings are returned in null-terminated EBCDICencoded UTF-8 format. Code page 37 is used on i5/iSeries, code page 1047 on zSeries.
Default: utf8 for C, Ruby, Perl, Python, PHP, and ebcdicutf8 for C on i5/iSeries and zSeries
resourcefile
(Name string) Relative or absolute file name of the UPR resource file. The resource file will be loaded
immediately. Existing resources will be kept; their values will be overridden by new ones if they are set
again. Explicit resource options will be evaluated after entries in the resource file.
The resource file name can also be supplied in the environment variable TETRESOURCEFILE or with a
Windows registry key (see Section 5.2, »Resource Configuration and File Searching«, page 59). Default:
tet.upr (on MVS: upr)
searchpath1
(List of name strings) Relative or absolute path name(s) of a directory containing files to be read. The
search path can be set multiply; the entries will be accumulated and used in least-recently-set order (see
Section 5.2, »Resource Configuration and File Searching«, page 59). It is recommended to use double
braces even for a single entry to avoid problems with directory names containing space characters. An
empty string list (i.e. {{}} ) deletes all existing search path entries including the default entries. On
Windows the search path can also be set via a registry entry. Default: platform-specific, see »File search
and the searchpath resource category«, page 60.
shutdownstrategy
(Integer) Strategy for releasing global resources which are allocated once for all TET objects. Each global
resource is initialized on demand when it is first needed. This option must be set to the same value for all
TET objects in a process; otherwise the behavior is undefined (default: 0):
0
A reference counter keeps track of how many TET objects use the resource. When the last TET
object is deleted and the reference counter drops to zero, the resource is released.
1
The resource is kept until the end of the process. This may slightly improve performance, but
requires more memory after the last TET object is deleted.
1. Option values can be accumulated with multiple calls.
2. Unlike the UPR syntax an equal character ’=’ between the name and value is neither required nor allowed.
10.2 General Functions
163
10.2.2 Setup
C TET *TET_new(void)
Create a new TET object.
Returns A handle to a TET object to be used in subsequent calls. If this function doesn’t succeed
due to unavailable memory it will return NULL.
Bindings This function is not available in object-oriented language bindings since it is hidden in
the TET constructor.
Java void delete( )
C# void Dispose( )
C void TET_delete(TET *tet)
Delete a TET object and release all related internal resources.
Details Deleting a TET object automatically closes all of its open documents. The TET object
must no longer be used in any function after it has been deleted.
Bindings In object-oriented language bindings this function is generally not required since it is
hidden in the TET destructor. However, in Java it is available nevertheless to allow explicit cleanup in addition to automatic garbage collection. In .NET Dispose( ) should be
called at the end of processing to clean up unmanaged resources.
164
Chapter 10: TET Library API Reference
10.2.3 PDFlib Virtual Filesystem (PVF)
C++ void create_pvf(wstring filename, const void *data, size_t size, wstring optlist)
C# Java void create_pvf(String filename, byte[] data, String optlist)
Perl PHP create_pvf(string filename, string data, string optlist)
C void TET_create_pvf(TET *tet,
const char *filename, int len, const void *data, size_t size, const char *optlist)
Create a named virtual read-only file from data provided in memory.
filename (Name string) The name of the virtual file. This is an arbitrary string which
can later be used to refer to the virtual file in other TET calls.
len (C language binding only) Length of filename (in bytes) for UTF-16 strings. If len=0
a null-terminated string must be provided.
data A reference to the data for the virtual file. In COM this is a variant of byte containing the data comprising the virtual file. In C and C++ this is a pointer to a memory
location. In Java this is a byte array. In Perl and PHP this is a string.
size
(C and C++ only) The length in bytes of the memory block containing the data.
optlist
An option list according to Table 10.3. The following option can be used: copy
Details The virtual file name can be supplied to any API function which uses input files. Some
of these functions may set a lock on the virtual file until the data is no longer needed.
Virtual files will be kept in memory until they are deleted explicitly with TET_delete_
pvf( ), or automatically in TET_delete( ).
Each TET object will maintain its own set of PVF files. Virtual files cannot be shared
among different TET objects. Multiple threads working with separate TET objects do not
need to synchronize PVF use. If filename refers to an existing virtual file an exception
will be thrown. This function does not check whether filename is already in use for a regular disk file.
Unless the copy option has been supplied, the caller must not modify or free (delete)
the supplied data before a corresponding successful call to TET_delete_pvf( ). Not obeying to this rule will most likely result in a crash.
Table 10.3 Options for TET_create_pvf( )
option
description
copy
(Boolean) TET will immediately create an internal copy of the supplied data. In this case the caller may
dispose of the supplied data immediately after this call. The copy option will automatically be set to true
in the COM, .NET, and Java bindings (default for other bindings: false). In other language bindings the
data will not be copied unless the copy option is supplied.
10.2 General Functions
165
C++ Java C# int delete_pvf(String filename)
Perl PHP int delete_pvf(string filename)
C int TET_delete_pvf(TET *tet, const char *filename, int len)
Delete a named virtual file and free its data structures.
filename
(Name string) The name of the virtual file as supplied to TET_create_pvf( ).
len (C language binding only) Length of filename (in bytes) for UTF-16 strings. If len=0
a null-terminated string must be provided.
Returns -1 if the corresponding virtual file exists but is locked, and 1 otherwise.
Details If the file isn’t locked, TET will immediately delete the data structures associated with
filename. If filename does not refer to a valid virtual file this function will silently do
nothing. After successfully calling this function filename may be reused. All virtual files
will automatically be deleted in TET_delete( ).
The detailed semantics depend on whether or not the copy option has been supplied
to the corresponding call to TET_create_pvf( ): If the copy option has been supplied, both
the administrative data structures for the file and the actual file contents (data) will be
freed; otherwise, the contents will not be freed, since the client is supposed to do so.
C++ Java C# int info_pvf(String filename, String keyword)
Perl PHP int info_pvf(string filename, string keyword)
C int TET_info_pvf(TET *tet, const char *filename, int len, const char *keyword)
Query properties of a virtual file or the PDFlib Virtual File system (PVF).
filename (Name string) The name of the virtual file. The filename may be empty if
keyword=filecount.
len (C language binding only) Length of filename (in bytes) for UTF-16 strings. If len=0 a
null-terminated string must be provided.
keyword
A keyword according to Table 10.4.
Details This function returns various properties of a virtual file or the PDFlib Virtual File system (PVF). The property is specified by keyword.
Table 10.4 Keywords for TET_info_pvf( )
option
description
filecount
Total number of files in the PDFlib Virtual File system maintained for the current TET object. The
filename parameter will be ignored.
exists
1 if the file exists in the PDFlib Virtual File system (and has not been deleted), otherwise 0
size
(Only for existing virtual files) Size of the specified virtual file in bytes.
iscopy
(Only for existing virtual files) 1 if the copy option was supplied when the specified virtual file was created, otherwise 0
lockcount
(Only for existing virtual files) Number of locks for the specified virtual file set internally by TET functions.
The file can only be deleted if the lock count is 0.
166
Chapter 10: TET Library API Reference
10.2.4 Unicode Conversion Function
C++ string convert_to_unicode(wstring inputformat, string input, wstring optlist)
C# Java String convert_to_unicode(String inputformat, byte[ ] input, String optlist)
Perl PHP string convert_to_unicode(string inputformat, string input, string optlist)
C const char *TET_convert_to_unicode(TET *tet,
const char *inputformat, const char *input, int inputlen, int *outputlen, const char *optlist)
Convert a string in an arbitrary encoding to a Unicode string in various formats.
inputformat Unicode text format or encoding name specifying interpretation of the
input string:
> Unicode text formats: utf8, ebcdicutf8, utf16, utf16le, utf16be, utf32
> An encoding name according to Section 10.1.4, »Encoding Names«, page 160
> The keyword auto specifies the following behavior: if the input string contains a
UTF-8 or UTF-16 BOM it will be used to determine the appropriate format, otherwise
the current system codepage is assumed.
input
String to be converted to Unicode.
inputlen (C language binding only) Length of the input string in bytes. If inputlen=0 a
null-terminated string must be provided.
outputlen (C language binding only) C-style pointer to a memory location where the
length of the returned string (in bytes) will be stored.
optlist An option list specifying options according to Table 10.5:
> Input filter options: charref, escapesequence
> Unicode conversion options: bom, errorpolicy, inflate, outputformat
Returns A Unicode string created from the input string according to the specified parameters
and options. If the input string does not conform to the specified input format (e.g. invalid UTF-8 string) an empty output string will be returned if errorpolicy=return, and an
exception will be thrown if errorpolicy=exception.
Details This function may be useful for general Unicode string conversion. It is provided for the
benefit of users working in environments which do not provide suitable Unicode converters.
Bindings C binding: the returned strings will be stored in a ring buffer with up to 10 entries. If
more than 10 strings are converted, the buffers will be reused, which means that clients
must copy the strings if they want to access more than 10 strings in parallel. For example, up to 10 calls to this function can be used as parameters for a printf( ) statement
since the return strings are guaranteed to be independent if no more than 10 strings are
used at the same time.
C++ binding: The parameters inputformat and optlist must be passed as wstrings as usual,
while input and returned data must have type string.
Python binding: UTF-8 results will be returned as a string, Python 3: non-UTF-8 results
will be returned as bytes.
10.2 General Functions
167
Table 10.5 Options for TET_convert_to_unicode( )
option
description
charref
(Boolean) If true, enable substitution of numeric and character entity references and glyph name references. Default: false
bom
(Keyword; ignored for outputformat=utf32; for Unicode-aware language bindings only none is allowed)
Policy for adding a byte order mark (BOM) to the output string. Supported keywords (default: none):
errorpolicy
add
Add a BOM.
keep
Add a BOM if the input string has a BOM.
none
Don’t add a BOM.
optimize
Add a BOM except if outputformat=utf8 or ebcdicutf8 and the output string contains only
characters in the range < U+007F.
(Keyword) Behavior in case of conversion errors (default: exception):
return
The replacement character U+FFFD will be used if a character reference cannot be resolved.
An empty string will be returned in case of conversion errors.
exception An exception will be thrown in case of conversion errors.
escapesequence
(Boolean) If true, enable substitution of escape sequences in strings. Default: false
inflate
(Boolean; only for inputformat=utf8; will be ignored if outputformat=utf8) If true, an invalid UTF-8
input string will not trigger an exception, but rather an inflated byte string in the specified output format will be generated. This may be useful for debugging. Default: false
outputformat
(Keyword) Unicode text format of the generated string: utf8, ebcdicutf8, utf16, utf16le, utf16be,
utf32. An empty string is equivalent to utf16. Default: utf16
Unicode-aware language bindings: the output format will be forced to utf16.
C++ language binding: only the following output formats are allowed: ebcdicutf8, utf8, utf16,
utf32.
168
Chapter 10: TET Library API Reference
10.2.5 Exception Handling
C++ Java C# String get_apiname( )
Perl PHP string get_apiname( )
C const char *TET_get_apiname(TET *tet)
Get the name of the API function which caused an exception or failed.
Returns The name of the function which threw an exception, or the name of the most recently
called function which failed with an error code. An empty string will be returned if
there was no error.
C++ Java C# String get_errmsg( )
Perl PHP string get_errmsg( )
C const char *TET_get_errmsg(TET *tet)
Get the text of the last thrown exception or the reason for a failed function call.
Returns Text containing the description of the last exception thrown, or the reason why the
most recently called function failed with an error code. An empty string will be returned
if there was no error.
C++ Java C# int get_errnum( )
Perl PHP long get_errnum( )
C int TET_get_errnum(TET *tet)
Get the number of the last thrown exception or the reason for a failed function call.
Returns The number of an exception, or the error code of the most recently called function
which failed with an error code. This function will return 0 if there was no error.
C TET_TRY(tet)
C TET_CATCH(tet)
C TET_RETHROW(tet)
C TET_EXIT_TRY(tet)
Set up an exception handling block; catch or rethrow an exception; or inform the exception machinery that a TET_TRY( ) block will be left without entering the corresponding
TET_CATCH( ) block. TET_RETHROW( ) can be used to throw an exception again to a higherlevel function after catching it.
Details (C language binding only) See Section 3.2, »C Binding«, page 24.
10.2 General Functions
169
10.2.6 Logging
The logging feature can be used to trace API calls. The contents of the log file may be
useful for debugging purposes, or may be requested by PDFlib GmbH support. Table
10.6 lists the options for activating the logging feature with TET_set_option( ) (see Section 10.2.1, »Option Handling«, page 162).
Table 10.6 Logging-related keys for TET_set_option( )
key
explanation
logging
Option list with logging options according to Table 10.7
userlog
String which will be copied to the log file
The logging options can be supplied in the following ways:
> As an option list for the logging option of TET_set_option( ), e.g.:
tet.set_option("logging={filename={debug.log} remove}")
> In an environment variable called TETLOGGING. Doing so will activate the logging
output starting with the very first call to one of the API functions.
Table 10.7 Suboptions for the logging option of TET_set_option( )
key
explanation
(empty list)
Enable log output after it has been disabled with disable.
disable
(Boolean) Disable logging output. Default: false
enable
(Boolean) Enable logging output
filename
(String) Name of the log file (stdout and stderr are also acceptable). Output will be appended to any
existing contents. The log file name can alternatively be supplied in an environment variable called TETLOGFILENAME (in this case the option filename will always be ignored). Default: tet.log (on Windows
and OS X in the / directory, on Unix in /tmp)
flush
(Boolean) If true, the log file will be closed after each output, and reopened for the next output to make
sure that the output will actually be flushed. This may be useful when chasing program crashes where
the log file is truncated, but significantly slows down processing. If false, the log file will be opened only
once. Default: false
includepid
(Boolean; not on MVS) Include the process id in the log file name. This should be enabled if multiple processes use the same log file name. Default: false
includetid
(Boolean; not on MVS) Include the thread id in the log file name. This should be enabled if multiple
threads in the same process use the same log file name. Default: false
includeoid
(Boolean; not on MVS) Include the object id in the log file name. This should be enabled if multiple TET
objects in the same thread use the same log file name. Default: false
remove
(Boolean) If true, an existing log file will be deleted before writing new output. Default: false
removeonsuccess
(Boolean) Remove the generated log file in TET_delete( ) unless an exception occurred. This may be useful
for analyzing occasional problems in multi-threaded applications or problems which occur only sporadically. It is recommended to combine this option with includepid/includetid/includeoid as appropriate.
stringlimit
(Integer) Limit for the number of characters in text strings, or 0 for unlimited. Default: 0
170
Chapter 10: TET Library API Reference
Table 10.7 Suboptions for the logging option of TET_set_option( )
key
explanation
classes
(Option list) List containing options of type integer, where each option describes a logging class and the
corresponding value describes the granularity level. Level 0 disables a logging class, positive numbers enable a class. Increasing levels provide more and more detailed output. The following options are supported (default: {api=1 warning=1}):
api
Log all API calls with their function parameters and results. If api=2 a timestamp is created in
front of all API trace lines, and deprecated functions and options will be marked
filesearch Log all attempts related to locating files via SearchPath or PVF.
resource
Log all attempts at locating resources via Windows registry, UPR definitions as well as the
results of the resource search.
user
User-specified logging output supplied with the userlog option.
warning
Log all warnings, i.e. error conditions which can be ignored or fixed internally. If warning=2
messages from functions which do not throw any exception, but hook up the message text
for retrieval via TET_get_errmsg( ), and the reason for all failed attempts at opening a file
(searching for a file in searchpath) will also be logged.
10.2 General Functions
171
10.3 Document Functions
C++ Java C# int open_document(String filename, String optlist)
Perl PHP long open_document(string filename, string optlist)
C int TET_open_document(TET *tet, const char *filename, int len, const char *optlist)
Open a disk-based or virtual PDF document for content extraction.
filename The full path name of the PDF file to be opened. The file will be searched by
means of the SearchPath resource.
In non-Unicode language bindings the file name is converted to UTF-8 according to
the filenamehandling option (unless filenamehandling=unicode or the supplied file name
starts with a UTF-8 BOM). If len is different from 0 (C language binding only) the file
name is converted from UTF-16 to UTF-8 regardless of the option filenamehandling. An
error occurs if the file name cannot be converted or if the file name does not constitute
valid UTF-8 or UTF-16.
On Windows it is OK to use UNC paths or mapped network drives as long as you have
the necessary permissions (which may not be the case when running in ASP).
len (Only C language binding) Length of filename (in bytes) for UTF-16 strings. If len = 0
a null-terminated string must be provided.
optlist An option list specifying document options according to Table 10.8. The following options can be used:
allowjpeg2000, checkglyphlists, decompose, encodinghint, engines, fold, glyphmapping,
lineseparator, normalize, inmemory, paraseparator, password, repair, requiredmode, shrug,
spotcolor, tetml, usehostfonts, wordseparator
Returns -1 on error, or a document handle otherwise. For example, it is an error if the input document or the TETML output file cannot be opened. If -1 is returned it is recommended to
call TET_get_errmsg( ) to find out more details about the error.
Details Within a single TET object an arbitrary number of documents may be kept open at the
same time. However, a single TET object must not be used in multiple threads simultaneously without any locking mechanism for synchronizing the access.
Encryption: if the document is encrypted its user password must be supplied in the
password option if the permission settings allow content extraction. The document’s
master password must be supplied if the permission settings do not allow content extraction. If the requiredmode option has been specified, documents can be opened even
without the appropriate password, but operations are restricted. The shrug option can
be used to enable content extraction from protected documents under certain conditions (see Section 5.1, »Extracting Content from protected PDF«, page 57).
Supported file systems on i5/iSeries: TET has been tested with PC type file systems
only. Therefore input and output files should reside in PC type files in the IFS (Integrated File System). The QSYS.lib file system for input files has not been tested and is not
supported. Since QSYS.lib files are mostly used for record-based or database objects, unpredictable behavior may be the result if you use TET with QSYS.lib objects. TET file I/O is
always stream-based, not record-based.
172
Chapter 10: TET Library API Reference
Table 10.8 Document options for TET_open_document( ) and TET_open_document_callback( )
option
description
acceptdynamicxfa
(Boolean) If true, dynamic XFA forms can successfully be opened. Querying pCOS paths is the only reasonable activity. Calling TET_open_page( ) will fail since no meaningful text or images can be extracted.
Default: false
allowjpeg2000
(Boolean) If true, JPEG 2000 (*.jp2, *.jpf or *.j2k) is allowed as output format for TET_write_image_file( )
and TET_get_image_data( ). Otherwise JPEG 2000 is avoided in favor of TIFF which may result in larger
image files. Default: true
checkglyphlists
(Boolean) If true, TET will check all builtin glyphmapping rules with condition=allfonts before text extraction starts. Otherwise the global glyphmapping rules will not be applied. This option slows down
processing, but is useful for certain kinds of TeX documents with glyph names which cannot be mapped
to Unicode by default. Default: false
decompose
(Keyword or option list; not relevant for granularity=glyph) Unicode decompositions which are applied
to all characters which have a specified Unicode decomposition tag and are part of the specified Unicode
set. These conditions are provided in the suboption name and value. Decompositions can be used to either remove or preserve the distinction between equivalent Unicode characters (see Section 7.3, »Unicode
Postprocessing«, page 98).
Default: see »Default decompositions«, page 104. However, if the normalize option has a value other
than none, all default decompositions are disabled, i.e. setting the normalize option sets the default to
decompose=none. User-specified decompositions can still be applied.
The following keywords can be supplied instead of a list:
none
No decompositions will be applied.
default
The default decompositions (see »Default decompositions«, page 104) will be applied before
other specified decompositions.
The following suboptions for decompositions are supported:
canonical, circle, compat, final, font, fraction, initial, isolated, medial, narrow, nobreak, small, square,
sub, super, vertical, wide
Each of these suboptions accepts a string or keyword which specifies the decomposition’s domain, i.e. the
set of Unicode characters to which the decomposition will be applied. A string specifies a Unicode set for
the domain. This can be used to restrict decompositions to subsets of the characters with the specified
decomposition tag. Characters outside the domain will not be modified.
As an alternative to a string for a Unicode set the following keywords can be supplied:
_all
The set of all Unicode characters, i.e. the decomposition will be applied to all characters with
the specified decomposition tag.
_none
The empty set, i.e. the decomposition will not be applied at all.
1
encodinghint (String ) The name of an encoding which will be used to determine Unicode mappings for glyph names
which cannot be mapped by standard rules, but only by a predefined internal glyph mapping rule. The
keyword none can be used to disable all predefined rules. Default: winansi
engines
(Option list) Enable or disable TET engines for page parsing. Disabled engines never provide any information. Disabling engines which are not required improves performance (default: all engines are active):
image
(Boolean) Enable the image extraction engine.
text
(Boolean) Enable the text extraction engine.
textcolor (Boolean) Enable the text color engine.
vector
(Boolean) Enable the vector graphics engine which is relevant for clipping and improved table
detection.
10.3 Document Functions
173
Table 10.8 Document options for TET_open_document( ) and TET_open_document_callback( )
option
description
fold
(Keyword or list of lists; the first element of each inner list is a Unicode set or keyword, the second element is a Unichar or a keyword; not relevant for granularity=glyph) Apply a post-folding (equivalence
mapping) to all characters in a folding domain specified as a Unicode set or keyword. The foldings are applied to all text except separator characters added by the lineseparator, paraseparator, or
wordseparator options (see Section 7.3, »Unicode Postprocessing«, page 98). Default: see Table 7.3, page
100.
The following keyword can be supplied instead of a list:
none
No foldings are applied.
The following keyword can be supplied instead of a sublist:
default
The default foldings are applied.
The first element of each list specifies the folding’s domain, i.e. the set of Unicode characters to which the
folding is applied. A string specifies a Unicode set for the domain. If a character is included in multiple
sets specified within the fold option, the first matching set definition has priority over all others. In order
to avoid unexpected results it is recommended to use disjoint sets.
As an alternative to specifying the domain as a Unicode set the following keywords can be used:
_dehyphenation
The folding is applied to hyphen characters which have been found within hyphenated words
at line breaks. These characters are flagged in the attributes member returned by TET_get_
char_info( ) and the Glyph/@dehyphenation attribute in TETML.
_tetpua
The folding is applied to the TET PUA values which are assigned to unmappable glyphs. These
characters are flagged with the unknown member returned by TET_get_char_info( ) and the
Glyph/@unknown attribute in TETML.
The second element in each list contains the target character or action for the folding. It is specified with
one of the following variants:
(Unichar) Replace all characters in the domain with the specified Unicode character.
preserve
The characters in the domain are not modified.
remove
The characters in the domain are removed.
shift
Shift all characters in the domain by the specified value (which may be negative).
unknownchar
Replace all characters in the domain with the character specified in the unknownchar option,
or apply the action specified in the unknownchar option.
glyphmapping (List of option lists) A list of option lists where each option list describes a glyph mapping method for one
or more font/encoding combinations which cannot reliably be mapped with standard methods. The
mappings will be used in least-recently-set order. If the last option list contains the font name wildcard
»*«, preceding mappings will no longer be used. Each rule consists of an option list according to Table
10.9. All glyph mappings which match a particular font name will be applied to this font (default: predefined internal glyph mappings are applied).
Note that glyph mapping rules can also be specified as an external resource in the UPR file (see Section
5.2, »Resource Configuration and File Searching«, page 59).
ignoreactualtext
(Boolean) If true, all ActualText mappings in the document are ignored. Default: false
lineseparator
(Unichar; Only for granularity=page) Character to be inserted between lines2. Default: U+000A
174
Chapter 10: TET Library API Reference
Table 10.8 Document options for TET_open_document( ) and TET_open_document_callback( )
option
description
normalize
(Keyword; not relevant for granularity=glyph) Normalize the text output to one of the Unicode normalization forms (default: none):
none
Do not apply any normalization.
nfc
Normalization Form C (NFC): canonical decomposition followed by canonical composition
nfd
Normalization Form D (NFD): canonical decomposition
nfkc
Normalization Form KC (NFKC): compatibility decomposition followed by canonical composition
nfkd
Normalization Form KD (NFKD): compatibility decomposition
Since the Unicode normalization forms involve canonical and compatibility decompositions, combinations of the options decompose and normalize must be constructed carefully. Setting the normalize option to a value different from none sets the decomposition default to decompose=none.
inmemory
(Boolean; Only for TET_open_document( )) If true, TET will load the complete file into memory and process it from there. This can result in a tremendous performance gain on some systems (especially MVS) at
the expense of memory usage. If false, individual parts of the document will be read from disk as needed. Default: false
paraseparator
(Unichar; Only for granularity=page) Character to be inserted between paragraphs2. Default: U+000A
password
(String) The user, master or attachment password for encrypted documents. If the document’s permission
settings allow text copying then the user password is sufficient, otherwise the master password must be
supplied.
See the pCOS Path Reference to find out how to query a document’s encryption status, and pCOS operations which can be applied even without knowing the user or master password.
The shrug option can be used to enable content extraction from protected documents under certain conditions (see Section 5.1, »Extracting Content from protected PDF«, page 57).
repair
(Keyword) Specifies how to treat damaged PDF documents. Repairing a document takes more time than
normal parsing, but may allow processing of certain damaged PDFs. Note that some documents may be
damaged beyond repair (default: auto):
force
Unconditionally try to repair the document, regardless of whether or not it has problems.
auto
Repair the document only if problems are detected while opening the PDF.
none
No attempt will be made at repairing the document. If there are problems in the PDF the
function call will fail.
requiredmode
(Keyword) The minimum pcosmode (minimum/restricted/full) which is acceptable when opening the
document. The call will fail if the resulting pcosmode (see the pCOS Path Reference) would be lower than
the required mode. If the call succeeds it is guaranteed that the resulting pcosmode is at least the one
specified in this option. However, it may be higher; e.g. requiredmode=minimum for an unencrypted document will result in full mode. Default: full
shrug
(Boolean) If true, the shrug feature will be activated to enable content extraction from protected documents under certain conditions (see Chapter 5.1, »Extracting Content from protected PDF«, page 57). By
using the shrug option you assert that you will honor the PDF document author’s rights. Default: false
10.3 Document Functions
175
Table 10.8 Document options for TET_open_document( ) and TET_open_document_callback( )
option
description
spotcolor
(Keyword) Control treatment of spot color images in TET_write_image_file( ) and TET_get_image_
data( ). Images with a Separation or DeviceN colorspace, i.e. one or more named process or spot colors are
extracted as follows (default: ignore):
tetml
convert
Emit a grayscale or CMYK image if no custom spot colors are used. Otherwise convert spot
colors to the corresponding alternate color space. For some images conversion to the
alternate color space is not possible. In this case this method behaves like spotcolor=ignore
(for a single custom spot color) or spotcolor=preserve (for two or more custom spot colors).
ignore
Like convert except that images with exactly one custom spot color are extracted as grayscale image and the spot color name is lost.
preserve
(Forces TIFF output) Emit a grayscale or CMYK image with one or more extra spot color
channels if required for custom spot color names. TIFF images with preserved spot colors in
extra channels work only in Adobe Photoshop and compatible programs, but not in all simple
TIFF viewers.
(Option list) TETML output will be initiated, and can be created page by page with TET_process_page( ).
The following suboptions are supported:
elements (Option list) Specify whether certain TETML elements will be included in the output:
annotations (Boolean) Emit /TET/Document/Pages[]/Page/Annotations if the document contains annotations. Default: true
bookmarks (Boolean) Emit /TET/Document/Bookmarks if the document contains bookmarks.
Default: true
destinations (Boolean) Emit /TET/Document/Destinations if the document contains destinations. Default: true
docinfo
(Boolean) Emit /TET/Document/DocInfo element if the document contains document info entries. Default: true
fields
(Boolean) Emit /TET/Document/Pages[]/Page/Fields and TET/Document/SignatureFields if the document contains AcroForm fields or digital signatures. Default: true
javascripts (Boolean) Emit /TET/Document/JavaScripts if the document contains JavaScript.
Default: true
metadata (Boolean) Emit /TET/Document/Metadata and/or /TET/Document/Images[]/
Image/Metadata if the document contains XMP metadata on the document or
image level. Default: true
options
(Boolean) The elements /TET/Document/Options and /TET/Document/Pages[]/
Page/Options. Default: true
encodingname
(Keyword) The name to use in the XML encoding declaration of the text declaration of the
generated TETML. The output will always be created in UTF-8 (default: UTF-8):
_none
No encoding declaration will be created; the output will still be in UTF-8 format.
UTF-8
The declaration encoding="UTF-8" will be created.
Any other encoding name will be used literally in the encoding declaration. The client is
responsible for supplying a suitable encoding name and converting the generated TETML
(which is UTF-8) to the specified encoding after TET finished TETML output.
filename (String) The name of the TETML file. If no filename is supplied, output will be
created in memory, and can be retrieved with TET_get_tetml( ). If the function call
fails (i.e. the PDF input document could not successfully be opened), no TETML
output will be created.
unknownchar
176
(Unichar or keyword) Character or action to be applied to TET PUA characters for unmappable glyphs (see
»Unmappable glyphs and the TET PUA«, page 108). The following keywords are supported (default: Unicode replacement character U+FFFD):
remove
Unmappable glyphs are removed. The value U+0000 is equivalent to remove.
preserve
Unmappable glyphs are represented by TET PUA values.
Chapter 10: TET Library API Reference
Table 10.8 Document options for TET_open_document( ) and TET_open_document_callback( )
option
description
usehostfonts (Boolean) If true, data for fonts which are not embedded, but are required for determining Unicode
mappings will be searched on the OS X or Windows host operating system. Default: true
wordseparator (Unichar; Only for granularity=line and page) Character to be inserted between words2. Default:
U+0020
1. See footnote 1 in Table 10.9
2. Use U+0000 to disable the separator.
10.3 Document Functions
177
Table 10.9 Suboptions for the glyphmapping option of TET_open_document( ) and TET_open_document_callback( )
option
description
codelist
(String) Name of a codelist resource to be applied to the font. It will have higher priority than an embedded ToUnicode CMap or encoding entry.
fold
Apply a pre-folding (equivalence mapping) to all characters in a folding domain specified as a Unicode
set; see description of option fold inTable 10.8. The keywords remove, preserve and unknownchar can
not be used. Font-specific foldings with the shift keyword can be used to correct systematic errors in a
font’s ToUnicode CMap.
fontname
(Name string) Partial or full name of the font(s) which will be selected for the rule. If a subset prefix has
been supplied only the specified subset is selected. If no subset prefix has been supplied, all fonts where
the name (without any subset prefix) matches are selected. The wildcard character »*« can be used to
specify multiple similar font names. Default: *
fonttype
(List of keywords) The glyphmapping will only be applied to the specified font types: * (designates all font
types), Type1, MMType1, TrueType, CIDFontType2, CIDFontType0, Type3. Default: *
forceencoding
(List with one or two strings1, If there are two names, the first must be winansi, macroman, or Custom)
Fonts with an 8-bit encoding: Replace the first encoding with the encoding resource specified by the second name. If only one entry is supplied, the specified encoding will be used to replace all instances of
MacRoman, WinAnsi, and MacExpert encoding. If this option matches a font no other glyph mappings will
be applied to the same font.
CID fonts: Only the single value unicode is supported. It interprets CID values as Unicode values.
forcettsymbol- (Keyword or string1) The name of an encoding which will be used to determine Unicode mappings for emencoding
bedded pseudo TrueType symbol fonts which are actually text fonts, or one of the following keywords
(default: none):
auto
If the font’s builtin encoding (see below) contains at least one Unicode character in the
symbolic range U+F000-U+F0FF, the encoding specified in the encodinghint option will be
used to map the pseudo symbol characters to real text characters. Otherwise encodinghint
will not be used, and the characters will be mapped according to the builtin keyword.
builtin
Use the font’s builtin encoding, which results from the Unicode mappings of the glyph names
in the font’s post table.
none
No encoding is enforced.
The well-known TrueType fonts Wingdings* and Webdings* are always treated as symbol fonts.
globalglyphlist (Boolean) If true, the specified glyph list will be kept in memory until the end of the TET object, i.e. it can
be applied to more than one document. Default: false
glyphlist
(String) Name of a glyphlist resource to be applied
glyphrule
(Option list) Mapping rule for numerical glyph names (in addition to the predefined rules). The option list
must contain the following suboptions:
prefix
(String; may be empty) Prefix of the glyph names to which the rule will be applied. The wildcard character »?« can be used. It matches exactly one character provided this character is different from 0-9.
base
(Keyword) Specifies the interpretation of glyph names:
ascii
Single-byte glyphnames will be interpreted as the corresponding literal ASCII
character (e.g. 1 will be mapped to U+0031).
auto
Automatically determine whether glyph names represent decimal or hexadecimal
values. If the result is not unique, decimal will be assumed.
dec
The glyphnames will be interpreted as a decimal representation of a code.
hex
The glyphnames will be interpreted as a hexadecimal representation of a code.
encoding (String) Name of an encoding resource which will be used for this rule, or the keyword none to
disable the rule.
ignoretounicodecmap
178
(Boolean) If true, a ToUnicode CMap for the font will be ignored. Default: false
Chapter 10: TET Library API Reference
Table 10.9 Suboptions for the glyphmapping option of TET_open_document( ) and TET_open_document_callback( )
option
description
override
(Boolean; only reasonable together with the glyphlist or glyphrule option) If true, the glyphmapping
rule is applied before the standard (builtin) glyph name mappings (i.e. the new mappings have priority
over the builtin ones), otherwise the rule is applied after the builtin mappings. Default: true
remove
(Boolean) If true, all text which uses the specified font name(s) and/or font type(s) will be removed from
the retrieved text.
tounicodecmap
(String) Name of a ToUnicode CMap resource to be applied to the font; it will have higher priority than an
embedded ToUnicode CMap or encoding entry.
1. Encoding name according to Section 10.1.4, »Encoding Names«, page 160
C++ int open_document_callback(void *opaque, size_t filesize,
size_t (*readproc)(void *opaque, void *buffer, size_t size),
int (*seekproc)(void *opaque, long offset),
wstring optlist)
C int TET_open_document_callback(TET *tet, void *opaque, size_t filesize,
size_t (*readproc)(void *opaque, void *buffer, size_t size),
int (*seekproc)(void *opaque, long offset),
const char *optlist)
Open a PDF document from a custom data source for content extraction.
opaque A pointer to some user data that might be associated with the input PDF document. This pointer will be passed as the first parameter of the callback functions, and
can be used in any way. TET will not use the opaque pointer in any other way.
filesize
Size of the PDF document in bytes.
readproc A C callback function which copies size bytes to the memory pointed to by
buffer. If the end of the document is reached it may copy less data than requested. The
function must return the number of bytes copied.
seekproc A C callback function which sets the current read position in the document.
offset denotes the position from the beginning of the document (0 meaning the first
byte). If successful, this function must return 0, otherwise -1.
optlist
An option list specifying document options according to Table 10.8.
Returns See TET_open_document( ).
Details See TET_open_document( ).
Bindings This function is only available in the C and C++ language bindings.
C++ Java C# void close_document(int doc)
Perl PHP close_document(long doc)
C void TET_close_document(TET *tet, int doc)
Release a document handle and all internal resources related to that document.
doc
A valid document handle obtained with TET_open_document*( ).
10.3 Document Functions
179
Details Closing a document automatically closes all of its open pages. All open documents and
pages will be closed automatically when TET_delete( ) is called. It is good programming
practice, however, to close documents explicitly when they are no longer needed.
Closed document handles must no longer be used in any function call.
180
Chapter 10: TET Library API Reference
10.4 Page Functions
C++ Java C# int open_page(int doc, int pagenumber, String optlist)
Perl PHP long open_page(long pagenumber, string optlist)
C int TET_open_page(TET *tet, int doc, int pagenumber, const char *optlist)
Open a page for content extraction.
doc
A valid document handle obtained with TET_open_document*( ).
pagenumber The physical number of the page to be opened. The first page has page
number 1. The total number of pages can be retrieved with TET_pcos_get_number( ) and
the pCOS path length:pages.
optlist An option list specifying page options according to Table 10.10. The following
options can be used:
clippingarea, contentanalysis, docstyle, emptycheck, excludebox, fontsizerange,
granularity, ignoreartifacts, ignoreinvisibletext, imageanalysis, includebox, layers, layoutanalysis, layouteffort, structureanalysis, topdown, vectoranalysis.
Returns A handle for the page, or -1 in case of an error. If -1 is returned it is recommended to call
TET_get_errmsg( ) to find out more details about the error.
Details Within a single document an arbitrary number of pages may be kept open at the same
time. The same page may be opened multiply with different options. However, options
can not be changed while processing a page.
Layer definitions (optional content groups): the contents of all visible layers on the
page will be extracted by default. This behavior can be modified with the layers option.
Table 10.10 Page options for TET_open_page( ) and TET_process_page( )
option
description
clippingarea
(Keyword; ignored if includebox is specified) Specifies the area from which text and images are extracted
(default: cropbox):
mediabox Use the MediaBox (which is always present)
cropbox
Use the CropBox (the area visible in Acrobat) if present, else MediaBox
bleedbox Use the BleedBox if present, else use cropbox
trimbox
Use the TrimBox if present, else use cropbox
artbox
Use the ArtBox if present, else use cropbox
unlimited Consider all text, regardless of its location
contentanalysis
(Option list; not for granularity=glyph) List of suboptions according to Table 10.11 for controlling highlevel content analysis and text processing.
10.4 Page Functions
181
Table 10.10 Page options for TET_open_page( ) and TET_process_page( )
option
description
docstyle
(Keyword) A hint which is used by the layout detection engine to select various parameters. These parameters optimize layout detection for situations where the document belongs to one of the classes below. If the document is known to fall into one of these classes layout detection results can be improved
significantly by supplying a suitable value for this option. This option activates advanced layout recognition (default: none):
book
Typical book
business
Business documents
cad
Technical or architectural drawings which are typically heavily fragmented
fancy
Fancy pages with complex layout
forms
Structured forms
generic
The most general document class without any further qualification.
magazines Magazine articles
none
No specific document style is known and advanced layout recognition will be disabled.
native
Disable layout recognition and return the contents in native page content ordering. This may
be useful for layouts such as forms where text is placed all over the page and column
detection is not desired, but rather row-by-row text retrieval.
papers
Newspaper
science
Scientific article
searchengine
The application is a search engine indexer or similar application, and mainly interested in
retrieving the word list for the page as fast as possible. Table and page structure recognition
are disabled.
spacegrid List-oriented report (often generated on mainframe systems) where the visual layout is
generated using space characters. Since many heuristics like shadow detection and
sophisticated word boundary detection are not required for this class of documents text
extraction can be accelerated with this option.
emptycheck
(Boolean) If true normal content extraction is disabled. Instead, the box provided in the includebox option is used to check whether the box contains any text, image, or vector graphics (only a single includebox is supported). If the includebox option is not supplied the whole clipping area is checked. This can be
used to identify empty pages. The following options are ignored: granularity, engines, fontsizerange.
Clipping operators are ignored.
The result of the check can be retrieved with a call to TET_get_text( ) which will return one of the strings
empty or notempty instead of any page contents. Default: false
excludebox
(List of rectangles) Exclude the combined area of the specified rectangles from text and image extraction.
Default: empty
fontsizerange
(List of two floats) Two numbers specifying the minimum and maximum font size of text. Text with a size
outside of this interval will be ignored. The maximum can be specified with the keyword unlimited,
which means that no upper limit will be active. Default: { 0 unlimited }
granularity
(Keyword) The granularity of the text fragments returned by TET_get_text( ); all modes except glyph will
enable the Wordfinder. See »Text granularity«, page 85, for more details (default: word).
182
glyph
A fragment contains the result of one glyph, but may contain more than one character (e.g.
for ligatures).
word
A fragment contains a word as determined by the Wordfinder.
line
A fragment contains a line of text, or the closest approximation thereof. Word separators are
inserted between two consecutive words.
page
A fragment contains the contents of a single page. Word, line,and paragraph separators are
inserted as appropriate.
Chapter 10: TET Library API Reference
Table 10.10 Page options for TET_open_page( ) and TET_process_page( )
option
description
ignoreartifacts
(Boolean; only relevant for Tagged PDF documents) Ignore text and images which are marked as Artifact.
This can be used to skip irrelevant page contents which are marked as artifact. Default: false
ignoreinvisibletext
(Boolean) If true, text with rendering mode 3 (invisible) will be ignored. Default: false (since invisible
text is mainly used for image+text PDFs containing scanned pages and the corresponding OCR text)
imageanalysis
(Option list) List of suboptions according to Table 10.13 for controlling high-level image processing.
includebox
(List of rectangles) Restrict text and image extraction to the combined area of the specified rectangles.
Default: the complete clipping area
layers
(Keyword) Treatment of page contents within layers (also known as optional content). Supported keywords (default: visible):
all
Extract all page contents regardless of layers. Text may be garbled and image merging may
be spoiled if the contents of multiple layers overlap on the page.
invisible
Extract contents of all layers which are invisible by default and ignore all other layers.
visible
Extract contents of all layers which are visible by default and ignore all other layers.
layoutanalysis
(Option list; not for granularity=glyph) List of suboptions according to Table 10.12 for controlling layout
detection features.
layouteffort
(Keyword) Controls the quality/performance trade-off of layout recognition. Layout recognition can be
improved by spending more effort, but this may slow down operation. The layout recognition effort can
be controlled with the keywords none, low, medium, high, and extra. Default: low
layouthint
(Option list) Inform the layout recognition engine about the presence of certain page layout elements:
subsummary
(Keyword) Informs the engine about the presence of subsummaries (marginalia) and possibly
also their position. Supported keywords (default: none):
auto
No subsummary detection
left
Try to detect subsummaries on the left side of the page.
none
Try to detect subsummaries automatically.
right
Try to detect subsummaries on the right side of the page.
header
(Boolean) If true, the engine tries to detect page headers (default: false).
footer
(Boolean) If true, the engine tries to detect page footers (default: false).
maxvectorcount
(Float) Maximum number of vector objects to be taken into account by the vector graphics engine. Default: 500
minvectorsize
(Float) Minimum size of a vector object to be taken into account by the vector graphics engine. The size of
a vector object is the length of the diagonal of its bounding box in points. Default: 5
skipengines
Deprecated; use the engines option of TET_open_document( )
structureanalysis
(Option list; not for granularity=glyph) List of suboptions according to Table 10.14 for controlling page
structure analysis.
10.4 Page Functions
183
Table 10.10 Page options for TET_open_page( ) and TET_process_page( )
option
description
topdown
(Option list) Specify a coordinate system with the origin in the top left corner of the visible page, and y
coordinates which increase downwards; otherwise the default coordinate system with the origin in the
lower left corner will be used. Enabling topdown coordinates enables the same coordinate system which
is displayed in Acrobat. Supported suboptions:
vectoranalysis
184
input
(Boolean) If true, enable topdown coordinates for the following items (default: false):
page options includebox, excludebox
output
(Boolean) If true, enable topdown coordinates for the following items (default: false):
TET_char_info: y, alpha, beta
TET_image_info: y, alpha, beta
TETML: Destination/@bottom, Destination/@top, Box/@lly, Box/@ury, Box/@uly, Box/@lry,
Cell/@ury, Cell/@uly, Cell/@lry, Glyph/@y, Glyph/@alpha, Glyph/@beta, PlacedImage/@y,
PlacedImage/@alpha, PlacedImage/@beta, Table/@lly, Table/uly, Table/@ury, Table/
@lry.
(Option list; not for granularity=glyph) Suboptions according to Table 10.15 for controlling analysis of
vector graphics for table and layout detection. If this option is present, vector graphics is taken into account for table and layout detection.
Chapter 10: TET Library API Reference
Table 10.11 Suboptions for the contentanalysis option of TET_open_page( ) and TET_process_page( )
option
description
bidi
(Keyword; will be ignored for granularity=glyph; has an effect only if right-to-left characters are present on the page) Control the inverse Bidi algorithm which reorders right-to-left and left-to-right text in a
chunk (default: logical):
bidilevel
visual
Keep RTL and LTR characters in a chunk in visual order, i.e. do not apply the inverse Bidi
algorithm
logical
Apply the inverse Bidi algorithm to bring the characters in a chunk in logical order.
(Keyword) Specify the page’s base level (i.e. the main direction of text progression) for the inverse Bidi algorithm (default: auto):
auto
Determine the main direction of text progression heuristically based on the content.
ltr
Assume left-to-right as main direction of text progression (e.g. Latin documents)
rtl
Assume right-to-left as main direction of text progression (e.g. Hebrew or Arabic documents)
dehyphenate
(Boolean) If true, hyphenated words will be identified and the text fragments surrounding the hyphen
will be combined. The hyphen itself will be treated according to the keephyphens option. Default: true
dropcapsize
(Float) The minimum size at which large glyphs will be recognized as a drop cap. Drop caps are large
characters at the beginning of a zone that are enlarged to »drop« down several lines. They will be
merged with the remainder of the zone and form part of the first word in the zone. Default: 35
dropcapratio (Float) The minimum ratio of the font size of drop caps and neighboring text. Large characters will be recognized as drop caps if their size exceeds dropcapsize and the font size quotient exceeds dropcapratio.
In other words, this is the number of text lines spanned by drop caps. Default: 4 (drop caps spanning
three lines are very common, but additional line spacing must be taken into account)
ideographic
(Keyword; deprecated) In TET 4 is was recommended to set this option to keep to avoid the default behavior of split. Since TET 5 no longer treats ideographic characters as word boundaries with
granularity=word, this option is no longer required.
includeboxorder
(Integer) When multiple include boxes have been supplied (see option includebox), this option controls
how the order of boxes affects the Wordfinder (default: 0):
0
Ignore include box ordering when analyzing the page contents. The result will be the same as
if all the text outside the include boxes was deleted. This is useful for eliminating unwanted
text (e.g. headers and footers) while not affecting the Wordfinder in any way.
1
Take include box ordering into account when creating words and zones, but not for zone
ordering. A word will never belong to more than one box. The resulting zones will be sorted in
logical order. In case of overlapping boxes the text will belong to the box which is earlier in
the list. Other than that, the ordering of include boxes in the option list doesn’t matter. This
setting is useful for extracting text from forms, extracting text from tables, or when include
boxes overlap for complicated layouts.
2
Consider include box ordering for all operations. The contents of each include box will be
treated independently from other boxes, and the resulting text will be concatenated
according to the order of the include boxes. This is useful for extracting text from forms in a
particular ordering, or extracting article columns in a magazine layout in a predefined order.
In these cases advance knowledge about the page layout is required in order to specify the
include boxes in appropriate order.
keephyphen- (Boolean) If true and dehyphenate=true the hyphen glyph between parts of dehyphenated words will
glyphs
be kept in the list of glyphs returned by TET_get_char_info( ) and the Glyph element in TETML. This is useful for applications which need detailed information about the position of hyphens, e.g. exactly replacing text on the page. Note that this is different from fold={{_dehyphenation remove} which only removes hyphens from the logical text returned by TET_get_text( ), but does not affect glyphs. Default:
false
linespacing
(Keyword) Specify the typical vertical distance between text lines within a paragraph: small, medium, or
large (default: medium)
10.4 Page Functions
185
Table 10.11 Suboptions for the contentanalysis option of TET_open_page( ) and TET_process_page( )
option
description
maxwords
(Integer or keyword) If the number of words on the page is not greater than the specified number (the
keyword unlimited means that no limit will be active) the detected zones on the page will be merged
appropriately and sorted. If the number of words on the page is greater than the specified number, no
zones will be built, and words will be retrieved in page content reading order. Processing will be faster in
the latter case, but the ordering of the retrieved words may not be optimal. Setting this option to
unlimited is recommended for large pages with many words, such as newspapers. Default: 5000
merge
(Integer) Controls strip and zone merging (default: 2):
numericentities
shadowdetect
0
No merging after strip creation. This can significantly increase processing speed, but may
create less than optimal output, and prevent some shadows from being detected properly.
1
Simple strip-into-zone merging: strips will be merged into a zone if they overlap this
particular zone, but don’t overlap strips other than the next one (to avoid zone overlapping
for non-shadow cases).
2
Advanced zone merging for out-of-sequence text: in addition to merge=1, multiple
overlapping zones will be combined into a single zone, provided the text contents of both
zones do not overlap.
(Keyword) Control word boundary detection for numeric entities such as numbers, fractions, and time
(default: keep):
split
Split the entity according to the punctuationbreaks suboption.
keep
Keep the entity as a whole word.
(Boolean) If true, redundant instances of overlapping text fragments which create a shadow or fake bold
text will be detected and removed. Default: true
punctuation- (Boolean; only for granularity=word) If true, punctuation characters which are placed close to a letter
breaks
are treated as word boundaries, otherwise they are included in the adjacent word. For example, this option may be useful for the treatment of URLs and mail addresses. Default: true
superscript
(Integer) Controls subscript and superscript detection (default: 2):
0
No subscript and superscript detection
1
Simple subscript and superscript detection
2
Advanced algorithm for subscript and superscript detection
useclasses
(Boolean) If true, Unicode classification is considered to determine word boundaries. Default: true
usemetrics
(Boolean) If true, the distance between glyphs is compared with the width of the space glyph to determine word boundaries. Default: true
186
Chapter 10: TET Library API Reference
Table 10.12 Suboptions for the layoutanalysis option of TET_open_page( ) and TET_process_page( )
option
description
layoutastable
(Boolean) If true, layout recognition will treat the zones on the page as one or more tables. The minimum number of columns which is required to consider the sequence as a table depends on the document
style. If false, supertable recognition will be disabled (default: true).
layoutcolumnhint
(Keyword) This option may improve zone reading order detection for complex layouts. Supported keywords (default: multicolumn):
multicolumn
The page contains multi-column text; zones will be sorted column by column.
none
No hint available; zone ordering will be determined by page content order.
singlecolumn
The page contains single-column text; zones will be sorted row by row. This keyword should
be used with layouteffort=low.
layoutdetect (Integer) Specifies the depth of recursive layout recognition (default: 1):
layoutrowhint
0
No layout recognition.
1
Layout recognition for the whole page. This is sufficient for the vast majority of documents.
2
Layout recognition for the results of level 1. This is required for layouts with different multicolumn sublayouts and titles on different parts of the page as well as multi-paragraph tables.
3
Layout recognition for the results of level 2. This is required only for very complex layouts.
(Option list) Control layout row processing. Supported options (default: none):
full
Enable layout row processing.
none
Disable layout row processing.
separation (Keyword) Enable layout row processing, but disable it if layout recognition suspects a
supertable. The following suboptions can be supplied:
preservecolumns
Try to keep vertical columns based on the geometric relationship between zones.
This is recommended if zones within columns are separated by large gaps (e.g.
caused by images).
thick
Try to combine neighboring zones and place them in the same layout row. This results in a smaller number of larger layout rows. This is recommended for complex
layouts, such as magazines and papers where paragraphs within columns are separated from each other by more than the font size, and for layouts with several
multi-column articles one under the other.
thin
Try to separate neighboring zones and place them in different layout rows. This results in a larger number of smaller layout rows.
Example: layoutanalysis = {layoutrowhint={full separation=thick}}
mergetables
splithint
standalonefontsize
(Integer) Tables with a single row will be skipped during table recognition, and treated as regular zones. If
two sequential zones are tables (even with only a single row) they can be combined. (default: none):
down
Combine downstairs only.
none
Don’t combine.
up
Combine upstairs only.
updown
Combine in both directions.
(Keyword or option list) Activate special treatment of double-page spreads (or even pages consisting of
more spreads). The page may be divided vertically or horizontally in two or more sections. The keyword
includebox means that the split areas will be defined by the includebox option. Alternatively the following options can be supplied:
x
(Float) Divider for the x axis, e.g. 0.5 for a double-page spread, 0.33 for a three-page spread.
y
(Float) Divider for y axis.
(Float) Minimum font size for huge glyphs. Huge glyphs form single-glyph strips, and will not be combined with other zones (default: 70).
10.4 Page Functions
187
Table 10.12 Suboptions for the layoutanalysis option of TET_open_page( ) and TET_process_page( )
option
description
supertablecolumns
(Integer; only if layoutastable=true) Minimum number of columns in a layout row to consider the sequence of zones as a supertable. When a table is created from paragraphs, these columns are recognized
as separate zones instead of being combined. As a consequence of this, layout recognition can identify
these zone sequences as a table (default: 4).
tabledetect
(Integer) Specifies the depth of recursive table recognition (default: 1):
0
No table recognition.
1
Table recognition for each zone.
2
Table recognition for each table cell detected in level 1. This is required for nested tables and
resolving row spans.
Table 10.13 Suboptions for the imageanalysis option of TET_open_page( ) and TET_process_page( )
option
description
merge
(Option list) Control image merging. This process combines adjacent images which together may form a
single larger image. This is useful for multi-strip images where the individual strips have been preserved
in the PDF, and for background images which are broken into a large number of very small images.
Supported options:
smallimages
disable
(Boolean) If true, image merging will be disabled. Default: false
gap
(Float) Maximum gap or overlap between two images to be considered for merging.The value
is interpreted as absolute distance in points, and also as number of pixels. Default: 1.0
(Option list) Control small image removal. Small images must often be ignored since they are not useful
for processing. Small image removal does not affect images used as mask:
disable
(Boolean) If true, small image removal is disabled. Default: false
maxarea (Integer) Maximum area (=width x height) of an image in pixels to be considered as a small
image. Default: 500
maxcount Deprecated, don’t use.
maxheight (Integer) Maximum height of an image in pixels to be considered as a small image. Default:
20
maxwidth (Integer) Maximum width of an image in pixels to be considered as a small image. Default:
20
C++ Java C# void close_page(int page)
Perl PHP close_page(long page)
C void TET_close_page(TET *tet, int page)
Release a page handle and all related resources.
page
A valid page handle obtained with TET_open_page( ).
Details All open pages of the document will be closed automatically when TET_close_document( )
is called. It is good programming practice, however, to close pages explicitly when they
are no longer needed. Closed page handles must no longer be used in any function call.
188
Chapter 10: TET Library API Reference
Table 10.14 Suboptions for the structureanalysis option of TET_open_page( ) and TET_process_page( )
option
description
bullets
(List of option lists; only if list=true) Specifies combinations of Unicode characters and font names
which are used as bullet characters in lists. Supported suboptions:
bulletchars
(List of Unicode values) One or more Unicode values for the bullet characters. If this suboption
is not supplied, all characters using the specified fontname will be treated as bullet characters.
fontname (String) Name of the font from which bullet characters are drawn. If this suboption is not
supplied, the characters specified in the bulletchars suboption will always be treated as
bullet characters.
Examples:
bullets={{fontname=ZapfDingbats}}
bullets={{bulletchars={U+2022}}}
bullets={{fontname=KozGoPro-Medium bulletchars={U+2460 U+2461 U+2462 U+2463 U+2464}}}
list
(Boolean) Enable list recognition (default: false). If false, no information about list structure will be determined.
Table 10.15 Suboptions for the vectoranalysis option of open_page( ) and process_page( )
option
description
closetablearea (Boolean) If true, create table border for analysis even if none is present. Default: false
ignorelines
(Keyword) Specify which lines to exclude from the analysis. Supported keywords (default: none):
horizontal Ignore horizontal lines.
none
Don’t ignore any lines.
vertical
Ignore vertical lines.
pagesizelines (Boolean) If true, take into account large lines which are almost as long as the page size. Default: false
splitsequence
(Boolean) If true, vertical lines are allowed to split text sequences (or horizontal lines for rotated text).
Default: false
structures
(Keyword) Specify the kind of vector graphics and it should be processed. Supported keywords (default:
unions):
tables
Extended unions mode: the engine additionally checks whether unions form a table net. If so,
the result is treated as a single table zone.
unions
Try to build sub-layout unions from lines. If such a union is built, it is treated as a complete
sub-layout entity, i.e. all enclosed text zones belong to the sub-layout.
10.4 Page Functions
189
10.5 Text and Glyph Details Retrieval Functions
C++ Java C# String get_text(int page)
Perl PHP string get_text(long page)
C const char *TET_get_text(TET *tet, int page, int *len)
Get the next text fragment from a page’s content.
page
A valid page handle obtained with TET_open_page( ).
len (C language binding only) A pointer to a variable which will hold the length of the
returned string depending on the outputformat option of TET_set_option( ):
If outputformat=utf8 the length is reported as number of Unicode characters. The
number of bytes in the null-terminated string (which is identical to the number of 8-bit
code units) can be determined with the strlen( ) function.
If outputformat=utf16 the length is reported as number of 16-bit code units; surrogate
pairs are counted as two code units. The number of bytes in the string is 2*len.
If outputformat=utf32 the length is reported as number of 32-bit code units (which is
identical to the number of Unicode characters). The number of bytes in the string is
4*len.
Returns A string containing the next text fragment on the page. The length of the fragment is
determined by the granularity option of TET_open_page( ). Even for granularity=glyph the
string may contain more than one character (see Section 7.1, »Important Unicode Concepts«, page 93).
If all text on the page has been retrieved an empty string or null object will be returned (see below). In this case TET_get_errnum( ) should be called to find out whether
there is no more text because of an error on the page, or because the end of the page has
been reached.
Bindings C language binding: the result is provided as null-terminated UTF-8 (default) or UTF-16/
UTF-32 string according to the outputformat option of TET_set_option( ). On i5/iSeries and
zSeries EBCDIC-encoded UTF-8 can also be selected, and is enabled by default. The returned data buffer can be used until the next call to this function. If no more text is
available a NULL pointer and *len=0 will be returned.
C++ and COM: the result is provided as Unicode string in UTF-16 format (wstring in C++).
If no more text is available an empty string will be returned.
Java, .NET and Objective-C: the result is provided as Unicode string. If no more text is
available a null (nil in Objective-C) object will be returned.
TET_Perl and PHP: the result is provided as UTF-8 (default) or UTF-16/UTF-32 string according to the outputformat option of TET_set_option( ). If no more text is available an
empty string is returned.
Python: the result is provided as UTF-8 (default) or UTF-16/UTF-32 string according to
the outputformat option of TET_set_option( ). In Python 3 UTF-16/UTF-32 results are returned as bytes. If no more text is available None is returned.
190
Chapter 10: TET Library API Reference
Ruby: the result is provided as UTF-8 (default) or UTF-16/UTF-32 string according to the
outputformat option of TET_set_option( ). If no more text is available a nil object is returned.
REALbasic/Xojo: the result is provided as Unicode string. If no more text is available an
empty string is returned.
RPG language binding: the result is provided as Unicode string. If no more text is available NULL is returned.
C++ const TET_char_info *get_char_info(int page)
C# Java int get_char_info(int page)
Perl PHP object get_char_info(long page)
C const TET_char_info *TET_get_char_info(TET *tet, int page)
Get detailed information for the next glyph in the most recent text chunk.
page
A valid page handle obtained with TET_open_page( ).
Note The name of this function is a misnomer. It should better be called TET_get_glyph_info( ) since
it reports information about visual glyphs on the page, not the corresponding Unicode
characters.
Returns If no more glyphs are available for the most recent text fragment returned by TET_get_
text( ), a binding-specific value will be returned. See section Bindings below for more details.
Details This function can be called one or more times after TET_get_text( ). It will advance to the
next glyph for the current text chunk associated with the supplied page handle (or return nothing if there are no more glyphs), and provide detailed information for this
glyph. There will be one or more successful calls to this function for a text chunk with N
glyphs and M logical characters. The relationship between N and M depends on the
granularity:
> For granularity=glyph each text chunk corresponds to a single glyph, i.e. N = 1. One
glyph corresponds to one character in many cases, i.e. M = 1. However, for ligature
glyphs a single glyph creates multiple characters, i.e. M > 1 and TET_get_char_info( )
must be called more than once.
> For granularities other than glyph a sequence of glyphs creates a sequence of characters, where each glyph may contribute to 0, 1, or more characters. The sequence of
glyphs serves as raw material for the sequence of Unicode characters. In other words,
there is no fixed relationship between N and M. The relationship between N and M
may be influenced by content analysis (e.g. hyphens are removed by the dehyphenation process) or Unicode postprocessing (e.g. characters are added or deleted because of a folding).
For granularities other than glyph this function advances to the next glyph which contributes to the chunk returned by the most recent call to TET_get_text( ). This way it is
possible to retrieve glyph details when the Wordfinder is active and a text chunk may
contain more than one character. In order to retrieve all glyph details for the current
text chunk this function must be called repeatedly until it returns no more info.
10.5 Text and Glyph Details Retrieval Functions
191
The glyph details in the structure or properties/fields are valid until the next call to
TET_get_char_info( ) or TET_close_page( ) with the same page handle. Since there is only a
single set of glyph info properties/fields per TET object, clients must retrieve all glyph
info before they call TET_get_char_info( ) again for the same or another page or document.
Bindings C and C++ language bindings: If no more glyphs are available for the most recent text
chunk returned by TET_get_text( ), a NULL pointer is returned. Otherwise, a pointer to a
TET_char_info structure containing information about a single glyph will be returned.
The members of the data structure are detailed in Table 10.16.
COM, Java, .NET, and Objective-C language bindings: -1 is returned if no more glyphs are
available for the most recent text chunk returned by TET_get_text( ), otherwise 1. Individual glyph info can be retrieved from the TET properties/public fields according to Table 10.16. All properties/fields contain the value -1 (the unknown field contains false) if
they are accessed although the function returned -1.
Perl and Python language bindings: 0 is returned if no more glyphs are available for the
most recent text chunk returned by TET_get_text( ), otherwise a hash containing the
keys listed in Table 10.16. Individual glyph info can be retrieved with the keys in this
hash.
PHP language binding: an empty (null) object is returned if no more glyphs are available
for the most recent text chunk returned by TET_get_text( ), otherwise an object containing the fields listed in Table 10.16. Individual glyph info can be retrieved from the member fields of this object. Integer fields in the glyph info object are implemented as long
in the PHP language binding.
REALbasic/Xojo binding: nil is returned if no more glyphs are available for the most recent text chunk returned by TET_get_text( ), otherwise a TET_char_info object containing
the members listed in Table 10.16. Individual glyph info can be retrieved with the keys
in this object. The attributes field is called attrs in the REALbasic/Xojo binding to work
around a REALbasic/Xojo interface problem.
Ruby binding: nil (null object) is returned if no more glyphs are available, and a TET_
char_info object otherwise.
Table 10.16 Members of the TET_char_info structure (C, C++, Ruby), equivalent public fields (Java, PHP, Objective-C), keys
(Perl) or properties (COM and .NET) with their type and meaning. See Section 6.2, »Page and Text Geometry«, page 72 and
Section 6.3, »Text Color«, page 78 for more details.
property/
field name
uv
192
explanation
(Integer) UTF-32 Unicode value for the current glyph. For granularities other than glyph this may be an
artificial value or an inserted separator character which has no relationship to the final text chunk. For
granularity=glyph the sequence of Unicode values for the glyphs is identical to the logical text, but for
other granularities it may be modified by various processing steps.
Chapter 10: TET Library API Reference
Table 10.16 Members of the TET_char_info structure (C, C++, Ruby), equivalent public fields (Java, PHP, Objective-C), keys
(Perl) or properties (COM and .NET) with their type and meaning. See Section 6.2, »Page and Text Geometry«, page 72 and
Section 6.3, »Text Color«, page 78 for more details.
property/
field name
type
explanation
(Integer) Type of the character. The following types describe real characters which correspond to a glyph
on the page. The values of all other properties/fields are determined by the corresponding glyph:
0
Normal character which corresponds to exactly one glyph
1
Start of a sequence (e.g. ligature)
The following types describe artificial characters which do not correspond to a glyph on the page. The x
and y fields will specify the most recent real character’s endpoint, the width field will be 0, and all other
fields except uv will contain the values corresponding to the most recent real character:
1
attributes
10
Continuation of a sequence (e.g. ligature)
11
(Deprecated and unused)
12
Inserted word, line, or paragraph separator
(Integer) Glyph attributes expressed as bits which can be combined:
bit 0
Geometric or semantic subscript
bit 1
Geometric or semantic superscript
bit 2
Drop cap character (initial large character at the start of a paragraph)
bit 3
Glyph- or word-based shadow duplicate of this glyph has been removed
bit 4
Glyph represents last character before hyphenation point
bit 5
Hyphenation artifact (i.e. the hyphen character) which was removed unless
contentanalysis={keephyphenglyphs=true} was specified.
bit 6
Glyph represents the character after hyphenation point
unknown
(Boolean, in C, C++ and Perl: integer) Usually false (0), but will be true (1) if the original glyph could not
be mapped to Unicode and has been replaced with the character specified as unknownchar.
x, y
(Double) Position of the glyph’s reference point. The reference point is the lower left corner of the glyph
box for horizontal writing mode, and the top center point for vertical writing mode. For artificial characters the x, y coordinates will be those of the end point of the most recent real character.
width
(Double) Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial
characters (i.e. inserted separators with type=12 and hyphenation artifacts with attribute bit 5 set) the
width is 0.
height
(Double) For vertical writing mode: height of the corresponding glyph according to the font metrics and
text output parameters (e.g. character spacing). The height is positive in the default coordinate system,
but negative for topdown coordinates. In monospaced vertical fonts all glyphs have fontsize as height
unless character spacing has been applied. Artificial characters (e.g. separators) have a height of 0.
For horizontal writing mode an approximation of the glyph height is supplied. This approximate value is
derived from font properties and therefore identical for all glyphs in a font. There is no guarantee that
the visible glyph has the exact height value supplied here.
alpha
(Double) Direction of inline text progression in degrees measured counter-clockwise (or clockwise for topdown coordinates). For horizontal writing mode this is the direction of the text baseline; for vertical writing mode it is the digression from the standard vertical direction. The angle will be in the range
-180° < alpha ³ +180°. For standard horizontal text as well as for standard text in vertical writing mode
the angle will be 0°.
beta
(Double) Text slanting angle in degrees measured counter-clockwise (or clockwise for topdown coordinates), relative to the perpendicular of alpha. The angle will be 0° for upright text, and negative for italicized (slanted) text (positive for topdown coordinates). The angle will be in the range -180° < beta ³ 180°,
but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.
10.5 Text and Glyph Details Retrieval Functions
193
Table 10.16 Members of the TET_char_info structure (C, C++, Ruby), equivalent public fields (Java, PHP, Objective-C), keys
(Perl) or properties (COM and .NET) with their type and meaning. See Section 6.2, »Page and Text Geometry«, page 72 and
Section 6.3, »Text Color«, page 78 for more details.
property/
field name
explanation
fontid
(Integer) Index of the font in the fonts[] pseudo object (see the pCOS Path Reference). fontid is never
negative.
fontsize
(Double) Size of the font (always positive); the relation of this value to the actual height of glyphs is not
fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses
all ascenders (including accented characters) and descenders.
textrendering
(Integer) Text rendering mode:
P
0
1
P
2
3
fill text
4
fill text and add it to the clipping path
stroke text (outline)
5
stroke text and add it to the clipping path
fill and stroke text
6
fill and stroke text and add it to the clipping path
invisible text (for OCR text)
7
add text to the clipping path
Text in Type 3 fonts: textrendering=3 and 7 result in invisible text; all other values of textrendering are
irrelevant and are ignored.
colorid
(Integer) Index of the text color which represents the combination of fill color, stroke color, and text rendering. All occurrences of the same combination in a document are represented by the same color id. Different combinations are represented by different ids, which means that colors of multiple glyphs can be
checked for equality by comparing their color ids. For example, by comparing the colorid values of successive glyphs you can identify changes in text color. The exact color space and color components for filling and/or stroking text can be retrieved with TET_get_color_info( ).
1. In the REALbasic/Xojo binding this field is called attrs.
C++ const TET_color_info *get_color_info(int doc, int colorid, wstring optlist)
C# Java int get_color_info(int doc, int colorid, String optlist)
Perl PHP object get_color_info(long doc, long colorid, string optlist)
C const TET_color_info *TET_get_color_info(TET *tet, int doc, int colorid, const char *optlist)
Request color details for a color id which has been retrieved with TET_get_char_info.
doc
Valid document handle obtained with TET_open_document*( ).
colorid
Valid color id obtained from the colorid member of TET_get_char_info( ).
optlist
Option list according to Table 10.17 specifying the kind of color to retrieve.
Table 10.17 Option for TET_get_color_info( )
option
description
usage
(Keyword) Usage of the color (default: fill)
194
fill
Color used for filling
stroke
Color used for stroking
Chapter 10: TET Library API Reference
Returns A structure with details about the requested color space and color.
Bindings C and C++ language bindings: A pointer to a TET_color_info structure containing information about the requested fill or stroke color. The members of the data structure are
detailed in Table 10.18.
COM, Java, .NET, and Objective-C language bindings: color information can be retrieved
from the TET properties/public fields according to Table 10.18.
Perl and Python language bindings: color information can be retrieved from a hash containing the keys listed in Table 10.18.
PHP language binding: color information can be retrieved from an object containing the
fields listed in Table 10.18.
REALbasic/Xojo binding: color information can be retrieved from a TET_char_info object
containing the members listed in Table 10.18.
Ruby binding: color information can be retrieved from a a TET_char_info object containing the members listed in Table 10.18.
Table 10.18 Members of the TET_color_info structure (C, C++, Ruby), equivalent public fields (Java, PHP, Objective-C), keys
(Perl) or properties (COM and .NET) with their type and meaning. See Section 6.3, »Text Color«, page 78 for more details.
property/
field name
explanation
colorspaceid
(Integer) Index of the color space in the colorspaces[] pseudo object (see the pCOS Path Reference), or -1
if no color is applied to the glyph.
patternid
(Integer) Index of the pattern in the patterns[] pseudo object (see the pCOS Path Reference), or -1 if no
pattern is applied to the glyph.
components
(Array of double values) Color component values which must be interpreted in the color space reported
with colorspaceid.
C and C++ language bindings: The number of relevant array entries is available in the n field.
n
(Integer; C and C++ language bindings only) Number of array entries in components
10.5 Text and Glyph Details Retrieval Functions
195
10.6 Image Retrieval Functions
C++ const TET_image_info *get_image_info(int page)
C# Java int get_image_info(int page)
Perl PHP object image_info get_image_info(long page)
C const TET_image_info *TET_get_image_info(TET *tet, int page)
Retrieve information about the next image on the page (but not the actual pixel data).
page
A valid page handle obtained with TET_open_page( ).
Returns If no more images are available on the page, a binding-specific value will be returned,
otherwise image details are available in a binding-specific manner. See section Bindings
below for more details.
Details This function advances to the next image associated with the supplied page handle (or
return 0/NULL if there are no more images) and provides detailed information for the
image. The following types of images are ignored:
> Images used as mask are ignored. They can be retrieved via pCOS and the maskid
pseudo object (see Section 8.5.2, »Image Masks and Soft Masks«, page 127).
> Images which have been consumed by the merging process and merged to form a
larger image (i.e. mergetype=consumed) are ignored.
> Images which have been eliminated by the small image filter (see Section 8.4, »Small
Image Filtering«, page 125) are ignored.
> Images which are located completely outside the extraction area specified by the
clippingarea,, excludebox, and includebox options are ignored.
The image details in the structure or properties/fields are valid until the next call to
TET_get_image_info( ) or TET_close_page( ) with the same page handle. Since there is only
a single set of image info properties/fields per TET object, clients must retrieve all image
info before they call TET_get_image_info( ) again for the same or another page.
Bindings C and C++ language bindings: If no more images are available on the page a NULL pointer will be returned. Otherwise, a pointer to a TET_image_info structure containing information about the image. The members of the data structure are detailed in Table 10.19.
COM, Java, .NET, and Objective-C language bindings: -1 will be returned if no more images are available on the page, otherwise 1. Individual image info can be retrieved from the
TET properties/fields according to Table 10.19. All properties/fields contain the value -1
if they are accessed although the function returned -1.
Perl and Python language bindings: 0 will be returned if no more images are available
on the page, otherwise a hash containing the keys listed in Table 10.19. Individual image
info can be retrieved with the keys in this hash.
PHP language binding: an empty (null) object will be returned if no more images are
available on the page, otherwise an object of type TET_image_info. Individual image info
can be retrieved from its fields according to Table 10.19. Integer fields in the image info
object are implemented as long in the PHP language binding.
196
Chapter 10: TET Library API Reference
REALbasic/Xojo binding: nil will be returned if no more images are available on the page,
otherwise a TET_image_info object containing the members listed in Table 10.19. Individual image info can be retrieved with the member of this object.
Ruby binding: nil (null object) will be returned if no more images are available, and a
TET_image_info object otherwise.
Table 10.19 Members of the TET_image_info structure (C, C++, Ruby), equivalent public fields (Java, PHP, Objective-C), and
properties (COM and .NET) with their type and meaning. See Section 8.1, »Image Extraction Basics«, page 115, for details.
property/
field name
explanation
x, y
(Double) Position of the image’s reference point. The reference point is the lower left corner of the image.
width,
height
(Double) Width and height of the image on the page in points, measured along the image’s edges
alpha
(Double) Direction of the pixel rows. The angle will be in the range -180° < alpha ³ +180°. For upright images alpha will be 0°.
beta
(Double) Direction of the pixel columns, relative to the perpendicular of alpha. The angle will be in the
range -180° < beta ³ +180°, but different from ±90°. For upright images beta will be in the range -90° <
beta < +90°. If abs(beta) > 90° the image will be mirrored at the baseline.
imageid
(Integer) Index of the image in the pCOS pseudo object images[ ]. Detailed image and mask properties
can be retrieved via the entries in this pseudo object (see the pCOS Path Reference).
C++ Java C# int write_image_file(int doc, int imageid, String optlist)
Perl PHP long write_image_file(long doc, long imageid, string optlist)
C int TET_write_image_file(TET *tet, int doc, int imageid, const char *optlist)
Write image data to disk.
doc
A valid document handle obtained with TET_open_document*( ).
imageid pCOS ID of the image. This ID can be retrieved from the imageid field after a
successful call to TET_get_image_info( ), or by looping over all entries in the images pseudo object (there are length:images entries in this array).
optlist An option list specifying page options according to Table 10.20. The following
options can be used:
compression, dpi, filename, keepiccprofile, keepxmp, preferredtiffcompression, typeonly,
validatejpeg.
The following options of other functions also affect the generated image output:
> TET_open_document*( ): allowjpeg2000, spotcolor (see Table 10.8)
> TET_open_page/TET_process_page( ): imageanalysis (see Table 10.10 and Table 10.13)
Returns -1 on error, or a value greater than 0 otherwise. If -1 is returned it is recommended to call
TET_get_errmsg( ) to find out more details about the error. No image output is created in
case of an error. If the return value is different from -1 it indicates that the image can be
extracted in the file format indicated by the return value:
> -1: an error occurred; no image is extracted
> 10: image extracted as TIFF (.tif)
> 20: image extracted as JPEG (.jpg)
10.6 Image Retrieval Functions
197
>
>
>
>
31: image extracted as plain JPEG 2000 (.jp2) (only if allowjpeg2000=true)
32: image extracted as extended JPEG 2000 (.jpf) (only if allowjpeg2000=true)
33: image extracted as raw JPEG 2000 code stream (.j2k) (only if allowjpeg2000=true)
50: image extracted as JBIG2 (.jbig2)
Details This function converts the pixel data for the image with the specified pCOS ID to one of
several image formats and writes the result to a disk file. If the typeonly option has been
supplied, only the image type is returned, but no image file is generated.
> TET_TET_TET_
Bindings C/C++: macros for the return values are available in tetlib.h.
Table 10.20 Options for TET_write_image_file( ) and TET_get_image_data( )
option
description
compression
(Keyword) The algorithm for compressing the pixel data (default: auto):
auto
Select a suitable compression algorithm automatically.
none
(Only relevant for TIFF images) Write the pixel data without any compression if possible.
dpi
(List of one or two non-negative float values) One or two values specifying the image resolution in pixels
per inch in horizontal and vertical direction. If a single value is supplied it is used for both dimensions. The
supplied values are recorded in generated TIFF images. They don’t change the number of pixels in the image (i.e. no downsampling). See »Image resolution«, page 121, for details about determining image resolution. If one or two values are zero no resolution entry will be written. Default: 72
filename1
(String; required unless typeonly is supplied) The name of the image file on disk. A suffix is added to the
filename to indicate the image file format.
Table 2.1, page 17 describes the file name conventions used by the TET command-line tool and in TETML. It
is recommended to use the same file name pattern if images are used together with TETML.
keepiccprofile
(Boolean) If true and an ICC profile is assigned to the image, the ICC profile is embedded in extracted TIFF
and JPEG images. Setting this option to false may result in smaller image files, but sacrifices color management.Default: true
keepxmp
(Boolean) If true and the image has associated XMP metadata in the PDF, the metadata will be embedded in extracted TIFF and JPEG images. Default: true
preferredtiffcompression
(Keyword) Compression scheme used for most extracted TIFF images (default: flate):
typeonly
1
validatejpeg
lzw
LZW compression (TIFF compression scheme 5)
flate
Flate compression, also called Adobe Deflate or zlib compression (TIFF compression scheme 8)
(Boolean) The image type will be determined according to the supplied options, but no image file will be
written. This is useful for determining the type of image returned by TET_get_image_data( ), which does
not return the image type itself. Default: false
(Boolean) If true, extracted JPEG images are validated to ensure correct image output. If false, processing is slightly faster, but invalid JPEG data is copied unmodified to the generated image file. Default: true
1. Only for TET_write_image_file( )
198
Chapter 10: TET Library API Reference
C++ const char *get_image_data(int doc, size_t *length, int imageid, wstring optlist)
C# Java final byte[ ] get_image_data(int doc, int imageid, String optlist)
Perl PHP string get_image_data(long doc, long imageid, string optlist)
C const char * TET_get_image_data(TET *tet, int doc, size_t *length, int imageid, const char *optlist)
Write image data to memory.
doc
A valid document handle obtained with TET_open_document*( ).
length (C and C++ language bindings only) C-style pointer to a memory location where
the length of the returned data in bytes will be stored.
imageid The pCOS ID of the image. This ID can be retrieved from the imageid field after
a successful call to TET_get_image_info( ), or by looping over all entries in the images
pCOS array (there are length:images entries in this array).
optlist An option list specifying image-related options according to Table 10.20. The
following options can be used: compression, keepxmp
Returns The data representing the image according to the specified options. In case of an error a
NULL pointer is returned in C and C++, and empty data in other language bindings. If an
error happens it is recommended to call TET_get_errmsg( ) to find out more details about
the error.
Details This function converts the pixel data for the image with the specified pCOS ID to one of
several image formats, and makes the data available in memory.
Bindings COM: Most client programs will use the Variant type to hold the image data.
C and C++ language bindings: The returned data buffer can be used until the next call to
this function.
REALbasic/Xojo: the result is provided as REALbasic/Xojo string with encoding -1 (binary
data).
10.6 Image Retrieval Functions
199
10.7 TET Markup Language (TETML) Functions
C++ Java C# int process_page(int doc, int pagenumber, String optlist)
Perl PHP long process_page(long doc, long pagenumber, string optlist)
C int TET_process_page(TET *tet, int doc, int pagenumber, const char *optlist)
Process a page and create TETML output.
doc
A valid document handle obtained with TET_open_document*( ).
pagenumber The physical number of the page to be processed. The first page has page
number 1. The total number of pages can be retrieved with TET_pcos_get_number( ) and
the pCOS path length:pages. The pagenumber parameter may be 0 if trailer=true.
optlist An option list specifying options from the following groups:
> General page-related options according to Table 10.10 (these will be ignored if
pagenumber=0):
clippingarea, contentanalysis, excludebox, fontsizerange, granularity, ignoreinvisibletext,
imageanalysis, includebox, layoutanalysis, skipengines
> Option specifying TETML details according to Table 10.21: tetml
Table 10.21 Additional options for TET_process_page( )
option
description
tetml
(Option list) Controls details of TETML. The following options are available:
elements (Option list) Specify optional TETML elements:
line
(Boolean; only for granularity=word) If true, TETML output includes Line elements between Para and Word levels. Default: false
glyphdetails
(Option list; only for granularity=glyph and word) Specify which attributes are reported for
each Glyph element (default for all suboptions: false):
all
(Boolean) Enable all attribute suboptions.
dehyphenation
(Boolean) Emit attribute dehyphenation to indicate hyphenated words.
dropcap (Boolean) Emit attribute dropcap to indicate large initial characters at the start
of a paragraph.
font
(Boolean) Emit attributes font, size, textrendering, unknown.
geometry (Boolean) Emit attributes x, y, width, alpha, beta.
sub
(Boolean) Emit attribute sub to indicate subscripts.
sup
(Boolean) Emit attribute sup to indicate superscripts.
shadow (Boolean) Emit attribute shadow to indicate shadow or simulated bold text.
textcolor (Boolean) Emit attributes fill and stroke for the glyph colors (subject to
textrendering) and corresponding Color elements.
trailer
(Boolean) If true, document trailer data, i.e. data after the last page, is emitted (it must be
appended to the page-specific data emitted earlier). This option is required in the last call to
this function in order to emit trailer data. If pagenumber=0 only trailer data (without any
page-specific data) is emitted. Once trailer=true has been supplied, no more calls to TET_
process_page( ) are allowed for the same document. Default: false
Returns This function always returns 1. PDF problems are reported in a TETML Exception element.
Problems related to option list parsing trigger an exception.
200
Chapter 10: TET Library API Reference
Details This function opens a page, creates TETML output according to the format-related options supplied to TET_open_document*( ), and closes the page. The generated data can be
retrieved with TET_get_tetml( ).
This function must only be called if the option tetml has been supplied in the corresponding call to TET_open_document*( ). Header data, i.e. document-specific data before
the first page, will be created by TET_open_document*( ) before the first page data. It can
be retrieved separately by calling TET_get_tetml( ) before the first call to TET_process_
page( ), or in combination with page-related data.
Trailer data, i.e. document-specific data after the last page, must be requested with
the trailer suboption when this function is called for the last time for a document. Trailer data can be created with a separate call after the last page (pagenumber=0), or together with the last page (pagenumber is different from 0). Pages can be retrieved in any order, and any subset of the document’s pages can be retrieved.
It is an error to call TET_close_document( ) without retrieving the trailer, or to call TET_
process_page( ) again after retrieving the trailer.
C++ const char *get_tetml(int doc, size_t *length, wstring optlist)
C# Java final byte[ ] get_tetml(int doc, String optlist)
Perl PHP string get_tetml(long doc, string optlist)
C const char * TET_get_tetml(TET *tet, int doc, size_t *length, const char *optlist)
Retrieve TETML data from memory.
doc
A valid document handle obtained with TET_open_document*( ).
length (C and C++ language binding only) A pointer to a variable which will hold the
length of the returned string in bytes. length does not count the terminating null byte.
optlist
(Currently there are no supported options.)
Returns A byte array containing the next chunk of data. If the buffer is empty an empty string
will be returned (in C: a NULL pointer and *len=0).
Details This function retrieves TETML data which has been created by TET_open_document*( )
and one or more calls to TET_process_page( ). The TETML data is always encoded in UTF-8,
regardless of the outputformat option. The internal buffer is cleared by this call. It is not
required to call TET_get_tetml( ) after each call to TET_process_page( ). The client may accumulate the data for one or more pages or for the whole document in the buffer.
In TETML mode this function must be called at least once before TET_close_
document( ) since otherwise the data would no longer be accessible. If TET_get_tetml( ) is
called exactly once (such a single call must happen between the last call to TET_process_
page( ) and TET_close_document( )) the buffer is guaranteed to contain well-formed TETML output for the whole document.
This function must not be called if the filename suboption has been supplied to the
tetml option of TET_open_document*( ).
Bindings C and C++ language bindings: the result is provided as null-terminated UTF-8. On i5/
iSeries and zSeries EBCDIC-encoded UTF-8 is returned. The returned data buffer can be
used until the next call to TET_get_tetml( ).
10.7 TET Markup Language (TETML) Functions
201
Java and .NET language bindings: the result is provided as a byte array containing UTF-8
data.
COM: Most client programs will use the Variant type to hold the UTF-8 data.
REALbasic/Xojo: The result is returned as REALbasic/Xojo String with encoding UTF-8.
PHP language binding: the result is provided as UTF-8 string.
Python: the result is returned as 8-bit string (Python 3: bytes).
RPG language binding: the result is returned as null-terminated EBCDIC UTF-8.
C++ const char *get_xml_data(int doc, size_t *length, wstring optlist)
C# Java final byte[ ] get_xml_data(int doc, String optlist)
Perl PHP string get_xml_data(long doc, string optlist)
C const char *TET_get_xml_data(TET *tet, int doc, size_t *length, const char *optlist)
Deprecated, use TET_get_tetml( ).
202
Chapter 10: TET Library API Reference
10.8 pCOS Functions
The full pCOS syntax for retrieving object data from a PDF is supported. For a detailed
description please refer to the pCOS Path Reference which is available as a separate document.
C++ Java C# double pcos_get_number(int doc, String path)
Perl PHP float pcos_get_number(int doc, string path)
C double TET_pcos_get_number(TET *tet, int doc, const char *path, ...)
Get the value of a pCOS path with type number or boolean.
doc
path
A valid document handle obtained with TET_open_document*( ).
A full pCOS path for a numerical or boolean object.
Additional parameters (C language binding only) A variable number of additional parameters can be supplied if the key parameter contains corresponding placeholders (%s
for strings or %d for integers; use %% for a single percent sign). Using these parameters
will save you from explicitly formatting complex paths containing variable numerical
or string values. The client is responsible for making sure that the number and type of
the placeholders matches the supplied additional parameters.
Returns The numerical value of the object identified by the pCOS path. For Boolean values 1 will
be returned if they are true, and 0 otherwise.
C++ Java C# String pcos_get_string(int doc, String path)
Perl PHP string pcos_get_string(int doc, string path)
C const char *TET_pcos_get_string(TET *tet, int doc, const char *path, ...)
Get the value of a pCOS path with type name, number, string, or boolean.
doc
path
A valid document handle obtained with TET_open_document*( ).
A full pCOS path for a string, name, or boolean object.
Additional parameters (C language binding only) A variable number of additional parameters can be supplied if the key parameter contains corresponding placeholders (%s
for strings or %d for integers; use %% for a single percent sign). Using these parameters
will save you from explicitly formatting complex paths containing variable numerical
or string values. The client is responsible for making sure that the number and type of
the placeholders matches the supplied additional parameters.
Returns A string with the value of the object identified by the pCOS path. For Boolean values the
strings true or false will be returned.
Details This function raises an exception if pCOS does not run in full mode and the type of the
object is string. However, the objects /Info/* (document info keys) can also be retrieved in
restricted pCOS mode if nocopy=false or plainmetadata=true, and bookmarks[...]/Title as
well as all paths starting with pages[...]/annots[...]/ can be retrieved in restricted pCOS
mode if nocopy=false.
10.8 pCOS Functions
203
This function assumes that strings retrieved from the PDF document are text strings.
String objects which contain binary data should be retrieved with TET_pcos_get_stream( )
instead which does not modify the data in any way.
Bindings C language binding: The string will be returned in UTF-8 format (on zSeries and i5/
iSeries: EBCDIC-UTF-8) without BOM. The returned strings will be stored in a ring buffer
with up to 10 entries. If more than 10 strings are queried, buffers will be reused, which
means that clients must copy the strings if they want to access more than 10 strings in
parallel. For example, up to 10 calls to this function can be used as parameters for a
printf( ) statement since the return strings are guaranteed to be independent if no more
than 10 strings are used at the same time.
C++ language binding: The string will be returned as wstring in the default wstring configuration of the C++ wrapper. In string compatibility mode on zSeries and i5/iSeries the
result will be returned in EBCDIC-UTF-8 without BOM.
Java and .NET bindings: the result will be provided as Unicode string. If no more text is
available a null object will be returned.
Perl, PHP, Python and Ruby language bindings: the result will be provided as UTF-8
string. If no more text is available a null object will be returned.
RPG language binding: the result will be provided as EBCDIC-UTF-8 string.
C++ const unsigned char *pcos_get_stream(int doc, int *length, string optlist, wstring path)
C# Java final byte[ ] pcos_get_stream(int doc, String optlist, String path)
Perl PHP string pcos_get_stream(int doc, string optlist, string path)
C const unsigned char *TET_pcos_get_stream(TET *tet, int doc, int *length, const char *optlist,
const char *path, ...)
Get the contents of a pCOS path with type stream, fstream, or string.
doc
A valid document handle obtained with TET_open_document*( ).
length (C and C++ language bindings only) A pointer to a variable which will receive
the length of the returned stream data in bytes.
optlist
An option list specifying stream retrieval options according to Table 10.22.
path A full pCOS path for a stream or string object.
Additional parameters (C language binding only) A variable number of additional parameters can be supplied if the key parameter contains corresponding placeholders (%s
for strings or %d for integers; use %% for a single percent sign). Using these parameters
will save you from explicitly formatting complex paths containing variable numerical
or string values. The client is responsible for making sure that the number and type of
the placeholders matches the supplied additional parameters.
Returns The unencrypted data contained in the stream or string. The returned data will be empty (in C and C++: NULL) if the stream or string is empty, or if the contents of encrypted
attachments in an unencrypted document are queried and the attachment password
has not been supplied.
204
Chapter 10: TET Library API Reference
If the object has type stream all filters are removed from the stream contents (i.e. the
actual raw data is returned) unless keepfilter=true. If the object has type fstream or string
the data is delivered exactly as found in the PDF file, with the exception of ASCII85 and
ASCIIHex filters which are removed.
In addition to decompressing the data and removing ASCII filters, text conversion
may be applied according to the convert option.
JPX-compressed streams are treated as follows: image data with 1..8 bits per component is returned with 8 bits per component; Image data with 9..16 bits per component is
returned with 16 bits per component. If no PDF colorspace is present and the JPX-compressed stream contains an internal color palette, the palette is applied before returning the uncompressed stream data to ensure that the pixel data matches the reported
colorspace and number of components. Note that the palette is not applied if the PDF
colorspace Indexed is present.
Details This function will throw an exception if pCOS does not run in full mode (see the pCOS
Path Reference). As an exception, the object /Root/Metadata can also be retrieved in restricted pCOS mode if nocopy=false or plainmetadata=true. An exception will also be
thrown if path does not point to an object of type stream, fstream, or string.
Despite its name this function can also be used to retrieve objects of type string. Unlike TET_pcos_get_string( ), which treats the object as a text string, this function will not
modify the returned data in any way. Binary string data is rarely used in PDF, and cannot be reliably detected automatically. The user is therefore responsible for selecting
the appropriate function for retrieving string objects as binary data or text.
Bindings COM: Most client programs will use the Variant type to hold the stream contents. JavaScript with COM does not allow to retrieve the length of the returned variant array (but
it does work with other languages and COM).
C and C++ language bindings: The returned data buffer can be used until the next call to
this function.
Python: the result will be returned as 8-bit string (Python 3: bytes).
Note This function can be used to retrieve embedded font data from a PDF. Users are reminded of
the fact that fonts are subject to the respective font vendor’s license agreement, and must not
be reused without the explicit permission of the respective intellectual property owners. Please
contact your font vendor to discuss the relevant license agreement.
Table 10.22 Options for TET_pcos_get_stream( )
option
description
convert
(Keyword; ignored for streams which are compressed with unsupported filters) Controls whether or not
the string or stream contents will be converted (default: none):
none
Treat the contents as binary data without any conversion.
unicode
Treat the contents as textual data (i.e. exactly as in TET_pcos_get_string( )), and normalize it
to Unicode. In non-Unicode-aware language bindings this means the data will be converted
to UTF-8 format without BOM.
This option is required for the data type »text stream« in PDF which is rarely used (e.g. it can
be used for JavaScript, although the majority of JavaScripts is contained in string objects, not
stream objects).
10.8 pCOS Functions
205
Table 10.22 Options for TET_pcos_get_stream( )
option
description
keepfilter
(Boolean; Recommended only for image data streams; will be ignored for streams which are compressed
with unsupported filters) If true, the stream data will be compressed with the filter which is specified in
the image’s filterinfo pseudo object (see the pCOS Path Reference). If false, the stream data will be
uncompressed. Default: true for all unsupported filters, false otherwise
A TET Library Quick Reference
The following tables contain an overview of all TET API functions. The prefix (C) denotes
C prototypes of functions which are not available in the Java language binding.
Setup Functions
Function prototype
page
(C) TET *TET_new(void)
164
void delete( )
164
Option Handling
Function prototype
page
void set_option(String optlist)
162
PVF Functions
Function prototype
page
void create_pvf(String filename, byte[] data, String optlist)
165
int delete_pvf(String filename)
166
int info_pvf(String filename, String keyword)
166
Unicode Conversion Function
Function prototype
page
String convert_to_unicode(String inputformat, byte[ ] input, String optlist)
167
Exception Handling Functions
Function prototype
page
String get_apiname( )
169
String get_errmsg( )
169
int get_errnum( )
169
Document Functions
Function prototype
page
int open_document(String filename, String optlist)
172
(C) int TET_open_document_callback(TET *tet, void *opaque, size_t filesize, size_t (*readproc)(void
*opaque, void *buffer, size_t size), int (*seekproc)(void *opaque, long offset), const char *optlist)
179
void close_document(int doc)
179
Page Functions
Function prototype
page
int open_page(int doc, int pagenumber, String optlist)
181
void close_page(int page)
188
A TET Library Quick Reference
207
Text and Glyph Details Retrieval Functions
Function prototype
page
String get_text(int page)
190
int get_char_info(int page)
191
int get_color_info(int doc, int colorid, String optlist)
194
Image Retrieval Functions
Function prototype
page
int get_image_info(int page)
196
int write_image_file(int doc, int imageid, String optlist)
197
final byte[ ] get_image_data(int doc, int imageid, String optlist)
199
TET Markup Language (TETML) Functions
Function prototype
page
int process_page(int doc, int pagenumber, String optlist)
200
final byte[ ] get_tetml(int doc, String optlist)
201
pCOS Functions
Function prototype
double pcos_get_number(int doc, String path)
208
page
203
String pcos_get_string(int doc, String path)
203
final byte[ ] pcos_get_stream(int doc, String optlist, String path)
204
Chapter A: TET Library Quick Reference
B Revision History
Revision history of this manual
Date
Changes
November 03, 2015
> Updates for TET 5.0
> Updates for TET 4.4
> Updates for TET 4.3
> Updates for TET 4.2
> Updates for TET 4.1p1
> Updates for TET 4.1
> Updates for TET 4.0p2
> Updates for TET 4.0
> Updates for TET 3.0
> Updated the manual for TET 2.3
> Minor additions for TET 2.2
> Additions and corrections for TET 2.1.0; added descriptions for the PHP and RPG
January 27, 2015
May 26, 2014
May 17, 2013
April 04, 2012
February 20, 2012
September 22, 2010
July 27, 2010
February 01, 2009
January 16, 2008
January 23, 2007
December 14, 2005
language bindings
November 23, 2002
> Expanded and reorganized the manual for TET 2.0.0
> Updated the manual for TET 1.1
> Added the description of TET_open_doc_callback( ) and a code sample for deter-
April 4, 2002
> First edition for TET 1
June 20, 2005
October 14, 2003
mining the page size for TET 1.0.2
B Revision History
209
Index
A
annotations 69
API reference 155
Arabic 83
area of text extraction 72
artifacts in Tagged PDF 183
ascender 75
attachment password 57
B
bidirectional text 83
BMP (Basic Multilingual Plane) 93
bookmarks 69
Boolean values in option lists 159
Byte Order Mark (BOM) 94
C
C binding 24
C++ and .NET 31
C++ binding 26
canonical decomposition 101
capheight 75
categories of resources 59
characters and glyphs 93
CJK (Chinese, Japanese, Korean) 12, 80
compatibility forms 81
configuration 7
word boundaries 80
CLI 26
codelist 110
color of text 78
color spaces 126
COM binding 28
command-line tool 17
comments 69
commercial license 10
compatibility decomposition 101
composite characters 95
concordance (XSLT sample) 151
connector 43
content analysis 85
coordinate system 72
CSV format 153
D
decomposition 101
dehyphenation 87
descender 76
DeviceN colorspace 126
Dispose( ) 164
document and page functions 172
document domains 67
document info entries 67
document styles 89
double-byte variants 81
E
end points of glyphs and words 76
EUDC fonts 109
evaluation version 7
examples
text extraction status 57
XSLT 151
exception handling 23
in C 24
F
fake bold removal 87
file attachments 69
file search 60
fill color of text 78
float and integer values in option lists 160
folding 98
font filtering (XSLT sample) 151
font statistics (XSLT sample) 152
FontReporter plugin 12, 109
form fields 69
fullwidth variants 81
G
geometry of images 120
glyph metrics 73
glyph rules 113
glyphlist 112
glyphs 93
granularity 85
H
halfwidth variants 81
Hebrew 83
highlighting 76
HTML converter (XSLT sample) 153
I
ICC profiles 126
211
ideographic text: word boundaries 80
IFilter for Microsoft products 51
images
color fidelity 126
determining type 116
extract to disk or memory 115
extracting 115
formats 115
geometry 120
merging 123
number of images in a document 118
page-based extraction loop 119
placed images 118
resolution 121
resource-based extraction loop 120
resources 118
small image removal 125
XMP metadata 116
inch 72
index (XSLT sample) 153
installing TET 7
invisible text 194
J
J2EE application servers 29
Java binding 29
Javadoc 30
JBIG2 115
JPEG 115
JPEG 2000 115
K
keywords in option lists 159
Objective-C binding 32
optimizing performance 63
option list syntax 155
option lists 155
Oracle Text 48
outline text 194
owner password 57
P
packages 70
page boxes 72
page-based image extraction loop 119
passwords 57
pCOS
API functions 203
Cookbook 15
PDF versions 11
performance optimization 63
Perl binding 34
permissions password 57
PHP binding 35
placed images 118
points 72
portfolios 70
postprocessing 96
preprocessing 96
prerotated glyphs 81
protected documents 57
PUA (Private Use Area) 94, 99, 109
Python Binding 37
R
layers 70, 183
license key 8
ligatures 95
list values in option lists 156
logging 170
logo fonts 109
Lucene search engine 44
raw text extraction (XSLT sample) 154
REALbasic/Xojo binding 38
rectangles in option lists 160
resource configuration 59
resource-based image extraction loop 120
resourcefile parameter 61
response file 20
roadmap to documentation and samples 14
RPG binding 41
Ruby binding 39
M
S
master password 57
MediaWiki 55
millimeters 72
mini samples 14
schema 139
searching for font usage (XSLT sample) 152
searchpath 60
separation colorspace 126
sequences 95
servlets 29
shadow removal 87
shrug feature 57
single-byte variants 81
small image removal 125
Solr search server 47
L
N
nested option lists 156
.NET binding 31
normalization 105
numbers in option lists 160
212
O
spot color 126
strings in option lists 157
stroke color of text 78
surrogates 94
syntax of option lists 155
T
table detection 91
table extraction (XSLT sample) 153
Tagged PDF 70, 183
TET command-line tool 17
TET connector 43
for Lucene 44
for MediaWiki 55
for Microsoft products 51
for Oracle 48
for Solr 47
for TIKA 53
TET Cookbook 15
TET features 11
TET Markup Language (TETML) 129
TET plugin for Adobe Acrobat 43
TET_CATCH( ) 169
TET_close_document( ) 179
TET_close_page( ) 188
TET_convert_to_unicode( ) 167
TET_create_pvf() 165
TET_delete_pvf() 166
TET_delete( ) 164
TET_EXIT_TRY( ) 24, 169
TET_get_apiname() 169
TET_get_char_info( ) 191
TET_get_color_info( ) 194
TET_get_errmsg( ) 169
TET_get_errnum( ) 169
TET_get_image_data( ) 199
TET_get_image_info( ) 196
TET_get_tetml( ) 201
TET_get_text( ) 190
TET_info_pvf() 166
TET_new( ) 164
TET_open_document_callback( ) 179
TET_open_document( ) 172
TET_open_page( ) 181
TET_pcos_get_number( ) 203
TET_pcos_get_stream( ) 204
TET_pcos_get_string( ) 203
TET_RETHROW( ) 169
TET_set_option( ) 162
TET_TRY( ) 169
TET_write_image_file( ) 197
tet.upr 61
TETML 129
schema 139
TETRESOURCEFILE environment variable 61
TeX documents 65
text color 78
text extraction status 57
text filtering 96
TIFF 115
TIKA toolkit 53
ToUnicode CMap 111
U
Unichar values in option lists 158
Unicode
BOM 94
concepts 93
decomposition 101
encoding forms 94
encoding schemes 94
folding 98
in option lists 158
normalization 105
postprocessing 98
pre- and postprocessing 96
preprocessing 96
sets 158
units 72
unmappable glyphs 108
UPR file format 59
user password 57
UTF formats 94
UTF-32 107
V
vertical writing mode 80
W
word boundary detection 86
Wordfinder 86
X
XFA forms 144
xheight 75
XMP metadata 68
for images 116
XSLT sample 153
Xojo binding 38
XSD schema for TETML 139
XSLT 147
samples 14, 151
213
ABC
PDFlib GmbH
Franziska-Bilek-Weg 9
80339 München, Germany
www.pdflib.com
phone +49 • 89 • 452 33 84-0
fax +49 • 89 • 452 33 84-99
If you have questions check the PDFlib mailing list
and archive at groups.yahoo.com/neo/groups/pdflib/info
Licensing contact
[email protected]
Support
[email protected] (please include your license number)
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement