
The i-net PDFC is a tool for comparing two PDF files (or PDF files that are contained in two folders) for content differences. It is useful for comparing the output of two different PDF generator programs, e.g. OpenOffice and Adobe PDF Writer, or different versions of one program, e.g. two different versions of i-net Clear Reports.
Thereby the PDF content comparer, as the name states, does not do a pixel-based or structural comparison but an element-based check for differences in the documents.
The following elements are compared and any differences logged:
These differences each have a configurable tolerance value so that minor differences can be ignored if necessary. For example, it might not matter whether a line or box is misplaced by a couple of pixels, as long the line is there and looks as intended.
To use i-net PDFC, extract the downloaded zip file “PDFC.zip” into a folder of your choice. i-net PDFC Content Comparer can be used in various ways. You can execute it on a command line, e.g. using a .bat or .sh file. As alternative you can use the API to write your own Java program that can be executed as standalone application, JUnit test or integrated into a continuous integration system like Jenkins. See the section Programming (API) for more information about API and Java code samples.
i-net PDFC is platform-independent. This means it can run on every platform that supports Java 5 or higher (e.g. Windows, Linux, Unix, Solaris, Mac OS X 10.4+, …). A Java VM version 5 (or higher) is required to run i-net PDFC.
To compare PDF files you can execute i-net PDFC on a command line. To start it manually the i-net PDFC package contains the start files runPDFC.bat for Windows and runPDFC.sh for Unix/Linux.
Usage:
runPDFC [-c <config file>] [-[i][o]] [<Folder1> <Folder2> | <File1> <File2>]
| Parameter | Description |
|---|---|
| -c | Specifies the configuration file (config.xml) for i-net PDFC. If none is specified, the default “config.xml” will be used. |
| -i | Creates difference images in <Folder1>/differences for any differences found (recommended for a graphical comparison) |
| -x | Creates XOR difference images, may be combined with -i. If this parameter is specified, the difference image contains a third image in which the images of both sides are laid over one another, so that small differences can be seen more clearly. |
| -o | Creates images for each page of each version (need only be used for debug purposes) |
| -r | Set the root folder for difference images. If this parameter is specified, i-net PDFC creates a subdirectory for each pair of PDF files beneath the root directory. In this subdirectory it stores the page images and the difference images in the directory “differences”. |
Note that if using two folders, the PDF files must have the same names in each folder.
It is also possible to configure the parameter values using the file config.xml (see Configuration).
Example usage:
Assuming you extracted the file PDFC.zip into the directory “C:/PDFC” and the sub-directories “folder1” and “folder2” containing PDF files with the same name, you can start the comparison with the following command:
You can do the same with single PDF files:
With the API you can call i-net PDFC from a Java program to compare mutiple PDF files programmatically. You can find more information in the API documentation.
The following Java code samples show how to use the API of i-net PDFC:
You can integrate i-net PDFC with Jenkins (formerly called Hudson) to compare PDF files automatically.
The i-net PDFC installation package contains a sample Ant script and a Java sample that executes i-net PDFC as JUnit test. This can be found in the directory: '<install-dir>/documentation/samples/junit'.
To integrate i-net PDFC sample with Jenkins / Hudson, the following requirements must be fulfilled:
2.4.1 Jenkins / Hudson Integration sample
To integrate the provided sample into Jenkins / Hudson, just do the following steps.
The behaviour and the precision of i-net PDFC can be specified using the configuration properties. These configuration properties of i-net PDFC are included in the file config.xml.
The package “PDFC.zip” contains the config.xml with default values. You can change the default values by editing this file. If it does not exists, then i-net PDFC uses the default values.
With the following properties it is possible to configure the output of i-net PDFC and the logging.
| Property Name | Description |
|---|---|
CREATE_DIFFIMAGES | Specifies if a PNG image with the marked difference will be created for each pair of pages that contains differences. The default value is: false. |
CREATE_ORIGIMAGES | Specifies if a PNG image with the original content will be created for each compared page. The default value is: false. |
LOG_LEVEL | Specifies the Logging Level. Available values: OFF, ERROR, WARN, INFO, ALL. The default value is: INFO. OFF switches the output completely off. ERROR logs error messages. WARN contains all the messages from ERROR-Level and furthermore, informs about the irregularities during the execution. INFO contains all the messages from WARN-Level and furthermore, describes settings and environment attributes. ALL is used to display the maximal information during the PDFC execution, e.g. for support requests. |
MAX_ERRORS_PER_FILE | Specifies the maximum number of errors that can occur before the comparison will be canceled for the current PDF file. The default value is: 100. |
Modules are various ways of comparing PDF files. Each such operation examines the differences in similar elements in two files and if this difference is greater than the tolerance level, it is logged as a difference. The tolerance level and the modules itself can be configured in the file config.xml. If some module is not set in the configuration entry MODULES, the compare operation will not be executed.
| Property Name | Description |
|---|---|
MODULES | Specifies a comma-separated list of modules that will be executed for each page. |
The following modules are available:
MODULE_PAGEPROPERTIES
This module compares page properties (page number, rotation, width, height and aspect ratio).
| Property Name | Description |
|---|---|
TOLERANCE_PAGE_LEFTCORNER | Specifies the maximum number of pixels that the left or top margin of a page can differ (is the upper left corner of all elements) before it is viewed as a difference. The default value is: 3. |
TOLERANCE_PAGE_RATIO | Specifies the tolerance for the aspect ratio of the PDF page. The default value is: 0.1. |
TOLERANCE_PAGE_SIZE | Specifies the maximum number of pixels that the width or height of a page can differ before it is viewed as a difference. The default value is: 2. |
MODULE_IMAGE
This module compares the position of images, size and their content.
| Property Name | Description |
|---|---|
TOLERANCE_IMAGE_DISTANCE | Specifies the maximum number of pixels that the position of an image can differ before it is viewed as a difference. The default value is: 3. |
TOLERANCE_IMAGE_PIXEL_VALUE | Specifies the maximal allowed discrepancy of pixel values (Double) before it is viewed as a difference. The range of this property is [0,1]. The default value is: 0.05. |
TOLERANCE_IMAGE_SIZE | Specifies the maximum difference in percent that the area spanned by an image may differ before it is viewed as a difference. The default value is: 0.1. |
USE_PIXEL_MEDIUM_VALUE | This property of the MODULE_IMAGE specifies, if i-net PDFC should compare the medium values instead of single-pixel values. The default value is: true. |
MODULE_LINES
This module compares the shape positions and their properties.
| Property Name | Description |
|---|---|
TOLERANCE_BOX_ROUND_EDGES | Specifies the maximum number of pixels (1 pixel is approximately 0.353mm) that a control point of a quadratic Bézier curve * may differ in total before it is viewed as a difference. The default value is: 3. |
TOLERANCE_LINE_POSITION | Specifies the maximum number of pixels that the position of a line or curves can differ per axis before it is viewed as a difference. The default value is: 3. |
TOLERANCE_LINE_SIZE | Specifies the maximum number of pixels that the length of a line can differ in total before it is viewed as a difference. The default value is: 2. |
TOLERANCE_LINE_STYLE | Specifies if a different line dash pattern, describing dashes and gaps used to stroke paths will be viewed as a difference. The default value is: false. |
TOLERANCE_LINE_THICKNESS | Specifies the maximum difference in stroke thickness of two lines or curves (measured in pt) before it is viewed as a difference. The default value is: 1. |
TOLERANCE_UNDERLINE_LENGTH | Specifies the maximum difference in percent, in which the length of underlines may differ before it is viewed as a difference. The default value is: 0.1. |
MODULE_TEXT_WORDORDER
This module splits the PDF texts into words and compares these words. It can identify inserted or removed words within a line.
Normalizers modify the content of PDF file before comparison is started. This simplifies the comparison by reducing it to only comparing elements which are important.
| Property Name | Description |
|---|---|
NORMALIZERS | Specifies a comma-separated list of normalizers that will be executed before and after each page. |
The following normalizers are available:
NORMALIZER_CLIP
This normalizer removes the elements laying outside of clip regions, i.e. which are not visible.
NORMALIZER_MARGIN
This normalizer changes the coordinates of elements to take into account the differences in margin values.
CHART_REMOVAL
This normalizer attempts to detect charts and removes them. This can be useful when comparing reports generated by Crystal Reports and i-net Clear Reports, since charts look entirely different in these two products.
| Property Name | Description |
|---|---|
CHART_DENSITY_THRESHOLD | (value must be a Double) density-of-shapes threshold for detecting a chart: ((number of shapes)³ / area size). |
CHART_REMOVAL_MARGIN | (value must be a Double) percent of shape height to use as margin for removing PDF elements above and below detected charts. |
CHART_REMOVAL_MOD | Specifies the chart detector mode. Available values: CHART_REMOVAL_ALWAYS and CHART_REMOVAL_AUTO - default. |
With the tolerance values it is possible to specify at which value differences between the PDF files will be ignored. This makes it possible to ignore minor differences. Note that for the most part, you will normally leave the tolerance values at their defaults. In section Modules you can find the available tolerance values for each module. They can be set in the configuration file config.xml.
This property specifies whether page images will be saved as temporary PNG files on the hard disk or in memory.
| Property Name | Description |
|---|---|
CACHE_FOR_PAGE_IMAGES | Specifies a type of temporary storage of page images. These images can be used to show original pages of the compared PDF files (CREATE_ORIGIMAGES) and/or to show the marked differences of the compared pages (CREATE_DIFFIMAGES). This option is effective only if at least one of the options CREATE_ORIGIMAGES and CREATE_DIFFIMAGES is set. If “MEMORY” or “HARD_DISK” is set, the page images will be created during comparison. If “NONE” is set, the page images are created in an additional runthrough after comparison is done. If “HARD_DISK” Cache is used, the images will be saved as PNG files in the sub folders of folders containing the compared PDF files. |
If you have any questions or problems, please do not hesitate to contact pdfc@inetsoftware.de for technical support.
Copyright 2009 - 2011, i-net software GmbH. All rights reserved.