i-net PDF Content Comparer v1.10

1. Introduction

The i-net PDFC is a tool for comparing two PDF files (or PDF files that are contained in two folders) for content differences. It is useful for comparing the output of two different PDF generator programs, e.g. OpenOffice and Adobe PDF Writer, or different versions of one program, e.g. two different versions of i-net Clear Reports.

Thereby the PDF content comparer, as the name states, does not do a pixel-based or structural comparison but an element-based check for differences in the documents.

The following elements are compared and any differences logged:

  • Text differences (letters or words missing)
  • Line/Arc/Box differences (lines or boxes missing or with different styles)
  • Image differences (images missing or pixel values differ)
  • Margin differences (page margins different)

These differences each have a configurable tolerance value so that minor differences can be ignored if necessary. For example, it might not matter whether a line or box is misplaced by a couple of pixels, as long the line is there and looks as intended.

2. Using i-net PDFC

To use i-net PDFC, extract the downloaded zip file “PDFC.zip” into a folder of your choice. i-net PDFC Content Comparer can be used in various ways. You can execute it on a command line, e.g. using a .bat or .sh file. As alternative you can use the API to write your own Java program that can be executed as standalone application, JUnit test or integrated into a continuous integration system like Jenkins. See the section Programming (API) for more information about API and Java code samples.

2.1 System Requirements

i-net PDFC is platform-independent. This means it can run on every platform that supports Java 5 or higher (e.g. Windows, Linux, Unix, Solaris, Mac OS X 10.4+, …). A Java VM version 5 (or higher) is required to run i-net PDFC.

2.2 Command Line

To compare PDF files you can execute i-net PDFC on a command line. To start it manually the i-net PDFC package contains the start files runPDFC.bat for Windows and runPDFC.sh for Unix/Linux.

Usage:

runPDFC [-c <config file>] [-[i][o]] [<Folder1> <Folder2> | <File1> <File2>]
Parameter Description
-c Specifies the configuration file (config.xml) for i-net PDFC. If none is specified, the default “config.xml” will be used.
-i Creates difference images in <Folder1>/differences for any differences found (recommended for a graphical comparison)
-x Creates XOR difference images, may be combined with -i. If this parameter is specified, the difference image contains a third image in which the images of both sides are laid over one another, so that small differences can be seen more clearly.
-o Creates images for each page of each version (need only be used for debug purposes)
-r Set the root folder for difference images. If this parameter is specified, i-net PDFC creates a subdirectory for each pair of PDF files beneath the root directory. In this subdirectory it stores the page images and the difference images in the directory “differences”.

Note that if using two folders, the PDF files must have the same names in each folder.

It is also possible to configure the parameter values using the file config.xml (see Configuration).

Example usage:

Assuming you extracted the file PDFC.zip into the directory “C:/PDFC” and the sub-directories “folder1” and “folder2” containing PDF files with the same name, you can start the comparison with the following command:

  • runPDFC.bat folder1 folder2
    This will compare all PDF files in the folder “folder1” with the PDF files of the same name in the folder “folder2” and will result in an output on the console for any differences found between the compared PDF files in “folder1” and “folder2”.
  • runPDFC.bat -i folder1 folder2
    In addition to the console output this creates PNG images in the subfolder “folder1/differences” in which any detected differences that were found are marked in color.
  • runPDFC.bat -i -o folder1 folder2
    This creates additional PNG images for each page of the compared PDF files in the subfolder “folder1/differences/<pdf-file-name>” with the original page content.
  • runPDFC.bat -i -o folder1 folder2 > pdfc.log
    Additionally you can write the console output in a file. With the parameter LOG_LEVEL you can disable the log output and configure the amount of log output.

You can do the same with single PDF files:

  • runPDFC.bat -i D:/my-PDF-files/file1.pdf D:/my-PDF-files/file2.pdf > pdfc.log
    This will compare both PDF files and create the difference images in the folder “D:/my-PDF-files/differences”.

2.3 Programming (API)

With the API you can call i-net PDFC from a Java program to compare mutiple PDF files programmatically. You can find more information in the API documentation.

The following Java code samples show how to use the API of i-net PDFC:

2.4 Jenkins / Hudson Integration

You can integrate i-net PDFC with Jenkins (formerly called Hudson) to compare PDF files automatically.

The i-net PDFC installation package contains a sample Ant script and a Java sample that executes i-net PDFC as JUnit test. This can be found in the directory: '<install-dir>/documentation/samples/junit'.

To integrate i-net PDFC sample with Jenkins / Hudson, the following requirements must be fulfilled:

  1. Jenkins / Hudson must be installed and configured, so that Ant scripts can be executed
  2. the CompareTwoFoldersAsUnitTest.java file in the directory '<install-dir>/documentation/samples/junit' must be compiled and packaged into a Java archive (.jar)
  3. the Ant script from the sample requires a directory containing all archives from i-net PDFC and a JUnit archive and the previously created sample archive.

2.4.1 Jenkins / Hudson Integration sample

To integrate the provided sample into Jenkins / Hudson, just do the following steps.

  • create a free-style software job in Jenkins / Hudson
  • under the 'Advanced Project Options' select 'Use custom workspace' and enter the directory containing the Ant build script
  • Select 'Invoke Ant' under the 'Build' section of the job
  • open the 'Advanced…' options of the Ant build and add the following 3 properties
    • source_dir=<path to the source pdf files>
    • reference_dir=<path to the reference pdf files>
    • libraries_dir=<path to the directory containing all required libraries>
  • under the 'Post-build Actions' select 'Publish JUnit test result report' and enter 'junit-reports/*.xml'
  • save the configuration and run the new job

3. Configuration

The behaviour and the precision of i-net PDFC can be specified using the configuration properties. These configuration properties of i-net PDFC are included in the file config.xml.

The package “PDFC.zip” contains the config.xml with default values. You can change the default values ​​by editing this file. If it does not exists, then i-net PDFC uses the default values.

3.1 Logging and Results

With the following properties it is possible to configure the output of i-net PDFC and the logging.

Property Name Description
CREATE_DIFFIMAGES Specifies if a PNG image with the marked difference will be created for each pair of pages that contains differences. The default value is: false.
CREATE_ORIGIMAGES Specifies if a PNG image with the original content will be created for each compared page. The default value is: false.
LOG_LEVEL Specifies the Logging Level. Available values: OFF, ERROR, WARN, INFO, ALL. The default value is: INFO.
OFF switches the output completely off.
ERROR logs error messages.
WARN contains all the messages from ERROR-Level and furthermore, informs about the irregularities during the execution.
INFO contains all the messages from WARN-Level and furthermore, describes settings and environment attributes.
ALL is used to display the maximal information during the PDFC execution, e.g. for support requests.
MAX_ERRORS_PER_FILE Specifies the maximum number of errors that can occur before the comparison will be canceled for the current PDF file. The default value is: 100.

3.2 Modules

Modules are various ways of comparing PDF files. Each such operation examines the differences in similar elements in two files and if this difference is greater than the tolerance level, it is logged as a difference. The tolerance level and the modules itself can be configured in the file config.xml. If some module is not set in the configuration entry MODULES, the compare operation will not be executed.

Property Name Description
MODULES Specifies a comma-separated list of modules that will be executed for each page.

The following modules are available:

MODULE_PAGEPROPERTIES

This module compares page properties (page number, rotation, width, height and aspect ratio).

Property Name Description
TOLERANCE_PAGE_LEFTCORNER Specifies the maximum number of pixels that the left or top margin of a page can differ (is the upper left corner of all elements) before it is viewed as a difference. The default value is: 3.
TOLERANCE_PAGE_RATIO Specifies the tolerance for the aspect ratio of the PDF page. The default value is: 0.1.
TOLERANCE_PAGE_SIZE Specifies the maximum number of pixels that the width or height of a page can differ before it is viewed as a difference. The default value is: 2.

MODULE_IMAGE

This module compares the position of images, size and their content.

Property Name Description
TOLERANCE_IMAGE_DISTANCE Specifies the maximum number of pixels that the position of an image can differ before it is viewed as a difference. The default value is: 3.
TOLERANCE_IMAGE_PIXEL_VALUE Specifies the maximal allowed discrepancy of pixel values (Double) before it is viewed as a difference. The range of this property is [0,1]. The default value is: 0.05.
TOLERANCE_IMAGE_SIZE Specifies the maximum difference in percent that the area spanned by an image may differ before it is viewed as a difference. The default value is: 0.1.
USE_PIXEL_MEDIUM_VALUE This property of the MODULE_IMAGE specifies, if i-net PDFC should compare the medium values instead of single-pixel values. The default value is: true.

MODULE_LINES

This module compares the shape positions and their properties.

Property Name Description
TOLERANCE_BOX_ROUND_EDGES Specifies the maximum number of pixels (1 pixel is approximately 0.353mm) that a control point of a quadratic Bézier curve * may differ in total before it is viewed as a difference. The default value is: 3.
TOLERANCE_LINE_POSITION Specifies the maximum number of pixels that the position of a line or curves can differ per axis before it is viewed as a difference. The default value is: 3.
TOLERANCE_LINE_SIZE Specifies the maximum number of pixels that the length of a line can differ in total before it is viewed as a difference. The default value is: 2.
TOLERANCE_LINE_STYLE Specifies if a different line dash pattern, describing dashes and gaps used to stroke paths will be viewed as a difference. The default value is: false.
TOLERANCE_LINE_THICKNESS Specifies the maximum difference in stroke thickness of two lines or curves (measured in pt) before it is viewed as a difference. The default value is: 1.
TOLERANCE_UNDERLINE_LENGTH Specifies the maximum difference in percent, in which the length of underlines may differ before it is viewed as a difference. The default value is: 0.1.

MODULE_TEXT_WORDORDER

This module splits the PDF texts into words and compares these words. It can identify inserted or removed words within a line.

3.3 Normalizers

Normalizers modify the content of PDF file before comparison is started. This simplifies the comparison by reducing it to only comparing elements which are important.

Property Name Description
NORMALIZERS Specifies a comma-separated list of normalizers that will be executed before and after each page.

The following normalizers are available:

NORMALIZER_CLIP

This normalizer removes the elements laying outside of clip regions, i.e. which are not visible.

NORMALIZER_MARGIN

This normalizer changes the coordinates of elements to take into account the differences in margin values.

CHART_REMOVAL

This normalizer attempts to detect charts and removes them. This can be useful when comparing reports generated by Crystal Reports and i-net Clear Reports, since charts look entirely different in these two products.

Property Name Description
CHART_DENSITY_THRESHOLD (value must be a Double) density-of-shapes threshold for detecting a chart: ((number of shapes)³ / area size).
CHART_REMOVAL_MARGIN (value must be a Double) percent of shape height to use as margin for removing PDF elements above and below detected charts.
CHART_REMOVAL_MOD Specifies the chart detector mode. Available values: CHART_REMOVAL_ALWAYS and CHART_REMOVAL_AUTO - default.

3.4 Tolerance Values

With the tolerance values it is possible to specify at which value differences between the PDF files will be ignored. This makes it possible to ignore minor differences. Note that for the most part, you will normally leave the tolerance values at their defaults. In section Modules you can find the available tolerance values for each module. They can be set in the configuration file config.xml.

3.5 Page Image Cache

This property specifies whether page images will be saved as temporary PNG files on the hard disk or in memory.

Property Name Description
CACHE_FOR_PAGE_IMAGES Specifies a type of temporary storage of page images. These images can be used to show original pages of the compared PDF files (CREATE_ORIGIMAGES) and/or to show the marked differences of the compared pages (CREATE_DIFFIMAGES). This option is effective only if at least one of the options CREATE_ORIGIMAGES and CREATE_DIFFIMAGES is set. If “MEMORY” or “HARD_DISK” is set, the page images will be created during comparison. If “NONE” is set, the page images are created in an additional runthrough after comparison is done. If “HARD_DISK” Cache is used, the images will be saved as PNG files in the sub folders of folders containing the compared PDF files.

4. Limitations

  • PDF documents with restricted access permissions are partly supported. The PDF documents, having such restrictions, are encrypted. The i-net PDFC parser can read PDF files, if they are encrypted with the standard algorithm, based on RC4. If the revision is not 3 or if the AES encryption, instead of RC4, was used, then such documents can not be read.
  • i-net PDFC does not read annotations, interactive forms and signature fields. These elements will be ignored by the parser. If these or some other not implemented features are important for you, please contact our support team and send us two PDF files that should be compared with i-net PDFC.

5. Changes

v1.10 (November 21, 2011)

  • API enhancements: Classes in package “diffimage” added.
  • Command line parameter -x and -r added.
  • Bug: IllegalArgumentException occurred during the file comparison.
  • Bug: Endless loop occurred during the file comparison.
  • The height and width of the text difference mark box were not correct calculated for large font size.

v1.09 (July 29, 2011)

  • Differences in several PDF/A documents were not found.
  • Equal text passages marked as different.
  • NumberFormatException: For input string: “xxx” occurred.

v1.08 (June 10, 2011)

  • Bug in ASCIIHexDecode was fixed. Some texts encoded with this encoding could not be decoded.
  • Text scaling was corrected in order to calculate the proper height of text-difference marker.
  • The international format for date-time is now used for error messages.
  • The batch file runPDFC.bat (or shell script runPDFC.sh) can now be launched from any directory.
  • The log level “OFF” produces no output.
  • The log level list is reduced. The following log levels can be used: OFF, ERROR, WARN, INFO, ALL.
  • Differences on single PDF pages not found because of an error while reading the PDF comments in the file.

v1.07 (May 09, 2011)

  • API revised
    • Properties “USE_HD_CACHE_FOR_PAGE_IMAGES” and “USE_MEMORY_CACHE_FOR_PAGE_IMAGES” replaced with “CACHE_FOR_PAGE_IMAGES”.
  • API documentation added
  • Java code samples added
  • Documentation enhanced
  • The default encryption of pdf files is taken into account for additional elements such as functions, color spaces and hint tables.
  • Differences in PDF files were not found, because some characters from CFF fonts were not shown.
  • Character codes using MacRomanEncoding could not be compared.
  • Not embedded CID-fonts could not be used to build the page difference images.
  • Image comparision has been improved through corrected image cache.
  • Problems with PDF/A files occurred. If the key-length for default encryption is not set, it will be defined on the basis of security handler version.
  • Java 5 supported.

v1.06 (Feb 23, 2011)

  • Rectangles in difference images sometimes appeared a few pixels below the point at which they should appear.

v1.05 (Jan 26, 2011)

  • No differences were found between PDF files because of error by font reading.
  • ClassCastException occurred during the comparison of PDF files.
  • It is now supported to compare PDF files created with FastReports.
  • The width array length during string processing is now limited by the string length to avoid IndexOutOfBoundsExceptions.

v1.04 (Dec 09, 2010)

  • Using hard disk or memory cache is now possible.
  • Improved image comparison: the pixel values of images are now compared too.

v1.03 (Nov 30, 2010)

  • Fixed a problem regarding the reading of compressed objects and xref streams.
  • Fixed a bug with default width for CID-fonts type 0.
  • Fixed a bug with ToUnicode font map ranges for CID-fonts.

v1.02 (Jul 08, 2010)

  • Fixed a bug with inlined DCT-encoded images in a PDF.
  • Fixed a bug regarding Unicode special characters not being read correctly.
  • Fixed a problem regarding the reading of embedded True Type text.

v1.01 (Mar 23, 2010)

  • Improved chart detection and comparison.
  • Fixed problems when identical shapes occurred more than once.

v1.0 (Feb 25, 2010)

  • Initial release.

6. Support

If you have any questions or problems, please do not hesitate to contact pdfc@inetsoftware.de for technical support.

Copyright 2009 - 2011, i-net software GmbH. All rights reserved.

 

© Copyright 1996 - 2012, i-net software; All Rights Reserved.