This page describes how to write filters in regular expression syntax. i-net PDFC uses Java's regular expression implementation, so a full specification of the syntax can be found in the Java Pattern Documentation.
i-net PDFC has some specific rules for pattern matching apart from the normal regular expression rules:
To define a pattern from scratch, simply start with one example of the text sequence which will be matched by the filter. Imagine you want to exclude a “last modified” date from the comparison. Your first step might be to start with a date as filter pattern:
But that would only match one date. To allow other dates to match as well, replace all parts which may have more than one value with a more generic expression. In our example, these are the day, month and year numbers. Replace them by a generic number expression:
\\d matches any number including fraction separator. The
+ defines that there must be one or more number digits. But wait - this would match other numbers as well which are no dates at all, such as
1234/521/5122.. To avoid accidental matches, use precise expressions wherever possible. As a simple improvement we choose to limit the number of digits:
There are many more ways to optimize the pattern, but one is special for i-net PDFC and helps you to precisely define the context of the pattern to only remove the last modified dates:
Last Modified: (
Expressions in a parenthesis are a so-called match group. The filter will still match the whole expression but only the content of the match groups will be excluded.
The smallest unit in matching is a character matcher. The following expressions each match a single character. For a complete list please refer to the Java Pattern Documentation
| ||the character x|
| || the backslash character - keep in mind that a single
| ||the tab character (Unicode 0x0009)|
| ||the newline / line feed character (Unicode 0x000A)|
| ||the carriage-return character (Unicode 0x000D)|
| ||either x, y or t|
| ||any character but x, y or z|
| ||a through z(lower case only) or 0 through 9|
| ||subtraction of ranges - result is here only a or z|
| ||any character (except the line terminator unless explicitly specified)|
| ||a digit / a NON-digit|
| ||an ASCII whitespace character / a NON-whitespace character|
| ||a word character ([a-zA-Z_0-9]) / a NON-word character|
To match words, simply use the character matchers and extend them by a quantifier.
| ||character matcher X, once or not at all|
| ||character matcher X, zero or more times|
| ||character matcher X, one or more times|
| ||character matcher X, exactly n times|
| ||character matcher X, at least n times|
| ||character matcher X, at least n but not more than m times|
As an example, the expression
matches any number with at least one digit.
A very common term to exclude any text between to known expressions is
.*. But since the
* operator is greedy this will match the maximum number of characters, which isn't the expected result in most cases.
To change this, simply add an
? to the quantifier. For example:
|Pattern||Argument||First matched sequence|
| ||12345||(no match at all!)|
| ||a to z or z to a||a to z or z|
| ||a to z or z to a||a to z|
As a general guideline, the non-greedy quantifiers should be preferred.
Regular expressions allow more complex patterns as well. Any term up to this point can be nested in parentheses and used as much alike a single matcher. For example:
|Pattern||example which matches completely|
Note: If you use match groups, i-net PDFC will exclude only the content matched by these groups. Any part of the pattern outside these groups will be used as an anchor but not be removed. To avoid this, you can simply put an additional group around the whole pattern.
In case a matcher needs to match a list of alternative terms, these terms can be defined using the | operator. For example:
|Pattern||example which matches completely|
| ||May (for instance)|
Some operators don't match anything but can be used to change the behavior of the matcher.
| ||Switches to case insensitive matching|
| ||Switches the . operator to match linebreaks as well, enable multiline matches|
| ||Defines a non-capturing group. i-net PDFC will not exclude matches of such groups unless there are no capturing groups in the pattern|
| Page: ||ignore page numbers in n of m format|
| ||Valid dates in year-month-day format|