Regular expressions filter

This page describes how to write filters in regular expression syntax. i-net PDFC uses Java's regular expression implementation, so a full specification of the syntax can be found in the Java Pattern Documentation.

i-net PDFC specific implementation

i-net PDFC has some specific rules for pattern matching apart from the normal regular expression rules:

  • a word will be ignored in the comparison if at least one character of the word is matched by any pattern
    (you can however match a space at the end or beginning to match the whole word)
  • patterns are by default case sensitive
  • the . operator does not match a line break
  • if match groups are defined, only the content of the match groups will be removed

Developing a pattern

To define a pattern from scratch, simply start with one example of the text sequence which will be matched by the filter. Imagine you want to exclude a “last modified” date from the comparison. Your first step might be to start with a date as filter pattern:

24/12/2013

But that would only match one date. To allow other dates to match as well, replace all parts which may have more than one value with a more generic expression. In our example, these are the day, month and year numbers. Replace them by a generic number expression:

\\d+/\\d+/\\d+

The expression \\d matches any number including fraction separator. The + defines that there must be one or more number digits. But wait - this would match other numbers as well which are no dates at all, such as 1234/521/5122.. To avoid accidental matches, use precise expressions wherever possible. As a simple improvement we choose to limit the number of digits:

\\d{1,2}/\\d{1,2}/\\d{2,4}

There are many more ways to optimize the pattern, but one is special for i-net PDFC and helps you to precisely define the context of the pattern to only remove the last modified dates:

Last Modified: (\\d{1,2}/\\d{2,4}/\\d{2,4})

Expressions in a parenthesis are a so-called match group. The filter will still match the whole expression but only the content of the match groups will be excluded.

Character matching types

The smallest unit in matching is a character matcher. The following expressions each match a single character. For a complete list please refer to the Java Pattern Documentation

Single characters
x the character x
\\\\ the backslash character - keep in mind that a single \\ is an escape sequence in regular expressions
\\t the tab character (Unicode 0x0009)
\\n the newline / line feed character (Unicode 0x000A)
\\r the carriage-return character (Unicode 0x000D)
Character groups
[xyz] either x, y or t
[^abc] any character but x, y or z
[a-z0-9] a through z(lower case only) or 0 through 9
[a-z&&[\^b-y]] subtraction of ranges - result is here only a or z
Predefined groups
. any character (except the line terminator unless explicitly specified)
\\d / \\D a digit / a NON-digit
\\s / \\S an ASCII whitespace character / a NON-whitespace character
\\w / \\W a word character ([a-zA-Z_0-9]) / a NON-word character

Word Matching

To match words, simply use the character matchers and extend them by a quantifier.

Quantifiers
X? character matcher X, once or not at all
X* character matcher X, zero or more times
X+ character matcher X, one or more times
X{n} character matcher X, exactly n times
X{n,} character matcher X, at least n times
X{n,m} character matcher X, at least n but not more than m times

As an example, the expression

\\d+

matches any number with at least one digit.

Greedy / Reluctant quantifier

A very common term to exclude any text between to known expressions is .*. But since the * operator is greedy this will match the maximum number of characters, which isn't the expected result in most cases.

To change this, simply add an ? to the quantifier. For example:

Pattern Argument First matched sequence
.* 12345 12345
.*? 12345 (no match at all!)
\\d+ 12345 12345
\\d+? 12345 1
a.*z a to z or z to a a to z or z
a.*?z a to z or z to a a to z
a{1,3} aaaaa aaa
a{1,3}? aaaaa a

As a general guideline, the non-greedy quantifiers should be preferred.

Match groups

Regular expressions allow more complex patterns as well. Any term up to this point can be nested in parentheses and used as much alike a single matcher. For example:

Pattern example which matches completely
(123)+ 123123123
(\\d{1,3},)*(\\d{1,3})(.\\d+)? 123,456,789.123
([0-9A-F]{2}-){5}[0-9A-F]{2} 0A-12-ED-32-9C-72

Note: If you use match groups, i-net PDFC will exclude only the content matched by these groups. Any part of the pattern outside these groups will be used as an anchor but not be removed. To avoid this, you can simply put an additional group around the whole pattern.

Alternative terms

In case a matcher needs to match a list of alternative terms, these terms can be defined using the | operator. For example:

Pattern example which matches completely
(a|b)* abba
(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) May (for instance)

Flags and Switches

Some operators don't match anything but can be used to change the behavior of the matcher.

Operator Effect
(?i) Switches to case insensitive matching
(?s) Switches the . operator to match linebreaks as well, enable multiline matches
(?: ... ) Defines a non-capturing group. i-net PDFC will not exclude matches of such groups unless there are no capturing groups in the pattern

Standard patterns

Pattern Use case
Page: (\\d+) of (\\d+) ignore page numbers in n of m format
\(\(19|20)\\d\\d([- /.])(0[1-9]|1[012])([- /.])(0[1-9]|[12][0-9]|3[01])) Valid dates in year-month-day format

 

© Copyright 1996 - 2017, i-net software; All Rights Reserved.