Regular expressions are a powerful tool for text processing (searching, modifying, parsing). In Java, starting with version 1.4, classes and methods to deal with regular expressions are available as part of the standard set of packages. In this tutorial, I will show you the basics of this issue and how to use it (with examples in java).
A regular expression describes the common features (syntax) of a set of strings. A regular expression is a pattern that describes one or more strings that match that pattern. We write such a pattern using a special regular expression syntax. To begin with, I will give you a simple example to understand this better: the simplest pattern is simply a sequence of characters that have no special meaning (a sequence of literals).
Example 1
The regular expression oskar is a pattern that describes five consecutive chracters: o, s, k, a and r. This pattern describes one string „oskar”.
In patterns, we can use special characters (so-called metacharacters) and syntactic constructions created with them.
The special characters include:
$ | ^ | . | * |
+ | ? | [ and ] | ( and ) |
{ and } | \ |
Remember that if we want to treat special characters as literals – we precede them with a backslash \.
With the help of special characters and the more complex syntax constructs they create, we describe things such as:
- occurrence of one of many characters – appropriate syntax constructs are named after a character class (e.g. letters or numbers),
- beginning or end of a limited string of characters (e.g. a line or a word),
- repetitions – in the syntax of regular expressions described by the so-called quantifiers,
- logical combinations of regular expressions.
Example 2
Regular expression [0-4] is a pattern that describes one character, which can be any digit 0,1,2,3,4. This pattern describes all strings consisting of one digit.
Example 3
The regular expression d.*x (d, period, asterisk, x) describes any sequence of characters starting with the letter d and ending with the letter x. This pattern matches, for example, the following strings: „djx”, „dddddx”, „ddevx” .
The use of regular expressions in programming
- checking if a given string matches the pattern given by the expression,
- checking if a given string contains a string matching the given pattern,
- replacing parts of the string that match the pattern with other strings,
- highlight parts of the string that are delimited by strings that follow the given pattern.
Regular expression in Java
Java.util.regex classes are used for this in Java: Pattern and Matcher. Before a regular expression can be used to parse any string, it must be compiled. Pattern objects represent compiled regular expressions, and these objects are obtained using static methods of the Pattern class – compile(…), with a regular expression as an argument. Matcher objects perform text searches by interpreting a compiled regular expression and matching it to text or a portion of it.
The Matcher object is always associated with a given pattern. So we get it from the pattern-object using the matcher(…) method of the Pattern class, giving the search text as its argument. We can then perform various text search and replace operations by using different methods of the Matcher class.
The matches() method tries to match the entire given string to a pattern, while the find() method searches the input string for another matching string. All match and search methods return boolean values that either match (true) or none (false).
A typical sequence of operations needed to apply regular expressions can be described schematically as follows:
- The text to be matched can be represented by an object of any class that implements the CharSequence interface (e.g. String, StringBuffer, CharBuffer from the java.nio package).
String myText = "oskarito-190";
- Create a regular expression as a string:
String regexp = "[0-9]";
- Compile the regular expression and get the compiled pattern:
Pattern pattern = Pattern.compile(regexp);
- Create a matcher object associated with the given expression, giving the text to match:
Matcher matcher = pattern.matcher(myText);
- Expression is looking for a text match (in the text) according to the pattern (we choose one of the following):
/
boolean hasMatch = matcher.find();boolean isMatching = matcher.matches();
Before we go into a detailed discussion of the structure of individual examples and uses of regular expressions, we will build a simple java program that will extract and check our values. Individual fragments of code are briefly described with appropriate comments. If there is something unclear to you, please ask in the comments.
public static String checkRegularExpression(String regex, String text) {
try {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
StringBuilder findResult = new StringBuilder();
// Try to match all text to a pattern
boolean isMatching = matcher.matches();
findResult.append("\nmatches(): All text").append(isMatching ? " " : " NOT ").append("matches the pattern.");
// Restoring the original position of the matcher
matcher.reset();
// Searching for all strings in the given text using a loop
boolean found = matcher.find();
if (!found)
findResult.append("\nnfind(): There is no matching string");
else
do {
findResult.append(String.format("\nfind(): Found substring: '%s' between chars in position: %d and %d",
matcher.group(), matcher.start(), matcher.end()));
} while(matcher.find());
return findResult.toString();
} catch (Exception ex) {
return String.format("Error has occurred %s", ex.getMessage());
}
}
Literals
The literals used in the regular expression are matched sequentially.
Example X
public static void main(String... args) {
String regexp = "oskar";
String text = "this is oskar. oskar is here";
String result = checkRegularExpression(regexp, text);
System.out.println(result);
}
matches(): All text NOT matches the pattern.
find(): Found substring: 'oskar' between chars in positions: 8 and 13
find(): Found substring: 'oskar' between chars in positions: 15 and 20
Sequential characters 'o’, 's’, 'k’, 'a’ and 'r’ are searched for in this string. The result of the matches() method is true only if all the text matches this pattern („oskar”), the find() method finds multiple „oskar” strings in the text.
On this occasion, the aforementioned difference between the matches () method (which looks for a pattern match of the whole text) and the find () method – which searches the text sequentially, becomes apparent.
Example X
Let’s see it again on the example of the text „oooxooxooo” and the pattern „o*” (0 or more occurrences of the character ’o’).
Of course, our input text is not a string of 0 or more o’s. So matches() rightly says „All text NOT matches the pattern”. But the find() method follows our pattern by sequentially scrolling through the text from the beginning. Method finds two o characters (from position 0) – which satisfies the pattern (we get a match), then hits the character ’x’ in position 2. This is not an ’o’, but the pattern allows no ’x’ to occur – so find() announces a zero-length text match (which means nothing less than o is not present here).
String regexp = "o*";
String text = "oooxooxooo";
String result = checkRegularExpression(regexp, text);
matches(): All text NOT matches the pattern.
find(): Found substring: 'ooo' between chars in positions: 0 and 3
find(): Found substring: '' between chars in positions: 3 and 3
find(): Found substring: 'oo' between chars in positions: 4 and 6
find(): Found substring: '' between chars in positions: 6 and 6
find(): Found substring: 'ooo' between chars in positions: 7 and 10
find(): Found substring: '' between chars in positions: 10 and 10
Symbols with special meaning in regular expression syntax cannot be used as literals. You will then get a syntax error when compiling the expression. We can handle syntax errors (of various kinds) by catching the PatternSyntaxException exception:
public static void main(String... args) {
String regexp = "(x";
String text = "this is text with brackets (x)";
String result = checkRegularExpression(regexp, text);
System.out.println(result);
}
Error has occurred Unclosed group near index 2
(x
You can also literally enter character codes and control characters by using the backslash (/). Regular expressions (including those in the form of literals) can be logically connected with each other. In fact, the regular expression „xyz” is a logical conjunction of the expressions „x„, „y„, and „z„. We also have a logical alternative introduced with the | character.
Example X
For example, if you want to match (or find) text that is „dj” or „man” or „you„, we can build the regular expression „dj | man | you„
public static void main(String... args) {
String regexp = "dj|man|you";
String text = "This id good man - great dj.";
String result = checkRegularExpression(regexp, text);
System.out.println(result);
}
matches(): All text NOT matches the pattern.
find(): Found substring: 'man' between chars in positions: 13 and 16
find(): Found substring: 'dj' between chars in positions: 25 and 27
Character classes
By using square brackets, we can introduce the so-called character classes. A simple character class is a string of characters enclosed in square brackets, e.g.
[123abc]
The matcher will match any of the characters listed in this pattern. It is in fact an abbreviation of 1 | 2 | 3 | a | b | c.
If the first character in square brackets is ^, a match will be made for any character except those listed. This is a kind of negation of the character class. For example, any character except x, y, and z will match [^ xyz].
It is also possible to formulate ranges of characters (which makes writing much easier). When formulating ranges, we use the natural symbol -.
Sample patterns:
[0-9] – any digit,
[a-zA-Z] – any lowercase and uppercase letter of the English alphabet.
[a-zA-Z0-9] = any number or letter
Example X
The word Oskar should be followed by a space, then one of the numbers 1,2,3,7,8,9, then any number, a slash and any character except the numbers 0,1,2,3 and lowercase letters of the English alphabet. Character class specifies one character belonging (or not) to the specified set in square brackets. The order of the characters in the set is not important, but the ranges must be in ascending order.
public static void main(String... args) {
String regexp = "Oskar [1-37-9][0-9]/[^0-3a-z]";
String text = "Oskar 11/9";
String result = checkRegularExpression(regexp, text);
System.out.println(result);
}
matches(): All text matches the pattern.
find(): Found substring: 'Oskar 11/9' between chars in positions: 0 and 10
Example X
public static void main(String... args) {
String regexp = "Oskar [1-37-9][0-9]/[^0-3a-z]";
String text = "Oskar 51/9";
String result = checkRegularExpression(regexp, text);
System.out.println(result);
}
matches(): All text NOT matches the pattern.
find(): There is no matching string
Predefined classes make it easier to write ranges:
Type | Description |
. | Any character (depending on the pattern compilation option, it may or may not match the end of line character) |
\d | Digit: [0-9] |
\D | Non-digit: [^0-9] |
\s | White character: [\t\n\x0B\f\r] |
\S | Any character except white character [^\s] |
\w | One of the characters: [a-zA-Z0-9], the character which is allowed in a word |
\W | A character that is not a letter or a number [^\w] |
Example X
Pattern that matches texts that consist of two arbitrary digits, followed by two arbitrary characters, and three non-numeric characters.
public static void main(String... args) {
String regexp = "\\d\\d..\\D\\D\\D";
String text = "12##aa$";
String result = checkRegularExpression(regexp, text);
System.out.println(result);
}
matches(): All text matches the pattern.
find(): Found substring: '12##aa$' between chars in positions: 0 and 7
Character classes (which, after all, are treated as sets) can also be combined in an even more flexible way. The following operations are available for this:
Summing classes (sets)
Obtained by embedding additional square brackets. The sum of the classes represents the concatenation of the character set of these classes. For example [x-z[4-7]] – this matches x, y, z, 4, 5, 6, 7 characters.
Common part of classes (sets)
Obtained by the operator &&. The common part of classes defines the characters that appear in both classes. For example, the pattern: [1-9 && 3-7] matches the digits 3, 4, 5, 6, 7 which are common to both of the ranges listed. We could write the same pattern anyway [1-9 && [3-7]].
Class difference (set)
Obtained by a combination of && and negation operations. The class difference creates a class that contains all the characters of one of the specified classes except the characters of the other specified class. To achieve this effect, the && and negation ^ operators must be combined. For example, the pattern [a-z && [^ mno]] specifies any letter from a to z except the letters m, n, and o.
Example X
String regexp = "[a-z && [^ mno]]";
String text = "x";
String result = checkRegularExpression(regexp, text);
matches(): All text matches the pattern.
find(): Found substring: 'x' between chars in positions: 0 and 1
Example X
String regexp = "[a-z && [^ mno]]";
String text = "n";
checkRegularExpression(regexp, text);
matches(): All text NOT matches the pattern.
find(): There is no matching string
Well, the metacharacters used in the class character set (in square brackets) (except for the backslash, the symbol ^, and -) lose their special meaning and are treated literally. In particular, a period is treated literally (it stands for any character outside of the square brackets). You can see what’s going on on the fragment code below:
Example X
String regexp = "[({]x[})]";
String text = "{x}";
checkRegularExpression(regexp, text);
matches(): All text matches the pattern.
find(): Found substring: '{x}' between chars in positions: 0 and 3
The „escape” (backslash) symbol allows you to indicate that you want to treat the square bracket literally. (and we normally only use it for class).
String regexp = "[\[({]x[)}\]]";
String text = "[x]";
checkRegularExpression(regexp, text);
matches(): All text matches the pattern.
find(): Found substring: '[x]' between chars in positions: 0 and 3
Quantifiers
Regular expressions would be completely useless if they could not match repeated sequences of characters. Quantifiers are used to specify repetitions. The symbols of the quantifiers are as follows:
- ? – occurrence once or not at all
- * – occurrence zero or more times
- + – occurrence one or more times
- {n} – occurring exactly n times
- {n,} – occurring at least n times
- {n, m} – occurrence at least n but not more than m times
When the quantifier follows a literal – an instance is required (the number of occurrences depends on the quantifier, in particular it may be 0) of this literal.
Example X
„12a+” means 1, then 2, then the character 'a’ one or more times.
Note: „12a+” does not mean that 12a will occur one or more times!
When a quantifier follows a character class, applies to any character in that class.
Example X
[abc]+ means the occurrence of the character a or the character b or the character c one or more times.
This pattern matches the following texts: „ABC” „bcaabc” „aaaaaaaaaa”
If, however, we want the quantifier to apply to any regular expression, one of the following syntactic constructs must be used (such structures create the so-called groups):
- (X) quantifier_symbol – the groups enclosed in parentheses are used to remember the text that matches the pattern given in the parentheses
- (?: X) quantifier_symbol – the form is for grouping only, without memorization.
Example X
Pattern describes one or more occurrences of any of the words dj, singer, song (not separated by spaces).
String regexp = "(dj|singer|song)+";
String text = "djsongsinger";
checkRegularExpression(regexp, text);
matches(): All text matches the pattern.
find(): Found substring: 'djsongsinger' between chars in positions: 0 and 12
Example X
Pattern describes situation when the text may consist of any single word dj, singer, song (without the trailing space) and that this text may consist of any number of repetitions of these words, but separated by at least one and no more than three spaces.
So, the text begins with one of the words dj, singer, song followed by zero or more repetitions (* quantifier) of the text described by the expression in second parentheses. The expression in the second parentheses states that the text they describe is to start with 1, 2, or three spaces (quantifier {1,3}), followed by one of the words dj, singer, song.
String regexp = "(dj|singer|song)( {1,3}(dj|singer|song))*";
String text = "dj dj song song";
checkRegularExpression(regexp, text);
String regexp2 = "(dj|singer|song)({1,3}(dj|singer|song))*";
String text2 = "dj dj dj dj";
checkRegularExpression(regexp2, text2);
matches(): All text NOT matches the pattern.
find(): Found substring: 'dj' between chars in positions: 0 and 2
find(): Found substring: 'dj' between chars in positions: 3 and 5
find(): Found substring: 'song' between chars in positions: 6 and 10
find(): Found substring: 'song' between chars in positions: 11 and 15
matches(): All text NOT matches the pattern.
find(): Found substring: 'dj' between chars in positions: 0 and 2
find(): Found substring: 'dj' between chars in positions: 4 and 6
find(): Found substring: 'dj' between chars in positions: 9 and 11
find(): Found substring: 'dj' between chars in positions: 13 and 15
Example X
Now let’s pay special attention to the use of parentheses. It is used not only to apply quantifiers, but also to change the order in which a regular expression is interpreted. If the first alternative dj | singer | song were not included in brackets, the interpretation of the expression (dj | singer | song)({1,3}(dj | singer | song))* would be:
either a dj or a singer or a song followed by 0 or more repetitions of dj, singer, song combinations, separated by spaces (from one to three). So we would get different than expected results. Only correct if text containing more than one word would start with „song„
String regexp = "dj|singer|song( {1,3}(dj|singer|song))*";
String text = "dj song";
checkRegularExpression(regexp, text);
String regexp2 = "(dj|singer|song)( {1,3}(dj|singer|song))*";
String text2 = "song dj song";
checkRegularExpression(regexp2, text2);
matches(): All text NOT matches the pattern.
find(): Found substring: 'dj' between chars in positions: 0 and 2
find(): Found substring: 'singer' between chars in positions: 3 and 9
matches(): All text matches the pattern.
find(): Found substring: 'song dj song' between chars in positions: 0 and 12
The quantifiers described so far are the so-called greedy quantifiers. The matcher – when used – is „greedy” because it first consumes all the input text (and tries to match it). If it fails, it backsits character by character until it is matched or not matched.
There are also two other kinds of quantifiers in Java: reluctant and possessive. Reluctant quantifiers – unlike greedy quantifiers – start at the beginning of the input text and take character by character looking for a match. Any attempt to match the entire text occurs at the very end.
Possessive quantifiers, like greedy ones, consume the entire text and check for a match, but when no match is found – unlike greedy quantifiers – there is no character-by-character backspace.
Example X
The example shows the difference in how the three types of quantifiers work.
// greedy quantifier
String regexp = ".*dj";
String text = "This is dj and techno is playing by this dj";
String result = checkRegularExpression(regexp, text);
// reluctant quantifier
regexp = ".*?dj";
text = "This is dj and techno is playing by this dj";
result = checkRegularExpression(regexp, text);
// possessive quantifier
regexp = ".+dj";
text = "This is dj and techno is playing by this dj";
result = checkRegularExpression(regexp, text);
matches(): All text matches the pattern.
find(): Found substring: 'This is dj and techno is playing by this dj' between chars in positions: 0 and 43
matches(): All text matches the pattern.
find(): Found substring: 'This is dj' between chars in positions: 0 and 10
find(): Found substring: ' and techno is playing by this dj' between chars in positions: 10 and 43
matches(): All text matches the pattern.
find(): Found substring: 'This is dj and techno is playing by this dj' between chars in positions: 0 and 43
The greedy quantifier (first case) consumed the entire text, then the matcher backed off until it found a match (the words „dj” at the end). At this, the matcher has finished, as you can clearly see from the result of the find() method.
The reluctant quantifier (the second case) started from the beginning of the text, and the find() method had the opportunity to find two matches in the text in the pattern.
The possessive quantifier (third case) consumed the entire text, and there was nothing left to match the word „dj” in the pattern. This quantifier does not allow the matcher to fall back, so we got a mismatch.
Which type of quantifier to choose depends on the context.
Boundary matchers (anchors)
If we are interested in matching the pattern in a specific place of the text, we use the so-called boundary matchers otherwise known as „anchors”. In particular, we may want the match to occur at the beginning or end of a line, or at a word boundary. The symbols for marking the boundaries are as follows:
- ^ – beginning of line
- $ – end of line
- \b – on the verge of the word
- \B – not on verge of a word
- \A – start of entry
- \G – end of previous match
- \Z – end of input (without terminator)
- \z – end of input
Example X
The example is for matching lines that start with one or more digits. Assuming this regex is inside a Java String
literal, you need to escape the backslashes for your \d
and \w
tags. Note the difference: the pattern „\d+.*” (Without specifying that the digits are at the beginning of the line) will allow find() to match part of the string „dj 567”:
String regexp = "^\\d+.*";
String text = "6789";
String result = checkRegularExpression(regexp, text);
regexp = "^\\d+.*";
text = "DJ 567";
result = checkRegularExpression(regexp, text);
regexp = "\\d+.*";
text = "DJ 567";
result = checkRegularExpression(regexp, text);
matches(): All text matches the pattern.
find(): Found substring: '6789' between chars in positions: 0 and 4
matches(): All text NOT matches the pattern.
find(): There is no matching string
matches(): All text NOT matches the pattern.
find(): Found substring: '567' between chars in positions: 3 and 6
Example X
Now we will check if the line contains an integer at the very end (and extract it in a group).
String regexp = ".+?(\\d+)$";
String text = "this is dj 33";
String result = checkRegularExpression(regexp, text);
matches(): All text matches the pattern.
find(): Found substring: 'this is dj 33' between chars in positions: 0 and 13
Example X
Now let’s try to isolate the words. If we write the pattern „prod„, the occurrences of this string will be found, regardless of whether it is a word or part of a word.
String regexp = "prod";
String text = "prod producer production product unproductive";
String result = checkRegularExpression(regexp, text);
matches(): All text NOT matches the pattern.
find(): Found substring: 'prod' between chars in positions: 0 and 4
find(): Found substring: 'prod' between chars in positions: 5 and 9
find(): Found substring: 'prod' between chars in positions: 14 and 18
find(): Found substring: 'prod' between chars in positions: 25 and 29
find(): Found substring: 'prod' between chars in positions: 35 and 39
Example X
If we want to find only the whole „prod” words we will use the symbols \b.
String regexp = "\\bprod\\b";
String text = "prod producer production product prod unproductive";
String result = checkRegularExpression(regexp, text);
matches(): All text NOT matches the pattern.
find(): Found substring: 'prod' between chars in positions: 0 and 4
find(): Found substring: 'prod' between chars in positions: 33 and 37
Example X
If we want to find the string „prod„, starting a word (which may be, but not necessarily a word) we would write:
String regexp = "\\bprod";
String text = "prod unprod producer production product prod unproductive";
String result = checkRegularExpression(regexp, text);
matches(): All text NOT matches the pattern.
find(): Found substring: 'prod' between chars in positions: 0 and 4
find(): Found substring: 'prod' between chars in positions: 12 and 16
find(): Found substring: 'prod' between chars in positions: 21 and 25
find(): Found substring: 'prod' between chars in positions: 32 and 36
find(): Found substring: 'prod' between chars in positions: 40 and 44
Example X
if we want to find „prod” as the beginning of words (but not words) we can write:
String regexp = "\\bprod\\B";
String text = "prod unprod producer production product prod unproductive";
String result = checkRegularExpression(regexp, text);
matches(): All text NOT matches the pattern.
find(): Found substring: 'prod' between chars in positions: 12 and 16
find(): Found substring: 'prod' between chars in positions: 21 and 25
find(): Found substring: 'prod' between chars in positions: 32 and 36
Flags
The way of interpreting a regular expression can be modified using the so-called flag. Flags are given in two cases:
- when compiling the expression – as the second argument of the compile(…) method from the Pattern class,
- directly in the regular expression using the appropriate symbols.
Available flags:
Static constant of the Pattern class | Equivalent in regular expression | Meaning |
Pattern.CANON_EQ | none | Allows full, canonical comparison of characters and character codes (reduces efficiency) |
Pattern.CASE_INSENSITIVE | (?i) | Compare letters without taking into account their size |
Pattern.COMMENTS | (?x) | Lets you insert comments in an expression |
Pattern.MULTILINE | (?m) | Allows matching ^ and $ at the beginning and end of lines (separated by a line break) |
Pattern.DOTALL | (?s) | Allows you to match the metacharacter „.” also to the line break |
Pattern.UNICODE_CASE | (?u) | Include Unicode characters in the comparisons when ignoring case |
Pattern.UNIX_LINES | (?d) | Unix line separators |
When compiling the expression, we give the flags as a bitwise sum of constants with the names mentioned above. They then work for the entire expression.
String regex = "....";
int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL;
Pattern pattern = Pattern.compile(regex, flags);
Alternatively, we can supply flags directly in the regular expression – then they will work from the moment they appear in the regular expression.
Example X
Search for „Dj” (case sensitive) followed by „plays techno„, which can be lowercase, uppercase, or mixed-case.
String regexp = "Dj (?i)plays techno";
String text = "Dj plays techno";
String result = checkRegularExpression(regexp, text);
text = "Dj PLAYS TECHNO";
result = checkRegularExpression(regexp, text);
text = "DJ PLAYS TECHNO";
result = checkRegularExpression(regexp, text);
matches(): All text matches the pattern.
find(): Found substring: 'Dj plays techno' between chars in positions: 0 and 15
matches(): All text matches the pattern.
find(): Found substring: 'Dj PLAYS TECHNO' between chars in positions: 0 and 15
matches(): All text NOT matches the pattern.
find(): There is no matching string
Methods of the String class related to regular expressions
As a simplification, for ad hoc applications, in the String class, we find methods that represent some of the discussed methods of the Matcher and Pattern classes.
- matches(String regex) – checks if this string matches the regex pattern
- replaceAll(String regex, String replacement) – replaces any matching regex substring of this string with the given string replacement
- replaceFirst(String regex, String replacement) – replaces the first matching regex substring of this string with the specified string replacement
- split(String regex) – decomposes this string around separators that are substrings that match the pattern
- split(String regex, int limit) – decomposes this string around separators that are substrings matching the pattern, but not more than (limit-1) times
A practical example
At the end of this article, consider a more extensive practical example that uses virtually all of the regular expression tools discussed so far, as well as showing how to support their power with additional programming means. By the way, we will once again draw your attention to some important issues related to the use of regular expressions.
We want to save to the output file all paragraphs (marked with <p1>, <p2> tags etc.) from some HTML file, create a specific table of contents. The task of choosing the right pattern seems trivial. The paragraph is the text between the opening <pn> tag and the closing </pn> tag, and n – the paragraph number. An overriding pattern could be:
<p[1-9]>(.+)</p[1-9]>
Example X
String regexp = "<p[1-9]>(.+)</p[1-9]>";
String text = "<p1>This is techno 1</p1>\n" +
"<p2>This is techno 2</p2>\n" +
"<p3>This is techno 3</p3>";
String result = checkRegularExpression(regexp, text);
matches(): All text NOT matches the pattern.
find(): Found substring: '<p1>This is techno 1</p1>' between chars in positions: 0 and 25
find(): Found substring: '<p2>This is techno 2</p2>' between chars in positions: 26 and 51
find(): Found substring: '<p3>This is techno 3</p3>' between chars in positions: 52 and 77
In the above case no problem as each paragraph is placed on a separate line. However, in HTML files, paragraphs can be broken into multiple lines. In this case, find() will not find a match for a paragraph that consists of several lines. You have to deal with it. Why is it like that? Because the metacharacter „.” by default it matches any character except the line break. So we should apply – either when compiling the pattern or in the expression itself – the DOTALL – ?s flag (causing the metacharacter to match also at line breaks). And if anything, let’s also take into account the fact that html tags can be written with both upper and lower case letters (so let’s enter the CASE_INSENSITIVE –?i flag).
The new pattern would look like this (we will introduce the flags in the expression itself):
(?i)(?s)<p[1-9]>(.+)</p[1-9]>
Unfortunately, it won’t work this time either. It will find only one large text starting with the first <p1> tag and ending with the last </p2> tag. The greedy quantifier „.+” consumed all the characters to the end of the input text (it could have been done because the period now matches the end-of-line characters. Then the matcher backing up found a match </p2> (last in the text), and that satisfied the pattern for the find() method, which exited at this point. So we should use the reluctant quantifier.
(?i)(?s)<p[1-9]>(.+?)</p[1-9]>
Example X
String regexp = "(?i)(?s)<p[1-9]>(.+?)</p[1-9]>";
String text = "<p1>This is \n" +
"techno 1</p1>\n" +
"<p2>This is techno 2</p2>\n" +
"<p3>This is techno 3</p3>";
String result = checkRegularExpression(regexp, text);
matches(): All text matches the pattern.
find(): Found substring: '<p1>This is
techno 1</p1>' between chars in positions: 0 and 26
find(): Found substring: '<p2>This is techno 2</p2>' between chars in positions: 27 and 52
find(): Found substring: '<p3>This is techno 3</p3>' between chars in positions: 53 and 78
Thank you for making it to the end of the article. If you have any comments or want to share your opinion about my work, write everything in the comments! See you next time! It’s time to code ! 🙂