Using Regular Expressions
To check if there is a match in a string Perl uses special operator
=~. Usual syntax
is:
variable =~ m/regular_expression/
The result of this operator is either true (if there is a match) or false (no match). The following example
checks if a string contains a phone number in it:
$str = "My phone number is 123-3445. This is my home phone.";
if( $str =~ /\d{3}-\d{4}/ ){
print "There is a phone number in the string '$str'\n";
}
else{
print "There is not a phone number in the string '$str'\n";
}
As you can see the character
m may be omitted (m stays for "match").
We also often need to know not only if there is match or not, but also what is the substring that matches
the pattern. Perl provides several special variables for that purpose:
- $& - contains the matching substring
- $` - contains the substring on the left of the match
- $' - contains the substring on the right of the match
Let's modify the previous example:
$str = "My phone number is 123-3445. This is my home phone.";
if( $str =~ /\d{3}-\d{4}/ ){
print "There is a phone number in the string '$str'\n";
print "The phone is: $&\n";
print " Before: '$`'\n";
print " After: '$''\n";
}
else{
print "There is not a phone number in the string '$str'\n";
}
If we want to find all matches in a string we can use operator =~ in a loop and setting the
value of the variable $str to the substring on the right of the match:
$str = "My phones: 123-3456 (home), 234-4557 (office), 456-4564 (cell).";
while( $str =~ /\d{3}-\d{4}/ ){
print "The phone is: $&\n";
$str = $';
}
or we can use the global modifier
g and use operator =~ in the list context.
If used on the right side of the assignment operator operator =~ returns an array of matches:
$str = "My phones: 123-3456 (home), 234-4557 (office), 456-4564 (cell).";
@phones = $str =~ /\d{3}-\d{4}/g;
foreach $phone (@phones){
print "$phone\n";
}
In addition to operator =~ Perl has operator !~ that returns true if there
is no match and false otherwise.
Getting information about a match
You can not only verify that a one-field date entry is in desired format, but also extract match components
of the entry. To get any piece of information inside a substring matching the pattern we need to embrace
the corresponding part of the pattern in parenthesis. Please notice that parenthesis themselves are special
symbols inside patterns and do not require any match on the string. If a pattern includes one or more
parenthesis sets, then substring of the match corresponding the pattern inside parenthesis will be placed
by Perl in special variables
$1,
$2,
$3, etc (Perl
does not stop at
$9).
For example, if we are checking that a date was entered in either in "mm/dd/yyyy" or
"mm-dd-yyyy" format and also need to know the values of the month, day, and year we can use the following
regular expression:
$today = "1/24/2003";
if( $today =~ /\b(1[0-2]|0?[1-9])[\-\/](0?[1-9]|[12][0-9]|3[01])[\-\/]((19|20)\d{2})/ ){
print "Date: $&\n";
print "Month: $1\n";
print "Day: $2\n";
print "Year: $3\n";
print "Century: $4\n";
}
Let's take a closer look at this expression:
- 1[0-2]|0?[1-9] for the month (can start with 1 and then have only 0, 1, or 2 as the second digit,
or can start with 0 and have anything but zero as the second digit). We also put parenthesis around
this part of the regular expression to extract month from the matching string.
- 0?[1-9]|[12][0-9]|3[01] for the day (can start with 0 and have at least 1 after that,
can start with 1 or two and have any digit after that, or can start with 3 and have only 0 or 1 after).
This part is also inside parenthesis because we will need to know the day of the month as well.
- (19|20)\d{2} for the year (can start with 19 or 20 and have exactly two digit after that).
This part of the pattern is inside the parenthesis (that'll give us the year) and also has it's own part
parenthesised. This subpart will give us the century.
combining these three thing together and adding possibility for different separators we come up with
code above.
String Replacement
Let's consider a small example about credit card numbers. Credit card number can be entered as
6432-2345-2342-2342
or
6432 2345 2342 2342
or
6432234523422342
our goal is to recognize a valid number and transform it the the first form. Let's use
the following regular expression:
$card = "6432 23452342-2342";
if( $card =~ /(\d\d\d\d)[\-\s]?(\d\d\d\d)[\-\s]?(\d\d\d\d)[\-\s]?(\d\d\d\d)/ ){
$card = "$1-$2-$3-$4";
}
else{
$card = "Invalid credit card number!";
}
To replace a part of a string that matches a regular expression we can use s///
regular expression. This expression includes a pattern (goes between the first and second slash)
and a string to substitute with (goes between the second and the third slash).
Operator =~ performs the substitution if used with s/// regular expression.
Thus, the following example substitutes the first 4 digits in a credit card number with stars:
$card = "6432 23452342-2342";
if( $card =~ s/\d{4}/****/ ){
print "$card\n";
}
else{
print "Sorry, there is no match\n";
}
If used in the modifier g such regular expression substitutes all matches in the string.
In the following example we first bring the card number in the normal form and then substitute all digits
but the last 4 with stars:
$card = "6432 23452342-2342";
$card =~ s/(\d\d\d\d)[\-\s]?/$1-/g; # separate 4-digit groups with dashes
$card =~ s/-$//; # remove the trailing dash
$card =~ s/\d{4}-/****-/g; # substitute 4-digit groups with 4 stars
print "$card\n";