Using Regular Expressions

To check if there is a match in a string Perl uses special operator =~. Usual syntax is:
variable =~ m/regular_expression/
The result of this operator is either true (if there is a match) or false (no match). The following example checks if a string contains a phone number in it:
$str = "My phone number is 123-3445. This is my home phone.";
if( $str =~ /\d{3}-\d{4}/ ){
print "There is a phone number in the string '$str'\n";
}
else{
print "There is not a phone number in the string '$str'\n";
}
As you can see the character m may be omitted (m stays for "match").

We also often need to know not only if there is match or not, but also what is the substring that matches the pattern. Perl provides several special variables for that purpose:

  • $& - contains the matching substring
  • $` - contains the substring on the left of the match
  • $' - contains the substring on the right of the match
Let's modify the previous example:
$str = "My phone number is 123-3445. This is my home phone.";
if( $str =~ /\d{3}-\d{4}/ ){
print "There is a phone number in the string '$str'\n";
print "The phone is: $&\n";
print "      Before: '$`'\n";
print "       After: '$''\n";
}
else{
print "There is not a phone number in the string '$str'\n";
}

If we want to find all matches in a string we can use operator =~ in a loop and setting the value of the variable $str to the substring on the right of the match:

$str = "My phones: 123-3456 (home), 234-4557 (office), 456-4564 (cell).";
while( $str =~ /\d{3}-\d{4}/ ){
print "The phone is: $&\n";
$str = $';
}
or we can use the global modifier g and use operator =~ in the list context. If used on the right side of the assignment operator operator =~ returns an array of matches:
$str = "My phones: 123-3456 (home), 234-4557 (office), 456-4564 (cell).";
@phones = $str =~ /\d{3}-\d{4}/g;
foreach $phone (@phones){
print "$phone\n";
}

In addition to operator =~ Perl has operator !~ that returns true if there is no match and false otherwise.

Getting information about a match

You can not only verify that a one-field date entry is in desired format, but also extract match components of the entry. To get any piece of information inside a substring matching the pattern we need to embrace the corresponding part of the pattern in parenthesis. Please notice that parenthesis themselves are special symbols inside patterns and do not require any match on the string. If a pattern includes one or more parenthesis sets, then substring of the match corresponding the pattern inside parenthesis will be placed by Perl in special variables $1, $2, $3, etc (Perl does not stop at $9).

For example, if we are checking that a date was entered in either in "mm/dd/yyyy" or "mm-dd-yyyy" format and also need to know the values of the month, day, and year we can use the following regular expression:

$today = "1/24/2003";
if( $today =~ /\b(1[0-2]|0?[1-9])[\-\/](0?[1-9]|[12][0-9]|3[01])[\-\/]((19|20)\d{2})/ ){
print "Date: $&\n";
print "Month: $1\n";
print "Day: $2\n";
print "Year: $3\n";
print "Century: $4\n";
}
Let's take a closer look at this expression:
  • 1[0-2]|0?[1-9] for the month (can start with 1 and then have only 0, 1, or 2 as the second digit, or can start with 0 and have anything but zero as the second digit). We also put parenthesis around this part of the regular expression to extract month from the matching string.
  • 0?[1-9]|[12][0-9]|3[01] for the day (can start with 0 and have at least 1 after that, can start with 1 or two and have any digit after that, or can start with 3 and have only 0 or 1 after). This part is also inside parenthesis because we will need to know the day of the month as well.
  • (19|20)\d{2} for the year (can start with 19 or 20 and have exactly two digit after that). This part of the pattern is inside the parenthesis (that'll give us the year) and also has it's own part parenthesised. This subpart will give us the century.
combining these three thing together and adding possibility for different separators we come up with code above.

String Replacement

Let's consider a small example about credit card numbers. Credit card number can be entered as
  • 6432-2345-2342-2342
    or
  • 6432 2345 2342 2342
    or
  • 6432234523422342
    our goal is to recognize a valid number and transform it the the first form. Let's use the following regular expression:
    $card = "6432 23452342-2342";
    if( $card =~ /(\d\d\d\d)[\-\s]?(\d\d\d\d)[\-\s]?(\d\d\d\d)[\-\s]?(\d\d\d\d)/ ){
    $card = "$1-$2-$3-$4";
    }
    else{
    $card = "Invalid credit card number!";
    }
    

    To replace a part of a string that matches a regular expression we can use s/// regular expression. This expression includes a pattern (goes between the first and second slash) and a string to substitute with (goes between the second and the third slash). Operator =~ performs the substitution if used with s/// regular expression. Thus, the following example substitutes the first 4 digits in a credit card number with stars:

    $card = "6432 23452342-2342";
    if( $card =~ s/\d{4}/****/ ){
    print "$card\n";
    }
    else{
    print "Sorry, there is no match\n";
    }
    
    If used in the modifier g such regular expression substitutes all matches in the string. In the following example we first bring the card number in the normal form and then substitute all digits but the last 4 with stars:
    $card = "6432 23452342-2342";
    $card =~ s/(\d\d\d\d)[\-\s]?/$1-/g; # separate 4-digit groups with dashes
    $card =~ s/-$//;                    # remove the trailing dash
    $card =~ s/\d{4}-/****-/g;          # substitute 4-digit groups with 4 stars
    print "$card\n";