Introduction
Without regular expressions, Perl would be a fast development environment.
Probably a little faster than VB for console apps. With the addition of regular
expressions, Perl exceeds other RAD environments five to twenty-fold in the
hands of an experienced practitioner, on console apps whose problem domains
include parsing (and that's a heck of a lot of them).
Regular expressions is a HUGE area of knowledge, bordering on an art.
Rather than regurgitate the contents of the Perl documentation or the plethora
of Perl books at your local bookstore, this page will attempt to give you
the 10% of regular expressions you'll use 90% of the time. Note that for
this reason we assume all strings to be single-line strings containing no
newline chars.
What They Are
Regular expressions are a syntax, implemented in Perl and certain other environments,
making it not only possible but easy to do some of the following:
- Complex string comparisons
- $string =~ m/sought_text/; # m before the first
slash is the "match" operator.
- Complex string selections
- $string =~ m/whatever(sought_text)whatever2/;
- $soughtText = $1;
- Complex string replacements
- $string =~ tr/originaltext/newtext/; # tr before
first slash is "translate" operator.
- Parsing based on the above abilities
Doing String Comparisons
We start with string comparisons because they're the easiest, and yet most
of what's contained here is applicable in selecting and replacing text.
Simple String Comparisons
The most basic string comparison is
$string =~ m/sought_text/;
The above returns true if string $string contains substring "sought_text",
false otherwise. If you want only those strings where the sought text appears
at the very beginning, you could write the following:
$string =~ m/^sought_text/;
Similarly, the $ operator indicates "end of string". If you wanted to find
out if the sought text was the very last text in the string, you could write
this:
$string =~ m/sought_text$/;
Now, if you want the comparison to be true only if $string contains the sought
text and nothing but the sought text, simply do this:
$string =~ m/^sought_text$/;
Now what if you want the comparison to be case insensitive? All you do is
add the letter i after the ending delimiter:
$string =~ m/^sought_text$/i;
Using Simple "Wildcards" and "Repetitions"
Calling these "wildcards" may actually conflict with the theoretical grammer
and syntax of Perl, but in fact is the most intuitive way to think of it,
and will not lead to any coding mistakes.
. Match any character
\w Match "word" character (alphanumeric plus "_")
\W Match non-word character
\s Match whitespace character
\S Match non-whitespace character
\d Match digit character
\D Match non-digit character
\t Match tab
\n Match newline
\r Match return
\f Match formfeed
\a Match alarm (bell, beep, etc)
\e Match escape
\021 Match octal char ( in this case 21 octal)
\xf0 Match hex char ( in this case f0 hexidecimal)
You can follow any character, wildcard, or series of characters and/or wildcard
with a repetiton. Here's where you start getting some power:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
Now for some examples:
$string =~ m/\s*rem/i; #true if the first printable text is rem or REM
$string =~ m/^\S{1,8}\.\S{0,3}/; # check for DOS 8.3 filename
# (note a few illegals can sneak thru)
Using Groups ( ) in Matching
Note: Many situations can be done either with groups ( )
or character classes [ ]. Groups are less quirky and they more often yield
the results you were looking for.
Groups are regular expression characters surrounded by parentheses. They
have two major uses:
- To allow alternative phrases as in /(Clinton|Bush|Reagan)/i. Note that
for single character alternatives, you can also use character classes.
- As a means of retrieving selected text in selection, translation and
substitution, used with the $1, $2, etc scalers.
This section will discuss only the first use. To see more about the second
use,
click here.
Powerful regular expressions can be made with groups At its simplest,
you can match either all lowercase or name case like this:
if($string =~ m/(B|b)ill (C|c)linton/)
{print "It is Clinton, all right!\n"}
Detect all strings containing vowels
if($string =~ m/(A|E|I|O|U|Y|a|e|i|o|u|y)/)
{print "String contains a vowel!\n"}
Detect if the line starts with any of the last three presidents:
if($string =~ m/^(Clinton|Bush|Reagan)/i)
{print "$string\n"};
Note that the parenthesized element will appear as $1 statements that follow
the regular expression. That's OK. If you don't want to use $1, just ignore
it. The use of $1, etc, will be explained in the section on
Doing String Selections.
Using Character Classes [ ]
Character classes are alternative single characters within square brackets,
and are not to be confused with OOP classes, which are blueprints for objects.
If not used carefully, they can yield unexpected results. Remember that
groupsare an alternative.
Character classes have three main advantages:
- Shorthand notation, as [AEIOUY] instead of (A|E|I|O|U|Y). This advantage
is minor at best.
- Character Ranges, such as [A-Z].
- One to one mapping from on class to another, as in tr/[a-z]/[A-Z]. This
is essential! It will be discussed in the section on translations.
THE WHOLE THING IN THE SQUARE BRACKETS REPRESENTS EXACTLY ONE CHARACTER!!!
Did I shout loud enough? It may be tempting to do something like this:
if($string =~ /[Clinton|Bush|Reagan]/){$office = "President"}
The above may even appear to work upon casual testing. Don't do it. Remember
that everything inside the brackets represents ONE character, simply listing
all it's alternative possibilities.
Other Quirks
I haven't fully investigated this yet, but character classes seem to sometimes
do goofy things in regular expressions where the case is ignored (i after
the trailing delimiter).
Special Characters Inside the Square Brackets
As we've already seen, a hyphen is used to indicate all characters in the
colating sequence between the character on the hyphen's left and the character
on its right.
An uparrow (^) at immediately following the opening square bracket means
"Anything but these characters", and effectively negates the character class.
For instance, to match anything that is not a vowel, do this:
if($string =~ /[^AEIOUYaeiouy]/){print "This string contains a non-vowel"}
Contrast to this:
if($string !~ /[AEIOUYaeiouy]/){print "This string contains no vowels at all"}
Best Uses of Character Classes
Print all people whose name begins with A through E
if($string =~ m/^[A-E]/)
{print "$string\n"}
If character classes are giving you quirky results, consider using
groups!
Matching: Putting it All Together
Print everyone whose last name is Clinton, Bush or Reagan. Each element of
list is first name, blank, last name, and possibly more blanks and more info
after the last name. Study this til you understand it.
if($string =~ m/^\S+\s+(Clinton|Bush|Reagan)/i)
{print "$string\n"};
Print every line with a valid phone number.
if($string =~ m/[\)\s\-]\d{3}-\d{4}[\s\.\,\?]/)
{print "Phone line: $string\n"};
Doing String Selections (Parsing)
If regular expressions' only benefit was looking for a (albeit complex) string
within a string, it wouldn't be worth learningl. Regular expressions (and
Perl itself, for that matter) really start earning their keep by allowing
you to select and process substrings based on what they contain, and the
context in which they appear.
For instance, create a program whose input is a piped in directory command
and whose output is stdout, and whose output represents a batch file which
copies every file (not directory) older than 12/22/97 to a directory called
\oldie. This would be pretty nasty in C or C++. The directory output would
look something like this:
Volume in drive D has no label
Volume Serial Number is 4547-15E0
Directory of D:\polo\marco
. <DIR> 12-18-97 11:14a .
.. <DIR> 12-18-97 11:14a ..
INDEX HTM 3,237 02-06-98 3:12p index.htm
APPDEV HTM 6,388 12-24-97 5:13p appdev.htm
NORM HTM 5,297 12-24-97 5:13p norm.htm
IMAGES <DIR> 12-18-97 11:14a images
TCBK GIF 532 06-02-97 3:14p tcbk.gif
LSQL HTM 5,027 12-24-97 5:13p lsql.htm
CRASHPRF HTM 11,403 12-24-97 5:13p crashprf.htm
WS_FTP LOG 5,416 12-24-97 5:24p WS_FTP.LOG
FIBB HTM 10,234 12-24-97 5:13p fibb.htm
MEMLEAK HTM 19,736 12-24-97 5:13p memleak.htm
LITTPERL <DIR> 02-06-98 1:58p littperl
9 file(s) 67,270 bytes
4 dir(s) 132,464,640 bytes free
UUUUgly! I'd hate to do this in C or C++. But wait. It's 18 lines in Perl?
while(<STDIN>)
{
my($line) = $_;
chomp($line);
if($line !~ /<DIR>/) #directories don't count
{
#** only lines with dates at position 28 and (long) filename at pos 44 **
if ($line =~ /.{28}(\d\d)-(\d\d)-(\d\d).{8}(.+)$/)
{
my($filename) = $4;
my($yymmdd) = "$3$1$2";
if($yymmdd lt "971222")
{
print "copy $filename \\oldie\n";
}
}
}
}
Not bad for 18 lines of code. It could have been shorter, but I wanted to
keep it readable. In the snippet above, $1, $2, $3 and $4 are the scalers
inside the first, second, third and fourth parenthesis sets. The first three
are re-assembled into a yymmdd date string which can be compared with the
constant "971222". The fourth holds the filename which will be copied to the
\oldie directory if it's not a directory, it's a line with a date, and the
date is before 971222. This is the true power of regular expressions and
Perl.
Now count the bytes in the directory:
my($totalBytes) = 0;
while(<STDIN>)
{
my($line) = $_;
chomp($line);
if($line !~ /<DIR>/) #directories don't count
{
#*** only lines with dates at position 28 ****
if ($line =~ /.{12}((\d| |,){14}) \d\d-\d\d-\d\d/)
{
my($bytes) = $1;
$bytes =~ s/,//; #substitute nothing for comma -- delete commas
$totalBytes += $bytes;
}
}
}
print "$totalBytes bytes in directory.\n";
Note the group within a group, where the inner one is used for character
alternation, and the outer is used as a selection.
Doing Substitutions
Replace every "Bill Clinton" with an "Al Gore"
$string =~ s/Bill Clinton/Al Gore/;
Now do it ignoring the case of bIlL ClInToN.
$string =~ s/Bill Clinton/Al Gore/i;
Doing Translations
Translations are like substitutions, except they happen on a letter by letter
basis instead of substituting a single phrase for another single phrase. For
instance, what if you wanted to make all vowels upper case:
$string =~ tr/[a,e,i,o,u,y]/[A,E,I,O,U,Y]/;
Change everything to upper case:
$string =~ tr/[a-z]/[A-Z]/;
Change everything to lower case
$string =~ tr/[A-Z]/[a-z]/;
Change all vowels to numbers to avoid "4 letter words" in a serial number.
$string =~ tr/[A,E,I,O,U,Y]/[1,2,3,4,5]/;
Greedy and Ungreedy Matching
Perl regular expressions normally match the longest string possible. For
instance:
my($text) = "mississippi";
$text =~ m/(i.*s)/;
print $1 . "\n";
Run the preceding code, and here's what you get:
ississ
It matches the first i, the last s, and everything in between them. But what
if you want to match the first i to the s most closely following it? Use this
code:
my($text) = "mississippi";
$text =~ m/(i.*?s)/;
print $1 . "\n";
Now look what the code produces:
is
Clearly, the use of the question mark makes the match ungreedy. But theres
another problem in that regular expressions always try to match as early
as possible. Read on...
Resolving Doubledots in A Filepath
Doubledots are placefillers for "go up one directory" in a file path. Typically,
when you desire to create an absolute path, you want to resolve them by deleting
them and the level of directory above them. For instance, /a/b/../whatever
becomes /a/whatever.
This is MUCH trickier than it might seem. It's likely that all your ideas
about greedy matching, replacement strings and the like won't work. Here's
the regular expression to resolve A SINGLE double dot:
$text =~ s/\/[^\/]*\/\.\.//;
In English, this says "find a slash, followed by any number of nonslashes,
followed by a slash, followed by two dots, and replace them with nothing.
This technique will resolve doubledots in a string as long as that string
has only one doubledot. But the plot thickens...
Doubledots can occur alternatively with directories (/a/b/../c/../d)
or nested (/a/b/c/../../d). The best way I've found to reliably
resolve all doubledots is to make a function that loops through the preceding
regular expression until there are no more doubledots. Here's the function:
sub deleteDoubleDots($)
{
while($_[0] =~ m/\.\./)
{
$_[0] =~ s/\/[^\/]*\/\.\.//;
}
}
The preceding function will resolve all doubledots, be they alternating or
nested, or combinations thereof.
Kewl Splitpath One Liner Regex
Check out this splitpath command:
my($text) = "/etc/sysconfig/network-scripts/ifcfg-eth0";
my($directory, $filename) = $text =~ m/(.*\/)(.*)$/;
print "D=$directory, F=$filename\n";
Is that cool or what?
Using a Variable as a Match Expression
You can use a variable inside the match expression. This yields tremendous
power. Simply place the variable name between the forward slashes, and the
expression will be sought in the string. Here's an example:
#!/usr/bin/perl -w
# use strict;
sub test($$)
{
my $lookfor = shift;
my $string = shift;
print "\n$lookfor ";
if($string =~ m/($lookfor)/)
{
print " is in ";
}
else
{
print " is NOT in ";
}
print "$string.";
if(defined($1))
{
print " <$1>";
}
print "\n";
}
test("st.v.", "steve was here");
test("st.v.", "kitchen stove");
test("st.v.", "kitchen store");
|
The preceding code produces the following output.
[slitt@mydesk slitt]$ ./junk.pl
st.v. is in steve was here. <steve>
st.v. is in kitchen stove. <stove>
st.v. is NOT in kitchen store.
[slitt@mydesk slitt]$
|
As you can see, you can seek a regex expression stored in a variable, and
you can retrieve the result in $1.
Symbol Explanations:
=~
This operator appears between the string var you are comparing, and the regular
expression you're looking for (note that in selection or substitution a regular
expression operates on the string var rather than comparing). Here's a simple
example:
$string =~ m/Bill Clinton/;
#return true if var $string contains the name of the president
$string =~ tr/Bill Clinton/Al Gore/; #replace the president with the
vice president
!~
Just like =~, except negated. With matching, returns true if it DOESN'T match.
I can't imagine what it would do in translates, etc.
/
This is the usual delimiter for the text part of a regular expression. If
the sought-after text contains slashes, it's sometimes easier to use pipe
symbols (|) for delimiters, but this is rare. Here are simple examples:
$string =~ m/Bill Clinton/;
#return true if var $string contains the name of the president
$string =~ tr/Bill Clinton/Al Gore/; #replace the president with the
vice president
m
The match operator. Coming before the opening delimiter, this is the "match"
operator. It means read the string expression on the left of the =~, and
see if any part of it matches the expression within the delimiters following
the m. Note that if the delimiters are slashes (which is the normal state
of affairs), the m is optional and often not included. Whether it's there
or not, it's still a match operation. Here are some examples:
$string =~ m/Bill Clinton/;
#return true if var $string contains the name of the president
$string =~ /Bill Clinton/;
#same result as previous statement
^
This is the "beginning of line" symbol. When used immediately after the starting
delimiter, it signifies "at the beginning of the line". For instance:
$string =~ m/^Bill Clinton/;
#true only when "Bill Clinton" is the first text in the string
$
This is the "end of line" symbol. When used immediately before the ending
delimiter, it signifies "at the end of the line". For instance:
$string =~ m/Bill Clinton$/;
#true only when "Bill Clinton" is the last text in the string
i
This is the "case insensitivity" operator when used immediately after the
closing delimiter. For instance:
$string =~ m/Bill Clinton/i;
#true when $string contains "Bill Clinton" or BilL ClInToN"