[ Perl tips index ]
[ Subscribe to Perl tips ]
In an earlier tip we discussed some of the changes to regular expressions in Perl 5.10. In particular Perl 5.10 allows us to name the captures we want to make inside a regular expression. In this tip we explore more powerful capturing techniques, as well as using named groups to parse complex grammars.
Perl's regular expressions set the match variables $1, $2 and friends
counting each open parentheses in an expression. When we embed other
variables inside regular expressions this can make it very hard to identify
which match variable will be set for later parentheses. For example, which
match variable will the last sequence of digits be placed into in the
following expression? What if $customer_name_regexp contains
parentheses too?
/ (\d+) $customer_name_regexp (\d+) /x;
In Perl 5.10 we can name our captures:
/ (?<account>\d+) $customer_name_regexp (?<credit>\d+) /x;
and then access our values using $+{account} and $+{credit}. While
you won't see it very often, we're actually looking up entries in the
special hash %+.
We can use the power of named captures to allow us to build up complex
regular expressions out of smaller, simpler pieces - and still trust them
to work. We can use qr{} (quoted regular expressions) to create our regular
expression snippets (these are covered in more detail
in another tip).
In this example we build an expression to match a title and another to match a name before combining them to pull information out of a letter:
my $title = qr{
(?<title>
Mrs|Mr|Ms|Miss|Dr
)
}x;
my $name = qr{
(?<name> \w+ )
}x;
$letter =~ m{
Dear \s $title \s $name,
}x;
say "Title: $+{title}";
say "Name: $+{name}";
As we have used named captures, we know that $+{title} will be set for
any successful regular expression that includes the expression in
$title. Most importantly, this makes our expressions much
more maintainable; rather than looking at ugly regexp syntax and numbered
variables, we're now looking at meaningful names. If we update a regexp,
say by allowing hyphens in names and ensuring we match word-boundaries,
we can be sure that all code that uses $name will use that update:
my $name = qr{
(?<name> \b [\w-]+ \b )
}x;
If a named capture is used more than once, %+ will contain only
the last successful match. However all matches can be accessed
via the special %- hash:
my $account = qr{ (?<account> \d{8} ) }x;
my $money = qr{ \$ (?<money> (\d+\.\d{2} ) }x;
my $date = qr{ (?<date> \d{4}-\d{2}-\d{2} ) }x;
if(/Transferred $money from $account to $account on $date/) {
say "From account: $-{account}[0]";
say "To account: $-{account}[1]";
}
A common problem with Perl 5's normal back references is that you can't
build up patterns which use them; as they rely on knowing which pattern
buffer you want to match. In the following expression, we don't know
what \2 will refer to, as $some_regexp may also contain captures:
/ (\d+) $some_regexp (\d+) \s+ \2 /x;
Perl 5.10 provides a new back reference syntax. We can use \g{-1} to
refer to the previous capture, which means we can always be sure of
getting the same result even if $some_regexp contains captures:
/ (\d+) $some_regexp (\d+) \s+ \g{-1} /x;
We can also use \g{1} to refer to the first capture (the same as \1),
and \g{label} to refer to the first capture with a name of label. This
last form allows writing of regexps like the following, which capture an
account number by name, and then look for the same account later in the
regexp:
/ (?<account> \d{8}) $some_regexp \g{account} /x
The (?(DEFINE)...) construct allows us to define parts of a regular
expression which are not immediately executed as part of a match, but
which can be recursed into later using (?&NAME). This allows us to
create powerful regular expressions which can match recursive
structures such as grammars.
Let's look at an example designed to recognise a simple set of
algebraic expressions. For example we'd like to match the valid
expressions a=x+1 and x=2, but not the invalid expression =a.
my $expression = qr{
(?(DEFINE)
(?<expr> (?&term) (?&opterm)? )
(?<term> (?&identifier) | (?&number) )
(?<opterm> (?&operator) (?&term) (?&opterm)? )
(?<operator> [=+*/-] )
(?<identifier> [A-Za-z][A-Za-z0-9]* )
(?<number> [0-9]+ )
)
(?&expr)
}x;
There's a lot going on here, so let's look at that line by line:
We use qr{} to create a quoted regular expression reference and assign that
into $expression. This does not run the expression, we're just building
it for later.
This tells Perl that we're defining a set of rules. None of the parentheses in a define capture, instead they only act to group terms together.
This defines the named capture expr and says that an expr is a
term which may (optionally) be followed by an opterm. We'll find out
what is allowed as a term and opterm as we progress through the
expression.
A term is either an identifier or a number.
An opterm is an operator followed by a term which may then
be optionally followed another opterm. Note here that we're
defining an opterm in terms of itself!
An op (operator) is one of the five basic algebraic operations
(equals, plus, multiply, divide, and minus).
An identifier is a sequence of letters and numbers, but this
sequence must begin with a letter. For example value or
x3 are both considered valid identifiers.
A number is one or more digits between 0 and 9.
Finally, we close our define rule.
A define block merely specifies a set of rules. In order to be able to
then use those rules we need to specify which rule to start with. Thus we
tell Perl to recurse into the expr rule to start matching.
The end of our expression. We're using extended regular expressions, so space characters and comments are ignored.
To show this in action, let's consider a how a few expressions might be broken up:
x = 3 => expr
=> term opterm
=> ident op term
=> (x) (=) number
=> (x) (=) (3)
x = a + b * c => expr
=> term opterm
=> ident op term opterm
=> (x) (=) ident op term opterm
=> (x) (=) (a) (+) ident op term
=> (x) (=) (a) (+) (b) (*) ident
=> (x) (=) (a) (+) (b) (*) (c)
However this cannot match the following:
x = a = => expr
=> term opterm
=> ident op term opterm
=> (x) (=) ident op term
=> (x) (=) (a) (=) ???
because we're missing the final term.
In order to use the regular expression we've built up in $expression we
just include inside a regular expression where we want it, as follows:
while (<>) {
say "That's an expression" if /^ $expression $/x;
}
Unfortunately, the named blocks inside a DEFINE section
do not capture, so additional work may be required to extract the
information you're after. However this still allows the regexp engine
to be used for some very powerful tasks that were previously
impossible for the average developer.
For further information, we recommend the following resources:
[ Perl tips index ]
[ Subscribe to Perl tips ]
This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.
If you have any questions please don't hesitate to contact us:
| Email: | contact@perltraining.com.au |
| Phone: | 03 9354 6001 (Australia) |
| International: | +61 3 9354 6001 |
Copyright 2001-2012 Perl Training Australia. Contact us at contact@perltraining.com.au