Regular expressions in Perl 5.10

[ Perl tips index ]
[ Subscribe to Perl tips ]

Perl 5.10 was released late last year, and with it come a number of significant improvements to the language. We'll be running a series of Perl tips covering some of the changes, and how you can use them to make your life easier.

say

Perl 5.10 finally has print with a newline! It's called say, and can be enabled with:

        use feature 'say';

at the top of any program or module that needs it. You can then simple write:

        say "Hello World!";             # No \n needed!

rather than:

        print "Hello World\n";

While we'll be discussing new functions and constructs in a later Perl-tip, the say function is so handy we wanted to mention it before anything else.

Debugging regular expressions

One of the largest improvements to Perl 5.10 has been in the area of regular expressions (regexs). To get started, it's now possible to debug your regexs with:

        use re 'debug'
        $some_string =~ /some_regexp/;

use re 'debug' also existed in Perl 5.8, however its behaviour there was global, resulting in debugging information for all your regexs. In 5.10 the pragma has lexical scope, meaning it lasts only until the end of the current block, file, or eval.

        {
                use re 'debug';
                $some_string =~ /some_regexp/;  # This gets debugged.
        }
        $some_string =~ /some_other_regexp/;    # This isn't debugged.

There's also no re 'debug' to turn off regex debugging, without having to play around with blocks:

        use re 'debug';
        $some_string =~ /some_regexp/;  # This gets debugged.
        no re 'debug';
        $some_string =~ /some_other_regexp/; # This isn't debugged.

Named Capture Variables

We've always been able to capture information from regular expressions using parentheses, and recalling them using the match variables $1, $2, $3.... However sometimes it can be rather challenging to tell which match variable you want.

This can be doubly challenging when we interpolate smaller regexps into bigger ones. For example, what match variable will the last sequence of digits be placed into in the following expression?

        / (\d+) $customer_name_regexp (\d+) /x;

Keep in mind that $customer_name_regexp may or may not contain parentheses itself.

In Perl 5.10 we can now have named captures. This means we can write:

        / (?<account>\d+) $customer_name_regexp (?<credit>\d+) /x;

Using (?<name>...) syntax allows us to capture a match and then later refer to it by name. We can also refer to it by its regular match number, so our account match above can still be referred to as $1.

In order to retrieve named match information, we can use the special hash %+:

        say "Customer account number is $+{account}";
        say "Customer credit balance is $+{credit}";

Named captures can also be used in substitutions, using the new \k sequence. For example, we can swap the first and last words on a line (ignoring punctuation) using:

        s{
                ^
                (?<first>  \w+)
                (?<middle> .* ) \b
                (?<last>   \w+)
                $
         }
         {\k<last>\k<middle>\k<first>}x;

Alternatives to $`, $& and $'

The special regex variables $`, $& and $' would match everything before, inside, and after a regex respectively. However they came at a great cost; mentioning one of these special variables anywhere in your program would turn them on for all your regular expressions; even those that didn't need them. As such, the use of these variables are strongly discouraged in all but the most simple of programs.

However they can be very useful. There are some algorithms that really appreciate knowing everything that was before or after a given match.

In Perl 5.10 there's a new regexp modifier, /p, that gives us all the conveience of the old $`, $& and $' variables, but without the global performance penalty. Here's how it works:

        /(foo|bar|baz)/p;
        say "Everything before the match: ${^PREMATCH}";
        say "Everything inside the match: ${^MATCH}";
        say "Everything after  the match: ${^POSTMATCH}";

The ${^PREMATCH}, ${^MATCH} and ${^POSTMATCH} variables are only set when the /p switch is used.

More information

This tip only reveals some of the improvements made to the regexp engine in Perl 5.10. A lot of advanced features have been added, and a lot of new optimisations and improvements have been made under-the-hood.

For further information, we recommend the following resources:

[ Perl tips index ]
[ Subscribe to Perl tips ]


This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.

If you have any questions please don't hesitate to contact us:

Email: contact@perltraining.com.au
Phone: 03 9354 6001 (Australia)
International: +61 3 9354 6001

Valid XHTML 1.0 Valid CSS