Back-references

[ Perl tips index ]
[ Subscribe to Perl tips ]

Back-references allow us to re-match the same sequence of characters matched by an earlier set of parentheses. For example, we could use back-references to match repeating words:

    this this
    that that
    bang bang

While we can use our capture variables in substitutions, this is no use in a simple match pattern, because $1 and friends aren't set until after the match is complete. Something like:

    say if m{(\b\w+) $1\b};

will not match "this this" or "that that". Rather, it will match a word followed by a space, followed by whatever $1 was set to by an earlier match.

In order to match "this this" (or "that that") we need to use the special regular expression meta characters called back-references. These are written: \1, \2, etc. These meta characters refer to parenthesized parts of a match pattern, just as $1 does, but within the same match rather than referring back to the previous match.

    # say if we find repeated words, eg: "this this"
    say $1 if m{(\b\w+) \1\b};

Back-references for named matches

Along with named matches, Perl 5.10.0 provides us with named back references using the \g{name} syntax:

    say $+{repeated} if m{(?<repeated>\w+) \g{repeated}};

Relative back-references

We can also use the \g{} syntax to match recent matches by counting backwards. \g{-1} matches the most recent set of parentheses (including named matches), \g{-2} the second most recent set and so on. Thus the above could also have been written:

    say $+{repeated} if m{(?<repeated>\w+) \g{-1}};

Safe back-references

Finally we can use \g{} to match regular back-references in a way that is always safe. Inside a regular expression \10 can mean either the character whose ordinal in octal is 010 (a backspace) or - if there are at least 10 matching parentheses in this regular expression - the 10th back-reference. Thus it is better (for clarity and a lack of surprises) to always use \g{}. \1 and \g{1} are identical.

    say $1 if m{(\b\w+) \g{1}\b};

Other syntaxes

The braces in the above examples are not required if you're using a numbered back-reference. Thus sometimes you'll see \g1 and \g2 etc:

    say $1 if m{(\b\w+) \g1\b};

Likewise, named matches can also use \k<name>:

    say $+{repeated} if m{(?<repeated>\w+) \k<repeated>};

Although these syntaxes are allowed, we recommend always using \g{} for consistency.

Back-reference basics

Regardless of the syntax we use to write our back-reference, it is important to remember that any back-reference will only match the characters the matching set of parentheses matched. This is why we require both of the word boundaries in the above examples. Without them we'd match part-way through words:

    say $1 if m{(\w+) \g{1}};
    # matches: "this is a test" (prints "is")
    # matches: "an antelope ate the apples" (prints "an")

In the first case, the parentheses are starting their match part way through a word (at "is" in "this") and the back-reference is matching "is" standing alone. In the second case, the parentheses are matching a whole word ("an") but the back-reference is matching a partial word ("an" in "antelope"). If we wish to match duplicate words, we need to match only full words, so we require that both the parentheses and the back-reference be bounded with word boundaries.

More information

For more information on references check out the handy Perl Regular Expression Tutorial.

[ Perl tips index ]
[ Subscribe to Perl tips ]


This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.

If you have any questions please don't hesitate to contact us:

Email: contact@perltraining.com.au
Phone: 03 9354 6001 (Australia)
International: +61 3 9354 6001

Valid XHTML 1.0 Valid CSS