[ Perl tips index ]
[ Subscribe to Perl tips ]
Today's Perl tip is about a simple programming concept, but one that's easy to get wrong: basic operations when dealing with lists of data. Let's pretend we wish to find the unique elements in an array. For example, if we had:
my @array = (1, 1, 1, 2, 1);
we'd want to get back:
my @unique = (1, 2);
This problem is naively solved using a nested loop. For example:
# This is an AWFUL solution
my @array = (1, 1, 1, 2, 1);
my @uniques;
FIND:
foreach my $element (@array) {
foreach my $unique (@uniques) {
if($element == $unique) {
# This is already in our unique array
# so skip to the next element
next FIND;
}
}
# If we're here, it's not yet in our unique array, so add
# it.
push @uniques, $element;
}
# Uniques now contains all the unique elements
print "@uniques\n";
While this solution looks correct, and may seem fast when dealing with
small lists, it's extremely slow when working with large
lists. In fact this algorithm has a complexity of O(n2), which
means the time taken increases in proportion to the square of
the number of elements. (See our Perl tip on Big-O
notation for more information.)
A better way to find the unique elements in a list is to use a hash,
since it guarantees its keys are unique. To build a hash, we only need
to walk through our list once, which gives us an algorithmic complexity
of O(n), meaning our time scales in linear proportion to the number
of elements we're processing.
We could use a looping construct to build our hash, but Perl actually provides us with a short-cut, known as a hash slice. A hash slice allows us to reference multiple keys at once, and looks a little like this:
my %favourite_colour;
# A hash slice still uses curly-braces around the keys, but
# an '@' to access multiple keys at once.
@favourite_colour{ qw(Paul Jacinta Sean) } = qw(Blue Green Red);
We can use the fact that hash keys are unique, as well as the short-cut of using a hash-slice, to very quickly find our unique set:
my @duplicates = (1, 1, 2, 3, 4, 5, 5, 6, 7, 7, 8, 9, 2, 3, 3,);
my %unique;
# Our hash here has keys, but all its values will be undefined.
# That's fine, since we're only interested in the keys.
@unique{@duplicates} = ();
my @unique_elements = sort keys %unique;
print "@unique_elements\n"; # 1 2 3 4 5 6 7 8 9
If we did not sort the results, in the previous example, we might get back the set:
6 3 7 9 2 8 1 4 5
which is not related to the original order of @duplicates.
This is because hashes do not keep any internal ordering. If it is
important that the original ordering is preserved we can use
Tie::IxHash (covered in our tip about Ordered
hashes.
Tie::IxHash does not come standard with Perl, but can be downloaded
from the CPAN.
use Tie::IxHash;
tie my %unique => 'Tie::IxHash';
my @duplicates = (1, 1, 2, 3, 4, 5, 5, 6, 7, 7, 8, 9, 2, 3, 3,);
@unique{@duplicates} = ();
my @unique_elements = keys %unique;
print "@unique_elements\n"; # 1 2 3 4 5 6 7 8 9
This works even if our @duplicates are words:
my @duplicates = qw(e d c b e e e b c a a d e e a b );
@unique{@duplicates} = ();
my @unique_elements = keys %unique;
print "@unique_elements\n"; # e d c b a
Although hash slices provide us with a very useful way to gain unique
values, there's another, even easier way to achieve the same purpose. The
List::MoreUtils module (available from the CPAN) provides a uniq
function which with the same results as our Tie::IxHash example
above:
use List::MoreUtils qw(uniq);
my @duplicates = (1, 1, 2, 3, 4, 5, 5, 6, 7, 7, 8, 9, 2, 3, 3,);
my @unique_elements = uniq( @duplicates );
print "@unique_elements\n"; # 1 2 3 4 5 6 7 8 9
The List::MoreUtils subroutines are not only easier to understand
than our hand-rolled code, but since they're implemented in pure C
they should also outperform our home-grown code by a significant
margin.
For more information about keeping insert order for hashes see our
Perl tip on
Ordered Hashes and
Tie::IxHash.
For more on List::MoreUtils and the other excellent methods it provides
see the
documentation.
For more on algorithmic complexity and why the naive solution is a bad choice see our tip on Big-O notation.
[ Perl tips index ]
[ Subscribe to Perl tips ]
This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.
If you have any questions please don't hesitate to contact us:
| Email: | contact@perltraining.com.au |
| Phone: | 03 9354 6001 (Australia) |
| International: | +61 3 9354 6001 |
Copyright Perl Training Australia. Contact us at contact@perltraining.com.au