[ Perl tips index]
[ Subscribe to Perl tips ]
Often we need to search for one or more items in a larger list. This tip examines how we can do this and examines the efficiency of various techniques.
Let's pretend we have a list of usernames and we wish to find out whether a given name is in that list. A traditional approach using a foreach loop would look like this:
my $found = 0;
foreach my $username (@usernames) {
if($username eq $check_name) {
$found = 1;
last;
}
}
Those who have used Perl's grep function before would probably write:
my $found = grep { $_ eq $check_name } @usernames;
We could also use first from the Scalar::Util module (which comes
standard with Perl), which allows us to find the first occurrence of
the element in the list.
use Scalar::Util qw(first);
my $found = first { $_ eq $check_name } @usernames;
Each of these are fine solutions, although the solution using
foreach provides the best performance in most circumstances.
It avoids searching the entire list (which grep does even in
a scalar context), and the overhead of subroutine calls (which
first uses for each comparison).
Let's say we have a list of items and we want to test each one for existence in a much larger list. A simple solution may look like this:
foreach my $wanted_book (@books_I_want) {
foreach my $library_book (@library_books) {
if($library_book eq $wanted_book) {
# My book is available. Order it.
order_book($wanted_book);
last;
}
}
}
An alternate solution using grep could look like this:
foreach my $wanted_book (@books_I_want) {
if(grep { $_ eq $wanted_book} @library_books) {
# My book is available. Order it.
order_book($wanted_book);
last;
}
}
If we have many library books (for example approximately 10,000) and a medium
demand for book orders (approximately 150), one third of which do not exist in
the library, then we find that foreach is moderately faster (26%) than grep
which is moderately faster (26%) than first. Each of these take between one
third and half of a second to run on modern hardware.
If we're doing lots of searches, it becomes much faster to have our books in a hash, and use a hash-lookup:
my $found = exists $library_index{$book_name};
How much faster? About 100,000 times faster on average when compared to a linear search with foreach on a list with with 10,000 items in it. Of course, there is the additional overhead of building the hash in the first place. For two-character long keys, building a hash takes twice the time as building an array, and it's even longer for longer keys. It's just not worth the extra effort to build a hash if we're going to perform just a single search.
However in our library example we're walking over our arrays multiple times, making this a prime candidate for using a hash to speed things up. We can use a hash slice to add our keys to the hash in a single step.
my %library_search;
@library_search{@library_books} = ();
foreach my $book (@books_I_want) {
if(exists $library_search{$book}) {
# My book is available. Order it.
order_book($book);
}
}
Or alternately, putting our wanted books into a hash (notice the change in
the foreach loop's list).
# Add library books into a hash
my %books_I_want;
@books_I_want{@books_I_want} = ();
foreach my $library_book (@library_books) {
if(exists $books_I_want{$library_book}) {
# yay my book's available
order_book($library_book);
# We've ordered it, delete from hash
delete $books_I_want{$library_book};
}
# end if we have no more books we want
last unless keys %books_I_want;
}
Building the hash of all the library books (our first example) and then
searching is much faster (1928%) than our simple linear searches with
foreach, while building the hash of our wanted books (our second example) is
even faster (160%) than that.
Both our examples requires that Perl walks through each of our arrays at least once, either to build the hash (using our hash slice), or to search the hash (using our foreach loop). Our second example is the faster of the two since we're building a smaller hash, which takes less time. We also have the potential to exit our main loop early if we find all the books we're looking for.
However our second solution is only superior if we throw away our hashes at the end, and rebuild them whenever we need to order new books. If we're able to retain our hashes in a persistent process, then it's much faster to index all the library books once, and then loop over the smaller list each time:
foreach my $wanted_book (@books_I_want) {
if(exists $prebuilt_library_books{$wanted_book}) {
# yay my book's available
order_book($wanted_book);
}
}
In fact, it's blindingly fast, running at more than 78 times the speed of the fastest code that builds the hash each time. Whenever you build a hash for the purposes of searching, it's often worthwhile to hang onto it as long as possible.
We used Perl's Benchmark module to obtain these results as follows:
Rate first grep foreach lb-hash w-hash pre-built
first 2.03/s -- -20% -37% -97% -99% -100%
grep 2.55/s 26% -- -21% -96% -98% -100%
foreach 3.21/s 58% 26% -- -95% -98% -100%
lb-hash 65.0/s 3103% 2451% 1928% -- -62% -100%
w-hash 169/s 8238% 6540% 5178% 160% -- -99%
pre-built 13444/s 662497% 527577% 419353% 20588% 7846% --
(where lb-hash is building the hash from the library books and w-hash is building the hash from the wanted books). Note that these results will vary slightly from machine to machine and run to run.
We can tell by the rate that we can run the solution using first 2.03
times in a second, while we can use the solution using foreach 3.21
times in a second. The results with less library books (about 5000) are
much the same.
Benchmark is a standard Perl module, and its documentation can be found
on CPAN ( http://search.cpan.org/perldoc ) or can be obtained
with perldoc Benchmark.
The original code used to generate these tests can be found at http://perltraining.com.au/tips/listsearch.pl.html. Note that there will be some variation between results from one invocation to the next.
There are reasons why you may have large lists instead of hashes. For
example you may need the elements to be ordered, or an array may make more
sense at other points of your program. However, if you are likely to be
doing lots of random accesses into a list, including looking for items
which do not exist, then it may be more efficient to create a hash. In our
situation, even with the cost of creating a hash of all of the library books,
we discovered a massive performance increase of over 1900% than using a simple
foreach loop. Using a better approach and creating a hash of our wanted
books was an improvement of more than 5100% over the simple foreach loop.
If we can instead use a hash instead of a list our lookups are even faster.
[ Perl tips index ]
[ Subscribe to Perl tips ]
| Location | Course | Course Date | Duration | Early Bird Date |
|---|---|---|---|---|
| Melbourne | Programming Perl | Tue 2 Sep 2008 | 4 days | Mon 4 Aug 2008 |
| Sydney | Programming Perl | Tue 7 Oct 2008 | 4 days | Mon 8 Sep 2008 |
| Canberra | Programming Perl | Mon 24 Nov 2008 | 4 days | Mon 27 Oct 2008 |
For future dates, please see our training calendar.
This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.
If you have any questions please don't hesitate to contact us:
| Email: | contact@perltraining.com.au |
| Phone: | 03 9354 6001 (Australia) |
| International: | +61 3 9354 6001 |
Copyright 2001-2008 Perl Training Australia. Contact us at contact@perltraining.com.au