Working with comma/tab separated data

[ Perl tips index ]
[ Subscribe to Perl tips ]

Sooner or later you'll be given some data in comma or tab separated data to read into your program. It's tempting to use split to do the task for you, and for small, tidy data sets this might be sufficient. The problem is that real-world data is rarely small or tidy.

Real world data contains commas

Consider for a moment, the data shown below:

        Item ID,Arrival Date,Supplier ID,Name,Description,Quantity,QC
        234,2010/03/17,1234,Wudget,Attach your frames to the wall without a problem,5,18
        245,2010/03/17,1234,Widget,For all oven uses,12,12
        267,2010/03/17,1225,Acme Play,Ages 5 and up,310
        890,2010/03/17,2344,"A4 recycled ream, 500 pages",photo-copier stationary,5,0

Hidden down on the last line, we have an embedded comma. As such we've had to quote that field.

If we were to use split, however, it wouldn't know that that field is quoted and so we'd end up with 8 fields instead of 7. Work-arounds aside, split rapidly becomes the wrong tool for the job. Fortunately there are modules to help solve all the kinds of problems you can hit with CSV and other *SV files. We recommend Text::CSV_XS.

Reading from a CSV file with Text::CSV_XS

Text::CSV_XS handles escapes, quoting and any other special behaviour. It is more robust and well tested than any alternative you are likely to be able to quickly create on your own.

In the most simple version, we create a Text::CSV_XS object, read one (or more) lines in from a CSV file and work with that data. For example:

        use Text::CSV_XS;

        # create my object
        my $csv = Text::CSV_XS->new()
                or die "Cannot use CSV: ".Text::CSV_XS->error_diag();

        # open my file
        open my $fh, "<", "test.csv" or die "test.csv: $!";

        # read one (or more) lines in from the file (getline parses the CSV for us)
        while (my $row = $csv->getline ($fh)) {

                # do something with $row (it's an array reference)
                push @rows, $row;

        }
        # check for errors
        $csv->eof or $csv->error_diag();

        # close our file
        close $fh;

The above example will work with most CSV files without a hitch, however we can also do more cool things. We can handle files which might have non-ASCII characters (such as Microsoft Word "smart quotes") or which may be encoded in Unicode (or both). We can get Text::CSV_XS to do its own error handling and throw exceptions on failure rather than requiring us to do our own checking.

We can also take advantage of the fact that the first line of our data contains the field names, to create hashes of fieldname => value rather than just relying on position. Here's a more complete example taking advantage of these extra options:

        #!/usr/bin/perl
        use strict;
        use warnings;
        use autodie;
        use Text::CSV_XS;

        # Create our Text::CSV_XS object, file may contain binary
        # characters.  auto_diag will make it throw an exception on
        # failure.

        my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 });

        # Open a file for reading (this is just a regular open, although
        # we're going to assume that the file uses utf8 encoding)

        open my $fh, "<:encoding(utf8)", "test.csv";

        # The first line of our CSV are all of the field names.  Capture
        # this so that we don't need to care about what order they've been
        # written in:

        my @fields = @{ $csv->getline( $fh ) };

        # Read each line from the CSV file, and store it in @rows
        
        my @rows;
        while (my $row = $csv->getline ($fh)) {

                # We can use the field names from above to store the values
                # into a hash with fieldname => value.  This will make
                # using the data much easier as we don't need to worry
                # about what position what piece of data is

                my %data;
                @data{ @fields } = @$row;       # This is a hash slice

                # We could either use the data here, or store it in @rows
                # for later access, we'll do a bit of both.  First we store
                # it

                push @rows, \%data;

                # Let's also print some values:

                print "Item ID: $data{'Item ID'}, Quantity: $data{Quantity}\n";
        }

        # Close the file handle.
        close $fh;

        # All of our CSV data is now in hash references inside @rows

The printed output of running this program with our sample data above is:

    Item ID: 234, Quantity: 5
    Item ID: 245, Quantity: 12
    Item ID: 267, Quantity: 310
    Item ID: 890, Quantity: 5

If we examine the final data structure with Data::Dumper we'd have:

    $VAR1 = [
                {
                    'Supplier ID' => '1234',
                    'Item ID' => '234',
                    'Arrival Date' => '2010/03/17',
                    'Quantity' => '5',
                    'Description' => 'Attach your frames to the wall without a problem',
                    'Name' => 'Wudget',
                    'QC' => '18'
                },
                {
                    'Supplier ID' => '1234',
                    'Item ID' => '245',
                    'Arrival Date' => '2010/03/17',
                    'Quantity' => '12',
                    'Description' => 'For all oven uses',
                    'Name' => 'Widget',
                    'QC' => '12'
                },
                {
                    'Supplier ID' => '1225',
                    'Item ID' => '267',
                    'Arrival Date' => '2010/03/17',
                    'Quantity' => '310',
                    'Description' => 'Ages 5 and up',
                    'Name' => 'Acme Play',
                    'QC' => undef
                },
                {
                    'Supplier ID' => '2344',
                    'Item ID' => '890',
                    'Arrival Date' => '2010/03/17',
                    'Quantity' => '5',
                    'Description' => 'photo-copier stationary',
                    'Name' => 'A4 recycled ream, 500 pages',
                    'QC' => '0'
                }
            ];

Now that we have read this data in, we can do whatever we need to do with it: put it into a database, create XML or HTML with it, work out how much our inventory is worth...

Writing CSV files with Text::CSV_XS

Text::CSV_XS can be told to be very lenient with what it accepts. When you create the object you can tell it to be more forgiving of whitespace, unnecessary escaping and anything else you need. It is however, very clean in what it outputs.

We can write out our data as follows:


        my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 });

        # Open a file for writing
        open $fh, ">:encoding(utf8)", "new.csv";

        # Set our line endings (use "\r\n" for Windows") 
        # This will be printed after each call to $cvs->print()
        $csv->eol ("\n");

        # Print out our headers
        $csv->print($fh, \@fields);

        # Write out each CSV row to $fh.  We use a hash slice again, 
        # to get out our fields in the desired order.

        foreach my $row (@rows) {

            $csv->print($fh, [ @$row{ @fields }]);
        }

        # And close our file.
        close $fh;

Text::CSV_XS's print method takes a file handle and an array reference containing your row data. We need to massage our data slightly as our @rows is an array of hash references. It quotes fields that require quoting and escapes extra quote characters as required. As expected the output is the same as what we read in (with extra quotes).

    "Item ID","Arrival Date","Supplier ID",Name,Description,Quantity,QC
    234,2010/03/17,1234,Wudget,"Attach your frames to the wall without a problem",5,18
    245,2010/03/17,1234,Widget,"For all oven uses",12,12
    267,2010/03/17,1225,"Acme Play","Ages 5 and up",310,
    890,2010/03/17,2344,"A4 recycled ream, 500 pages","photo-copier stationary",5,0

Conclusion

Although it can be very tempting to attempt to parse regularly separated files with split, it's dangerous to assume that real word data will not occasionally contain your separator. Text::CSV_XS is a robust and safe solution. In addition to comma separated data, you can use almost any other separator that you want. Likewise you can change your quoting character, and how quoting characters are escaped.

Finally, if you are not certain that you can use Text::CSV_XS due to the XS component, there's also Text::CSV which is compatible with Text::CSV_XS but uses a (slower) pure Perl implementation if the XS code isn't compiled.

[ Perl tips index ]
[ Subscribe to Perl tips ]


This Perl tip and associated text is copyright Perl Training Australia. You may freely distribute this text so long as it is distributed in full with this Copyright noticed attached.

If you have any questions please don't hesitate to contact us:

Email: contact@perltraining.com.au
Phone: 03 9354 6001 (Australia)
International: +61 3 9354 6001

Valid XHTML 1.0 Valid CSS