Perl: Downloading and Parsing XML

One fairly common task in Perl is downloading and parsing data. For instance, you might want to download sports data and parse it so that you can import it to a database or do some sports ranking calculations on it.

In this tutorial I'll show you a really simple way of downloading and parsing XML. You'll need, of course, a site that provides you with XML to practice on. The technique I show you here may work for some HTML too, but I certainly can't guarantee it.

The code I'm about to show you doesn't do a whole lot of error checking, in order to keep things simple for the purposes of this tutorial. If you want more details info on downloading stuff using Perl, including how to do error checking, see this page.

Firstly, the XML. Let's pretend that you can download the following XML on some URL or other. It contains information about various bands and positions their albums reached in the album charts.

 
<xml>
<entry>
    <band>The Basspluckers</band>
    <album>
        <name>Revenge of the Squirrels</name>
        <chartposition>434</chartposition>
    </album>
    <album>
        <name>Pluck My Bass</name>
        <chartposition>123</chartposition>
    </album>
</entry>
<entry>
    <band>The Dead Drunks</band>
    <album>
        <name>Spy in the House of Love</name>
        <chartposition>10</chartposition>
    </album>
    <album>
        <name>Get Out of My House of Love, You Spy</name>
        <chartposition>74</chartposition>
    </album>
</entry>
</xml>
 

Now, the Perl!

The first thing in any Perl script is to turn on warnings and 'strict', to catch typos in your variable names.

 
use strict;
use warnings;
 

Next we import some modules that we want to use. We'll use the XML::Simple module for parsing XML; this module parses XML to massively-nested Perl data structures. For downloading stuff we'll use LWP. And for printing out those massive data structures so that we can figure out how to get at the data in them, we'll use the standard Data::Dumper module.

You've probably already got XML::Simple installed, but if not you'll need to download it from CPAN and install it. Usually your Perl installation will come with an easy way of doing this.

 
# For parsing
use XML::Simple;

# For downloading
use LWP::Simple;

# For debug output
use Data::Dumper;

# Turn off output buffering
$|=1;
 

Notice I've also turned off output buffering, which always helps in debugging.

Next we'll create a main subroutine and call it. This is just a way of ensuring a clearly-defined start point in our program.

 
sub main 
{
}

main();

 

Now we can go ahead and put code in the main function.

First, let's download the data. As I say, we're not doing any error checking here for the sake of simplicity.

 
print "nDownloading ....";
my $data = get('http://www.somesiteorother.com/freedata/bands.xml');
 

get is from the LWP::Simple module.

Next we use XML::Simple to parse the XML.

 
my $parser = new XML::Simple;
    
print "nParsing ...";
my $dom = $parser->XMLin($data);
 

I like to keep the following lines handy in my code. You can use Data::Dumper to output any Perl variable in a readable form. You'll need to keep using it to drill down in the XML bit by bit and pluck out the bits you want. This is the only down-side to this technique, and I'll admit it's a downside. But at least it saves you from regular expressions or XPath. On the other hand, you'll have to know your arrays and hashes.

 
# Debug output. 
# print Dumper($dom);
 

Now the Hard Bit



Now you've got your data and you've parsed it, but you need to figure out bit by bit how to get at the details you want. You will end up with a hash of arrays of hashes of arrays of God-knows-what. Frequent use of Data::Dumper is advised.

For instance, to get just one chart position you'd need:

 

print 
$dom->{'entry'}->[0]->{'album'}->{'Pluck My Bass'}->{'chartposition'};

 

... and since actual XML field values end up as keys in the data structure, you can't normally get what you want in one go. You have to work systematically through the data, getting out particular hashes and arrays and so on in nested loops or whatever.

The alternative to this is of course to use a CPAN module that lets you work with XML nodes or XPath expressions. Either way it's not going to be pretty, and which you choose depends on which method of drilling down through XML you prefer.

If you know regular expressions pretty well, you might prefer just to side-step all XML parsers and use regexs directly.

The complete example below shows you what you have to do for the example XML given above. Your code will of course be different, since you'll be working with different XML.

 
use strict;
use warnings;

# For parsing
use XML::Simple;

# For downloading
use LWP::Simple;

# For debug output
use Data::Dumper;

# Turn off output buffering
$|=1;

# Note, no error checking here!

sub main 
{

    print "\nDownloading ....";
    my $data = get('http://www.somesiteorother.com/freedata/bands.xml');
    
    my $parser = new XML::Simple;
    
    print "\nParsing ...";
    my $dom = $parser->XMLin($data);
    
    print "\n\n";
    
    # Debug output. 
    # print Dumper($dom);
    
    # Data structure is a hash containing one key, 'entry'.
    # Get the hash value and cast to an array.
    my @entries = @{ $dom->{'entry'} };
    
    # Go through each array 'entry' element.
    foreach my $entry(@entries) 
    {
        # Each element is a hash.
        # The band name can be got from one hash key.
        my $band = $entry->{'band'};
        
        print "$band\n";
        
        # The hash of albums are in another hash value
        my $albums = $entry->{'album'};
        
        # Go through all key-values pairs of the albums hash.
        # Notice we cast the reference to a hash (%$albums)
        while( my ($album, $details) = each %$albums ) 
        {
            # The chart position is the sole member of the hash found
            # in one value of the albums hash.
            my $position = $details->{'chartposition'};
            
            # The album name is found in the other value of the albums hash.
            print "'$album' reached chart position $position\n";
        }
    
        print "\n\n";
    }


}
 
Downloading ....
Parsing ...

The Basspluckers
'Pluck My Bass' reached chart position 123
'Revenge of the Squirrels' reached chart position 434


The Dead Drunks
'Get Out of My House of Love, You Spy' reached chart position 74
'Spy in the House of Love' reached chart position 10