One fairly common task in Perl is downloading and parsing data. For instance, you might want to download sports data and parse it so that you can import it to a database or do some sports ranking calculations on it.
In this tutorial I'll show you a really simple way of downloading and parsing XML. You'll need, of course, a site that provides you with XML to practice on. The technique I show you here may work for some HTML too, but I certainly can't guarantee it.
The code I'm about to show you doesn't do a whole lot of error checking, in order to keep things simple for the purposes of this tutorial. If you want more details info on downloading stuff using Perl, including how to do error checking, see this page.
Firstly, the XML. Let's pretend that you can download the following XML on some URL or other. It contains information about various bands and positions their albums reached in the album charts.
Now, the Perl!
The first thing in any Perl script is to turn on warnings and 'strict', to catch typos in your variable names.
Next we import some modules that we want to use. We'll use the XML::Simple module for parsing XML; this module parses XML to massively-nested Perl data structures. For downloading stuff we'll use LWP. And for printing out those massive data structures so that we can figure out how to get at the data in them, we'll use the standard Data::Dumper module.
You've probably already got XML::Simple installed, but if not you'll need to download it from CPAN and install it. Usually your Perl installation will come with an easy way of doing this.
Notice I've also turned off output buffering, which always helps in debugging.
Next we'll create a main subroutine and call it. This is just a way of ensuring a clearly-defined start point in our program.
Now we can go ahead and put code in the main function.
First, let's download the data. As I say, we're not doing any error checking here for the sake of simplicity.
get is from the LWP::Simple module.
Next we use XML::Simple to parse the XML.
I like to keep the following lines handy in my code. You can use Data::Dumper to output any Perl variable in a readable form. You'll need to keep using it to drill down in the XML bit by bit and pluck out the bits you want. This is the only down-side to this technique, and I'll admit it's a downside. But at least it saves you from regular expressions or XPath. On the other hand, you'll have to know your arrays and hashes.
Now you've got your data and you've parsed it, but you need to figure out bit by bit how to get at the details you want. You will end up with a hash of arrays of hashes of arrays of God-knows-what. Frequent use of Data::Dumper is advised.
For instance, to get just one chart position you'd need:
... and since actual XML field values end up as keys in the data structure, you can't normally get what you want in one go. You have to work systematically through the data, getting out particular hashes and arrays and so on in nested loops or whatever.
The alternative to this is of course to use a CPAN module that lets you work with XML nodes or XPath expressions. Either way it's not going to be pretty, and which you choose depends on which method of drilling down through XML you prefer.
If you know regular expressions pretty well, you might prefer just to side-step all XML parsers and use regexs directly.
The complete example below shows you what you have to do for the example XML given above. Your code will of course be different, since you'll be working with different XML.
Downloading .... Parsing ... The Basspluckers 'Pluck My Bass' reached chart position 123 'Revenge of the Squirrels' reached chart position 434 The Dead Drunks 'Get Out of My House of Love, You Spy' reached chart position 74 'Spy in the House of Love' reached chart position 10