Perl Wget -- Retrieving HTML Pages In Perl

Wget is a command-line program that allows you to retrieve files via HTTP or FTP from a UNIX prompt. People who are familiar with UNIX or Linux often wonder how to use wget in Perl. The simple answer is -- don't! OK, if you really want to use wget in Perl, you can always execute it like any other command-line program and capture the output.




For example the following program executes a simple UNIX command, captures the output and displays it.

 
use strict;
use warnings;

sub main {
    my @lines = `ls -l`;

    print join("", @lines);
}

main();

 

However, it's better to bypass wget altogether and use the Perl package LWP.

Don't Use Wget -- Use LWP



This program retrieves an HTML page using LWP::Simple.

 
use strict; 
use warnings; 

use LWP::Simple;

sub main
{
    my $data = get("http://news.bbc.co.uk");
    
    print "Retrieved " . length($data) . " bytes of data.";
}

main();

 
Retrieved 84294 bytes of data.




Much cleaner and more portable than executing wget on the command line!

Checking For Errors While Retrieving HTML in Perl



If you want to do anything more complicated than simply fetch an HTML file, you might want to go to the extra trouble of creating an LWP user agent object. The "user agent" is really a virtual browser and allows you to accept cookies or pretend to be Internet Explorer, etc.

Here's a simple example that does the same thing as the code above, but contains better error checking.

 
use strict; 
use warnings; 

use LWP::UserAgent;

sub main
{
    my $ua = LWP::UserAgent->new();
    
    my $response = $ua->get('http://news.bbc.co.uk');
    
    if ($response->is_success) {
        print "Retrieved " . 
                      length($response->decoded_content) . 
                      " bytes of data.";
    }
    else {
        print "Error: " . $response->status_line;
    }
}

main();

 
Retrieved 84245 bytes of data.




If we mangle the URL, we get a message like this:

Error: 500 Can't connect to news.bbc.co.ukd:80 
(Bad hostname 'news.bbc.co.ukd')




Pretending To Be a Real Browser with Perl LWP



A lot of sites won't let you in unless you accept cookies and appear to be a real browser. This is especially true of sites that require you to login somehow. Let's take a look at a program that accepts website cookies and masquerades as a real browser. By default the cookies are discarded when the program finishes running, but just for kicks we'll save them in a file called "cookies.txt" (note: you could specify a full file path if you wanted).

We'll also save the retrieved HTML in a file called "save.html".

 
use strict; 
use warnings; 

use LWP::UserAgent;
use HTTP::Cookies;

sub main
{
    # Create the fake browser (user agent).
    my $ua = LWP::UserAgent->new();
    
    # Accept cookies. You don't need to supply
    # any options to new() here, but just for
    # kicks we'll save the cookies to a file.
    my $cookies = HTTP::Cookies->new(
        file => "cookies.txt",
        autosave => 1,
    );
    
    $ua->cookie_jar($cookies);
    
    # Pretend to be Internet Explorer.
    $ua->agent("Windows IE 7");
    # or maybe .... $ua->agent("Mozilla/8.0");
    
    # Get some HTML.
    my $response = $ua->get('http://news.bbc.co.uk');
    
    unless($response->is_success) {
        print "Error: " . $response->status_line;
    }
    
            
    # Let's save the output.
    my $save = "save.html";
    
    unless(open SAVE, '>' . $save) {
        die "\nCannot create save file '$save'\n";
    }
    
    # Without this line, we may get a
    # 'wide characters in print' warning.
    binmode(SAVE, ":utf8");
    
    print SAVE $response->decoded_content;
    
    close SAVE;
    
    print "Saved " . 
        length($response->decoded_content) . 
        " bytes of data to '$save'.";
}

main();

 
Saved 83938 bytes of data to 'save.html'.




FTP and Other Requests With LWP



LWP can also handle ftp and other protocols. Check out the CPAN documentation for more information.