Perl Split Whitespace

How can we split a string in Perl on whitespace? The simplest way of doing this is to use the split() function, supplying a regular expression that matches whitespace as the first argument.

my $string = "hello there how are you?";
    
my @tokens = split / /, $string;
    
# Print out the tokens we've extracted by 
# splitting the string, one per line.
foreach my $token(@tokens) {
    print "$token\n";
}





hello
there
how
are
you?






However, the above will not work as intended if the string you want to split contains whitespace characters like the newline character n or tab t. So to really split on all whitespace, use the whitespace regular expression sequence s:


my $string = "hellotthere hownare you?";
    
my @tokens = split /\s/, $string;
    
# Print out the tokens we've extracted by 
# splitting the string, one per line.
foreach my $token(@tokens) {
    print "$token\n";
}
    




hello
there
how
are
you?




Perl -- Split on Whitespace: Final Version!



The above program is better, but if you have more than one consecutive whitespace character in your input, you'll end up with unexpected results. That brings us to the final version of this little program, which can split on any kind of whitespace (within reason!) and regards consecutive whitespace characters to be the same as single whitespace characters.


my $string = "hellotthere  hownare you?";
    
my @tokens = split /\s+/, $string;
    
# Print out the tokens we've extracted by 
# splitting the string, one per line.
foreach my $token(@tokens) {
    print "$token\n";
}
    



hello
there
how
are
you?





Splitting Really Big Chunks of Text



The program above will do the trick for most purposes. But what if your chunk of text is absolutely massive, like a huge piece of XML exported from a database for instance? Depending on how much memory your machine has, you may or may not want to store the results of the split in an array. If you want to avoid putting everything in an array, you can always use a regular expression that matches tokens one by one.

# Pretend this is a huge piece of data that we
# don't want to bung into an array.
my $string = "hello\tthere  how\nare you?";

while($string =~ /([^\s]+)/g) {
    print "$1\n";
}






hello
there
how
are
you?




This program works by searching for strings of non-whitespace characters within the input data. Since tokens are processed one at a time, it avoid bunging everything into a huge array.

Such concern for memory is becoming less relevant as computers become ever-more powerful, but this little trick is still often useful.