This tutorial is based on the last one. We read in a file and print out each line of data with a line number. However, this time, if a line is blank or contains only white space, we remove it. To show that the script works, the line numbering in the output will match that of the input file. So there will be gaps in the numbering in the output. Once again, minimal error-checking is performed. The code is discussed below.

#!/usr/bin/perl
	
##########################
#
# FILE: rmblanklines01.pl
#
##########################
#
# Read all the lines of a named input data file
# and print each line out with a line number.
# Unless a line is blank; in which case, remove it.
#
	
use strict; # Declare strict checking on variable names, etc.
	
my $argc;   # Declare variable $argc. This represents
            # the number of commandline parameters entered.
my $lnum;   # Initialize line counter
my $fname;  # Declare file name variable
	
# Assign the number of commadline parameters to $argc
# Check the number
#
$argc = @ARGV;
	
if ($argc != 1)
{
  # If the number of commandline parameters is not 1,
  # print an usage message.
  #
  usage();  # Call subroutine usage()
  exit();   # When usage() has completed execution,
            # exit the program.
}
	
# You could use an else clause here, but it's not necessary
#
# Get the pathname and filename.
# Note: first item in a Perl array is at index 0, not 1
#
$fname = $ARGV[0];
	
# Open the file for reading, and assign arbitrary filehandle, FH.
# If the file does not exist or is otherwise unopenable, let the
# program stop gracefully with a message. (Note: some installations
# of Perl do not properly support the die() function).
#
open(FH, "<$fname") || die("Couldn't open file $fname\n");
	
#
# While there are lines to read from the input file, loop
#
while (<FH>)
{
  # The <FH> operator assigns the next line of data from
  # the named file to the Perl system variable $_.
  #
  $lnum++; # Increment the line counter
  chomp; # Remove the 'newline' character, '\n', from $_
  print "$lnum: $_\n" unless ($_=~/^\s+$/);
}
close(FH);
exit(0); # Exit gracefully
	
sub usage
{
  # Display the usage of this script.
  # The first parameter is the filename,
  # including any absolute or relative path
  #
  print "Usage: perl rmblanklines01.pl inputfilename\n";
}

Let’s discuss the print statement near the end of the main code:

  print "$lnum: $_\n" unless ($_=~/^\s+$/); #
The unless clause is rather cryptic looking, but it is checking some condition. If you’ve never worked with regular expressions before, the characters in brackets at the end of the print statement will look like gobbledygook. It should be obvious, though, that the print statement is run only if the condition in the unless clause is not true. If the condition is true, the print statement is ignored. Since we only want to print non-blank lines, the condition must be checking for exactly that.

Let’s define what we mean by a blank line. In this case, a blank line consists of only whitespace. That includes blanks, tabs, bells (some operating systems), etc.,  or no characters whatsoever, followed by an invisible end of line character, which Perl usually signifies as ‘\n’ (ctrl-n), but which on some operating systems is actually a combination of ‘carriage return’ and ‘linefeed’. In Perl, we often refer to it as the newline character.

Now let’s dissect the cryptic condition code in the unless clause. The code represents what is known as a regex (regular expression). Regexes are basically pattern rules. We specify a pattern rule, then compare it against each line of data. Since we want to ignore blank lines, we are checking for lines that contain only whitespace characters. So the entire line, from start to finish, must only be whitespace characters.

In Perl, the caret character, ‘^’, signifies the start of a string of characters (in this case, the line of data from the input file). Similarly, the ‘$’ is the line end character. Any type of whitespace character is matched by the control character ‘\s’ (backslash-s). When a ‘+’ is added after the ‘\s’ to form ‘\s+’, that means “match one or more whitespace characters in a row. Since we have ‘^’ in front of ‘\s+’, followed by a ‘$’, it means that from start to end, the string being checked must contain only whitepace characters.

These regex characters only mean something if they are inside either a regex match or substitute statement. A match is typically signified by writing “$variable =~ m/__regex__/;”. However, Perl lets you leave out the ‘m’ because the tilde, ‘~’, signfies that either a match or substitute rule follows. The first slash, ‘/’, starts the match rule. The second slash ends it. If the statement is to be a substitution, the second slash would have been followed by an additional regex or plain string, and then a third slash.

In the unless clause above, we are checking the contents of the Perl system variable ‘$_’, which is the default input variable when reading from STDIN (standard input) or from a named file. So to summarize, the statement

  print "$lnum: $_\n" unless ($_=~/^\s+$/); #
means print the stuff in double quotes unless the current line of data matches only a sequence of whitespace characters. If any other characters occur in the data line, print the line with a line number. Keep in mind that the line numbers will match the position of a data line in the input file. So if you have 10 lines of data in the source file, and line 5 is “blank”, then the output lines will be numbered 1-4, then 6-10. If you would rather have the output lines numbered 1-9, then substitute the while loop above with this one:


while (<FH>)
{
  # The <FH> operator assigns the next line of data from
  # the named file to the Perl system variable $_.
  #
  chomp; # Remove the 'newline' character, '\n', from $_
  if ($_ !~/^\s+$/)
  {
    $lnum++; # Increment the line counter
    print "$lnum: $_\n";
  }
}
Notice that we no longer have an unless clause. We’re using an if block now, but the condition here is the same as for the unless clause of the previous code. So an unless is simply a variation of an if statement or block.

One thing to consider how short this script is in Perl, and how long an equivalent program in some other language, say C or C++, would be. Perl (which stands for Pattern Extraction and Reporting Language) has been optimized for pattern matching - that is, regexes.

This is #5 in a series of Perl scripts. Please feel free to leave comments if there is something that you would like to see. I cannot guarantee that I’ll be able to answer immediately, but I will answer as soon as I can.

Technorati Tags: , , , , , ,