Previous Section Table of Contents Next Section

Sort a File by Line Length

This function reads lines from standard input and outputs them sorted by length.

For each line of text, it inserts a string in a new array. This string consists of the length of the line of text, concatenated with the vertical bar ('|') character, concatenated with the index of the line of text in the original array. The program sorts the new array, and then uses the results to display the original lines in sorted order.

This depends on the fact that when Perl converts a string to a number, it stops the conversion when it hits a non-numeric character, but still returns the number converted so far. Converting these artificially fabricated "length|index" array strings back to a number returns the length of the line of text because conversion stops (but does not fail) when it hits the vertical bar character.

Remember that index() returns the zero-based index of the first occurrence of a string in another string, and substr() with two parameters returns a substring from the specified index to the end.

This program uses Perl's built-in sort() function. By default, it sorts a list in ASCII order:


@sortedlist = sort @unsortedlist;


However, you can also specify a subroutine to use for comparisons. This can be a named subroutine, or it can be specified inline as a parameter to sort(). The semantics of the subroutine are that it always has two arguments-$a and $b-and it should return -1, 0, or 1, respectively, if $a should be sorted before, at the same place, or after $b. For your convenience, Perl defines two built-in operators that have these same semantics: <=> (known as the "spaceship" operator) does numeric comparisons and cmp does string comparisons. Thus, the following line of code


@sortarray = sort { $a <=> $b } @sortarray;


replaces the list @sortarray with its numerically sorted equivalent.

Source Code


 1.     # read in the entire input

 2.

 3.     @lines = <STDIN>;

 4.

 5.     #construct the array with the length prepended

 6.

 7.     foreach (0..$#lines) {

 8.         $sortarray[$_] = length($lines[$_]) . "|" . $_;

 9.     }

10.

11.     # sort using numeric comparison

12.

13.     @sortarray = sort { $a <=> $b } @sortarray;

14.

15.     # display results

16.

17.     foreach (@sortarray) {

18.         print $lines[substr($_, index($_, "|"))];

19.     }


Suggestions

  1. How will it affect @sortarray if one of the elements in lines has a | character in it?

  2. What are the empty and trivial cases for this function? How are they handled?

  3. Is the assignment in line 3 done in scalar context or list context?

Hints

Walk through the function with the following lines of input:

  1. The trivial case:

    abc

  2. Check if the program handles an empty line:

    
    this
    
    is
    
    
    
    test
    
    

  3. See if the program handles an already sorted file:

    1

    22

    333

Explanation of the Bug

The problem is in the display of the text at the end, on line 18:


print $lines[substr($_, index($_, "|"))];


An element in @sortarray will be of the form "12|3", which means the fourth entry in @lines was 12 bytes long. The index() function returns the location of the '|' character, but the code needs to index into @lines using the "3" part of the string, so it needs to add one to the value that index() returns:


print $lines[substr($_, index($_, "|") + 1)];


This is your basic A.off-by-one error. As it is currently written, the function attempts to display $lines["|3"], which converts to $lines[0]. It displays the first line of the file over and over, as many times as there were lines in the file.

    Previous Section Table of Contents Next Section