Previous Section Table of Contents Next Section

Merge Sort of Multiple Files

This function merges multiple sorted files into one large sorted result. It sorts in ascending order.

The program opens all the files and stores the handles in an array. It has a corresponding array that contains the current line in each file. When the end of a file is reached, it closes the file and takes its slot out of both arrays, so the sort then proceeds with one fewer file.

The handles are stored in variable names, but other than that, reading from the file is as usual in Perl. An assignment from the handle to a string reads a single line into the string, as shown in the following line:


$firstline = <$thishandle>;


The read returns undef when the end of file is reached.

The program assumes that the input files are already sorted in ascending order. The behavior is undefined if they are not.

This program opens files using the open() function, which takes a file handle and a filename. A file handle has no prefix character, and it is usually written in all uppercase for clarity. After the file is opened, the file handle is used the same as the STDIN file handle, and then closed with the close() function:


open FH, "logfile";

while (<FH>) {

    chomp($_);

    process($_);

}

close FH;


In the program here, the name of the handle is stored in a string variable and the code uses the variable in place of the handle


$thishandle = "FH$_";

open $thishandle, $thisfile;


to open the handles FH1, FH2, and so forth.

As with STDIN, when end-of-file is reached, reading from the file handle in scalar context returns undef.

Source Code


 1.     # open a handle to each file and read the first line

 2.

 3.     for (0..$#ARGV) {

 4.         $thisfile = $ARGV[$_];

 5.         $thishandle = "FH$_";

 6.         open $thishandle, $thisfile;

 7.         $firstline = <$thishandle>;

 8.         if (defined($firstline)) {

 9.             push @handlenames, $thishandle;

10.             push @currentlines, $firstline;

11.         } else {

12.             close $thishandle;

13.         }

14.     }

15.

16.     # now keep going until no more files have data left

17.

18.     while (@handlenames > 0) {

19.

20.         # find the file whose current line is first

21.

22.         $smallest = 0;

23.         for (1..$#handlenames) {

24.             if ($currentlines[$_] lt

25.                     $currentlines[$smallest]) {

26.                 $smallest = $_;

27.             }

28.         }

29.

30.         # display that line and read the next one

31.         print $currentlines[$smallest];

32.         $thishandle = $handlenames[$smallest];

33.         $currentlines[$smallest] = <$thishandle>;

34.

35.         # if no line was read, then close this file

36.         unless (defined($currentlines[$smallest])) {

37.             close $thishandle;

38.

39.             # take it out of @handlenames and @currentlines

40.             $justbefore = $smallest-1;

41.             $justafter = $smallest+1;

42.             @handlenames =

43.               ( @handlenames[0..$justbefore] ,

44.                 @handlenames[$justbefore..$#handlenames]);

45.             @currentlines =

46.               ( @currentlines[0..$justbefore] ,

47.                 @currentlines[$justbefore..$#currentlines]);

48.         }

49.     }


Suggestions

  1. What is the trivial case for this program?

  2. The @handlenames and @currentlines arrays need to be synchronized so they are always the same length and the entries at the same index correspond. To ensure that this is correct, verify every place where they are modified.

  3. When will the main while loop, starting at line 18, terminate?

  4. What is the goal of the loop on lines 22-28?

Hints

Walk through the program with the following inputs:

  1. A single file containing three lines with the words apple, carrot, and eggplant.

  2. Three files, each with one line, containing the words, respectively, tomato, cucumber, and parsley.

  3. Two files, the first containing three lines with the words airplane, boat, and train, the second containing two lines with the words car and helicopter.

Explanation of the Bug

The code on lines 42-47, to remove a file from the @handlenames and @currentlines arrays, has a B.variable error (or two, actually). The code


@handlenames =

  ( @handlenames[0..$justbefore] ,

    @handlenames[$justbefore..$#handlenames]);

@currentlines =

  ( @currentlines[0..$justbefore] ,

    @currentlines[$justbefore..$#currentlines]);


is using the wrong variable for the start of the second range. The code actually duplicates an element in the array, instead of removing an entry, and since the undefined entry at $currentlines[$smallest] will keep being determined to be the "smallest" by the code on lines 22-28, the program will loop forever printing out undefined strings and then increasing the size of the arrays by one.

It should use $justafter, not $justbefore, so lines 42-47 should read:


@handlenames =

  ( @handlenames[0..$justbefore] ,

   @handlenames[$justafter..$#handlenames]);

@currentlines =

  ( @currentlines[0..$justbefore] ,

   @currentlines[$justafter..$#currentlines]);


    Previous Section Table of Contents Next Section