16.1 Debugging and Parallel Programs
Parallel code presents new difficulties, and the task of coordinating
processes can result in some novel errors not seen in serial code.
While elaborate classification schemes for parallel problems exist,
there are two broad categories of
errors in parallel code that you are
likely to come up against. These are
synchronization problems that stem
from inherent nondeterminism found in parallel code and deadlock.
While we can further subclassify problems, you
shouldn't be too concerned about finer distinctions.
If you can determine the source of error and how to correct it, you
can leave the classification to the more academically inclined.
Synchronization problems result from variations in the order that
instructions may be executed when spread among multiple processes. By
contrast, serial programs are deterministic, executing each line of
code in the order it was written. Once you start forking off
processes, all bets are off. Moreover, since the loads on machines
fluctuate, as does the competition for communications resources, the
timing among processes can vary radically from run to run. One
process may run before another process one day and lag behind it the
next. If the order of execution among cooperating processes is
important, this can lead to problems. For example, the multiplication
of matrices is not commutative. If you are multiplying a chain of
matrices, you'll need to explicitly control the
order in which the multiplications occur when dividing the problem
among the processes. Otherwise, a race condition may exist among
processes.
Deadlock
occurs when two or more processes are waiting on each other for
something. For example, if process A is waiting for process B to send
it information before it can proceed, and if process B is waiting for
information from process A before it can proceed, then neither
process will be able to advance and send the other process what it
needs. Both will wait, very patiently, for the other to act first.
While this may seem an obvious sort of problem that should be easy to
spot, deadlock can involve a chain of different processes and may
depend on a convoluted path through conditional statement in code. As
such, it can occur in very nonobvious ways. A variant of deadlock is
livelock, where the process is still busy computing but
can't proceed beyond some point.
This shouldn't intimidate you. While you may
occasionally see explicitly parallel problems, most of the problems
you are likely to see are not new. They are the same mistakes
you'll have made with serial code. This is good
news! It means you should already be familiar with most of the
problems you'll see. It also suggests a strategy for
developing and debugging code.
Start, whenever possible, with a serial version of the code. This
will help you identify potential problems and work out the details of
your code. Once you have the serial version fully debugged, you can
move on to the parallel version. Depending of the complexity of the
problem, the next step may be running the code with a small number of
processes on the same machine. Only after this is working properly
should you scale up the problem.
Since most problems are serial, we'll start with a
quick review of debugging in general and then look at how we can
expand traditional techniques to parallel programs.
|