Previous Section Table of Contents Next Section

16.1 Debugging and Parallel Programs

Parallel code presents new difficulties, and the task of coordinating processes can result in some novel errors not seen in serial code. While elaborate classification schemes for parallel problems exist, there are two broad categories of errors in parallel code that you are likely to come up against. These are synchronization problems that stem from inherent nondeterminism found in parallel code and deadlock. While we can further subclassify problems, you shouldn't be too concerned about finer distinctions. If you can determine the source of error and how to correct it, you can leave the classification to the more academically inclined.

Synchronization problems result from variations in the order that instructions may be executed when spread among multiple processes. By contrast, serial programs are deterministic, executing each line of code in the order it was written. Once you start forking off processes, all bets are off. Moreover, since the loads on machines fluctuate, as does the competition for communications resources, the timing among processes can vary radically from run to run. One process may run before another process one day and lag behind it the next. If the order of execution among cooperating processes is important, this can lead to problems. For example, the multiplication of matrices is not commutative. If you are multiplying a chain of matrices, you'll need to explicitly control the order in which the multiplications occur when dividing the problem among the processes. Otherwise, a race condition may exist among processes.

Deadlock occurs when two or more processes are waiting on each other for something. For example, if process A is waiting for process B to send it information before it can proceed, and if process B is waiting for information from process A before it can proceed, then neither process will be able to advance and send the other process what it needs. Both will wait, very patiently, for the other to act first. While this may seem an obvious sort of problem that should be easy to spot, deadlock can involve a chain of different processes and may depend on a convoluted path through conditional statement in code. As such, it can occur in very nonobvious ways. A variant of deadlock is livelock, where the process is still busy computing but can't proceed beyond some point.

This shouldn't intimidate you. While you may occasionally see explicitly parallel problems, most of the problems you are likely to see are not new. They are the same mistakes you'll have made with serial code. This is good news! It means you should already be familiar with most of the problems you'll see. It also suggests a strategy for developing and debugging code.

Start, whenever possible, with a serial version of the code. This will help you identify potential problems and work out the details of your code. Once you have the serial version fully debugged, you can move on to the parallel version. Depending of the complexity of the problem, the next step may be running the code with a small number of processes on the same machine. Only after this is working properly should you scale up the problem.

Since most problems are serial, we'll start with a quick review of debugging in general and then look at how we can expand traditional techniques to parallel programs.

    Previous Section Table of Contents Next Section