1.2 Race Conditions — Parallel Computing for Beginners

1.2.1 Estimating the Area Under a Curve¶

As our next example, let’s look at the problem of estimating the area under a curve. If you have taken a calculus course, you may recognize this problem as the Riemann sum or Trapezoidal, which approximates the area under the curve (i.e. the integral) by splitting the area under the curve into a series of trapezoids.

In the following programs, we attempt to use the trapezoid rule to approximate the integral

\[\int_0^{\pi} sin(x)_{dx}\]

using \(2^{20}\) equal subdivisions. The answer from this computation should be 2.0. The following video shows how a single thread would solve this problem:

In this example, the single thread serially computes the area of each trapezoid and adds all the trapezoids together to one value.

A C implementation of this program may look like the following:

Q-3: Run the above serial implementation of integration a few times. What do you discover?

The program works as expected (outputs answer 2.0)
Correct! The area under the curve sums to 2.
The program returns a different approximate value (but the same value with every run)
Not quite. Did you try and run the code block above?
The program returns a different approximate value every run.
Try again. Did you run the code block above?

Let’s now consider how we can use multiple threads to approximate the area under the curve in parallel. One strategy would be to assign each thread a subset of the total set of subdivisions, so that each thread separately computes its assigned set of trapezoids.

The following video illustrates how 4 threads would work together to approximate the area under the curve:

Q-5: Is estimating the area under the curve an example of data parallelism or task parallelism?

Task parallelism
Correct! In this example, each thread is assigned a separate subset of rectangles, and are working together to solve the larger problem.
Data parallelism
Incorrect. Remember that in data parallelism, each thread is operating on a different unit of data or memory.
Neither
Actually, it is one of the options listed!

1.2.2 Parallel Integration - First Attempt¶

One of the advantages of OpenMP is the ability to incrementally add parallelism to a program. Using what we learned about the fork-join pattern and available pragams in the last section, let’s update the serial version of the integration program with OpenMP pragmas:

Our parallel implementation adds just two lines the serial code. First, we include the header file <omp.h>, in order to access all the functions available to us in the OpenMP library. The second line is the inclusion of the omp parallel for pragma on line 26.

Recall that the omp parallel for pragma combines the functionality of the omp parallel and omp for pragmas we covered in the last section. Specifically, the omp parallel for pragma:

creates a team of threads
assigns each thread a subset of iterations of the for loop
joins the threads back into a single threaded process at the end of the for loop.

For a machine running 4 threads, each thread receives n/4 (in this case 262,144) iterations of the for loop, with each thread running its subset of the for loop in parallel. The omp parallel for pragma has some additional clauses. The private(i) clause states that the variable i is private to each thread. In other words, each thread has its own copy of variable i. In contrast, the shared(a, n, h, integral) clause specifies that the variables a, n, h, and integral are shared amongst the threads. In other words, there is exactly one copy of the a, n, h, and integral variables, and all the threads have equal access to them.

If our program is parallelized correctly, the program should estimate the area under the curve as 2.00, which would be identical to the output of the serial program.

Q-7: Run the OpenMP implementation of integration a few times. What do you discover?

The program works as expected (outputs answer 2.0)
Not quite. Try running the program a few more times. What do you see?
The program returns a different approximate value (but the same value with every run)
Close, but not correct. Try running the program again. Do you really get the same answer?
The program returns a different approximate value every run.
Correct. Something is seriously wrong with our program!

Q-8: Do you have any guesses on what could be causing the issue?

This scenario illustrates a classic limitation of task parallelism.
Nope. This is actually not an issue with task parallelism. We can reproduce the issue with data parallelism too.
The program is losing values or overwriting values somewhere.
Correct! Can you figure out why?
Fairies have infested the computer and are wrecking havoc as we speak.
Nope! Thankfully, our servers are fairy-proof. :)
It's some other issue.
Actually, the issue is listed in one of the options!

1.2.3 Returning to the Array Example¶

While some may find it tempting to point the finger at task parallelism for our incorrect result, the truth is that we can get an incorrect result even when employing the SPMD pattern. Suppose we modify the populating array problem to one of array addition. In other words, instead of simply populating an array with random values, we are adding up the values in the array in parallel.

Here is a modified code snippet with the omp parallel for pragma placed in the correct place but commented out. Since the sum of n elements from \(1 \ldots n\) is \(\frac{n(n+1)}{2}\), we know the sum when n is 20 million is a really long number: 200,000,010,000,000.

Here is an updated code snippet that has the pragma around the sum commented out. Running it should confirm that the sum is the long value shown above.

Q-10: Now uncomment the pragama on line 22 and re-run the program a few times. Do you always get the correct result?

Yes, the program always prints out the correct result.
Incorrect. Did you remember to uncomment the pragma? Did you run the code a few times?
No, the result is different every time.
Correct! Once again, the program is losing or overwriting values somewhere...

1.2.4 Race Conditions and Critical Sections¶

To understand what is going on, let’s use an analogy to describe the process of adding an array in parallel.

To understand what is going on, we need to define a few new terms, specifically race condition critical section, and lock. Watch the following video to learn what these terms mean:

Q-13: Match descriptions that best define each term.

Feedback that is displayed if things are incorrectly matched.

arises when two or more threads attempt to modify a shared variable
race condition
smallest set of instructions that must execute sequentially to ensure correctness
critical section
a mechanism by which to protect a resource
lock
common pattern that often causes race conditions
read-modify-write

Let’s watch another video that explains how these mechanisms can help fix the issue in our program:

The omp critical pragma allows a programmer to define a critical section. At an initial glance, it is tempting to place the omp critical pragma around the statement sum+=array. However, since the critical section should be as small as possible, its necessary to separate the summing of the array elements from the update to the shared sum variable.

The following program does just that:

If you are having trouble understanding how the sum+=local_sum statement on line 34 translates to a read-modify-write pattern, recall that a single line of code often translates to multiple instructions in assembly. The line can also be written as sum = sum + local_sum. This single line of code involves:

reading the sum variable (and local_sum variable)
adding the values of sum and local_sum together, and
writing the total to the sum variable.

Lastly, observe that fixing the race condition added several lines to our program. One way to fix this is to use a new pragma called omp atomic. The omp atomic pragma forces the program to treat the enclosed section as being atomic, or something that is executed without interruption.

The following program illustrates how the omp atomic pragma can be used to shorten the program:

Q-17: Try replacing the omp atomic pragma (line 24) with the omp critical pragma and re-run the program. What happens?

The revised program is slower.
Correct! In fact, it is so slow it exceeds runestone's time limit!
Both versions of the program take an equal amount of time to execute.
Incorrect. Did you try modifying the program and re-running it?
The revised program is faster.
Incorrect. Did you try modifying the program and re-running it?
The program is impossible to revise. You always get a syntax error.
No, it's fixable. Try replacing the world 'atomic' with the word 'critical'.

In the next section, we cover a technique called reduction that offers another alternative.

1.2 Race Conditions¶

1.2.1 Estimating the Area Under a Curve¶

1.2.2 Parallel Integration - First Attempt¶

1.2.3 Returning to the Array Example¶

1.2.4 Race Conditions and Critical Sections¶