1.2 Race Conditions¶
This section builds on the simple fork-join and SPMD patterns and introduces the notion of a race condition. By the end of this section, you should be able to define what a race condition is, identify race conditions in sample code, and extrapolate some strategies for fixing them.
1.2.1 Estimating the Area Under a Curve¶
As our next example, let’s look at the problem of estimating the area under a curve. If you have taken a calculus course, you may recognize this problem as the Riemann sum or Trapezoidal, which approximates the area under the curve (i.e. the integral) by splitting the area under the curve into a series of trapezoids.
In the following programs, we attempt to use the trapezoid rule to approximate the integral
using \(2^{20}\) equal subdivisions. The answer from this computation should be 2.0. The following video shows how a single thread would solve this problem:
In this example, the single thread serially computes the area of each trapezoid and adds all the trapezoids together to one value.
A C implementation of this program may look like the following:
- The program works as expected (outputs answer 2.0)
- Correct! The area under the curve sums to 2.
- The program returns a different approximate value (but the same value with every run)
- Not quite. Did you try and run the code block above?
- The program returns a different approximate value every run.
- Try again. Did you run the code block above?
Q-3: Run the above serial implementation of integration a few times. What do you discover?
Let’s now consider how we can use multiple threads to approximate the area under the curve in parallel. One strategy would be to assign each thread a subset of the total set of subdivisions, so that each thread separately computes its assigned set of trapezoids.
The following video illustrates how 4 threads would work together to approximate the area under the curve:
- Task parallelism
- Correct! In this example, each thread is assigned a separate subset of rectangles, and are working together to solve the larger problem.
- Data parallelism
- Incorrect. Remember that in data parallelism, each thread is operating on a different unit of data or memory.
- Neither
- Actually, it is one of the options listed!
Q-5: Is estimating the area under the curve an example of data parallelism or task parallelism?
1.2.2 Parallel Integration - First Attempt¶
One of the advantages of OpenMP is the ability to incrementally add parallelism to a program. Using what we learned about the fork-join pattern and available pragams in the last section, let’s update the serial version of the integration program with OpenMP pragmas:
Our parallel implementation adds just two lines the serial code. First, we include the header file <omp.h>
, in order to
access all the functions available to us in the OpenMP library. The second line is the inclusion of the omp parallel for
pragma on line 26.
Recall that the omp parallel for
pragma combines the functionality of the omp parallel
and omp for
pragmas we covered in the last section.
Specifically, the omp parallel for
pragma:
creates a team of threads
assigns each thread a subset of iterations of the for loop
joins the threads back into a single threaded process at the end of the for loop.
For a machine running 4 threads, each thread receives n/4 (in this case 262,144) iterations of the for loop, with each thread running
its subset of the for loop in parallel. The omp parallel for
pragma has some additional clauses. The private(i)
clause
states that the variable i is private to each thread. In other words, each thread has its own copy of variable i
. In contrast, the
shared(a, n, h, integral)
clause specifies that the variables a
, n
, h
, and integral
are shared amongst the threads.
In other words, there is exactly one copy of the a
, n
, h
, and integral
variables, and all the threads have equal access
to them.
If our program is parallelized correctly, the program should estimate the area under the curve as 2.00, which would be identical to the output of the serial program.
- The program works as expected (outputs answer 2.0)
- Not quite. Try running the program a few more times. What do you see?
- The program returns a different approximate value (but the same value with every run)
- Close, but not correct. Try running the program again. Do you really get the same answer?
- The program returns a different approximate value every run.
- Correct. Something is seriously wrong with our program!
Q-7: Run the OpenMP implementation of integration a few times. What do you discover?
- This scenario illustrates a classic limitation of task parallelism.
- Nope. This is actually not an issue with task parallelism. We can reproduce the issue with data parallelism too.
- The program is losing values or overwriting values somewhere.
- Correct! Can you figure out why?
- Fairies have infested the computer and are wrecking havoc as we speak.
- Nope! Thankfully, our servers are fairy-proof. :)
- It's some other issue.
- Actually, the issue is listed in one of the options!
Q-8: Do you have any guesses on what could be causing the issue?
1.2.3 Returning to the Array Example¶
While some may find it tempting to point the finger at task parallelism for our incorrect result, the truth is that we can get an incorrect result even when employing the SPMD pattern. Suppose we modify the populating array problem to one of array addition. In other words, instead of simply populating an array with random values, we are adding up the values in the array in parallel.
Here is a modified code snippet with the omp parallel for
pragma placed in the correct place but commented out. Since
the sum of n elements from \(1 \ldots n\) is \(\frac{n(n+1)}{2}\), we know the sum when n is 20 million
is a really long number: 200,000,010,000,000.
Here is an updated code snippet that has the pragma around the sum commented out. Running it should confirm that the sum is the long value shown above.
- Yes, the program always prints out the correct result.
- Incorrect. Did you remember to uncomment the pragma? Did you run the code a few times?
- No, the result is different every time.
- Correct! Once again, the program is losing or overwriting values somewhere...
Q-10: Now uncomment the pragama on line 22 and re-run the program a few times. Do you always get the correct result?
1.2.4 Race Conditions and Critical Sections¶
To understand what is going on, let’s use an analogy to describe the process of adding an array in parallel.
To understand what is going on, we need to define a few new terms, specifically race condition critical section, and lock. Watch the following video to learn what these terms mean:
-
Q-13: Match descriptions that best define each term.
Feedback that is displayed if things are incorrectly matched.
- arises when two or more threads attempt to modify a shared variable
- race condition
- smallest set of instructions that must execute sequentially to ensure correctness
- critical section
- a mechanism by which to protect a resource
- lock
- common pattern that often causes race conditions
- read-modify-write
Let’s watch another video that explains how these mechanisms can help fix the issue in our program:
The omp critical
pragma allows a programmer to define a critical section. At an
initial glance, it is tempting to place the omp critical
pragma around the
statement sum+=array
. However, since the critical section should be as small as possible,
its necessary to separate the summing of the array elements from the update to the shared sum variable.
The following program does just that:
If you are having trouble understanding how the sum+=local_sum
statement
on line 34 translates to a read-modify-write pattern, recall that a single
line of code often translates to multiple instructions in assembly. The
line can also be written as sum = sum + local_sum
. This single line
of code involves:
reading the
sum
variable (andlocal_sum
variable)adding the values of
sum
andlocal_sum
together, andwriting the total to the
sum
variable.
Lastly, observe that fixing the race condition added several lines to
our program. One way to fix this is to use a new pragma called
omp atomic
. The omp atomic
pragma forces the program to
treat the enclosed section as being atomic, or something
that is executed without interruption.
The following program illustrates how the omp atomic
pragma
can be used to shorten the program:
- The revised program is slower.
- Correct! In fact, it is so slow it exceeds runestone's time limit!
- Both versions of the program take an equal amount of time to execute.
- Incorrect. Did you try modifying the program and re-running it?
- The revised program is faster.
- Incorrect. Did you try modifying the program and re-running it?
- The program is impossible to revise. You always get a syntax error.
- No, it's fixable. Try replacing the world 'atomic' with the word 'critical'.
Q-17: Try replacing the omp atomic
pragma (line 24) with the omp critical
pragma and re-run the program. What happens?
In the next section, we cover a technique called reduction that offers another alternative.