An OpenMP primer

A preface

Some time ago I started tinkering with OpenMP and forming a rough idea of how it worked when I discovered something that caught my eye. Of course, I tweeted it in a thread without missing a beat.

🧵 (1/6) I just made an interesting discovery in #OpenMP that, if you think about it, it makes total sense.

So, if you parallelize a loop, it's actually slower than making it sequential: pic.twitter.com/CJCu4c3Agb
— Mr. RattleBrain (@Mr_RattleBrain) November 24, 2024

A quick discovery I made: using a directive (I'll explain this in time) called #pragma omp parallel for can sometimes be less efficient than running a loop sequentially. Instead, there’s another directive we should be using, #pragma omp for, which works as expected for parallelizing this particular case.

Now, this isn’t some obscure revelation I stumbled upon at 3 a.m. while sipping cold coffee. It’s (should be) actually pretty common knowledge for anyone who works with OpenMP regularly—or even just has a decent level of familiarity with it.

You might be thinking, "Well, obviously, you need to use the right directives for the job." And yes, you’d be correct. But this realization made me rethink my initial perception of OpenMP. It’s not just a dumbed down "plug-and-play" tool for parallelizing code. It’s more nuanced than that. The fact that a single word can drastically impact performance proves it and got me wondering:

What’s the most efficient way to use OpenMP?
How far can we push it?

Well, I don’t have answers to either question, or not yet at least. But maybe this is what introduces you to OpenMP and you strike gold where I found just dirt.

In this post, we’re diving into the basics of OpenMP and exploring some cool things you can do with it. Don’t worry, you don’t need to read the whole thread. I’m going to introduce the technology first, as I always do.

What is Open MP?

OpenMP is short for Open Multi-Processing. It's an opensource API that lets you write multithreaded code with "minimal" hassle (you'll understand later why I quoted "minimal"). It's basically a tool to help you tell the compiler what you want to achieve, and it takes care of how it happens so you can just focus on the big picture. Utopian I hear you say? Yes. Actually using this is not as good as it sounds

Of course, you’ll need to understand the basics of how OpenMP manages threads to use it effectively. But the nitty-gritty details of thread scheduling and execution? That’s mostly handled behind the scenes.

According to the official specification there are a few terms you need to understand. If you don't, you're in for a world of confusion, trust me. And the doc doesn't help either. It's the longest, most boring documentation I have ever seen. And the quick reference guide assumes you are already familiar with it to a relatively high degree, so I will try my best to explain in simple terms what you need to understand before diving in.

Here are what I consider the basics. The most challenging concept to master is clauses. There are a ton of them, and many look similar or even produce the same result in simple scenarios. However, they behave slightly differently under the hood. Which means that a clause may work perfectly for a small case and fail miserably on a sufficient large scale or a ever-so-slightly different case. And now you are scratching your head for hours, wondering where you messed up, only to realize the issue isn’t with your code, it’s that pesky clause you misused.

Directives:
Directives are special preprocessor commands in OpenMP that guide the compiler on how to parallelize sections of the code. These start with the #pragma keyword.

Example:
```
#pragma omp parallel // PRAGMA IS THE DIRECTIVE. THE REST IS A CONSTRUCT
{
  printf("Hello from thread %d\n", omp_get_thread_num());
}
```
Constructs:
Constructs are blocks of code that define specific parallel regions, work-sharing, or synchronization behavior. They consist of a directive and, if defined in the syntax, an associated structured block that follows. Think of them as a function running parallelized code. Anything in between the curly brackets is shared.

Example:
```
#pragma omp sections
{
  #pragma omp section
  {
    printf("This is section 1\n");
  }
  #pragma omp section
  {
    printf("This is section 2\n");
  }
}
```
Clauses:
Clauses modify the behavior of directives or constructs, specifying how variables are shared, scheduled, or initialized in parallel regions. These are the hardest to understand and use since you need to be familiar with the underlying process being used.

Example:
```
#pragma omp parallel for private(i)
for (int i = 0; i < 10; i++) {
  array[i] = i * i;    // Parallel loop execution
}
```

Routines:

Routines are library functions provided by OpenMP to manage and query the parallel environment. Examples include getting the thread ID, setting the number of threads, etc.

Example:

#include <omp.h>

int main() {
  omp_set_num_threads(4); // ROUTINE TO SET THREAD NUMBER
  #pragma omp parallel
  {
    printf("Thread %d out of %d threads\n", omp_get_thread_num(), omp_get_num_threads());
  }
  return 0;
}

As you can see, it's pretty intuitive to use. You tell the compiler what region of code you want it to make parallel and how. Then, magically, it just works (or not).

I’ve said it before, and I’ll say it again: clauses are hard. When you start working with OpenMP, directives and constructs are usually straightforward. Sure, there will be cases,like the one I shared earlier, where I used the wrong directive, but most of the time, that's just not an issue. Clauses, however, are a different story. They can give you a major headache if you’re not familiar with each one of them.

The tricky part with clauses is their sheer number and how similar they can appear. Take private and firstprivate, for example. At first glance, they seem almost identical, but they’re really not. On the one hand private creates a separate copy of the variable for each thread, but it doesn’t initialize those copies with the value from the original variable. On the other hand, firstprivate also creates a separate copy for each thread but initializes each copy with the original value the variable has when the parallel region begins. It’s subtle, but those small differences can have big implications.

Using the wrong clause might not just prevent your code from being as optimal as it could be, it might actively make it worse than the sequential version or it can straight up make it return a wrong value. For example, if you use private when you really need the variable’s initial value, you could introduce subtle bugs or extra computations to manually initialize each thread’s copy, degrading performance. And when that happens, it’s frustrating. Very frustrating.

So, while clauses may not seem like a big deal at first, mastering them is critical. Those tiny monsters can make a huge difference in your code, and finding what the hell is wrong with it will drive you up the wall.

A boring masterclass in clauses

"Well, since clauses are so hard, can you teach us the something? Or are you just going to complain over an over again?" Yes, yes, settle down. I'll make a quick run-down on the most common clauses, so you can start coding somethign useful.

First of all, you can always refer to the quick reference guide. I know I said it assumes you are already relatively familiar with the API, but that doesn't mean it's completely useless, does it? You may struggle a tad to figure out what it's trying to tell you, true, but you'll manage. I know my audience is smart.

Still, some clauses can be better explained, or at least given some clues as to when is the best time to use them and when to better avoid them. I'll make a bit of foreshadowing and tell you there are a few clauses that are not the best at their job, so it's better to avoid them unless you have no other option.

private
Action: Creates a separate instance of the variable for each thread. The variable is uninitialized and exists only within the scope of the parallel region.

Good when: You need each thread to have its own independent copy of a variable, such as temporary variables in calculations.

Bad when: You need the variable to retain its initial value or its final state after the parallel region has ended.
firstprivate
Action: Similar to private, but the variable is initialized with its value from before the parallel region begins.

Good when: Each thread needs its own copy of a variable but also needs that copy to start with the value from the main thread.

Bad when: You aren't certain of the value the variable takes in initialization
shared
Action: Makes a variable shared among all threads, meaning all threads can access and modify the same memory location.

Good when: Variables represent shared resources or aggregate results, such as arrays or counters, with appropriate synchronization.

Bad when: You lack proper synchronization mechanisms. Have you ever heard of race conditions? Oh boy you're in for a treat.
default
Action: Specifies the default sharing attributes (shared, none, etc.) for variables in a parallel region.

Good when: To ensure consistency and avoid repetitive declarations for many variables.

Bad when: You need fine-grained control over individual variable attributes.
reduction
Action: Combines values from all threads into a single result using a specified operation (e.g., sum, product, min, max).

Good when: When aggregating results across threads, such as summing elements in an array or finding the maximum value.

Bad when: The operation you need isn’t supported or is highly complex. This is one of those examples of a clause that's better to avoid if you can.
schedule
Action: Determines how iterations of a loop are divided among threads (static, dynamic, or guided).

Good when: Controlling workload distribution in loops, especially when iterations vary in complexity.

Bad when: Misjudging the workload balance. That can lead to thread starvation or suboptimal performance.
nowait
Action: Allows threads to proceed without waiting for others at the end of a work-sharing construct (e.g., for or sections).

Good when: You know synchronization at the end of a construct isn’t necessary.

Bad when: Synchronization is required for correctness. Again, if you don't know what race conditions are I envy you.
atomic
Action: Ensures a specific memory update operation is performed atomically (either it completes all steps successfully or does nothing), preventing race conditions.

Good when: For simple, lightweight synchronization, such as incrementing a shared counter.

Bad when: Operations are complex, as atomic supports only basic updates.
collapse
Action: Collapses nested loops into a single loop for parallelization.

Good when: Dealing with nested loops that can be parallelized together.

Bad when: Loops are dependent between one another or the overhead of collapsing outweighs the benefit.
ordered
Action: Ensures that iterations of a loop are executed in the order they would be in a sequential program. It's another good example of a clause to avoid if you can.

Good when: You need ordered execution within a parallel loop.

Bad when: Performance is critical, as ordering can severely reduce parallel efficiency.

These descriptions are alright, but they are not sufficient in many ways. Honestly, I think you should take a look a the OpenMP Specification. Specifically, the Part II of the book. There you can find a lot of clauses and directives, what they mean and when is best to use them. Are the descriptions a bit obscure and difficult to understand? Yes, but they are precise.

If you boldly clicked that link and realized that Part II is over 300 pages long, congratulations, now you understand why I say that clauses are hard. Well, to be completely fair, that section does not only include clauses, it also explains directives and constructs. Still, that's just one of the six sections the book has to offer.

Performance comparison

The easiest way to see how OpenMP improves performance by just adding a simple directive is to compute nested loops in different ways and see compare the results. This is how I started playing with OpenMP's API and see what directives work best with what kind of code. Obviously, you can find ways to perform tasks in parallel in virtually any code. Be that as it may, computing matrices is always the easiest to understand.

The code is available on my Github profile. You can find the exact programs used in this blog post here.

Installation

Well, I'm getting ahead of myself. I assume you have not OpenMP installed on your system, so that goes first. We need the OpenMP library and LLVM.

Linux:

For any Linux distro, the name of the package to install is usually openmp. As an example, for Arch-Based distros, it's as follows

sudo pacman -Syy llvm openmp

If you have LLVM installed on your system, you don't need it, but if you don't know, just run that command.

MacOS

Same as for Linux, you need both LLVM and the OpenMP library. However, in this case, the OpenMP is called libomp instead of openmp. Using brew, run this:

brew install llvm
brew install libomp

Windows

I had to look this up, because I have no idea whether you have to install something other than the C/C++ compiler or the OpenMP library comes included in that. Aparently, it does, so if you have VisualStudio with C/C++ code support, you have nothing else to do. I may be wrong, but can't be bothered to manually check it in a Windows system, so you're going to have to figure this out by yourself.

Mandelbrot

Let’s be honest: a Mandelbrot set isn’t exactly a real-world use case. But for comparing performance, it’s more than good enough. I’ve written a simple C program that generates a Mandelbrot fractal and dumps it into a PPM image file both in single-threaded mode and using OpenMP. Now, I’m not going to dive into the details of how the Mandelbrot set works because this post is already long enough as it is, and it’s a bit outside our scope. But here’s a quick rundown: the Mandelbrot set is a type of fractal. And what’s a fractal, you ask? Oh boy, you’re in for a treat!

In short, a fractal is a self-replicating mathematical pattern that looks incredibly intricate, no matter how closely you zoom in. Basically nature’s version of recursive art, showing up in things like snowflakes, coastlines, and even the veins of leaves. The Mandelbrot set, in particular, is a famous example that arises from iterating simple equations in the complex plane. Its beauty lies in its infinite detail and stunning symmetry.

To code the Mandelbrot set, we need three key components:

A function to create an image file to save the generated fractal
A function to calculate the Mandelbrot value for each pixel
A nested loop to process the image, pixel by pixel

We’ve got all three covered! Below, I’ve provided the single-threaded implementation in mandelbrot.c. Your mission? To optimize where no man has optimized before (or not) and do so to the best of your abilities. Experiment with OpenMP directives, constructs, and clauses to see how much performance you can squeeze out of it. Once you’ve done that, you can compare your results with my OpenMP implementation over on my GitHub. Who knows, you might come up with something even better than mine!

So, without further ado, here’s the single-threaded code for mandelbrot.c. Let the optimization games begin!

mandelbrot.c

#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
#include <time.h>

#define WIDTH 1000          // Width of the image
#define HEIGHT 1000         // Height of the image
#define MAX_ITER 500        // Maximum number of iterations per pixel

/*
The write_image function writes the image to a file in PPM format.
*/
void write_image(const char *filename, int width, int height, int *buffer) {
    FILE *f = fopen(filename, "w");
    fprintf(f, "P3\n%d %d\n255\n", width, height);
    for (int i = 0; i < width * height; i++) {
        int color = buffer[i];
        fprintf(f, "%d %d %d ", color, color, color);
    }
    fclose(f);
}

/*
The mandelbrot function calculates the number of iterations required for the
complex number c to escape the circle of radius 2.0. If the number of iterations
exceeds the maximum number of iterations, the function returns the maximum number
of iterations. Otherwise, it returns the number of iterations required for the
complex number to escape the circle.
*/
int mandelbrot(double complex c) {
    double complex z = 0;
    int iter;
    // Think we can parallelize this loop? Why or why not?
    for (iter = 0; iter < MAX_ITER; iter++) {
        if (cabs(z) > 2.0) break;   // Maybe this break statement is a problem?
        z = z * z + c;
    }
    return iter;
}

int main() {
    int *image = (int *)malloc(WIDTH * HEIGHT * sizeof(int));
    
    clock_t start_time = clock();

    for (int y = 0; y < HEIGHT; y++) {
        for (int x = 0; x < WIDTH; x++) {
            double real = (x - WIDTH / 2.0) * 4.0 / WIDTH;
            double imag = (y - HEIGHT / 2.0) * 4.0 / HEIGHT;
            double complex c = real + imag * I;
            int color = mandelbrot(c);
            image[y * WIDTH + x] = (color == MAX_ITER) ? 0 : (255 * color / MAX_ITER);
        }
    }

    clock_t end_time = clock();
    double execution_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
    printf("Execution time single thread: %f seconds\n", execution_time);

    write_image("mandelbrot.ppm", WIDTH, HEIGHT, image);
    free(image);
    return 0;
}

The mathematics of this are a little complicated, but not that much, so you should be able to understand roughly what's going on if you never coded Mandelbrot before. Still, I think it's interesting to put the image produced here, just in case you don't even knwo what the mandelbrot fractal looks like.

Comparison

Well, after finishing the OpenMP version and running it, how did it go? Did you get a better result? Maybe not so good? Or worse? Don’t sweat it if you ended up with a worse result, chances are you’re overcomplicating the solution. OpenMP can be sneaky like that; simplicity is often the key to success.

If you cracked the problem on your own, congrats! If you didn’t and had to check out my repo, what did you think? A little disappointing? I get it. OpenMP tends to take a lot of the "fun" out of multithreading. You don’t have to worry about thread management, synchronization, or most of the usual headaches. It’s both a blessing and a curse. Anyway, here are the results I got. Yours might differ slightly depending on your processor and what’s hogging your CPU in the background.

In my case, I achieved almost an order of magnitude improvement over the single-threaded version. Not bad, right?

The "Oh God, why did I use OpenMP"

So, if OpenMP is so great, why isn’t it the go-to solution for every multithreaded application out there? Well, there are some significant drawbacks that keep it from being universally embraced:

Limited Scalability: OpenMP is best suited for shared-memory systems, which means it works great on a single machine with multiple cores. However, if your project needs to scale across distributed systems or clusters, OpenMP simply doesn't cut it.
Overhead for Small Tasks: OpenMP’s setup overhead can outweigh its benefits for small workloads or fine-grained parallelism. It’s better for large, compute-heavy tasks where the overhead becomes negligible.
Lack of Fine-Grained Control: OpenMP abstracts away much of the complexity of multithreading, which is fantastic for ease of use, but not so great if you need low-level control over thread behavior, memory placement, or task scheduling.
Non-Portability: Yes, OpenMP is widely supported however, the performance of OpenMP programs can vary significantly across compilers and hardware. A solution optimized for one system might underperform on another. This is partly what makes OpenMP not scale well.
Potential for Subtle Bugs: Clauses like private and shared can introduce subtle issues if misused. Debugging parallel code is challenging, and OpenMP doesn’t magically solve that. If anything, it makes matters worse by making transparent to you what is actually going on.
Single Responsibility: OpenMP is focused exclusively on parallelism. If you need a solution for asynchronous tasks, reactive programming, or GPU acceleration, you’ll need to look elsewhere or combine it with other technologies.

So, even if it works perfectly for the task it was designed for, it's still a very particular problme to be solved, not a universal multithreading tool. And if you underestimate the project you have at hand, picking it for the ease of use, at some point you will crash against a brick wall and look at the ceiling for hours pondering "Oh God, why did I use OpenMP"

Phew

And with that, we’re officially done! Seeing the impact OpenMP can have on a seemingly simple problem like Mandelbrot set generation is pretty cool. At least I think so. Of course, this was just scratching the surface. OpenMP is capable of so much more, but the rest is up to you to explore and experiment with.

I’ll confess, the moment I compiled the executable, curiosity got the better of me. I just *had* to take a peek at the Assembly code. My inner nerd couldn’t resist. What happened next? Absolutely nothing. I understood none of it. Zero. Zilch.

If deciphering the Assembly version of Mandelbrot is already hard enough, imagine throwing OpenMP into the mix. Yeah, I’m not that smart. But you know what? The idea of diving under the hood of OpenMP sounds very interesting, not gonna lie. Maybe a future post could dig into what OpenMP actually compiles down to. Would you like to see that? No promises, but it might be worth a shot.

Anyway, enough rambling. I hope you enjoyed this post and found OpenMP at least a little bit awesome. Go forth and tinker with it. Who knows what you’ll create?

Stay curious my nerds. I’ll catch you next time!

RattleSploitable

Search This Blog