Message passing, MPI ranks, threads, and mini-ranks

- October 20, 2012

Hey,

The name's Sébastien Boisvert (Sebastian GreenWood in English).

It's been a while since my last significant post. That's because I was busy.

What I have been up to

I have been busy for the last 4 months (July, August, September, October 2012) with these major tasks:

- preparing a manuscript about scalable metagenomics;
- submitting (and resubmitting owing to editorial rejections) my manuscript;
- coding (Ray plugins, RayPlatform engine);
- participating to a 1-week workshop in Utah, U.S.A. in August 2012;
- visiting researchers at Argonne National Laboratory in October 2012;
- preparing the 2013 Compute Canada proposal for my director;
- helping with a Genome Canada grant application based on Ray plugins and RayPlatform;
- buying a new computer (my Samsung NP-NF210 lost some keys);
- working on a contract for the CLUMEQ super computing center (they are picky about confidential information).

Manuscript publication means more coding

I am about to submit my revised manuscript to the editor. This should give me more coding time once the manuscript ships.

New computer

My new computer is a Lenovo Thinkpad X230 with Fedora 17. I ditched Gentoo because I wanted systemd, the latest gcc (4.7.2) and the latest buggy Linux kernel (3.6.2). In contrast with Unity in Ubuntu which I dislike, I really like Gnome 3. One of the important things is to have categories for applications instead of a brain-dead endless list of applications.

Hardware is hierarchical like society

Most of today's processor architectures are not as flat as a slice of bread. Indeed, these electronics blueprints exhibit nested and hierarchical designs. For example, the node of a super computer has usually one or more sockets. Each of these sockets accommodate a single processor. In turn, a processor has one or more cores. And finally, a core has execution threads.

Software is (or should be) hierarchical

Any modern operating system abstracts 3 major components:

processors => processes;
physical memory => virtual memory;
storage appliances => virtual file systems.

When running Linux, each hardware execution thread is reported as a virtual processor in /proc/cpuinfo. However, although a process can more or less run on a thread, the software ecosystem also provides a nested way of devising computing tasks. Within any process managed by the operating system, their can be any number of threads running.

MPI (message-passing interface) is not hierarchical

MPI (stands for message-passing interface) is a standard for passing messages between processes. These processes need not to be on the same machine -- they can be relatively remote processes. But because message passing is between processes, there is no room for hierarchy.

Figure 1: The MPI programming model.

            +--------------------+
            |   MPI_COMM_WORLD   |           MPI communicator
            +---------+----------+
                      |
    +------+------+---+--+------+------+
    |      |      |      |      |      |
+---+ +---+ +---+ +---+ +---+ +---+
| 0 | | 1 | | 2 | | 3 | | 4 | | 5 |    MPI ranks
+---+ +---+ +---+ +---+ +---+ +---+

Meeting Professor Rick Stevens at Argonne National Laboratory

The one single hour spent in the office of Rick Stevens was very productive.

My opinion is that explicit hybrid models (MPI+OpenMP, MPI+pthreads, MPI+ Windows threads) are nice and all, but they fall short in their lack of uniformity. The programmer has to deal with two programming models (MPI and threads) which are both arguably difficult on their own.

A pure thread-only application can not scale beyond one node. And a pure MPI application does not use threads.

So the question was: is there a way to write your application with only message passing in mind (that's easier than with threads because it's lockless), but at the same time to require that running some of the computation in threads instead of in processes ?

Thus was born the concept of mini-ranks in software!

Mini-ranks

The hierarchical design of mini-ranks was introduced in 2008 in the field of hardware memory subsystem design ("Mini-rank: Adaptive DRAM architecture for improving memory power efficiency", IEEE, 2008).

I did not get everything in this paper as I am no expert in that field.

But for mini-ranks in distributed programming, the idea is fairly simple. The application is coded as usual, using only one message inbox and one message outbox per rank. But instead of mapping each rank to a process, each rank is actually a mini-rank running in a thread. See the figure below.

Figure 2: The MPI programming model, with mini ranks.

            +--------------------+
            |   MPI_COMM_WORLD   |           MPI communicator
            +---------+----------+
                      |
    +------+------+---+--+------+------+
    |      |      |      |      |      |
  +---+  +---+  +---+  +---+  +---+  +---+ 
  | 0 |  | 1 |  | 2 |  | 3 |  | 4 |  | 5 |    MPI ranks   (1 VirtualMachine.cpp instance per rank)
  +---+  +---+  +---+  +---+  +---+  +---+                    with the main for MPI calls
  |   |  |   |  |   |  |   |  |   |  |   |
  | 0 |  | 4 |  | 8 |  |12 |  |16 |  |20 |  |                
  | 1 |  | 5 |  | 9 |  |13 |  |17 |  |21 |  | => mini ranks
  | 2 |  | 6 |  |10 |  |14 |  |18 |  |22 |  |
  | 3 |  | 7 |  |11 |  |15 |  |19 |  |23 |  |  (1 Minirank instance per minirank (in 1 pthread))
  |   |  |   |  |   |  |   |  |   |  |   |
  +---+  +---+  +---+  +---+  +---+  +---+       (will wrap Machine.cpp and ComputeCore.cpp)

First, I tested the ability of spinlocks to synchronize everything.

Yesterday, I completed the port of RayPlatform to this mini-ranks programming model.

Porting Ray plugins to the new RayPlatform will be straightforward.

With this model, there is one MPI rank per node. One of the hardware thread does the communication and for the rest, there is one mini-rank per hardware thread.

Search This Blog

DSKernel: AI and Strength Training

Message passing, MPI ranks, threads, and mini-ranks

Comments

Popular posts from this blog

Learning to solve the example 1 of puzzle 3aa6fb7a in the ARC prize

The Thorium actor engine is operational now, we can start to work on actor applications for metagenomics

The source code of SOAPdenovo2 sits in the shadows