2014-08-01

The Thorium actor engine is operational now, we can start to work on actor applications for metagenomics

I have been very busy during the last months. In particular, I completed my doctorate on April 10th, 2014 and we moved from Canada to the United States on April 15th, 2014. I started a new occupation on April 21st, 2014 at Argonne National Laboratory (a U.S. Department of Energy laboratory).

But the biggest change, perhaps, was not one listed in the enumeration above. The biggest change was to stop working on Ray. Ray is built on top of RayPlatform, which in turn uses MPI for the parallelism and distribution. But this approach is not an easy way of devising applications because message passing alone is a very leaky, not self-contained, abstraction. Ray usually works fine, but it has some bugs.

The problem with leaky abstractions is that they lack simplicity and are way too complex to scale out.

For example, it is hard to add new code to an existing code base without breaking anything. This is the case because MPI only offers a fixed number of ranks. Sure, the MPI standard has some features to spawn ranks, but it's not supported on most platforms and when it is ranks are spawned as operating system processes.

There are arguably 3 known methods to reduce the number of bugs. First is to (1) write a lot of tests. But it's better if you can have a lower number of bugs in the first place. The second one is to use pure (2) functional programming. The third is to use the (3) actor model.

If you look at what the industry is doing, Erlang, Scala (and perhaps D) use the actor model of computation. The actor model of computation was introduced by the legendary (that's my opinion) Carl Hewitt in two seminal papers (Hewitt, Bishop, Steiger 1973 and Hewitt and Baker 1977).

Erlang is cooler than Scala (this is an opinion, not a fact) because it enforces both the actor model and functional programming whereas Scala (arguably) does not enforce anything.

The real thing, perhaps, is to apply the actor model to high-performance computing. In particular, I am applying it to metagenomics because there is a lot of data. For example, Janet Jansson and her team generated huge datasets in 2011 in the context of a Grand Challenge.

So basically I started to work on biosal (biological sequence analysis library) on May 22th, 2014. The initial momentum for the SAL concept (Sequence Analysis Library) was created in 2012 at a workshop. So far, at least two projects (that I am aware of) are related to this workshop: KMI (Kmer Matching Interface) and biosal.

The biosal team is small: we are currently 6 people and we are only 2 that are pushing code.

Here is the current team:







Person (alphabetical order) Roles in biosal project
Pavan Balaji
  • MPI consultant
Sébastien Boisvert
  • Master branch owner
  • Actor model enthusiast
  • Metagenomics person
  • Scrum master
Huy Bui
  • PAMI consultant
  • Communication consultant
Rick Stevens
  • Supervisor
  • Metagenomics person
  • Stakeholder
  • Product owner
  • Exascale computing enthusiast
Venkatram Vishwanath
  • Actor model enthusiast
  • Exascale computing enthusiast
Fangfang Xia
  • Product manager
  • Actor model enthusiast
  • Metagenomics person





When I started to implement the runtime system in biosal, I did not plan to give a name to that component. But I changed my mind because the code is general and very cool. It is a distributed actor engine in C 1999, MPI 1.0, and Pthreads and it's named Thorium (like the atom).

Thorium uses the actor model, but does not use functional programming.

It is quite easy to get started with this. It is a two step process.

The first step is to create an actor script (3 C functions called init, destroy and receive). For a given actor script, you need to write 2 files (a H header file and a C implementation file).

The first step defines a actor script structure like this:

struct bsal_script hello_script = {
    .name = HELLO_SCRIPT,
    .init = hello_init,
    .destroy = hello_destroy,
    .receive = hello_receive,
    .size = sizeof(struct hello),
    .description = "hello"
};
The prototype for the 3 functions are:



Function Concrete actor function
init
void hello_init(struct bsal_actor *actor);
destroy
void hello_destroy(struct bsal_actor *self);
receive
void hello_receive(struct bsal_actor *self,
   struct bsal_message *message);


The functions init and destroy are called automatically by Thorium when an actor is spawned and killed, respectively. The function receive is called automatically by Thorium when the actor receives a message. Sending messages is the only way to interact with an actor.

There is only one (very simple) way to send a message to an actor:

void bsal_actor_send(struct bsal_actor *self, int destination, struct bsal_message *message);


The second step is to create a Thorium runtime node in a C file with a main function (around 10 lines).

After creating the code in two easy steps, you just need to compile and link the code.

After that, you can perform actor computations anywhere. A typical command to do so is:

mpiexec -n 1024 ./hello_world -threads-per-node 32
 
 

Obviously, you need more than just one actor script to actually something cool with actors.


On a final note, biosal is an object-oriented project. The current object is typically called self, like in Swift, Ruby, and Smalltalk.

4 comments:

Olivier said...

Congrats on finishing your doctorate! You are awesome!

Here are a few thoughts on your post:
1) To reduce the number of bugs of a software, to write a lot of tests is not sufficient to conclude in a reduction of the number of bugs. One can write a lot of bad tests! ;-)
Going with Test-Triven Development would be, for me, a way to really try to reduce the number of bugs.

2) I would be really interested to know what is your argument for the link between less number of bugs and functional programming (as for the actor model, I haven't tried it yet, I can't tell).

3) I've read somewhere that Scala is a post-functional language, as it incorporates functional programming in the tools of the language. I personnaly have a preference for a language that allows me to use the paradigm of my choice, because imho, functional programming (also true for structured or OO) is not always the right paradigm to construct a given software.

Sébastien Boisvert said...

> 1)

We have a robot who is sending us daily report like this:

From: biosal-tests [no-reply@sns.amazonaws.com]
Subject: [biosal_bot][state=PASSED] biosal quality assurance report

A quality assurance report (state=PASSED) is available at http://biosal.s3.amazonaws.com/quality-assurance-department/2014-08-01-19:58:18.log


> 2)

Functional programming is very good at lowering the number of bugs mainly because of the single-assignment rule. http://en.wikipedia.org/wiki/Assignment_%28computer_science%29#Single_assignment

Actor model enforces real encapsulation using messages whereas objects allow one to call a method on the object. Actors provide true isolation.

> 3)

Sure.

My preference is to use object-oriented programming. Functional programming can be nice though.

Rick said...

Aren't you still tied to MPI? At least the way you are calling hello world indicates the requirement to use MPI.

Anyway good luck on the new project.

Sébastien Boisvert said...

Hi Rick,

That's a very good question !

Even if a software uses MPI on a Cray XE6, you have to use aprun instead of mpiexec to launch the computation:

aprun -n 256 -N 1 -d 24 \
argonnite -threads-per-node 24 -print-load -print-memory-usage \
-k 43 Iowa_Continuous_Corn/*.fastq -o test-beagle-512x24-1 > test-beagle-512x24-1.stdout



On a IBM Blue Gene/Q, you have to use qsub instead of mpiexec for MPI computations:

qsub \
-A CompBIO \
-n 1024 \
-t 01:00:00 \
-O issue-471-mira-1024x24-1 \
--mode c1 \
argonnite -print-load -threads-per-node 24 -k 43 Iowa_Continuous_Corn/*.fastq -o issue-471-mira-1024x24-1


But yes, Thorium currently only support MPI for the transport layer.

For Blue Gene, Huy will implement PAMI transport (which will still use qsub to launch the job).

For Cray, we will eventually add uGNI transport.

Finally, perhaps the best transport that Thorium will have will be TCP.

So on a Cray, you'll be able to use Thorium with one of these:

- aprun + MPI
- aprun + uGNI
- aprun + TCP
(compute nodes have a IPv4 address, at least on Cray XE67 they do).

On a Blue Gene/Q, you'll be able to use Thorium with these:

- qsub + MPI
- qsub + PAMI
- aprun + TCP
(Venkat told me that compute nodes on BGQ have IP addresses)


On anything else, you'll be able to use Thorium with these:


- mpiexec + MPI
- mpiexec + TCP


The transport code is here:

https://github.com/sebhtml/biosal/tree/master/engine/thorium/transport



As far as I know, MPI, PAMI, uGNI all fail when a compute node fails.

That's why we want to have TCP transport in Thorium: to be able to sustain loss of compute nodes (and add compute nodes during a computation).



There was an error in this gadget