Tuesday, January 17, 2012

Ray Platform: a message-passing-interface programming framework

The first pieces of Ray were put together in a git repository on 21 January 2010. Since that, Ray has undergo numerous major refactoring events.

The biggest changes to date are those of a few days ago. The source code of Ray is now divided in Ray Platform (around 13000 source lines of code including comments and blank lines) and Ray Application (around 29 000 source lines of code including comments and blank lines).

Ray Platform compiles independently of Ray Application -- it can be reused over and over again. I changed the license too. While Ray Application is still licensed under the GNU General Public License, version 3, Ray Platform is licensed under the GNU Lesser General Public License, version 3.

This means that Ray Platform can be incorporated in proprietary software (for free, if that matters). If changes are made to Ray Platform, however, these have to be shared.

How it works ?

The main class of the Ray Platform is ComputeCore. It represents a process mapped to a single processor core. An instance of this class has a message inbox, a message outbox, a virtual processor and a virtual communicator. The programmer can use the application programming interface to register callbacks for message tags, slave modes and master modes.

Message tag handlers

Any MPI message tag (type of message) has a default callback method to handle it that does nothing at all. Let's say the MPI tag is called MPI_TAG_GET_COFFEE. The corresponding callback is call_MPI_TAG_GET_COFFEE. The empty callbacks are generated in MessageTagHandler automatically. To overwrite one, a class that inherits from MessageTagHandler has to be defined with the appropriate new handler (with the same name). Then, the programmer needs to register an instance of this new class as being the handler of the MPI tag.

m_computeCore.setMessageTagObjectHandler(MPI_TAG_GET_COFFEE, &m_coffeeHandler);

Henceforth, when a message is received, Ray Platform will call m_coffeeHandler.call_MPI_TAG_GET_COFFEE(message).


Master and slave modes

In the Ray Platform, each ComputeCore has a current master mode and a current slave mode. The master mode of all ranks but the master is MASTER_RANK_DO_NOTHING. Just like for the message tags, the default handlers for slave modes and master modes can be overwritten with custom ones.


Links

Saturday, January 7, 2012

Making a github project readable as a web page

It is no secret, I love github (and git).

Today I updated my knowledge on javascript.

I started to wrote a game, the slim volleyball game to be exact (see reddit).

So I pushed the thing on github.
However, the files were not served as being part of a web page.

Fortunately, github implements a service called github pages which allows any github user to share a web page about something using the user's username.

This service also allows one to add a branch called 'gh-pages' (refs/heads/gh-pages to be exact I believe) to git repository to serve its content on the web. The content of the branch 'gh-pages' for projectX of userA is served as a web page at http://userA.github.com/projectX.


Initially, the only reference is usually 'master' (along with its remote twin).

seb@godzilla$ git branch -a
* master
  remotes/origin/master

To serve all the content of the branch 'refs/heads/master' on github pages, a symbolic reference called 'refs/heads/gh-pages' must be created.


seb@godzilla$ git symbolic-ref "refs/heads/gh-pages" "refs/heads/master"

Then you can list the branches.

seb@godzilla$ git branch -a
  gh-pages -> master
* master
  remotes/origin/master

The next thing you need to do is to push the references.

seb@godzilla$ git push --mirror

The gh-pages reference is now on the remote site too.

seb@godzilla$ git branch -a
  gh-pages -> master
* master
  remotes/origin/gh-pages
  remotes/origin/master

While the the project is available on github as usual, the same content is served on the web too !

Thursday, November 3, 2011

Qui est en charge ?

Construction d'un bâtiment

2012 commencera bientôt et le Centre de génomique de Québec est toujours en piètre état. Rénovation ici, bruits insupportables là. Et il devait être livré (ou délivré) en 2005. Il faut déjà réparer ses défectuosités majeures.

Selon le Journal de Québec, ce projet aura coûté (au moins) 22 M$.

L'ancien directeur du centre de recherche du CHUQ, Jean-Claude Forest, affirme que l'impact financier est en train d'être calculé.

Mais il est (un peu) tard pour les chercheurs.


Gestion informatique d'étudiants

Un autre échec est le système informatique Capsule de l'Université Laval -- un portail web qui gère (très mal) les informations des étudiants. Ce projet aura coûté 26 M$.

André Armstrong, le directeur du projet de modernisation des études, attribue son échec à la non-participation étudiante à des comités de suivi.

 Je trouve difficile à comprendre comment on peut laisser un projet ayant un aussi gros budget échouer autant à cause de sièges étudiants vacants.


Carnet informatisé de santé

"Les médecins pourraient traiter 20 % plus de patients s'ils avaient accès aux dossiers électroniques."

Mais le Dossier (informatique) de santé du Québec, avec un budget de 563 M$, doit maintenant traverser l'étape décisive: l'implémentation d'une interopérabilité. Il est très difficile de modifier des logiciels qui n'ont pas été initialement conçus pour opérer ensemble (inter-opération).

Un autre échec, quoi.


Marché émergent

Les firmes d'avocats adorent. La corruption et la collusion est un marché lucratif pour celles-ci. Il y a même des chercheurs qui poursuivent l'université à la Cours supérieure du Québec (et vice-versa).

Tuesday, September 27, 2011

Code review: What happens in Open-MPI's MPI_Iprobe ?

Code review: What happens in Open-MPI's MPI_Iprobe ?


Update 1

Subject: Re: [OMPI devel] Implementation of MPI_Iprobe
From: George Bosilca (bosilca_at_[hidden])
Date: 2011-09-27 15:34:05

Sebastien,

Your analysis is correct in case the checkpoint/restart approach maintained by ORNL is enabled. This is not the code path of the "normal" MPI processes, where the PML OB1 is used. In this generic case the function mca_pml_ob1_iprobe, defined in the file ompi/mca/pml/ob1/pml_ob1_iprobe.c is used.

george.


http://www.open-mpi.org/community/lists/devel/2011/09/9766.php

End of update 1


The message-passing interface (MPI) standard defines an interface for passing messages between processes.

These processes are not necessarily running on the same physical computer.


Open-MPI is an implementation of the MPI standard.

Here, I have utilised openmpi-1.4.3 to find out what is happening when a call to MPI_Iprobe occurs.


According to the MPI 2.2 standard


"The MPI_PROBE and MPI_IPROBE operations allow incoming messages to be checked for, without actually receiving them."


The prototype of MPI_Iprobe is

int MPI_Iprobe(int source, int tag, MPI_Comm comm, int *flag, MPI_Status *status);


In ompi/mpi/c/iprobe.c, MPI_Iprobe is embodied.


The code (some lines omitted):

ompi/mpi/c/iprobe.c

/*
 * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2004-2005 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * Copyright (c) 2004-2008 High Performance Computing Center Stuttgart, 
 *                         University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2005 The Regents of the University of California.
 *                         All rights reserved.
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

int MPI_Iprobe(int source, int tag, MPI_Comm comm, int *flag, MPI_Status *status)
{
    int rc;

    MEMCHECKER(
        memchecker_comm(comm);
    );

    if ( MPI_PARAM_CHECK ) {
        rc = MPI_SUCCESS;
        OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
        if (((tag < 0) && (tag != MPI_ANY_TAG)) || (tag > mca_pml.pml_max_tag)) {
            rc = MPI_ERR_TAG;
        } else if (ompi_comm_invalid(comm)) {
            rc = MPI_ERR_COMM;
        } else if ((source != MPI_ANY_SOURCE) &&
                   (MPI_PROC_NULL != source) &&
                   ompi_comm_peer_invalid(comm, source)) {
            rc = MPI_ERR_RANK;
        }
        OMPI_ERRHANDLER_CHECK(rc, comm, rc, FUNC_NAME);
    }

    if (MPI_PROC_NULL == source) {
        if (MPI_STATUS_IGNORE != status) {
            *status = ompi_request_empty.req_status;
            /*
             * Per MPI-1, the MPI_ERROR field is not defined for single-completion calls
             */
            MEMCHECKER(
                opal_memchecker_base_mem_undefined(&status->MPI_ERROR, sizeof(int));
            );
        }
        return MPI_SUCCESS;
    }

    OPAL_CR_ENTER_LIBRARY();

    rc = MCA_PML_CALL(iprobe(source, tag, comm, flag, status));

    /*
     * Per MPI-1, the MPI_ERROR field is not defined for single-completion calls
     */
    MEMCHECKER(
        opal_memchecker_base_mem_undefined(&status->MPI_ERROR, sizeof(int));
    );

    OMPI_ERRHANDLER_RETURN(rc, comm, rc, FUNC_NAME);
}



In this function, the following line is the one we want to follow.

rc = MCA_PML_CALL(iprobe(source, tag, comm, flag, status));


The rest is mostly just parameter validation, I guess.


MCA_PML_CALL is probably a macro. Let's search it.


In ompi/mca/pml/pml.h, MCA_PML_CALL is defined. Here is the code (some lines omitted):

ompi/mca/pml/pml.h

/*
 * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2004-2005 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
 *                         University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2006 The Regents of the University of California.
 *                         All rights reserved.
 * Copyright (c) 2006-2007 Los Alamos National Security, LLC.  All rights
 *                         reserved. 
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

#if MCA_pml_DIRECT_CALL

#include MCA_pml_DIRECT_CALL_HEADER

#define MCA_PML_CALL_EXPANDER(a, b) MCA_PML_CALL_STAMP(a,b)
#define MCA_PML_CALL(a) MCA_PML_CALL_EXPANDER(MCA_pml_DIRECT_CALL_COMPONENT, a)

#else
#define MCA_PML_CALL(a) mca_pml.pml_ ## a
#endif


With MCA_PML_CALL, iprobe becomes mca_pml.pml_iprobe and some additional code is added depending on the value of MCA_pml_DIRECT_CALL.




In ompi/mca/pml/pml.h, mca_pml_base_module_t is defined.

ompi/mca/pml/pml.h

/*
 * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2004-2005 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
 *                         University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2006 The Regents of the University of California.
 *                         All rights reserved.
 * Copyright (c) 2006-2007 Los Alamos National Security, LLC.  All rights
 *                         reserved. 
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

struct mca_pml_base_module_1_0_0_t {

    /* downcalls from MCA to PML */
    mca_pml_base_module_add_procs_fn_t    pml_add_procs;
    mca_pml_base_module_del_procs_fn_t    pml_del_procs;
    mca_pml_base_module_enable_fn_t       pml_enable;
    mca_pml_base_module_progress_fn_t     pml_progress;

    /* downcalls from MPI to PML */
    mca_pml_base_module_add_comm_fn_t     pml_add_comm;
    mca_pml_base_module_del_comm_fn_t     pml_del_comm;
    mca_pml_base_module_irecv_init_fn_t   pml_irecv_init;
    mca_pml_base_module_irecv_fn_t        pml_irecv;
    mca_pml_base_module_recv_fn_t         pml_recv;
    mca_pml_base_module_isend_init_fn_t   pml_isend_init;
    mca_pml_base_module_isend_fn_t        pml_isend;
    mca_pml_base_module_send_fn_t         pml_send;
    mca_pml_base_module_iprobe_fn_t       pml_iprobe;
    mca_pml_base_module_probe_fn_t        pml_probe;
    mca_pml_base_module_start_fn_t        pml_start;

    /* diagnostics */
    mca_pml_base_module_dump_fn_t         pml_dump;

    /* FT Event */
    mca_pml_base_module_ft_event_fn_t     pml_ft_event;

    /* maximum constant sizes */
    int                                   pml_max_contextid;
    int                                   pml_max_tag;
};
typedef struct mca_pml_base_module_1_0_0_t mca_pml_base_module_1_0_0_t;
typedef mca_pml_base_module_1_0_0_t mca_pml_base_module_t;




In mca_pml_base_module_t, there is the attribute pml_iprobe. This type of this attribute is mca_pml_base_module_iprobe_fn_t.


This type is defined also in ompi/mca/pml/pml.h and is basically a function pointer (right?).

ompi/mca/pml/pml.h

/*
 * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2004-2005 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
 *                         University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2006 The Regents of the University of California.
 *                         All rights reserved.
 * Copyright (c) 2006-2007 Los Alamos National Security, LLC.  All rights
 *                         reserved. 
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

/**
 * Probe to poll for pending recv.
 *
 * @param src (IN)        Source rank w/in communicator.
 * @param tag (IN)        User defined tag.
 * @param comm (IN)       Communicator.
 * @param matched (OUT)   Flag indicating if matching recv exists.
 * @param status (OUT)    Completion statuses.
 * @return                OMPI_SUCCESS or failure status.
 *
 */
typedef int (*mca_pml_base_module_iprobe_fn_t)(
    int src,
    int tag,
    struct ompi_communicator_t* comm,
    int *matched,
    ompi_status_public_t *status
);




I believe that this scheme allows to have several implementations of pml_iprobe.



Let us look for these now.


In ompi/mca/pml/crcpw/pml_crcpw_module.c, there is this code (some lines omitted):

ompi/mca/pml/crcpw/pml_crcpw_module.c

/*
 * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
 *                         University Research and Technology
 *                         Corporation.  All rights reserved.
 * Copyright (c) 2004-2006 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
 *                         University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2006 The Regents of the University of California.
 *                         All rights reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
 *
 * $HEADER$
 */

int mca_pml_crcpw_iprobe(int dst, int tag, struct ompi_communicator_t* comm, int *matched, ompi_status_public_t* status )
{
    int ret;
    ompi_crcp_base_pml_state_t * pml_state;

    PML_CRCP_STATE_ALLOC(pml_state, ret);

    pml_state->wrapped_pml_component = &(mca_pml_crcpw_module.wrapped_pml_component);
    pml_state->wrapped_pml_module    = &(mca_pml_crcpw_module.wrapped_pml_module);

    pml_state->state = OMPI_CRCP_PML_PRE;
    pml_state = ompi_crcp.pml_iprobe(dst, tag, comm, matched, status, pml_state);
    if( OMPI_SUCCESS != pml_state->error_code) {
        ret =  pml_state->error_code;
        PML_CRCP_STATE_RETURN(pml_state);
        return ret;
    }

    if( OMPI_CRCP_PML_DONE == pml_state->state) {
        goto CLEANUP;
    }

    if( OMPI_CRCP_PML_SKIP != pml_state->state) {
        if( OMPI_SUCCESS != (ret = mca_pml_crcpw_module.wrapped_pml_module.pml_iprobe(dst, tag, comm, matched, status) ) ) {
            PML_CRCP_STATE_RETURN(pml_state);
            return ret;
        }
    }

    pml_state->state = OMPI_CRCP_PML_POST;
    pml_state = ompi_crcp.pml_iprobe(dst, tag, comm, matched, status, pml_state);
    if( OMPI_SUCCESS != pml_state->error_code) {
        ret =  pml_state->error_code;
        PML_CRCP_STATE_RETURN(pml_state);
        return ret;
    }

 CLEANUP:
    PML_CRCP_STATE_RETURN(pml_state);

    return OMPI_SUCCESS;
}



This code however seems to rely heavily on ompi_crcp.pml_iprobe.



In ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c (some lines omitted):

ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c

/*
 * Copyright (c) 2004-2008 The Trustees of Indiana University.
 *                         All rights reserved.
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */
/**************** Probe *****************/
/* JJH - Code reuse: Combine iprobe and probe logic */
ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_iprobe(
                                  int dst, int tag,
                                  struct ompi_communicator_t* comm,
                                  int *matched,
                                  ompi_status_public_t* status,
                                  ompi_crcp_base_pml_state_t* pml_state )
{
    ompi_crcp_bkmrk_pml_drain_message_ref_t   *drain_msg_ref = NULL;
    ompi_crcp_bkmrk_pml_message_content_ref_t *content_ref = NULL;
    int exit_status = OMPI_SUCCESS;
    int ret;

    OPAL_OUTPUT_VERBOSE((30, mca_crcp_bkmrk_component.super.output_handle,
                        "crcp:bkmrk: pml_iprobe(%d, %d)", dst, tag));

    /*
     * Before PML Call
     * - Determine if this can be satisfied from the drained list
     * - Otherwise let the PML handle it
     */
    if( OMPI_CRCP_PML_PRE == pml_state->state) {
        /*
         * Check to see if this message is in the drained message list
         */
        if( OMPI_SUCCESS != (ret = drain_message_find_any(PROBE_ANY_COUNT, tag, dst,
                                                          comm, PROBE_ANY_SIZE,
                                                          &drain_msg_ref,
                                                          &content_ref,
                                                          NULL) ) ) {
            ERROR_SHOULD_NEVER_HAPPEN("crcp:bkmrk: pml_iprobe(): Failed trying to find a drained message.");
            exit_status = ret;
            goto DONE;
        }

        /*
         * If the message is a drained message
         *  - Copy of the status structure to pass back to the user
         *  - Mark the 'matched' flag as true
         */
        if( NULL != drain_msg_ref ) {
            OPAL_OUTPUT_VERBOSE((12, mca_crcp_bkmrk_component.super.output_handle,
                                 "crcp:bkmrk: pml_iprobe(): Matched a drained message..."));

            /* Copy the status information */
            if( MPI_STATUS_IGNORE != status ) {
                memcpy(status, &content_ref->status, sizeof(ompi_status_public_t));
            }

            /* Mark as complete */
            *matched = 1;

            /* This will identify to the wrapper that this message is complete */
            pml_state->state = OMPI_CRCP_PML_DONE;
            pml_state->error_code = OMPI_SUCCESS;
            return pml_state;
        }
        /*
         * Otherwise the message is not drained (common case), so let the PML deal with it
         */
        else {
            /* Mark as not complete */
            *matched = 0;
        }
    }

 DONE:
    pml_state->error_code = exit_status;
    return pml_state;
}




How I understand it, ompi_crcp_bkmrk_pml_iprobe first check if any message in the drained list satisfies what the calling code wants with
a call to drain_message_find_any.

If this is not the case, the code "let the PML deal with it", which probably means that that this case would be dealt with somewhere in mca_pml_crcpw_iprobe although I am not sure.


drain_message_find_any is defined in ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c:

ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c

/*
 * Copyright (c) 2004-2008 The Trustees of Indiana University.
 *                         All rights reserved.
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

static int drain_message_find_any(size_t count, int tag, int peer,
                                  struct ompi_communicator_t* comm, size_t ddt_size,
                                  ompi_crcp_bkmrk_pml_drain_message_ref_t ** found_msg_ref,
                                  ompi_crcp_bkmrk_pml_message_content_ref_t ** content_ref,
                                  ompi_crcp_bkmrk_pml_peer_ref_t **peer_ref)
{
    ompi_crcp_bkmrk_pml_peer_ref_t *cur_peer_ref = NULL;
    opal_list_item_t* item = NULL;

    *found_msg_ref = NULL;

    for(item  = opal_list_get_first(&ompi_crcp_bkmrk_pml_peer_refs);
        item != opal_list_get_end(&ompi_crcp_bkmrk_pml_peer_refs);
        item  = opal_list_get_next(item) ) {
        cur_peer_ref = (ompi_crcp_bkmrk_pml_peer_ref_t*)item;

        /*
         * If we ware not MPI_ANY_SOURCE, then extract the process name from the
         * communicator, and search only the peer that matches.
         */
        if( MPI_ANY_SOURCE != peer && peer >= 0) {
            /* Check to see if peer could possibly be in this communicator */
            if( comm->c_local_group->grp_proc_count <= peer ) {
                continue;
            }

            if( OPAL_EQUAL != orte_util_compare_name_fields(ORTE_NS_CMP_ALL,
                                                            &(cur_peer_ref->proc_name),
                                                            &(comm->c_local_group->grp_proc_pointers[peer]->proc_name)) ) {
                continue;
            }
        }

        drain_message_find(&(cur_peer_ref->drained_list),
                           count, tag, peer,
                           comm->c_contextid, ddt_size,
                           found_msg_ref,
                           content_ref);
        if( NULL != *found_msg_ref) {
            if( NULL != peer_ref ) {
                *peer_ref = cur_peer_ref;
            }
            return OMPI_SUCCESS;
        }
    }

    return OMPI_SUCCESS;
}



If peer is MPI_ANY_SOURCE, then drain_message_find_any will fetch a message from any source.
Otherwise, it will look only for messages from the desired peer.


ompi_crcp_bkmrk_pml_peer_refs is populated in ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c (some lines omitted):

ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c

/*
 * Copyright (c) 2004-2008 The Trustees of Indiana University.
 *                         All rights reserved.
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

/**************** Processes *****************/
ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_add_procs(
                                   struct ompi_proc_t **procs,
                                   size_t nprocs,
                                   ompi_crcp_base_pml_state_t* pml_state )
{
    int ret;
    ompi_crcp_bkmrk_pml_peer_ref_t *new_peer_ref;
    size_t i;

    if( OMPI_CRCP_PML_PRE != pml_state->state ){
        goto DONE;
    }

    OPAL_OUTPUT_VERBOSE((30, mca_crcp_bkmrk_component.super.output_handle,
                        "crcp:bkmrk: pml_add_procs()"));

    /*
     * Save pointers to the wrapped PML
     */
    wrapped_pml_component = pml_state->wrapped_pml_component;
    wrapped_pml_module    = pml_state->wrapped_pml_module;

    /*
     * Create a peer_ref for each peer added
     */
    for( i = 0; i < nprocs; ++i) {
        HOKE_PEER_REF_ALLOC(new_peer_ref, ret);

        new_peer_ref->proc_name.jobid  = procs[i]->proc_name.jobid;
        new_peer_ref->proc_name.vpid   = procs[i]->proc_name.vpid;

        opal_list_append(&ompi_crcp_bkmrk_pml_peer_refs, &(new_peer_ref->super));
    }

 DONE:
    pml_state->error_code = OMPI_SUCCESS;
    return pml_state;
}


Presumably, nprocs is the number of MPI ranks in MPI_COMM_WORLD because ompi_crcp_bkmrk_pml_add_procs has
no argument struct ompi_communicator_t* comm.



Let's say that the peer is 255 and that the number of MPI ranks in MPI_COMM_WORLD is 256. In this case,
drain_message_find_any will need to loop over all the peers before calling drain_message_find for the peer 255 !

That is slow but it works, right ?

With MPI_ANY_SOURCE, I suspect message starvation can occur because there is no round-robin for the starting point of the loop -- the code always starts with the first peer.

Accordingly, if all peers send messages at the same rate, peers with lower ranks will see their messages received first.

This leads to starvation of peers with higher ranks.

Thursday, August 25, 2011

The VirtualProcessor technology in Ray



The VirtualProcessor I developed enables any MPI rank to have thousands of workers working
on different tasks. In reality, only one worker can work at any given moment, but
the VirtualProcessor schedules fairly the workers on the only instruction pipeline
available so that is 1d not a problem at all.


The VirtualProcessor is a technology that make thousands of worker compute tasks
in parallel on a single MPI rank. Obviously, only one such worker is active at
any point, but they all get to work.

The idea is that when a worker pushes a message on the VirtualCommunicator, it has
to wait for a reply. And this reply may arrive later. The idea of the
VirtualProcessor is to easily submit communication-intensive tasks.

Basically, the VirtualCommunicator groups smaller messages into larger messages
to send fewer messages on the physical network.

But to achieve that, an easy way of generating a lot of small messages is
needed. This is the use of the VirtualProcessor.


== Implementation ==

Author: Sébastien Boisvert
License: GPL
code/scheduling/VirtualProcessor.h
code/scheduling/VirtualProcessor.cpp

http://github.com/sebhtml/ray

== Workers ==

In Ray, each MPI rank actually has thousands of workers within it. All these
workers are scheduled, one after the other, inside the same process. No thread
are utilised as these would trash the processor cache.

At any point in time, each of these worker is in one of these states:

- active;
- sleeping;
- completed.



active

    The worker is not waiting for a message reply.

sleeping

    The worker is waiting for a message reply.

completed

    The worker has completed its task.

Thursday, July 21, 2011

Understand the main loop of message-passing-interface software

In video games, the main loop usually looks like this:


1
2
3
4
5
while(running){
    processInput();
    updateGameState();
    drawScreen();
}


(from higherorderfun.com)

Message-passing-interface (MPI) software can be designed in a similar fashion.

Each message-passing-interface rank of an MPI software has its own message inbox and its own message outbox.

Like for emails, received messages go in the inbox and sent messages go in the outbox.

The main loop of an MPI rank usually looks like this:

1
2
3
4
5
6
while(running){
    receiveMessages();
    processMessages();
    processData();
    sendMessages();
}

This general architecture is utilised in the Ray de novo genome assembler.

Read the code: github.com/sebhtml/ray

Tuesday, July 19, 2011

I don't understand why Ray was not included in the paper Bioinformatics 27, 2031–2037

In the paper

Lin, Y. et al. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27, 2031–2037 (2011).


Ray is not mentioned.

Why is that so ? 
We have been working on Ray for quite a while with an early prototype of the assembly engine called OpenAssembler (started in 2009-01-21).

We published Ray in 2010 in Journal of Computational Biology.


We have also presented Ray at a few places.
 


Cool facts about Ray:

- We are participating to Assemblathon 2.
- We assembled a genome on 512 compute cores in 18 hours, there were > 3 000 000 000 Illumina TruSeq paired reads (that is a lot of reads !)
- Ray is an high-performance peer-to-peer assembler
- Can assemble mixtures of technologies
- Is open source, licensed with the GNU GPL
- Ray is free.
- Works on POSIX systems (Linux, Mac, and others) and on Microsoft Windows
- Compiles cleanly with gcc and with Microsoft Visual Studio

- Ray utilises the message-passing interface

- Works well with Open-MPI or MPICH2 -- the 2 main open source implementations of the MPI standard.

- Ray does very few assembly errors.

- Ray is a single executable called Ray
- Implemented in C++
- Ray is object-oriented
- Ray is modular
- Ray is scalable

- Ray is easy to use.

- All the code is on github

I think the paper should have compared Ray with the other assemblers...



                                                     Sébastien