2012-05-17

Passing messages in the cloud

In this tutorial, I will show how to run Ray (v2.0.0-rc7) on 4 Amazon EC2 small instances (m1.small).

Tests with micro instances showed that it does not work for high performance computing. The virtual core of a micro instance works at 5% sometimes, and it does not work at all most of the time. That is why it is free !

For a list of instance types, see this page.

Each small instance will provide 1 virtual core with 1 Elastic Cloud Unit and 1.7 GB memory. We will run Ray on 4 instances with message passing in the cloud.
A small instance cost 0.080 $ per hour (on-demand pricing). So this tutorial should cost me 0.32 $ for the small instances and maybe some more for the Elastic Block Storage although I think this is free at some point.

You can also pay upfront for a term of 1 year or 3 years to have a lower pricing.


Tasks to do on the Amazon Web Services Console

First, I spawned 4 m1.small instances in the same zone with Ubuntu 12.04 64 bits. When creating your instances, be careful to select permissive firewall rules as mpiexec will use port 22 and various TCP/IP ports too.

In this tutorial, I called the instances ray0, ray1, ray2 and ray3.



Name Instance Public DNS
ray0 i-25c6ad43 ec2-50-17-142-151.compute-1.amazonaws.com
ray1 i-23c6ad45 ec2-23-22-128-218.compute-1.amazonaws.com
ray2 i-21c6ad47 ec2-75-101-188-165.compute-1.amazonaws.com
ray3 i-2fc6ad49 ec2-184-72-135-177.compute-1.amazonaws.com


At this point, you should also have a key file for connecting to this instances. I called mine BlackMesa.pem.

The job will be launched from ray0. What is needed to do distributed computing
is a network file system and reachable TCP/IP addresses.

Here, I used SSHFS to deploy the network file system using the
TCP/IP interconnect readily available on Amazon EC2. The backend for the
network file system is simply the Elastic Block Storage. Therefore, the method used here to create the network file system can be
called SSHFS/EBS.

Other ways of setting a network file systems are: Lustre, NFS, Amazon S3 with
SSHFS.


Stuff to do only on ray0


First, connect to ray0.

    ssh -i BlackMesa.pem ubuntu@ec2-50-17-142-151.compute-1.amazonaws.com



Here we create the source for our network file system.

    mkdir NetworkFileSystem.0
    cd NetworkFileSystem.0

And we also create a machine file for MPI for it to know what
instances to use as MPI ranks.

    vim Machines.txt

    cat Machines.txt
ec2-50-17-142-151.compute-1.amazonaws.com
ec2-23-22-128-218.compute-1.amazonaws.com
ec2-75-101-188-165.compute-1.amazonaws.com
ec2-184-72-135-177.compute-1.amazonaws.com



Next, we need  to add the hosts to the list of known hosts to avoid
(yes/no) questions later.


    for i in $(cat Machines.txt)
    do
        ssh -i ~/BlackMesa.pem $i date
    done


Then we transfer sequence data.

    scp -r joe@colosse.clumeq.ca:Amazon/Sample .



Now we are done for this part.
  
   exit



Tasks to do on ray0, ray1, ray2, ray3


(You need to repeat this section for each instance, that is 4 times.)



Install software with APT-GET.

    sudo apt-get install -y openmpi1.5-dev openmpi1.5-bin pentium-builder g++ sshfs


Get the private key. This is needed to mount the file system.

    scp joe@colosse.clumeq.ca:Amazon/BlackMesa.pem .

Then we mount the network file system.

    mkdir NetworkFileSystem

    sshfs -o IdentityFile=/home/ubuntu/BlackMesa.pem \
    ubuntu@ec2-50-17-142-151.compute-1.amazonaws.com:NetworkFileSystem.0 NetworkFileSystem

    exit


Stuff to do on ray0



First, we copy the private key in a place where Open-MPI/ssh will find it.

    cp ~/BlackMesa.pem ~/.ssh/id_rsa


Now we will install Ray in the network file system.

    cd NetworkFileSystem
    wget http://downloads.sourceforge.net/project/denovoassembler/Ray-2.0.0-rc7/Ray-v2.0.0-rc7.tar.bz2
    tar xjf Ray-v2.0.0-rc7.tar.bz2
    cd Ray-v2.0.0-rc7
    make PREFIX=/home/ubuntu/NetworkFileSystem/RayCloudApplication
    make install


Now, we will test the network with Ray.

    mpiexec -n 4 -machinefile Machines.txt \
    ./RayCloudApplication/Ray \
    -test-network-only \
    -o CloudNetworkTest | tee CloudNetworkTest.log

    cat CloudNetworkTest/NetworkTest.txt


ubuntu@ip-10-38-49-187:~/NetworkFileSystem$  cat
CloudNetworkTest/NetworkTest.txt
# average latency in microseconds (10^-6 seconds) when requesting a reply for
# a message of 4000 bytes
# MessagePassingInterfaceRank    Name    ModeLatencyInMicroseconds
# NumberOfTestMessages
# AverageForAllRanks: 16.75
# StandardDeviation: 1.22474
0    ip-10-38-49-187    18    4000
1    ip-10-244-135-103    14    4000
2    ip-10-32-219-241    17    4000
3    ip-10-38-51-160    18    4000


This is very low. Maybe because my 4 instances are on the same physical
computer or on the same rack. I don't know.

Finally, we are ready to assemble our data in the cloud.

    mpiexec -n 4 -machinefile Machines.txt \
    ./RayCloudApplication/Ray \
    -k 31 -p Sample/_1.fasta Sample/_2.fasta \
    -o CloudAssembly | tee CloudAssembly.log


Let us check some files.



    cat CloudAssembly/ElapsedTime.txt
Elapsed time for each step, Fri May 18 03:08:58 2012

 Network testing: 29 seconds
 Counting sequences to assemble: 0 seconds
 Sequence loading: 0 seconds
 K-mer counting: 12 seconds
 Coverage distribution analysis: 1 seconds
 Graph construction: 24 seconds
 Null edge purging: 6 seconds
 Selection of optimal read markers: 14 seconds
 Detection of assembly seeds: 2 minutes, 21 seconds
 Estimation of outer distances for paired reads: 2 minutes, 25 seconds
 Bidirectional extension of seeds: 9 minutes, 45 seconds
 Merging of redundant paths: 14 minutes, 33 seconds
 Generation of contigs: 0 seconds
 Scaffolding of contigs: 5 minutes, 11 seconds
 Counting sequences to search: 0 seconds
 Graph coloring: 0 seconds
 Counting contig biological abundances: 6 seconds
 Counting sequence biological abundances: 0 seconds
 Loading taxons: 0 seconds
 Loading tree: 0 seconds
 Processing gene ontologies: 0 seconds
 Computing neighbourhoods: 0 seconds
 Total: 35 minutes, 47 seconds

    cat CloudAssembly/RayVersion.txt
Ray version: 2.0.0-rc7

    cat CloudAssembly/RayPlatform_Version.txt
RayPlatform 1.0.2

    cat CloudAssembly/OutputNumbers.txt
Contigs >= 100 nt
 Number: 34
 Total length: 69851
 Average: 2054
 N50: 3207
 Median: 1660
 Largest: 6591
Contigs >= 500 nt
 Number: 26
 Total length: 67913
 Average: 2612
 N50: 3335
 Median: 2332
 Largest: 6591
Scaffolds >= 100 nt
 Number: 31
 Total length: 70135
 Average: 2262
 N50: 4045
 Median: 1631
 Largest: 6591
Scaffolds >= 500 nt
 Number: 23
 Total length: 68197
 Average: 2965
 N50: 4808
 Median: 2410
 Largest: 6591

    cat CloudAssembly/LibraryStatistics.txt
NumberOfPairedLibraries: 1

LibraryNumber: 0
 InputFormat: TwoFiles,Paired
 DetectionType: Automatic
 File: Sample/_1.fasta
  NumberOfSequences: 10000
 File: Sample/_2.fasta
  NumberOfSequences: 10000
 Distribution: CloudAssembly/Library0.txt
 Peak 0
  AverageOuterDistance: 199
  StandardDeviation: 19

    cat CloudAssembly/CoverageDistributionAnalysis.txt
k-mer length:    31
Number of k-mers in the distributed de Bruijn graph: 149320
Lowest coverage observed:    2
MinimumCoverage:    3
PeakCoverage:    9
RepeatCoverage:    15
Number of k-mers with at least MinimumCoverage:    147164 k-mers
Percentage of vertices with coverage 2:    1.44388 %
DistributionFile: CloudAssembly/CoverageDistribution.txt




    ls CloudAssembly/Plugins
plugin_Amos.txt
plugin_CoverageGatherer.txt
plugin_EdgePurger.txt
plugin_FusionData.txt
plugin_FusionTaskCreator.txt
plugin_GeneOntology.txt
plugin_GenomeNeighbourhood.txt
plugin_JoinerTaskCreator.txt
plugin_KmerAcademyBuilder.txt
plugin_Library.txt
plugin_MachineHelper.txt
plugin_MessageProcessor.txt
plugin_MessageRouter.txt
plugin_MessagesHandler.txt
plugin_NetworkTest.txt
plugin_Partitioner.txt
plugin_PhylogenyViewer.txt
plugin_Scaffolder.txt
plugin_Searcher.txt
plugin_SeedExtender.txt
plugin_SeedingData.txt
plugin_SequencesIndexer.txt
plugin_SequencesLoader.txt
plugin_SwitchMan.txt
plugin_VerticesExtractor.txt
plugin_VirtualCommunicator.txt
plugin_VirtualProcessor.txt

    ls CloudAssembly/Scheduling
0.MasterTicks.txt
0.SlaveTicks.txt
1.MasterTicks.txt
1.SlaveTicks.txt
2.MasterTicks.txt
2.SlaveTicks.txt
3.MasterTicks.txt
3.SlaveTicks.txt

    cat CloudAssembly/Scheduling/0.MasterTicks.txt



Index Master mode Start time in milliseconds End time in milliseconds Duration in milliseconds Number of ticks Average granularity in nanoseconds
0 RAY_MASTER_MODE_LOAD_CONFIG 0 1 1 1 1000000
1 RAY_MASTER_MODE_TEST_NETWORK 1 28384 28383 4978634 5700
2 RAY_MASTER_MODE_COUNT_FILE_ENTRIES 28384 28722 338 8197 41234
3 RAY_MASTER_MODE_LOAD_SEQUENCES 28722 28722 0 1 0
4 RAY_MASTER_MODE_DO_NOTHING 28722 29128 406 34884 11638
5 RAY_MASTER_MODE_TRIGGER_VERTICE_DISTRIBUTION 29128 29128 0 1 0
6 RAY_MASTER_MODE_DO_NOTHING 29128 41289 12161 1970193 6172
7 RAY_MASTER_MODE_PREPARE_DISTRIBUTIONS 41289 41289 0 1 0
8 RAY_MASTER_MODE_DO_NOTHING 41289 41295 6 94 63829
9 RAY_MASTER_MODE_PREPARE_DISTRIBUTIONS_WITH_ANSWERS 41295 41295 0 1 0
10 RAY_MASTER_MODE_DO_NOTHING 41295 41386 91 126 722222
11 RAY_MASTER_MODE_SEND_COVERAGE_VALUES 41386 41386 0 1 0
12 RAY_MASTER_MODE_DO_NOTHING 41386 41431 45 14771 3046
13 RAY_MASTER_MODE_TRIGGER_GRAPH_BUILDING 41431 41431 0 1 0
14 RAY_MASTER_MODE_DO_NOTHING 41431 65663 24232 3830837 6325
15 RAY_MASTER_MODE_PURGE_NULL_EDGES 65663 65688 25 1 25000000
16 RAY_MASTER_MODE_DO_NOTHING 65688 71772 6084 874612 6956
17 RAY_MASTER_MODE_WRITE_KMERS 71772 72086 314 35569 8827
18 RAY_MASTER_MODE_TRIGGER_INDEXING 72086 72086 0 1 0
19 RAY_MASTER_MODE_DO_NOTHING 72086 85963 13877 2114629 6562
20 RAY_MASTER_MODE_PREPARE_SEEDING 85963 85968 5 1 5000000
21 RAY_MASTER_MODE_TRIGGER_SEEDING 85968 85968 0 1 0
22 RAY_MASTER_MODE_DO_NOTHING 85968 226937 140969 24690296 5709
23 RAY_MASTER_MODE_TRIGGER_DETECTION 226937 226937 0 1 0
24 RAY_MASTER_MODE_DO_NOTHING 226937 371422 144485 25985741 5560
25 RAY_MASTER_MODE_ASK_DISTANCES 371422 371422 0 1 0
26 RAY_MASTER_MODE_DO_NOTHING 371422 371450 28 195 143589
27 RAY_MASTER_MODE_START_UPDATING_DISTANCES 371450 371450 0 1 0
28 RAY_MASTER_MODE_UPDATE_DISTANCES 371450 371531 81 5711 14183
29 RAY_MASTER_MODE_TRIGGER_EXTENSIONS 371531 371531 0 1 0
30 RAY_MASTER_MODE_DO_NOTHING 371531 956517 584986 101238269 5778
31 RAY_MASTER_MODE_TRIGGER_FUSIONS 956517 956517 0 1 0
32 RAY_MASTER_MODE_TRIGGER_FIRST_FUSIONS 956517 956517 0 1 0
33 RAY_MASTER_MODE_START_FUSION_CYCLE 956517 1830216 873699 151986087 5748
34 RAY_MASTER_MODE_ASK_EXTENSIONS 1830216 1830337 121 14395 8405
35 RAY_MASTER_MODE_SCAFFOLDER 1830337 1830337 0 1 0
36 RAY_MASTER_MODE_DO_NOTHING 1830337 2139880 309543 54403163 5689
37 RAY_MASTER_MODE_WRITE_SCAFFOLDS 2139880 2141209 1329 222425 5975
38 RAY_MASTER_MODE_COUNT_SEARCH_ELEMENTS 2141209 2141231 22 5269 4175
39 RAY_MASTER_MODE_ADD_COLORS 2141231 2141317 86 19 4526315
40 RAY_MASTER_MODE_CONTIG_BIOLOGICAL_ABUNDANCES 2141317 2146382 5065 862888 5869
41 RAY_MASTER_MODE_SEQUENCE_BIOLOGICAL_ABUNDANCES 2146382 2146510 128 14545 8800
42 RAY_MASTER_MODE_SEARCHER_CLOSE 2146510 2146512 2 7 285714
43 RAY_MASTER_MODE_PHYLOGENY_MAIN 2146512 2146724 212 14596 14524
44 RAY_MASTER_MODE_ONTOLOGY_MAIN 2146724 2146872 148 17121 8644
45 RAY_MASTER_MODE_NEIGHBOURHOOD 2146872 2146962 90 7591 11856
46 RAY_MASTER_MODE_KILL_RANKS 2146962 2146964 2 1 2000000
47 RAY_MASTER_MODE_KILL_ALL_MPI_RANKS 2146964 2147078 114 16207 7033
48 RAY_MASTER_MODE_DO_NOTHING 2147078 2147316 238 3 79333333

    cat CloudAssembly/Scheduling/2.MasterTicks.txt



Index Master mode Start time in milliseconds End time in milliseconds Duration in milliseconds Number of ticks Average granularity in nanoseconds
0 RAY_MASTER_MODE_DO_NOTHING 0 2147297 2147297 378412973 5674




    cat CloudAssembly/Scheduling/3.SlaveTicks.txt


Index Slave mode Start time in milliseconds End time in milliseconds Duration in milliseconds Number of ticks Average granularity in nanoseconds Received messages Average number of received messages per second Sent messages Average number of sent messages per second
0 RAY_SLAVE_MODE_DO_NOTHING 0 3 3 888 3378 2 666 0 0
1 RAY_SLAVE_MODE_TEST_NETWORK 3 28374 28371 5036353 5633 8049 283 8049 283
2 RAY_SLAVE_MODE_DO_NOTHING 28374 28385 11 3995 2753 1 90 0 0
3 RAY_SLAVE_MODE_COUNT_FILE_ENTRIES 28385 28570 185 22828 8104 1 5 2 10
4 RAY_SLAVE_MODE_DO_NOTHING 28570 28918 348 34728 10020 1 2 0 0
5 RAY_SLAVE_MODE_LOAD_SEQUENCES 28918 28918 0 1 0 0 0 1 1
6 RAY_SLAVE_MODE_DO_NOTHING 28918 29276 358 36783 9732 1 2 0 0
7 RAY_SLAVE_MODE_BUILD_KMER_ACADEMY 29276 41313 12037 1990908 6045 810 67 810 67
8 RAY_SLAVE_MODE_SEND_DISTRIBUTION 41313 41314 1 222 4504 1 1000 2 2000
9 RAY_SLAVE_MODE_DO_NOTHING 41314 41432 118 20474 5763 2 16 1 8
10 RAY_SLAVE_MODE_EXTRACT_VERTICES 41432 65688 24256 3808992 6368 10618 437 10618 437
11 RAY_SLAVE_MODE_PURGE_NULL_EDGES 65688 72016 6328 915472 6912 326 51 326 51
12 RAY_SLAVE_MODE_WRITE_KMERS 72016 72016 0 1 0 0 0 1 1
13 RAY_SLAVE_MODE_DO_NOTHING 72016 72113 97 13227 7333 1 10 0 0
14 RAY_SLAVE_MODE_INDEX_SEQUENCES 72113 85963 13850 2128138 6508 1157 83 1158 83
15 RAY_SLAVE_MODE_DO_NOTHING 85963 85969 6 2362 2540 1 166 0 0
16 RAY_SLAVE_MODE_START_SEEDING 85969 226743 140774 24660206 5708 44306 314 44307 314
17 RAY_SLAVE_MODE_DO_NOTHING 226743 226831 88 10290 8551 1 11 0 0
18 RAY_SLAVE_MODE_SEND_SEED_LENGTHS 226831 226831 0 2 0 0 0 2 2
19 RAY_SLAVE_MODE_DO_NOTHING 226831 226951 120 22282 5385 3 25 1 8
20 RAY_SLAVE_MODE_AUTOMATIC_DISTANCE_DETECTION 226951 371423 144472 25839442 5591 42019 290 42020 290
21 RAY_SLAVE_MODE_DO_NOTHING 371423 371423 0 342 0 1 1 0 0
22 RAY_SLAVE_MODE_SEND_LIBRARY_DISTANCES 371423 371425 2 349 5730 1 500 2 1000
23 RAY_SLAVE_MODE_DO_NOTHING 371425 371533 108 17483 6177 2 18 1 9
24 RAY_SLAVE_MODE_EXTENSION 371533 875049 503516 86024411 5853 153627 305 153628 305
25 RAY_SLAVE_MODE_DO_NOTHING 875049 956709 81660 14967577 5455 5036 61 5035 61
26 RAY_SLAVE_MODE_DISTRIBUTE_FUSIONS 956709 959308 2599 480956 5403 241 92 242 93
27 RAY_SLAVE_MODE_DO_NOTHING 959308 959309 1 360 2777 1 1000 0 0
28 RAY_SLAVE_MODE_FUSION 959309 1113988 154679 27155156 5696 45091 291 45092 291
29 RAY_SLAVE_MODE_DO_NOTHING 1113988 1114022 34 1660 20481 2 58 1 29
30 RAY_SLAVE_MODE_DISTRIBUTE_FUSIONS 1114022 1114632 610 92412 6600 233 381 234 383
31 RAY_SLAVE_MODE_DO_NOTHING 1114632 1115292 660 131616 5014 5 7 4 6
32 RAY_SLAVE_MODE_FINISH_FUSIONS 1115292 1244313 129021 22651248 5695 45309 351 45310 351
33 RAY_SLAVE_MODE_DO_NOTHING 1244313 1244408 95 2008 47310 2 21 1 10
34 RAY_SLAVE_MODE_DISTRIBUTE_FUSIONS 1244408 1245115 707 117336 6025 239 338 240 339
35 RAY_SLAVE_MODE_DO_NOTHING 1245115 1245249 134 27084 4947 6 44 5 37
36 RAY_SLAVE_MODE_FUSION 1245249 1396124 150875 26703744 5649 47633 315 47634 315
37 RAY_SLAVE_MODE_DO_NOTHING 1396124 1425655 29531 5412364 5456 2792 94 2791 94
38 RAY_SLAVE_MODE_DISTRIBUTE_FUSIONS 1425655 1427976 2321 424400 5468 225 96 226 97
39 RAY_SLAVE_MODE_DO_NOTHING 1427976 1428101 125 23388 5344 8 64 7 56
40 RAY_SLAVE_MODE_FINISH_FUSIONS 1428101 1572525 144424 25336440 5700 44544 308 44545 308
41 RAY_SLAVE_MODE_DO_NOTHING 1572525 1572717 192 15652 12266 2 10 1 5
42 RAY_SLAVE_MODE_DISTRIBUTE_FUSIONS 1572717 1577802 5085 935540 5435 234 46 235 46
43 RAY_SLAVE_MODE_DO_NOTHING 1577802 1577878 76 5748 13221 1 13 0 0
44 RAY_SLAVE_MODE_FUSION 1577878 1711356 133478 23081916 5782 44544 333 44545 333
45 RAY_SLAVE_MODE_DO_NOTHING 1711356 1711449 93 1460 63698 2 21 1 10
46 RAY_SLAVE_MODE_DISTRIBUTE_FUSIONS 1711449 1712791 1342 241900 5547 231 172 232 172
47 RAY_SLAVE_MODE_DO_NOTHING 1712791 1712799 8 2736 2923 4 500 3 375
48 RAY_SLAVE_MODE_FINISH_FUSIONS 1712799 1828907 116108 20311968 5716 44544 383 44545 383
49 RAY_SLAVE_MODE_DO_NOTHING 1828907 1829003 96 2364 40609 2 20 1 10
50 RAY_SLAVE_MODE_DISTRIBUTE_FUSIONS 1829003 1830182 1179 201264 5857 226 191 227 192
51 RAY_SLAVE_MODE_DO_NOTHING 1830182 1830332 150 27824 5391 9 60 8 53
52 RAY_SLAVE_MODE_SEND_EXTENSION_DATA 1830332 1830332 0 1 0 0 0 1 1
53 RAY_SLAVE_MODE_DO_NOTHING 1830332 1830338 6 2039 2942 1 166 0 0
54 RAY_SLAVE_MODE_SCAFFOLDER 1830338 2084513 254175 44050323 5770 116776 459 116777 459
55 RAY_SLAVE_MODE_DO_NOTHING 2084513 2141225 56712 10379537 5463 4482 79 4481 79
56 RAY_SLAVE_MODE_COUNT_SEARCH_ELEMENTS 2141225 2141226 1 397 2518 2 2000 3 3000
57 RAY_SLAVE_MODE_DO_NOTHING 2141226 2141232 6 2203 2723 1 166 0 0
58 RAY_SLAVE_MODE_ADD_COLORS 2141232 2141252 20 1781 11229 1 50 3 150
59 RAY_SLAVE_MODE_DO_NOTHING 2141252 2141347 95 11791 8056 1 10 0 0
60 RAY_SLAVE_MODE_CONTIG_BIOLOGICAL_ABUNDANCES 2141347 2145419 4072 712838 5712 341 83 342 83
61 RAY_SLAVE_MODE_DO_NOTHING 2145419 2146393 974 180490 5396 9 9 7 7
62 RAY_SLAVE_MODE_SEQUENCE_BIOLOGICAL_ABUNDANCES 2146393 2146505 112 18809 5954 2 17 3 26
63 RAY_SLAVE_MODE_DO_NOTHING 2146505 2146511 6 2187 2743 1 166 0 0
64 RAY_SLAVE_MODE_SEARCHER_CLOSE 2146511 2146511 0 1 0 0 0 1 1
65 RAY_SLAVE_MODE_DO_NOTHING 2146511 2146528 17 571 29772 1 58 0 0
66 RAY_SLAVE_MODE_PHYLOGENY_MAIN 2146528 2146640 112 13757 8141 1 8 2 17
67 RAY_SLAVE_MODE_DO_NOTHING 2146640 2146745 105 16783 6256 1 9 0 0
68 RAY_SLAVE_MODE_ONTOLOGY_MAIN 2146745 2146865 120 11310 10610 1 8 3 25
69 RAY_SLAVE_MODE_DO_NOTHING 2146865 2146874 9 2982 3018 1 111 0 0
70 RAY_SLAVE_MODE_NEIGHBOURHOOD 2146874 2146874 0 1 0 0 0 1 1
71 RAY_SLAVE_MODE_DO_NOTHING 2146874 2146993 119 20607 5774 2 16 1 8
72 RAY_SLAVE_MODE_DIE 2146993 2147349 356 1 356000000 0 0 1 2


There was an error in this gadget