Function pointers v. virtual methods in HPC

The design pattern 'Adapter' is useful to add a component to an existing architecture without modifying the said architecture. The Ray assembler is built on top of RayPlatform, a distributed compute engine. Ray is an array of plugins implementing algorithms for various things in genomics, and each of these plugins uses adapters to bind their behavior onto the RayPlatform core.

The initial design of adapters in RayPlatform was based on a common interface, using pure virtual classes. Although this model is an elegant design, it also implies a significant overhead because adapters in RayPlatform are called in the main loop, which aims at running as fast as possible. There are at least two problems with virtual methods: they add additional indirections and they can not be inlined. Furthermore, using classes to create these adapters implies that it is required to declare the class in a .h, and to implement it in a .cpp.

New adapter architecture

Recently, we have moved away from virtual methods in favor of function pointers as the mean to create the adapters. From a developer point of view, the process is simplier as nothing is required inside the .h file of any plugin. Instead, only a few macro calls are needed inside the .cpp file of a plugin.

At the top of a plugin .cpp file, the macro call __CreatePlugin(NameOfPlugin) must appear. Then adapters are created with __CreateSlaveModeAdapter(NameOfPlugin,SlaveModeHandle), __CreateMasterModeAdapter(NameOfPlugin,MasterModeHandle), and __CreateMessageTagAdapter(NameOfPlugin,MessageTagHandle), respectively for slave modes, master modes and message tags.

Finally, in the method registerPlugin() of the plugin, behavior is attached with core->setMasterModeObjectHandler(), core->setSlaveModeObjectHandler(), and core->setMessageTagObjectHandler(), respectively for slave modes, master modes and message tags. Adapters are fetched with the macro call __GetAdapter(NameOfPlugin,Handle).

The last thing required is the macro call __BindPlugin(NameOfPlugin).

This architecture is simplier than the previous as it only requires half the macro calls for the binding.

Setup

Benchmarks were done on 64 processor cores, with 8 cores per node. The processor technology was Intel(R) Xeon(R) CPU X5560  @ 2.80GHz and the interconnect was 0d:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev a0). The system was colosse.

The software was compiled with make ASSERT=n DEBUG=n HAVE_LIBZ=y HAVE_LIBBZ2=y  EXTRA=-march=native, which gave the options -Wall -ansi -O3 -D MAXKMERLENGTH=32 -march=native -DHAVE_LIBZ -DHAVE_LIBBZ2  -D RAY_VERSION=\"2.1.0-devel\" -D RAYPLATFORM_VERSION=\"1.1.0-devel\" -I. -IRayPlatform to the C++ compiler.

The dataset used was the sample SRS017209 from the NIH Human Microbiome Project. Its description is Human metagenome sample from G_DNA_Tongue dorsum of a male participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)"from G_DNA_Tongue dorsum of a human male participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)" and contained 191 588 208 short sequence reads.

With virtual tables


End-to-end latency

# AverageForAllRanks: 66.6875
# StandardDeviation: 2.35186

Running time

 Network testing: 10 seconds
 Counting sequences to assemble: 8 minutes, 8 seconds
 Sequence loading: 1 hours, 6 minutes, 12 seconds
 K-mer counting: 1 hours, 16 minutes, 22 seconds
 Coverage distribution analysis: 2 seconds
 Graph construction: 1 hours, 54 minutes, 58 seconds
 Null edge purging: 37 minutes, 31 seconds
 Selection of optimal read markers: 1 hours, 22 minutes, 9 seconds
 Detection of assembly seeds: 44 minutes, 1 seconds
 Estimation of outer distances for paired reads: 6 minutes, 41 seconds
 Bidirectional extension of seeds: 7 hours, 3 minutes, 32 seconds
 Merging of redundant paths: 2 hours, 16 minutes, 59 seconds
 Generation of contigs: 14 seconds
 Scaffolding of contigs: 37 minutes, 29 seconds
 Counting sequences to search: 10 seconds
 Graph coloring: 42 minutes, 8 seconds
 Counting contig biological abundances: 2 minutes, 36 seconds
 Counting sequence biological abundances: 1 hours, 24 minutes, 42 seconds
 Loading taxons: 7 seconds
 Loading tree: 4 seconds
 Processing gene ontologies: 55 seconds
 Computing neighbourhoods: 0 seconds
 Total: 19 hours, 25 minutes, 12 seconds

Result

Contigs >= 100 nt
 Number: 531984
 Total length: 206921447
 Average: 388
 N50: 910
 Median: 155
 Largest: 226570
Contigs >= 500 nt
 Number: 60928
 Total length: 124017769
 Average: 2035
 N50: 3951
 Median: 897
 Largest: 226570
Scaffolds >= 100 nt
 Number: 528927
 Total length: 207378098
 Average: 392
 N50: 954
 Median: 154
 Largest: 295571
Scaffolds >= 500 nt
 Number: 58086
 Total length: 124575648
 Average: 2144
 N50: 4554
 Median: 900
 Largest: 295571

With function pointers

End-to-end latency

# AverageForAllRanks: 63.3281
# StandardDeviation: 2.41738

Running time

 Network testing: 6 seconds
 Counting sequences to assemble: 7 minutes, 44 seconds
 Sequence loading: 1 hours, 7 minutes, 0 seconds
 K-mer counting: 1 hours, 8 minutes, 17 seconds
 Coverage distribution analysis: 53 seconds
 Graph construction: 1 hours, 41 minutes, 50 seconds
 Null edge purging: 35 minutes, 24 seconds
 Selection of optimal read markers: 1 hours, 17 minutes, 8 seconds
 Detection of assembly seeds: 40 minutes, 29 seconds
 Estimation of outer distances for paired reads: 6 minutes, 18 seconds
 Bidirectional extension of seeds: 4 hours, 4 minutes, 18 seconds
 Merging of redundant paths: 1 hours, 45 minutes, 48 seconds
 Generation of contigs: 15 seconds
 Scaffolding of contigs: 34 minutes, 54 seconds
 Counting sequences to search: 18 seconds
 Graph coloring: 26 minutes, 0 seconds
 Counting contig biological abundances: 2 minutes, 10 seconds
 Counting sequence biological abundances: 57 minutes, 39 seconds
 Loading taxons: 4 seconds
 Loading tree: 5 seconds
 Processing gene ontologies: 4 seconds
 Computing neighbourhoods: 0 seconds
 Total: 14 hours, 36 minutes, 46 seconds

Results

Contigs >= 100 nt
 Number: 532150
 Total length: 207164179
 Average: 389
 N50: 911
 Median: 155
 Largest: 226570
Contigs >= 500 nt
 Number: 61070
 Total length: 124223710
 Average: 2034
 N50: 3946
 Median: 898
 Largest: 226570
Scaffolds >= 100 nt
 Number: 529111
 Total length: 207618791
 Average: 392
 N50: 956
 Median: 155
 Largest: 295712
Scaffolds >= 500 nt
 Number: 58239
 Total length: 124776327
 Average: 2142
 N50: 4515
 Median: 901
 Largest: 295712

Conclusion

Virtual methods in HPC should be avoided in the main loop as it produces machine instructions that are avoided with function pointers. In the presented benchmark, using function pointers instead of virtual methods to bind adapters to the RayPlatform architecture reduced the running time by 25% while producing the same result.


Comments

navaneedh said…
Great thoughts you got there, believe I may possibly try just some of it throughout my daily life.






Function Point Estimation Training

sebhtml said…
The trick is that the this pointer is actually affected to a static global variable for each plugin object.

Then it is easy to define a global function that calls this->something.

Popular posts from this blog

A survey of the burgeoning industry of cloud genomics

Generating neural machine instructions for multi-head attention

Adding ZVOL VIRTIO disks to a guest running on a host with the FreeBSD BHYVE hypervisor