Function pointers v. virtual methods in HPC

The design pattern 'Adapter' is useful to add a component to an existing architecture without modifying the said architecture. The Ray assembler is built on top of RayPlatform, a distributed compute engine. Ray is an array of plugins implementing algorithms for various things in genomics, and each of these plugins uses adapters to bind their behavior onto the RayPlatform core.

The initial design of adapters in RayPlatform was based on a common interface, using pure virtual classes. Although this model is an elegant design, it also implies a significant overhead because adapters in RayPlatform are called in the main loop, which aims at running as fast as possible. There are at least two problems with virtual methods: they add additional indirections and they can not be inlined. Furthermore, using classes to create these adapters implies that it is required to declare the class in a .h, and to implement it in a .cpp.

New adapter architecture

Recently, we have moved away from virtual methods in favor of function pointers as the mean to create the adapters. From a developer point of view, the process is simplier as nothing is required inside the .h file of any plugin. Instead, only a few macro calls are needed inside the .cpp file of a plugin.

At the top of a plugin .cpp file, the macro call __CreatePlugin(NameOfPlugin) must appear. Then adapters are created with __CreateSlaveModeAdapter(NameOfPlugin,SlaveModeHandle), __CreateMasterModeAdapter(NameOfPlugin,MasterModeHandle), and __CreateMessageTagAdapter(NameOfPlugin,MessageTagHandle), respectively for slave modes, master modes and message tags.

Finally, in the method registerPlugin() of the plugin, behavior is attached with core->setMasterModeObjectHandler(), core->setSlaveModeObjectHandler(), and core->setMessageTagObjectHandler(), respectively for slave modes, master modes and message tags. Adapters are fetched with the macro call __GetAdapter(NameOfPlugin,Handle).

The last thing required is the macro call __BindPlugin(NameOfPlugin).

This architecture is simplier than the previous as it only requires half the macro calls for the binding.

Setup

Benchmarks were done on 64 processor cores, with 8 cores per node. The processor technology was Intel(R) Xeon(R) CPU X5560 @ 2.80GHz and the interconnect was 0d:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev a0). The system was colosse.

The software was compiled with make ASSERT=n DEBUG=n HAVE_LIBZ=y HAVE_LIBBZ2=y EXTRA=-march=native, which gave the options -Wall -ansi -O3 -D MAXKMERLENGTH=32 -march=native -DHAVE_LIBZ -DHAVE_LIBBZ2 -D RAY_VERSION=\"2.1.0-devel\" -D RAYPLATFORM_VERSION=\"1.1.0-devel\" -I. -IRayPlatform to the C++ compiler.

The dataset used was the sample SRS017209 from the NIH Human Microbiome Project. Its description is Human metagenome sample from G_DNA_Tongue dorsum of a male participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)"from G_DNA_Tongue dorsum of a human male participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)" and contained 191 588 208 short sequence reads.

With virtual tables

End-to-end latency

# AverageForAllRanks: 66.6875
# StandardDeviation: 2.35186

Running time

Network testing: 10 seconds
Counting sequences to assemble: 8 minutes, 8 seconds
Sequence loading: 1 hours, 6 minutes, 12 seconds
K-mer counting: 1 hours, 16 minutes, 22 seconds
Coverage distribution analysis: 2 seconds
Graph construction: 1 hours, 54 minutes, 58 seconds
Null edge purging: 37 minutes, 31 seconds
Selection of optimal read markers: 1 hours, 22 minutes, 9 seconds
Detection of assembly seeds: 44 minutes, 1 seconds
Estimation of outer distances for paired reads: 6 minutes, 41 seconds
Bidirectional extension of seeds: 7 hours, 3 minutes, 32 seconds
Merging of redundant paths: 2 hours, 16 minutes, 59 seconds
Generation of contigs: 14 seconds
Scaffolding of contigs: 37 minutes, 29 seconds
Counting sequences to search: 10 seconds
Graph coloring: 42 minutes, 8 seconds
Counting contig biological abundances: 2 minutes, 36 seconds
Counting sequence biological abundances: 1 hours, 24 minutes, 42 seconds
Loading taxons: 7 seconds
Loading tree: 4 seconds
Processing gene ontologies: 55 seconds
Computing neighbourhoods: 0 seconds
Total: 19 hours, 25 minutes, 12 seconds

Result

Contigs >= 100 nt
Number: 531984
Total length: 206921447
Average: 388
N50: 910
Median: 155
Largest: 226570
Contigs >= 500 nt
Number: 60928
Total length: 124017769
Average: 2035
N50: 3951
Median: 897
Largest: 226570
Scaffolds >= 100 nt
Number: 528927
Total length: 207378098
Average: 392
N50: 954
Median: 154
Largest: 295571
Scaffolds >= 500 nt
Number: 58086
Total length: 124575648
Average: 2144
N50: 4554
Median: 900
Largest: 295571

With function pointers

End-to-end latency

# AverageForAllRanks: 63.3281

# StandardDeviation: 2.41738

Running time

Network testing: 6 seconds

Counting sequences to assemble: 7 minutes, 44 seconds

Sequence loading: 1 hours, 7 minutes, 0 seconds

K-mer counting: 1 hours, 8 minutes, 17 seconds

Coverage distribution analysis: 53 seconds

Graph construction: 1 hours, 41 minutes, 50 seconds

Null edge purging: 35 minutes, 24 seconds

Selection of optimal read markers: 1 hours, 17 minutes, 8 seconds

Detection of assembly seeds: 40 minutes, 29 seconds

Estimation of outer distances for paired reads: 6 minutes, 18 seconds

Bidirectional extension of seeds: 4 hours, 4 minutes, 18 seconds

Merging of redundant paths: 1 hours, 45 minutes, 48 seconds

Generation of contigs: 15 seconds

Scaffolding of contigs: 34 minutes, 54 seconds

Counting sequences to search: 18 seconds

Graph coloring: 26 minutes, 0 seconds

Counting contig biological abundances: 2 minutes, 10 seconds

Counting sequence biological abundances: 57 minutes, 39 seconds

Loading taxons: 4 seconds

Loading tree: 5 seconds

Processing gene ontologies: 4 seconds

Computing neighbourhoods: 0 seconds

Total: 14 hours, 36 minutes, 46 seconds

Results

Contigs >= 100 nt

Number: 532150

Total length: 207164179

Average: 389

N50: 911

Median: 155

Largest: 226570

Contigs >= 500 nt

Number: 61070

Total length: 124223710

Average: 2034

N50: 3946

Median: 898

Largest: 226570

Scaffolds >= 100 nt

Number: 529111

Total length: 207618791

Average: 392

N50: 956

Median: 155

Largest: 295712

Scaffolds >= 500 nt

Number: 58239

Total length: 124776327

Average: 2142

N50: 4515

Median: 901

Largest: 295712

Conclusion

Virtual methods in HPC should be avoided in the main loop as it produces machine instructions that are avoided with function pointers. In the presented benchmark, using function pointers instead of virtual methods to bind adapters to the RayPlatform architecture reduced the running time by 25% while producing the same result.

Comments

navaneedh said…

Great thoughts you got there, believe I may possibly try just some of it throughout my daily life.

Function Point Estimation Training

Thursday, September 6, 2012 at 3:32:00 AM EDT

sebhtml said…

The trick is that the this pointer is actually affected to a static global variable for each plugin object.

Then it is easy to define a global function that calls this->something.

Thursday, September 20, 2012 at 8:59:00 AM EDT

Search This Blog

DSKernel: AI and Strength Training

Function pointers v. virtual methods in HPC

Comments

Popular posts from this blog

The Thorium actor engine is operational now, we can start to work on actor applications for metagenomics

Learning to solve the example 1 of puzzle 3aa6fb7a in the ARC prize

Adding ZVOL VIRTIO disks to a guest running on a host with the FreeBSD BHYVE hypervisor