Function pointers v. virtual methods in HPC
The design pattern 'Adapter' is useful to add a component to an existing architecture without modifying the said architecture. The Ray assembler is built on top of RayPlatform, a distributed compute engine. Ray is an array of plugins implementing algorithms for various things in genomics, and each of these plugins uses adapters to bind their behavior onto the RayPlatform core.
The initial design of adapters in RayPlatform was based on a common interface, using pure virtual classes. Although this model is an elegant design, it also implies a significant overhead because adapters in RayPlatform are called in the main loop, which aims at running as fast as possible. There are at least two problems with virtual methods: they add additional indirections and they can not be inlined. Furthermore, using classes to create these adapters implies that it is required to declare the class in a .h, and to implement it in a .cpp.
New adapter architecture
Recently, we have moved away from virtual methods in favor of function pointers as the mean to create the adapters. From a developer point of view, the process is simplier as nothing is required inside the .h file of any plugin. Instead, only a few macro calls are needed inside the .cpp file of a plugin.
At the top of a plugin .cpp file, the macro call __CreatePlugin(NameOfPlugin) must appear. Then adapters are created with __CreateSlaveModeAdapter(NameOfPlugin,SlaveModeHandle), __CreateMasterModeAdapter(NameOfPlugin,MasterModeHandle), and __CreateMessageTagAdapter(NameOfPlugin,MessageTagHandle), respectively for slave modes, master modes and message tags.
Finally, in the method registerPlugin() of the plugin, behavior is attached with core->setMasterModeObjectHandler(), core->setSlaveModeObjectHandler(), and core->setMessageTagObjectHandler(), respectively for slave modes, master modes and message tags. Adapters are fetched with the macro call __GetAdapter(NameOfPlugin,Handle).
The last thing required is the macro call __BindPlugin(NameOfPlugin).
This architecture is simplier than the previous as it only requires half the macro calls for the binding.
Setup
Benchmarks were done on 64 processor cores, with 8 cores per node. The processor technology was Intel(R) Xeon(R) CPU X5560 @ 2.80GHz and the interconnect was 0d:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev a0). The system was colosse.
The software was compiled with make ASSERT=n DEBUG=n HAVE_LIBZ=y HAVE_LIBBZ2=y EXTRA=-march=native, which gave the options -Wall -ansi -O3 -D MAXKMERLENGTH=32 -march=native -DHAVE_LIBZ -DHAVE_LIBBZ2 -D RAY_VERSION=\"2.1.0-devel\" -D RAYPLATFORM_VERSION=\"1.1.0-devel\" -I. -IRayPlatform to the C++ compiler.
The dataset used was the sample SRS017209 from the NIH Human Microbiome Project. Its description is Human metagenome sample from G_DNA_Tongue dorsum of a male participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)"from G_DNA_Tongue dorsum of a human male participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)" and contained 191 588 208 short sequence reads.
With virtual tables
End-to-end latency
# AverageForAllRanks: 66.6875
# StandardDeviation: 2.35186
Running time
Network testing: 10 seconds
Counting sequences to assemble: 8 minutes, 8 seconds
Sequence loading: 1 hours, 6 minutes, 12 seconds
K-mer counting: 1 hours, 16 minutes, 22 seconds
Coverage distribution analysis: 2 seconds
Graph construction: 1 hours, 54 minutes, 58 seconds
Null edge purging: 37 minutes, 31 seconds
Selection of optimal read markers: 1 hours, 22 minutes, 9 seconds
Detection of assembly seeds: 44 minutes, 1 seconds
Estimation of outer distances for paired reads: 6 minutes, 41 seconds
Bidirectional extension of seeds: 7 hours, 3 minutes, 32 seconds
Merging of redundant paths: 2 hours, 16 minutes, 59 seconds
Generation of contigs: 14 seconds
Scaffolding of contigs: 37 minutes, 29 seconds
Counting sequences to search: 10 seconds
Graph coloring: 42 minutes, 8 seconds
Counting contig biological abundances: 2 minutes, 36 seconds
Counting sequence biological abundances: 1 hours, 24 minutes, 42 seconds
Loading taxons: 7 seconds
Loading tree: 4 seconds
Processing gene ontologies: 55 seconds
Computing neighbourhoods: 0 seconds
Total: 19 hours, 25 minutes, 12 seconds
Result
Contigs >= 100 nt
Number: 531984
Total length: 206921447
Average: 388
N50: 910
Median: 155
Largest: 226570
Contigs >= 500 nt
Number: 60928
Total length: 124017769
Average: 2035
N50: 3951
Median: 897
Largest: 226570
Scaffolds >= 100 nt
Number: 528927
Total length: 207378098
Average: 392
N50: 954
Median: 154
Largest: 295571
Scaffolds >= 500 nt
Number: 58086
Total length: 124575648
Average: 2144
N50: 4554
Median: 900
Largest: 295571
The initial design of adapters in RayPlatform was based on a common interface, using pure virtual classes. Although this model is an elegant design, it also implies a significant overhead because adapters in RayPlatform are called in the main loop, which aims at running as fast as possible. There are at least two problems with virtual methods: they add additional indirections and they can not be inlined. Furthermore, using classes to create these adapters implies that it is required to declare the class in a .h, and to implement it in a .cpp.
New adapter architecture
Recently, we have moved away from virtual methods in favor of function pointers as the mean to create the adapters. From a developer point of view, the process is simplier as nothing is required inside the .h file of any plugin. Instead, only a few macro calls are needed inside the .cpp file of a plugin.
At the top of a plugin .cpp file, the macro call __CreatePlugin(NameOfPlugin) must appear. Then adapters are created with __CreateSlaveModeAdapter(NameOfPlugin,SlaveModeHandle), __CreateMasterModeAdapter(NameOfPlugin,MasterModeHandle), and __CreateMessageTagAdapter(NameOfPlugin,MessageTagHandle), respectively for slave modes, master modes and message tags.
Finally, in the method registerPlugin() of the plugin, behavior is attached with core->setMasterModeObjectHandler(), core->setSlaveModeObjectHandler(), and core->setMessageTagObjectHandler(), respectively for slave modes, master modes and message tags. Adapters are fetched with the macro call __GetAdapter(NameOfPlugin,Handle).
The last thing required is the macro call __BindPlugin(NameOfPlugin).
This architecture is simplier than the previous as it only requires half the macro calls for the binding.
Setup
Benchmarks were done on 64 processor cores, with 8 cores per node. The processor technology was Intel(R) Xeon(R) CPU X5560 @ 2.80GHz and the interconnect was 0d:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev a0). The system was colosse.
The software was compiled with make ASSERT=n DEBUG=n HAVE_LIBZ=y HAVE_LIBBZ2=y EXTRA=-march=native, which gave the options -Wall -ansi -O3 -D MAXKMERLENGTH=32 -march=native -DHAVE_LIBZ -DHAVE_LIBBZ2 -D RAY_VERSION=\"2.1.0-devel\" -D RAYPLATFORM_VERSION=\"1.1.0-devel\" -I. -IRayPlatform to the C++ compiler.
The dataset used was the sample SRS017209 from the NIH Human Microbiome Project. Its description is Human metagenome sample from G_DNA_Tongue dorsum of a male participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)"from G_DNA_Tongue dorsum of a human male participant in the dbGaP study "HMP Core Microbiome Sampling Protocol A (HMP-A)" and contained 191 588 208 short sequence reads.
With virtual tables
End-to-end latency
# AverageForAllRanks: 66.6875
# StandardDeviation: 2.35186
Running time
Network testing: 10 seconds
Counting sequences to assemble: 8 minutes, 8 seconds
Sequence loading: 1 hours, 6 minutes, 12 seconds
K-mer counting: 1 hours, 16 minutes, 22 seconds
Coverage distribution analysis: 2 seconds
Graph construction: 1 hours, 54 minutes, 58 seconds
Null edge purging: 37 minutes, 31 seconds
Selection of optimal read markers: 1 hours, 22 minutes, 9 seconds
Detection of assembly seeds: 44 minutes, 1 seconds
Estimation of outer distances for paired reads: 6 minutes, 41 seconds
Bidirectional extension of seeds: 7 hours, 3 minutes, 32 seconds
Merging of redundant paths: 2 hours, 16 minutes, 59 seconds
Generation of contigs: 14 seconds
Scaffolding of contigs: 37 minutes, 29 seconds
Counting sequences to search: 10 seconds
Graph coloring: 42 minutes, 8 seconds
Counting contig biological abundances: 2 minutes, 36 seconds
Counting sequence biological abundances: 1 hours, 24 minutes, 42 seconds
Loading taxons: 7 seconds
Loading tree: 4 seconds
Processing gene ontologies: 55 seconds
Computing neighbourhoods: 0 seconds
Total: 19 hours, 25 minutes, 12 seconds
Result
Contigs >= 100 nt
Number: 531984
Total length: 206921447
Average: 388
N50: 910
Median: 155
Largest: 226570
Contigs >= 500 nt
Number: 60928
Total length: 124017769
Average: 2035
N50: 3951
Median: 897
Largest: 226570
Scaffolds >= 100 nt
Number: 528927
Total length: 207378098
Average: 392
N50: 954
Median: 154
Largest: 295571
Scaffolds >= 500 nt
Number: 58086
Total length: 124575648
Average: 2144
N50: 4554
Median: 900
Largest: 295571
With function pointers
End-to-end latency
# AverageForAllRanks: 63.3281
# StandardDeviation: 2.41738
Running time
Network testing: 6 seconds
Counting sequences to assemble: 7 minutes, 44 seconds
Sequence loading: 1 hours, 7 minutes, 0 seconds
K-mer counting: 1 hours, 8 minutes, 17 seconds
Coverage distribution analysis: 53 seconds
Graph construction: 1 hours, 41 minutes, 50 seconds
Null edge purging: 35 minutes, 24 seconds
Selection of optimal read markers: 1 hours, 17 minutes, 8 seconds
Detection of assembly seeds: 40 minutes, 29 seconds
Estimation of outer distances for paired reads: 6 minutes, 18 seconds
Bidirectional extension of seeds: 4 hours, 4 minutes, 18 seconds
Merging of redundant paths: 1 hours, 45 minutes, 48 seconds
Generation of contigs: 15 seconds
Scaffolding of contigs: 34 minutes, 54 seconds
Counting sequences to search: 18 seconds
Graph coloring: 26 minutes, 0 seconds
Counting contig biological abundances: 2 minutes, 10 seconds
Counting sequence biological abundances: 57 minutes, 39 seconds
Loading taxons: 4 seconds
Loading tree: 5 seconds
Processing gene ontologies: 4 seconds
Computing neighbourhoods: 0 seconds
Total: 14 hours, 36 minutes, 46 seconds
Results
Contigs >= 100 nt
Number: 532150
Total length: 207164179
Average: 389
N50: 911
Median: 155
Largest: 226570
Contigs >= 500 nt
Number: 61070
Total length: 124223710
Average: 2034
N50: 3946
Median: 898
Largest: 226570
Scaffolds >= 100 nt
Number: 529111
Total length: 207618791
Average: 392
N50: 956
Median: 155
Largest: 295712
Scaffolds >= 500 nt
Number: 58239
Total length: 124776327
Average: 2142
N50: 4515
Median: 901
Largest: 295712
Conclusion
Virtual methods in HPC should be avoided in the main loop as it produces machine instructions that are avoided with function pointers. In the presented benchmark, using function pointers instead of virtual methods to bind adapters to the RayPlatform architecture reduced the running time by 25% while producing the same result.
Comments
Function Point Estimation Training
Then it is easy to define a global function that calls this->something.