Showing posts from July, 2015

The convergence of HPC and cloud computing

An exascale computer is an imaginary system that can sustain one exaflops (10^18 floating point operations per second.) Such an object is needed in science and engineering, mostly for simulating virtual versions of objects found in the real world, such as proteins, planes, and cities. Important requirements for such a computer are 1) memory bandwidth, 2) floating point operation throughput, 3) low network latency, and so on.

2 of the many challenges for possibly having exascale supercomputers by 2020 are 1) improving fault-tolerance and 2) lowering energy consumption. (see "No Exascale for You!" An Interview with Berkeley Lab's Horst Simon).

One typical solution to implement fault tolerance in HPC is the use of the checkpoint/restart cycle whereas in most cloud technologies fault tolerance is instead implemented using different principles/abstractions such as load balancing and replication (see the CAP theorem). The checkpoint/restart can not work at the exa scale because …