The convergence of HPC and cloud computing
An exascale computer is an imaginary system that can sustain one exaflops (10^18 floating point operations per second.) Such an object is needed in science and engineering, mostly for simulating virtual versions of objects found in the real world, such as proteins, planes, and cities. Important requirements for such a computer are 1) memory bandwidth, 2) floating point operation throughput, 3) low network latency, and so on.
2 of the many challenges for possibly having exascale supercomputers by 2020 are 1) improving fault-tolerance and 2) lowering energy consumption. (see "No Exascale for You!" An Interview with Berkeley Lab's Horst Simon).
One typical solution to implement fault tolerance in HPC is the use of the checkpoint/restart cycle whereas in most cloud technologies fault tolerance is instead implemented using different principles/abstractions such as load balancing and replication (see the CAP theorem). The checkpoint/restart can not work at the exa scale because there will almost always be a failing component at this scale. So, an exascale computation would need to survive such failures. In that regard, Facebook is a very large system that is fault-tolerant and that is based on cloud technologies rather than HPC.
The fact that fault tolerance has been figured out for a while now in cloud technologies allowed the cloud community to solve other important problems. One active area of development in cloud computing in 2015 has been without a doubt that of orchestration and provisioning. HPC is still making progress on solving the fault-tolerance problem in the HPC context.
Abstractions
A significant body of research output is coming endlessly from UC Berkeley's AMPLab and other research groups and also from Internet companies (Google, Facebook, Amazon, Microsoft, and others). The "cloud stack" (see all the Apache projects, like Spark, Mesos, ZooKeeper, Cassandra) is covering a significant part of today's market needs (datacenter abstraction, distributed databases, map-reduce abstractions at scale). What I mean here is that anyone can get started very quickly with all these off-the-shelf components, typically using high levels of abstractions (such as Spark's Resilient Distributed Datasets or RDD). Further, in addition to having these off-the-shelf building blocks available, they can be deployed very easily in various cloud environments, whereas this is rarely the case in HPC.
One observation that can be made is that HPC always want the highest processing speed, usually on bare metal. This low level of abstraction comes with the convenience that things are built on a very low number of abstractions (typically MPI libraries and job schedulers).
On the other hand, abstractions abound in the cloud world. Things are evolving much faster in the cloud than in HPC. (see "HPC is dying, and MPI is killing it").
But... I need a fast network for my HPC workflow
One thing that is typically associated to HPC and not with the cloud is the concept of having a very fast network. But this fast-network gap is closing, and the cloud is catching on in that regard. Recently, Microsoft added RDMA in Windows Azure. Thus, now the cloud technically offers a low latency (in microseconds) and high bandwidth (40 Gbps). This is no longer an exclusive feature of HPC.
The network is the computer
In the end, as Sun Microsystems's John Cage said, "The Network is the Computer." The HPC stack is already converging to what is being found in the web/cloud/big data stack (see this piece). There are significant advances in cloud networking too (such as software-defined networks, convenient/automated network provisioning, and performance improvements. So, the prediction that can perhaps be made today is that HPC and cloud will no longer be 2 solitudes in a not-so-distant future. HPC will benefit from the cloud and vice-versa.
What the future hold in this ongoing convergence will be very exciting.
References
-----------------
Daniel A. Reed, Jack Dongarra
Exascale Computing and Big Data
Communications of the ACM, Vol. 58 No. 7, Pages 56-68, 10.1145/2699414
http://cacm.acm.org/magazines/2015/7/188732-exascale-computing-and-big-data/fulltext
This survey paper is very comprehensive and highlights how HPC (called exascale computing even though there is no operational exascale computer as of today) and cloud can meet at the crossroads.
Tiffany Trader
Fastest Supercomputer Runs Ubuntu, OpenStack
HPCwire May 27, 2014
http://www.hpcwire.com/2014/05/27/fastest-supercomputer-runs-ubuntu-openstack/
This article reports on a very large supercomputer that is running OpenStack instead of the classic HPC schedulers (like MOAB, SGE, Cobalt, Maui).
Jonathan Dursi
HPC is dying, and MPI is killing it
R&D computing at scale, 2015-04-03
http://www.dursi.ca/hpc-is-dying-and-mpi-is-killing-it/
This piece is a provocative, yet realistic, depiction of the current state of popularity of various hpc and cloud technologies (surveyed using Google Trends).
2 of the many challenges for possibly having exascale supercomputers by 2020 are 1) improving fault-tolerance and 2) lowering energy consumption. (see "No Exascale for You!" An Interview with Berkeley Lab's Horst Simon).
One typical solution to implement fault tolerance in HPC is the use of the checkpoint/restart cycle whereas in most cloud technologies fault tolerance is instead implemented using different principles/abstractions such as load balancing and replication (see the CAP theorem). The checkpoint/restart can not work at the exa scale because there will almost always be a failing component at this scale. So, an exascale computation would need to survive such failures. In that regard, Facebook is a very large system that is fault-tolerant and that is based on cloud technologies rather than HPC.
The fact that fault tolerance has been figured out for a while now in cloud technologies allowed the cloud community to solve other important problems. One active area of development in cloud computing in 2015 has been without a doubt that of orchestration and provisioning. HPC is still making progress on solving the fault-tolerance problem in the HPC context.
Abstractions
A significant body of research output is coming endlessly from UC Berkeley's AMPLab and other research groups and also from Internet companies (Google, Facebook, Amazon, Microsoft, and others). The "cloud stack" (see all the Apache projects, like Spark, Mesos, ZooKeeper, Cassandra) is covering a significant part of today's market needs (datacenter abstraction, distributed databases, map-reduce abstractions at scale). What I mean here is that anyone can get started very quickly with all these off-the-shelf components, typically using high levels of abstractions (such as Spark's Resilient Distributed Datasets or RDD). Further, in addition to having these off-the-shelf building blocks available, they can be deployed very easily in various cloud environments, whereas this is rarely the case in HPC.
One observation that can be made is that HPC always want the highest processing speed, usually on bare metal. This low level of abstraction comes with the convenience that things are built on a very low number of abstractions (typically MPI libraries and job schedulers).
On the other hand, abstractions abound in the cloud world. Things are evolving much faster in the cloud than in HPC. (see "HPC is dying, and MPI is killing it").
But... I need a fast network for my HPC workflow
One thing that is typically associated to HPC and not with the cloud is the concept of having a very fast network. But this fast-network gap is closing, and the cloud is catching on in that regard. Recently, Microsoft added RDMA in Windows Azure. Thus, now the cloud technically offers a low latency (in microseconds) and high bandwidth (40 Gbps). This is no longer an exclusive feature of HPC.
The network is the computer
In the end, as Sun Microsystems's John Cage said, "The Network is the Computer." The HPC stack is already converging to what is being found in the web/cloud/big data stack (see this piece). There are significant advances in cloud networking too (such as software-defined networks, convenient/automated network provisioning, and performance improvements. So, the prediction that can perhaps be made today is that HPC and cloud will no longer be 2 solitudes in a not-so-distant future. HPC will benefit from the cloud and vice-versa.
What the future hold in this ongoing convergence will be very exciting.
References
-----------------
Daniel A. Reed, Jack Dongarra
Exascale Computing and Big Data
Communications of the ACM, Vol. 58 No. 7, Pages 56-68, 10.1145/2699414
http://cacm.acm.org/magazines/2015/7/188732-exascale-computing-and-big-data/fulltext
This survey paper is very comprehensive and highlights how HPC (called exascale computing even though there is no operational exascale computer as of today) and cloud can meet at the crossroads.
Tiffany Trader
Fastest Supercomputer Runs Ubuntu, OpenStack
HPCwire May 27, 2014
http://www.hpcwire.com/2014/05/27/fastest-supercomputer-runs-ubuntu-openstack/
This article reports on a very large supercomputer that is running OpenStack instead of the classic HPC schedulers (like MOAB, SGE, Cobalt, Maui).
Jonathan Dursi
HPC is dying, and MPI is killing it
R&D computing at scale, 2015-04-03
http://www.dursi.ca/hpc-is-dying-and-mpi-is-killing-it/
This piece is a provocative, yet realistic, depiction of the current state of popularity of various hpc and cloud technologies (surveyed using Google Trends).
Comments