In terms of AI startups, Cerebras has been our front-runner to make it to the next stage for years. Now, it seems to have exited a gaggle of startups scaling its giant wafer scale engine to AI supercomputer scale (for revenue.) At Hot Chips 2023, the company is detailing the new cluster that it plans to use to dwarf what NVIDIA is building.
We are doing this live, so please excuse typos.
Detail of the NVIDIA Dwarfing Cerebras Wafer-Scale Cluster
Cerebras started the presentation with a company update and that AI/ ML models are getting bigger (~40,000x in 5 years.) They also discussed some of the history of ML acceleration.
![Cerebras Wafer Scale Cluster HC35_Page_05](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_05-800x452.jpg)
Process technology has given gains over time.
![Cerebras Wafer Scale Cluster HC35_Page_06](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_06-800x452.jpg)
Architecture gains such as changing calculations from FP32 to bfloat16, INT8, or other formats and techniques have also given huge gains.
![Cerebras Wafer Scale Cluster HC35_Page_07](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_07-800x452.jpg)
Still, what models are practical to use depends on the ability to not just get gains at a chip level, but also at the cluster level.
![Cerebras Wafer Scale Cluster HC35_Page_08](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_08-800x452.jpg)
Some of the challenges of current scale-out is just the communication needs to keep data moving to smaller compute and memory nodes.
![Cerebras Wafer Scale Cluster HC35_Page_09](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_09-800x452.jpg)
Cerebras built a giant chip to get an order-of-magnitude improvement, but it also needs to scale out to clusters since one chip is not enough.
![Cerebras Wafer Scale Cluster HC35_Page_10](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_10-800x452.jpg)
Traditional scale-out has challenges because it is trying to split a problem, data, and compute across so many devices.
![Cerebras Wafer Scale Cluster HC35_Page_12](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_12-800x452.jpg)
On GPUs, that means using different types of parallelism to scale out to more compute and memory devices.
![Cerebras Wafer Scale Cluster HC35_Page_13](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_13-800x452.jpg)
Cerebras is looking to scale cluster level memory and cluster level compute to decouple compute and memory scaling as is seen on GPUs.
![Cerebras Wafer Scale Cluster HC35_Page_14](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_14-800x452.jpg)
Cerebras has 850,000 cores on the WSE-2 for its base. When will we get a 5nm WSE-3? Sounds like not today.
![Cerebras Wafer Scale Cluster HC35_Page_15](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_15-800x452.jpg)
Cerebras houses the WSE-2 in a CS2 and then connects it to MemoryX. It then can stream data to the big chip.
![Cerebras Wafer Scale Cluster HC35_Page_17](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_17-800x452.jpg)
It then has the SwarmX interconnect that does the data parallel scaling.
![Cerebras Wafer Scale Cluster HC35_Page_18](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_18-800x452.jpg)
Weights are never stored on the wafer. They are just streamed in.
![Cerebras Wafer Scale Cluster HC35_Page_19](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_19-800x452.jpg)
The SwarmX fabric scales weights and reduces gradients on the return.
![Cerebras Wafer Scale Cluster HC35_Page_20](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_20-800x452.jpg)
Each MemoryX unit has 12x MemoryX nodes. States are stored in DRAM and in flash. Up to 1TB of DRAM and 500TB of flash. The CPUs are interestingly only 32-core CPUs.
![Cerebras Wafer Scale Cluster HC35_Page_22](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_22-800x452.jpg)
Finally, it is connected to the cluster using 100GbE. One port goes to the CS-2 and one to other MemoryX modules.
MemoryX has to handle the sharding of the weights in a thoughtful way to make this work. Ordering the streaming helps perform an almost free transpose.
![Cerebras Wafer Scale Cluster HC35_Page_23](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_23-800x452.jpg)
In MemoryX, there is a high-performance runtime in order to transfer data and perform computations.
![Cerebras Wafer Scale Cluster HC35_Page_24](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_24-800x452.jpg)
SwarmX fabric uses 100GbE and RoCE RDMA to provide connectivity and Broadcast Reduce that happens on CPUs.
![Cerebras Wafer Scale Cluster HC35_Page_25](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_25-800x452.jpg)
Every broadcast reduce node has 12 nodes with 6x 100GbE links. Five of them are used for a 1:4 broadcast plus a redundant link. That means 150Tbps of broadcast reduce bandwidth.
![Cerebras Wafer Scale Cluster HC35_Page_26](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_26-800x452.jpg)
100GbE is interesting since it is now a very commoditized interconnect as compared to NVLink/ NVSwitch and InfiniBand.
Cerebras is doing these operations off of the CS-2/ WSE and that is helping this scale.
![Cerebras Wafer Scale Cluster HC35_Page_27](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_27-800x452.jpg)
This is the SwarmX topology.
![Cerebras Wafer Scale Cluster HC35_Page_28](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_28-800x452.jpg)
The flexibility in the fabric can be used to effectively provision work across the cluster while supporting sub-cluster partitioning.
![Cerebras Wafer Scale Cluster HC35_Page_29](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_29-800x452.jpg)
Here is the Cerebras WSE-2 with me at ISC 2022:
![Patrick With Cerebras WSE 2 Hamburg ISC 2022](https://www.servethehome.com/wp-content/uploads/2022/06/Patrick-with-Cerebras-WSE-2-Hamburg-ISC-2022.jpg)
That goes into an engine block that looks like this:
![Cerebras CS 2 WSE 2 Heart At SC22 4](https://www.servethehome.com/wp-content/uploads/2022/12/Cerebras-CS-2-WSE-2-Heart-at-SC22-4.jpg)
That goes into the Cerebras CS-2.
![Cerebras Wafer Scale Cluster HC35_Page_31](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_31-800x452.jpg)
Those were built into racks.
![Cerebras Wafer Scale Cluster HC35_Page_32](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_32-800x452.jpg)
We can say hello to Supermicro 1U servers above the CS-2’s.
Then CS-2’s went into larger clusters.
![Cerebras Wafer Scale Cluster HC35_Page_33](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_33-800x452.jpg)
Now bigger clusters.
![Cerebras Wafer Scale Cluster HC35_Page_34](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_34-800x452.jpg)
This is the older Andromeda wafer scale cluster.
![Cerebras Wafer Scale Cluster HC35_Page_35](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_35-800x452.jpg)
Cerebras was training large models on Andromeda quickly with 16x CS-2’s.
![Cerebras Wafer Scale Cluster HC35_Page_36](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_36-800x452.jpg)
It found that programming a job for a single CS-2 scaled to 16x CS-2’s.
![Cerebras Wafer Scale Cluster HC35_Page_37](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_37-800x452.jpg)
Then Cerebras got bigger with the Condor Galaxy-1 Wafer Scale Cluster that we covered in: 100M USD Cerebras AI Cluster Makes it the Post-Legacy Silicon AI Winner.
![Cerebras Wafer Scale Cluster HC35_Page_38](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_38-800x452.jpg)
Cerebras traned BTLM on that which is the top 3B model right now.
![Cerebras Wafer Scale Cluster HC35_Page_39](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_39-800x452.jpg)
Next, Cerebras is scaling to even larger clusters.
![Cerebras Wafer Scale Cluster HC35_Page_40](https://www.servethehome.com/wp-content/uploads/2023/08/Cerebras-Wafer-Scale-Cluster-HC35_Page_40-800x452.jpg)
Final Words
I fell pretty far behind covering this talk. Still, Cerebras is big game hunting which is important in the era of big models. Having customers buying huge amounts of hardware to get these clusters online is a big vote of confidence for the company. It is a very different approach to scaling than NVIDIA and companies trying to duplicate NVIDIA.