SambaNova SN10 RDU at Hot Chips 33

0
SambaNova SN10 RDU Cover
SambaNova SN10 RDU Cover

SambaNova is a hot company since it does something a bit different focusing on AI acceleration with a focus on scaling to large models and memory. SambaNova’s approach requires some fairly heavy compiler work alongside the hardware. This is being done live and SambaNova has over a slide per minute so we are just going to try doing the best to capture as much as possible.

SambaNova SN10 RDU at Hot Chips 33

First, getting into what this Cardinal SN10 RDU is. SambaNova calls its chip a Reconfigurable Dataflow Unit or (RDU.) They will get into why they call it this later. Beyond the linear algebra units, there is a big focus on routing and the on-chip memory bandwidth.

HC33 SambaNova SN10 RDU Cardinal Overview
HC33 SambaNova SN10 RDU Cardinal Overview

SambaNova is not just thinking in terms of a PCIe expansion card. Instead, it has systems that it sees as scaling. As an example, it has 8x RDUs in each SN10-8R using around a quarter of a rack. Each RDU has six-channel memory so that is 256GB per channel and 48 channels with 12TB total. The slide says DDR4-2667 but in Q&A the company also mentioned DDR4-3200.

HC33 SambaNova SN10 RDU DataScale SN10 8R Systems
HC33 SambaNova SN10 RDU DataScale SN10 8R Systems

The software side is extremely important the software needs to be very aware of the underlying architecture as well as the model to be able to efficiently use the RDU.

HC33 SambaNova SN10 RDU Software
HC33 SambaNova SN10 RDU Software

Something SambaNova’s software does is to map communication and then compile that communication to the underlying hardware.

HC33 SambaNova SN10 RDU Dataflow Mappings
HC33 SambaNova SN10 RDU Dataflow Mappings

The key insight behind the company is that GPUs do very well in a specific zone of problems. Some models are unable to use the hardware efficiently. Other models are simply too big for the onboard memory. As a result, they do not work well on GPUs and therefore are often lower performance.

HC33 SambaNova SN10 RDU Goldilocks Zone Around GPUs
HC33 SambaNova SN10 RDU Goldilocks Zone Around GPUs

SambaNova’s data flow has many compute and memory units to exploit parallelism. It also has the ability to exploit data locality.

HC33 SambaNova SN10 RDU Dataflow Exploits Data Locality And Parallelism
HC33 SambaNova SN10 RDU Dataflow Exploits Data Locality And Parallelism

Here is the chip architecture.

HC33 SambaNova SN10 RDU Chip Overview
HC33 SambaNova SN10 RDU Chip Overview

Here is the tile deep dive. We can see the tile is made up mostly of three key components, switches, pattern compute units, and pattern memory units. We are going to see an example later that should make SambaNova’s RDU make more sense if you are struggling later.

HC33 SambaNova SN10 RDU Chip Overview Tile
HC33 SambaNova SN10 RDU Chip Overview Tile

The Pattern Compute Unit are the compute engines with the SIMD paths.

HC33 SambaNova SN10 RDU Chip PCU
HC33 SambaNova SN10 RDU Chip PCU

The other big unit is the Pattern Memory Unit. This is the on-chip memory system with banked SRAM arrays.

HC33 SambaNova SN10 RDU Chip PMU
HC33 SambaNova SN10 RDU Chip PMU

The switch has a router pipeline and router crossbar. This is the component that helps direct data to and from the PCUs and PMUs.

HC33 SambaNova SN10 RDU Switch And On Chip Interconnect
HC33 SambaNova SN10 RDU Switch And On Chip Interconnect

There are also address generation and coalescing units.

HC33 SambaNova SN10 RDU AG And CU
HC33 SambaNova SN10 RDU AG And CU

Here is a part of the example using LayerNorm. One can see the different steps where computations are happening alongside data changes on the top. Instead of just doing these sequentially, the switches can be used to create a pipeline on the RDU where data can move directly to the adjacent step.

HC33 SambaNova SN10 RDU LayerNorm Pipelined In Spaced And Fused
HC33 SambaNova SN10 RDU LayerNorm Pipelined In Spaced And Fused

Then taking into account time and space, data can be focused on a smaller number of PMUs, PCUs, and switches. This adds the time component of when data needs to be where. The net impact is that fewer chip resources are used for a computation and thus more can be done on the chip. Combine this with the high-speed fabric and lots of memory off-chip in DDR4 and the solution can handle larger problem sizes.

HC33 SambaNova SN10 RDU Add Hybrid Space And Time Execution
HC33 SambaNova SN10 RDU Add Hybrid Space And Time Execution

Here is another example of the dataflow architecture with a bit more granularity.

HC33 SambaNova SN10 RDU Data Flow Instead Of Kernel By Kernel
HC33 SambaNova SN10 RDU Data Flow Instead Of Kernel By Kernel

By doing this work, as a dataflow, models do not need to be hand-optimized kernel by kernel.

HC33 SambaNova SN10 RDU Kernel Fusion
HC33 SambaNova SN10 RDU Kernel Fusion

Here are some of the features of SambaNova’s solution.

HC33 SambaNova SN10 RDU Spacial Dataflow Architecture
HC33 SambaNova SN10 RDU Spacial Dataflow Architecture

With the quarter rack system, we get 12TB of memory. As a result, this gives plenty of capacity to keep the chips fed.

HC33 SambaNova SN10 RDU Large Memory Capacity
HC33 SambaNova SN10 RDU Large Memory Capacity

If you are thinking that an approach like this requires really solid compiler work, it does. This approach is leaning on the software side heavily to map dataflows as part of its process.

SambaNova is trying to show multiple deployment examples.

HC33 SambaNova SN10 RDU Flexibility 4 RDU
HC33 SambaNova SN10 RDU Flexibility 4 RDU

Perhaps the one it is most focused on is scale-out to a data center scale solution.

HC33 SambaNova SN10 RDU Scale Out
HC33 SambaNova SN10 RDU Scale Out

One benefit is that with scaling out and lots of memory, SambaNova says it can scale out to train large 1 trillion parameter language models without needing to split them up behind the scenes as one would need to do with (more) GPU-based systems.

HC33 SambaNova SN10 RDU 1T Parameter NLP In A Small Footprint
HC33 SambaNova SN10 RDU 1T Parameter NLP In A Small Footprint

Other benefits are being able to train on full 4K and even 50K x 50K images without having to go through today’s generation of low-resolution downsamples.

HC33 SambaNova SN10 RDU True Resolution Images
HC33 SambaNova SN10 RDU True Resolution Images

GPU-based systems may downsample a larger image (potentially on the CPU) to create tiles that fit in a GPU’s memory resources. With SambaNova, the additional memory means analytics happen with the full image without downsampling/ tiling.

HC33 SambaNova SN10 RDU Medical Without Downsampling
HC33 SambaNova SN10 RDU Medical Without Downsampling

SambaNova says that is part of the reason it is able to also be accurate. As one can see, its inputs are higher resolution than on the GPU side.

HC33 SambaNova SN10 RDU Accuracy
HC33 SambaNova SN10 RDU Accuracy

The company also says its RDUs can be used on other large problems beyond AI.

HC33 SambaNova SN10 RDU Science
HC33 SambaNova SN10 RDU Science

Summarizing it went back to the Goldilocks slide and said that its solution can handle not just current GPU-optimized models faster, but also those that sit outside of that Goldilocks zone.

HC33 SambaNova SN10 RDU Spectrum
HC33 SambaNova SN10 RDU Spectrum

This was certainly a lot to take in.

Final Words

SambaNova, along with Cerebras were given special mentions by Dimitri Kusnezov, Deputy Under Secretary for AI and Technology, Department of Energy during his Hot Chips 33 keynote just before this section. That seems to indicate that the DoE sees some promise in its partnership with the two companies. This was the first time that SambaNova really went into is architecture at this depth publicly from what we have seen, so it was great to see what the hype is about.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.