Samsung Processing in Memory Technology at Hot Chips 2023

1
Samsung PIM PNM For Transformer Based AI HC35_Page_24
Samsung PIM PNM For Transformer Based AI HC35_Page_24

At Hot Chips 2023 (35) Samsung is talking about its processing-in-memory (PIM) again with new research and a new twist. We have covered this previously, for example in our Hot Chips 33 Samsung HBM2-PIM and Aquabolt-XL. Now, Samsung is showing this in the context of AI.

Since these are being done live from the auditorium, please excuse typos. Hot Chips is a crazy pace.

Samsung Processing in Memory Technology at Hot Chips 2023

One of the biggest costs in computing is moving data from different storage and memory locations to the actual compute engines.

Samsung PIM PNM For Transformer Based AI HC35_Page_03
Samsung PIM PNM For Transformer Based AI HC35_Page_03

Currently, companies try to add more lanes or channels for different types of memory. That has its limits.

Samsung PIM PNM For Transformer Based AI HC35_Page_04
Samsung PIM PNM For Transformer Based AI HC35_Page_04

Samsung is discussing CXL. CXL helps because it allows for things like re-purposing wires for PCIe to provide more memory bandwidth. We are going to discuss more on CXL Type-3 devices in the future on STH and have covered them a few times.

Samsung PIM PNM For Transformer Based AI HC35_Page_05
Samsung PIM PNM For Transformer Based AI HC35_Page_05

Samsung is discussing GPT bottlenecks.

Samsung PIM PNM For Transformer Based AI HC35_Page_06
Samsung PIM PNM For Transformer Based AI HC35_Page_06

Samsung has profiling GPT’s compute bount and memory bound workloads.

Samsung PIM PNM For Transformer Based AI HC35_Page_07
Samsung PIM PNM For Transformer Based AI HC35_Page_07

Here is a bit more on the profiling work in terms of utilization and execution time.

Samsung PIM PNM For Transformer Based AI HC35_Page_08
Samsung PIM PNM For Transformer Based AI HC35_Page_08

Samsung shows how parts of the compute pipeline can be offloaded to processing-in-memory (PIM) modules.

Samsung PIM PNM For Transformer Based AI HC35_Page_09
Samsung PIM PNM For Transformer Based AI HC35_Page_09

Doing processing at the memory module, instead of the accelerator saves data movement lowering power consumption and interconnect costs.

Samsung PIM PNM For Transformer Based AI HC35_Page_11
Samsung PIM PNM For Transformer Based AI HC35_Page_11

While SK hynix was talking about GDDR6 for its solution, Samsung is showing its high-bandwidth memory HBM-PIM. We are going to be showing HBM on Intel Xeon MAX CPUs in the next week or so on STH, but that is not using this new memory type.

Samsung PIM PNM For Transformer Based AI HC35_Page_12
Samsung PIM PNM For Transformer Based AI HC35_Page_12

Apparently, Samsung and AMD had MI100’s with HBM-PIM instead of just standard PIM so it could build a cluster so it could have what sounds like a 12-node 8-accelerator cluster to try out the new memory.

Samsung PIM PNM For Transformer Based AI HC35_Page_13
Samsung PIM PNM For Transformer Based AI HC35_Page_13

Here is how the T5-MoE model uses HBM-PIM in the cluster.

Samsung PIM PNM For Transformer Based AI HC35_Page_14
Samsung PIM PNM For Transformer Based AI HC35_Page_14

Here are the performance and energy efficiency gains.

Samsung PIM PNM For Transformer Based AI HC35_Page_15
Samsung PIM PNM For Transformer Based AI HC35_Page_15

A big part of this is also how to get the PIM modules to do useful work. That requires software work to program and utilize the PIM modules.

Samsung PIM PNM For Transformer Based AI HC35_Page_16
Samsung PIM PNM For Transformer Based AI HC35_Page_16

Samsung hopes to get this built-into standard programming modules.

Samsung PIM PNM For Transformer Based AI HC35_Page_17
Samsung PIM PNM For Transformer Based AI HC35_Page_17

Here is the OneMCC for memory-coupled computing to-be state, but this sounds like a future, rather than a current, state.

Samsung PIM PNM For Transformer Based AI HC35_Page_18
Samsung PIM PNM For Transformer Based AI HC35_Page_18

It looks like Samsung is showing off not just the HBM-PIM, but also a LPDDR-PIM. As with everything today, it needs a Generative AI label.

Samsung PIM PNM For Transformer Based AI HC35_Page_19
Samsung PIM PNM For Transformer Based AI HC35_Page_19

This one seems to be more of a concept rather than the HBM-PIM that is being used on AMD MI100’s in a cluster.

Samsung PIM PNM For Transformer Based AI HC35_Page_20
Samsung PIM PNM For Transformer Based AI HC35_Page_20

This LPDDR-PIM is only 102.4GB/s of internal bandwidth, but the idea is that keeping compute on the memory module means lower power by not having to transmit the data back to the CPU or xPU.

Samsung PIM PNM For Transformer Based AI HC35_Page_21
Samsung PIM PNM For Transformer Based AI HC35_Page_21

Here is the architecture with the PIM banks and DRAM banks on the module.

Samsung PIM PNM For Transformer Based AI HC35_Page_22
Samsung PIM PNM For Transformer Based AI HC35_Page_22

Here is what the performance and power analysis looks like on the possible LP5-PIM modules.

Samsung PIM PNM For Transformer Based AI HC35_Page_23
Samsung PIM PNM For Transformer Based AI HC35_Page_23

If HBM-PIM and LPDDR-PIM were not enough, Samsung is looking at putting compute onto CXL modules in the PNM-CXL.

Samsung PIM PNM For Transformer Based AI HC35_Page_25
Samsung PIM PNM For Transformer Based AI HC35_Page_25

The idea here is to not just put memory on CXL Type-3 modules. Instead, Samsung is proposing to put compute on the CXL module. This can be done either by adding a compute element to the CXL module and using standard memory or by using PIM on the modules and a more standard CXL controller.

Samsung PIM PNM For Transformer Based AI HC35_Page_26
Samsung PIM PNM For Transformer Based AI HC35_Page_26

Of course, we have our showing of how this helps generative AI with the GPT side.

Samsung has a concept 512GB CXL-PNM card with up to 1.1TB/s of bandwidth.

Samsung PIM PNM For Transformer Based AI HC35_Page_27
Samsung PIM PNM For Transformer Based AI HC35_Page_27

Here is Samsung’s proposed CXL-PNM software stack.

Samsung PIM PNM For Transformer Based AI HC35_Page_28
Samsung PIM PNM For Transformer Based AI HC35_Page_28

Here are the expected energy savings and throughput for large-scale LLM workloads. CXL is usually going over wires also used for PCIe, so energy costs for transmitting data are very high. As a result, there are large gains by being able to avoid that data transfer.

Samsung PIM PNM For Transformer Based AI HC35_Page_29
Samsung PIM PNM For Transformer Based AI HC35_Page_29

Samsung is also focused on the emissions reductions as a result of the above.

Samsung PIM PNM For Transformer Based AI HC35_Page_30
Samsung PIM PNM For Transformer Based AI HC35_Page_30

Google earlier today gave a big talk about CO2 emissions in AI computing. We plan to cover that later this week on STH.

Final Words

Samsung has been pushing PIM for years, but PIM/ PNM seems to be moving from purely a research concept to the company actually looking to productize it. Hopefully, we get to see more of this in the future. The CXL-PNM might end up being a ripe area for this type of compute.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.