Big News From GTC 2017 China NVIDIA CUDA 9 Launch

10x NVIDIA GTX 1080 TI FE Plus Mellanox Top
10x NVIDIA GTX 1080 TI FE Plus Mellanox Top

At GTC 2017 in China (NVIDIA has multiple GTC’s), NVIDIA announced that CUDA 9 is now available. That is a major milestone in the HPC/ AI industries as with each new CUDA release we generally see support for new architectures and libraries optimized for the most cutting-edge applications. NVIDIA CUDA 9 has been available in release candidate form for some time but we are finally seeing the GA mark of the new tooling.

New NVIDIA CUDA 9 Features

If you want to get a full overview, the NVIDIA Parallel Forall blog has an in-depth look at the new features of NVIDIA CUDA 9. We suggest giving it a read:

The key features via the NVIDIA Developer site are listed as:

  • Speed up high-performance computing (HPC) and deep learning apps with new GEMM kernels in cuBLAS
  • Execute image and signal processing apps faster with performance optimizations across multiple GPU configurations in cuFFT and NVIDIA Performance Primitives
  • Solve linear and graph analytics problems common in HPC with new algorithms in cuSOLVER and nvGRAPH
  • Express rich parallel algorithms with threads from sub-tiles to warps, blocks, and grids
  • Manage and reuse threads efficiently within an application with new API and function primitives
  • Optimize and pre-fetch memory access by identifying source code causing page faults in unified memory
  • Inspect unified memory performance bottlenecks with new event filters based on virtual address, migration reason and page fault access type

There are also a number of Volta and NVLink support items that have been added in the newest CUDA 9 release:

  • Replace warp-synchronous programming with robust programming model on Kepler architecture and above
  • Execute AI applications faster with Tensor Cores performing 5X faster than Pascal GPUs
  • Scale multi-GPU applications with next-generation NVLink delivering 2X throughput of prior generation
  • Increase GPU utilization with Volta Multi-Process Service (MPS)
  • Profile PCIe usage by analyzing bandwidth of memory transfers, latency, and comparison with NVLink

STH will be updating many of our nvidia-docker images with the new CUDA 9 after testing.


Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.