During the opening two-plus-hour keynote to NVIDIA’s GPU Technology Conference in San Jose this week, CEO Jensen Huang made announcements and proclamations on everything from autonomous driving to medical imaging to ray tracing. The breadth of coverage is substantial now for the company, a dramatic shift from roots in graphics and gaming solely. These kinds of events underlie the value that NVIDIA has created as a company, both for itself and the industry.
In that series of announcements Huang launched a $399,000 server. Yes, you read that right – a machine with a $400k price tag. The hardware is aimed at the highest end, most demanding AI applications on the planet, combining the best of NVIDIA’s hardware stack with its years of software expertise. Likely the biggest customer for these systems will be NVIDIA itself as the company continues to upgrade and improve its deep learning systems to aid in development of self-driving cars, robotics, and more.
The NVIDIA DGX-2 makes the claim of being the world’s first 2 petaFLOPS system, generating more compute power than any competing server in a similar size and density.
The DGX-2 is powered by 16 discrete V100 graphics chips based on the Volta architecture. These sixteen GPUs have a total of 512GB of HBM2 memory (now 32GB per card rather than 16GB) and an aggregate bandwidth of 14.4 TB/s. Each GPU offers 5,120 CUDA cores for a total of 81,920 in the system. The Tensor cores that make up much of the AI capability of the design breach the 2.0 PFLOPS mark. This is a massive compilation of computing hardware.
The previous DGX-1 V100 system, launched just 6 months ago, ran on 8 GPUs with half the memory per GPU. Part of the magic that makes the DGX-2 possible is the development of NVSwitch, a new interconnect architecture that allows NVIDIA to scale its AI integrations further. The physical switch itself is built on 12nm process technology from TSMC and encompasses 2 billion transistors all on its own. It offers 2.4 TB/s of bandwidth.
As PCI Express became a bottleneck for multi-GPU systems that are crunching on enormous data sets typical of deep learning applications, NVIDIA worked on NVLink. First released with the Pascal GPU design and used with Volta as well, the V100 chip has support for 6 NVLink connections and a total of 300 GB/s of bandwidth for cross-GPU communication.
NVSwitch builds on NVLink as an on-node design and allows for any two pairs of GPUs to communicate at full NVLink speed. This facilitates the next level of scaling, moving behind the number of NVLink connections on a per GPU basis and allows for a network to be built around the interface. The switch itself has 18 links and is capable of eight 25 Gbps bi-directional connections. Though the DGX-2 is using twelve NVSwitch chips for connecting 16 GPUs, NVIDIA tells me that there is no technological reason they couldn’t push beyond that. There is simply a question of need and physical capability.
With the DGX-2 system in place, NVIDIA claims to see as much as a 10x speedup in just the 6 months since the release of DGX-1, on select workloads like training FAIRSEQ. Compared to traditional data center servers using Xeon processors, Huang stated that the DGX-2 can provide computing capability at 1/8 the cost, 1/60 the physical space, and 1/18 the power. Though the repeated line of “the more you spend, the more you save” might seem cliché, NVIDIA hopes that those organizations investing in AI applications see value and adopt.
One oddity in the announcement of the DGX-2 was Huang’s claim that it represented the “world’s largest GPU”. The argument likely stems from Google’s branding of the “TPU” as a collection of processors, platforms, and infrastructure into a singular device and NVIDIA’s desire to show similar impact. The company may feel that a “GPU” is too generic a term for the complex systems it builds, which I would agree with, but I don’t think co-opting a term that has significant value in many other spaces is the right direction.
In addition to the GPUs, the DGX-2 does includes substantial hardware from other vendors that act as support systems. This includes a pair of Intel Xeon Platinum processors, 1.5 TB of system memory, eight 100 GigE network connections, and 30TB of NVMe storage. This is an incredibly powerful rackmount server that services AI workloads at unprecedented levels.
The answer that I am still searching for is for the simple question of “who buys these?” NVIDIA clearly has its own need for high performance AI compute capability, and the need to simplify and compress that capability to save money on server infrastructure is substantial. NVIDIA is one of the leading developers of artificial intelligence for autonomous driving, robotics training, algorithm and container set optimization, etc. But other clients are buying in – organizations like New York University, Massachusetts General Hospital, and UC Berkeley have been using the first-generation device in flagship, leadership development roles. I expect that will be the case for the DGX-2 sales targets; that small group on the bleeding edge of AI development.
Announcing a $400k AI accelerator may not have a direct effect on many of NVIDIA’s customers, but it clearly solidifies the company’s position of leadership and internal drive to maintain it. With added pressure from Intel, which is pushing hard into the AI and machine learning fields with acquisitions and internal development, NVIDIA needs to continue down its path and progression. If GTC has shown me anything this week, it’s that NVIDIA is doing just that.