NVIDIA DGX-1 The cognitive approach to learning pdf Fastest Deep Learning System NVIDIA Developer Blog - 23 September 2019

Home » » NVIDIA DGX-1 The cognitive approach to learning pdf Fastest Deep Learning System NVIDIA Developer Blog

15:54

NVIDIA DGX-1 The cognitive approach to learning pdf Fastest Deep Learning System NVIDIA Developer Blog

One year ago today, NVIDIA announced the NVIDIA® DGX-1™, an integrated system for deep learning. DGX-1 (shown in figure 1) features eight tesla P100 GPU accelerators connected through nvlink, the NVIDIA high-performance GPU interconnect, in a hybrid cube-mesh network.Cognitive approach to learning pdf together with dual socket intel xeon cpus and four 100 gb infiniband network interface cards, DGX-1 provides unprecedented performance for deep learning training.Cognitive approach to learning pdf moreover, the DGX-1 system software and powerful libraries are tuned for scaling deep learning on its network of tesla P100 gpus to provide a flexible and scalable platform for the application of deep learning in both production and research settings.Cognitive approach to learning pdf

To celebrate the first birthday of DGX-1 , NVIDIA is releasing a detailed new technical white paper about the DGX-1 system architecture. This white paper includes an in-depth look at the hardware and software technologies that make DGX-1 the fastest platform for deep learning training.Cognitive approach to learning pdf in this post, I will summarize those technologies, but make sure to read the DGX-1 white paper for complete details. DGX-1 system architecture

cognitive approach to learning pdf

DGX-1 is a deep learning system architected for high throughput and high interconnect bandwidth to maximize neural network training performance.Cognitive approach to learning pdf the core of the system is a complex of eight tesla P100 gpus connected in a hybrid cube-mesh nvlink network topology. (for more details about the NVIDIA pascal-architecture-based tesla P100, see the post inside pascal.) in addition to the eight gpus, DGX-1 includes two cpus for boot, storage management and deep learning framework coordination.Cognitive approach to learning pdf DGX-1 is built into a three-rack-unit (3U) enclosure that provides power, cooling, network, multi-system interconnect and SSD file system cache, balanced to optimize throughput and deep learning training time.Cognitive approach to learning pdf figure 2 shows the DGX-1 system components. Figure 2: DGX-1 system components.

NVLink is an energy-efficient, high-bandwidth interconnect that enables NVIDIA pascal gpus to connect to peer gpus or other devices within a node at an aggregate bidirectional bandwidth of 160 GB/s per GPU: roughly five times that of current pcie gen3 x16 interconnections.Cognitive approach to learning pdf the nvlink interconnect and the DGX-1 architecture’s hybrid cube-mesh GPU network topology enable the highest bandwidth data interchange between a group of eight tesla P100 gpus.Cognitive approach to learning pdf

Tesla P100’s page migration engine allows high bandwidth, low overhead sharing of data between the gpus and bulk host memory. For scaling to many-node high-performance clusters, DGX-1 provides high system-to-system bandwidth through infiniband (IB) networking.Cognitive approach to learning pdf nvlink for efficient deep learning scaling figure 3: the tesla P100 accelerator.

To provide the highest possible computational density, DGX-1 includes eight NVIDIA tesla P100 accelerators (figure 3).Cognitive approach to learning pdf application scaling on this many highly parallel gpus is hampered by today’s pcie interconnect. NVLink provides the communications performance needed to achieve good ( weak and strong) scaling on deep learning and other applications.Cognitive approach to learning pdf each tesla P100 GPU has four nvlink connection points, each providing a point-to-point connection to another GPU at a peak bandwidth of 20 GB/s.Cognitive approach to learning pdf multiple nvlink connections can be bonded together, multiplying the available interconnection bandwidth between a given pair of gpus. The result is that nvlink provides a flexible interconnect that can be used to build a variety of network topologies among multiple gpus.Cognitive approach to learning pdf pascal also supports 16 lanes of pcie 3.0. In DGX-1, these are used for connecting between the cpus and gpus. PCIe is also used for high-speed networking interface cards.Cognitive approach to learning pdf

The design of the nvlink network topology for DGX-1 aims to optimize a number of factors, including the bandwidth achievable for a variety of point-to-point and collective communications primitives, the flexibility of the topology, and its performance with a subset of the gpus.Cognitive approach to learning pdf the hybrid cube-mesh topology (figure 4) can be thought of as a cube with gpus at its corners and with all twelve edges connected through nvlink, and with two of the six faces having their diagonals connected as well.Cognitive approach to learning pdf it can also be thought of as two interwoven rings of single nvlink connections. Figure 4: DGX-1 uses an 8-GPU hybrid cube-mesh interconnection network topology.Cognitive approach to learning pdf the corners of the mesh-connected faces of the cube are connected to the pcie tree network, which also connects to the cpus and nics.

Figure 5 shows deep learning training performance and scaling on DGX-1.Cognitive approach to learning pdf the bars in figure 5 represent training performance in images per second for the resnet-50 deep neural network architecture using the microsoft cognitive toolkit (CNTK), and the lines represent the parallel speedup of 2, 4, or 8 P100 gpus versus a single GPU.Cognitive approach to learning pdf the tests used a minibatch size of 64 images per GPU. Figure 5: DGX-1 (weak) scaling results and performance for training the resnet-50 neural network architecture using the microsoft cognitive toolkit (CNTK) with a batch size of 64 per GPU.Cognitive approach to learning pdf the bars present performance on one, two, four, and eight tesla P100 gpus in DGX-1 using nvlink for inter-GPU communication (light green) compared to an off-the shelf system with eight tesla P100 gpus using pcie for communication (dark green).Cognitive approach to learning pdf the lines present the speedup compared to a single GPU. On eight gpus, nvlink provides about 1.4x (1513 images/s vs. 1096 images/s) higher training performance than pcie.Cognitive approach to learning pdf tests used NVIDIA DGX containers version 16.12, processing real data with cudnn 6.0.5, NCCL 1.6.1, gradbits=32.

The benefits of nvlink show clearly when comparing deep learning training using 1, 2, 4 and 8 gpus on pcie (tree topology) to the 8-GPU hybrid cube-mesh nvlink interconnect of DGX-1, as figure 5 shows.Cognitive approach to learning pdf nvlink really shines in the 4x and 8x cases, where DGX-1 aggregates multiple nvlink connections in a way that cannot be done with pcie, achieving nearly 1.4x total speedup vs.Cognitive approach to learning pdf pcie. Not only does the DGX-1 architecture’s nvlink interconnect achieve better scaling than pcie, the nvlink hybrid cube-mesh network topology provides the best overall scaling for deep learning, compared to alternative nvlink network configurations such as a ring topology.Cognitive approach to learning pdf infiniband for multi-system scaling of DGX-1 systems

Multi-system scaling of the latest computational workloads, especially deep learning, requires strong communications between gpus both inside the system and between systems to match the significant GPU performance of each system.Cognitive approach to learning pdf in addition to nvlink for high speed communication internally between gpus, DGX-1 also uses mellanox connectx-4 EDR infiniband ports to provide significant bandwidth between systems and reduce bottlenecks.Cognitive approach to learning pdf the latest infiniband standard, EDR IB, configured in DGX-1 provides:

DGX-1 comes configured with four EDR IB ports providing 800 gb/s (400 gb/s in and 400 gb/s out of the system simultaneously) that can be used to build a high-speed cluster of DGX-1 systems.Cognitive approach to learning pdf four EDR IB ports balance intra- and inter-node bandwidth, and in certain use cases can be fully consumed by inter-node communication. When compared to typical networking technologies such as ethernet, infiniband provides twenty times the bandwidth and four times lower latency even across a large multi-system cluster ( see the white paper for details).Cognitive approach to learning pdf

The latest DGX-1 multi-system clusters use a network based on a fat tree topology providing well-routed, predictable, contention-free communication from each system to every other system (see figure 6).Cognitive approach to learning pdf A fat tree is a tree-structured network topology with systems at the leaves that connect up through multiple switch levels to a central top-level switch.Cognitive approach to learning pdf each level in a fat tree has the same number of links providing equal bandwidth. The fat tree topology ensures the highest communication bisection bandwidth and lowest latency for all-to-all or all-gather type collectives that are common in computational and deep learning applications.Cognitive approach to learning pdf figure 6: example multisystem cluster of 124 DGX-1 systems tuned for deep learning. DGX-1 software

The DGX-1 software has been built to run deep learning at scale.Cognitive approach to learning pdf A key goal is to enable practitioners to deploy deep learning frameworks and applications on DGX-1 with minimal setup effort. The design of the platform software is centered around a minimal OS and driver install on the server, and provisioning of all application and SDK software in NVIDIA docker containers through the DGX container registry, maintained by NVIDIA.Cognitive approach to learning pdf containers available for DGX-1 include multiple optimized deep learning frameworks, the NVIDIA DIGITS deep learning training application, third party accelerated solutions, and the NVIDIA CUDA toolkit.Cognitive approach to learning pdf figure 7 shows the DGX-1 deep learning software stack. Figure 7: the DGX-1 deep learning software stack.

• the NVIDIA collective communication library (NCCL, pronounced “nickel”), a library of topology-aware multi-GPU collective communication primitives.Cognitive approach to learning pdf NVIDIA docker containers for DGX-1 include a version of NCCL that optimizes these collectives for the DGX-1 architecture’s 8-GPU hybrid cube-mesh nvlink network.Cognitive approach to learning pdf learn more about NCCL in this parallel forall blog post.

• deep learning frameworks for DGX-1. The NVIDIA deep learning SDK accelerates widely-used deep learning frameworks such as caffe, CNTK, mxnet, tensorflow, theano, and torch.Cognitive approach to learning pdf the DGX-1 software stack provides containerized versions of these frameworks optimized for the system. These frameworks, including all necessary dependencies, are pre-built, tested, and ready to run.Cognitive approach to learning pdf for users who need more flexibility to build custom deep learning solutions, each framework container image also includes the framework source code to enable custom modifications and enhancements, along with the complete software development stack.Cognitive approach to learning pdf

The performance of DGX-1 for training popular deep neural networks speaks volumes about the value of an integrated system for deep learning. The graph in figure 8 shows the training speedup of DGX-1 compared to an off-the shelf system with the same gpus for the resnet-50 and resnet-152 deep neural networks using the microsoft cognitive toolkit, tensorflow, and torch.Cognitive approach to learning pdf this graph demonstrates two clear benefits:

Figure 8: DGX-1 deep learning training speedup using all 8 tesla p100s of DGX-1 vs. 8-GPU tesla M40 and tesla P100 systems using PCI-e interconnect for the resnet-50 and resnet-152 deep neural network architecture on the popular CNTK (2.0 beta5), tensorflow (0.12-dev), and torch (11-08-16) deep learning frameworks.Cognitive approach to learning pdf training used 32-bit floating point arithmetic and total batch size 512 for resnet-50 and 128 for resnet-152. Other software: NVIDIA DGX containers version 16.12, NCCL 1.6.1, CUDA 8.0.54, cudnn 6.0.5, ubuntu 14.04.Cognitive approach to learning pdf NVIDIA linux display driver 375.30. The 8x M40 and 8x P100 pcie server is an SMC 4028GR with dual intel xeon E5-2698v4 cpus and 256GB DDR4-2133 RAM (DGX-1 has 512GB DDR4-2133).Cognitive approach to learning pdf

The high performance of DGX-1 is due in part to the nvlink hybrid cube-mesh interconnect between its eight tesla P100 gpus, but that is not the whole story.Cognitive approach to learning pdf much of the performance benefit of DGX-1 comes from the fact that it is an integrated system, with a complete software platform aimed at deep learning.Cognitive approach to learning pdf this includes the deep learning framework optimizations such as those in NVIDIA caffe, cublas, cudnn, and other GPU-accelerated libraries, and nvlink-tuned collective communications through NCCL.Cognitive approach to learning pdf this integrated software platform, combined with tesla P100 and nvlink, ensures that DGX-1 outperforms similar off-the-shelf systems.

Category: Cognitive learning | Views: 91 | Added by: poiskspider | Tags: cognitive approach to learning pdf | Rating: 0.0/0

Total comments: 0

Cognitive Learning