The computer, to be called Dojo, is where the company plans to train neural networks for use in its cars.
Within it, the smallest entity, said project head Ganesh Venkataramanan at the launch, is the ‘training node’.
Much thought, he said, went into sizing this fundamental building block.
“If you go to small, it will run fast but the overheads of synchronisation will dominate,” said Venkataramanan. “If you pick it too big, it will have complexities in implementation in the real hardware and ultimately run into memory bottleneck issues.”
Designing for minimum latency and maximum bandwidth, the team decided to go with a 64bit superscaler CPU (left) optimised around 8×8 matrix multiply units and vector SIMD for FP32, BFP16, CFP8, INT32, INT16 and INT8 data types. 1.25Mbyte of local ECC SRAM is included.
This is the maximum resource that can be spanned by a 2GHz clock on the chosen fabrication process.
It is a superscalar in-order cpu with four wide scaler and two wide vector pipes. “We call it in-order for the purists, although the vector and the scaler pipes can go out of order,” said Venkataramanan.
The processor also has four-way multi-threading to increases utilisation by allowing compute and data transfers simultaneously.
An instruction set was created particularly for machine learning workloads, with features including transfer, gather, link traversals and broadcast.
The whole thing “packs more than 1Tflop of compute,” according to Venkataramanan. In detail, it provides 1.024Tflop/s (BFP16 or CFP8, 64Gflop/s FP32) and 512Gbit/s in each cardinal direction.
Tesla decided to go with a two-dimensional network throughout this computer, so training nodes have been designed with a square footprint to allow them to be tiled in a square array across the die. Each side has a wide parallel data bus (north, south, east and west on the diagram above left) to transfer data to immediate neighbours without intervening logic.
The network switch in each training node has a latency of 1cycle/hop.
354 of these training nodes are arrayed across a single 645mm2 (25 x 25mm) 7nm die to form what the company is calling its ‘D1’ integrated circuit.
This is capable of delivering 362Tflop (BFP16 or CFP8 data, 22.6Tflop of FP32) of machine learning compute, said Venkataramanan, who added that is has 50 billion transistors, over 17km of wiring and dissipates 400W.
In keeping with the square array theme, packaging is flip-chip BGA, and each of the four package sides has an enormous data bus to connect to other D1 chips arranged in a square array.
Per chip, there are 576 lanes of 112Gbit serial data, providing 4Tbit/s IO each edge of the package (=16Tbit/s/chip) for connections to identical D1 chips to the north south east and west – again, without any glue logic.
“The D1 chip is entirely designed by the Tesla team internally,” said Venkataramanan. “100% of its area is going towards machine learning training and bandwidth – there is no dark silicon, there is no legacy support.”
Completing the Dojo will be ‘Dojo interface processors’ that will bridge the spare 112Gbit serial data channels around the edge of the array to PCI Gen4 lanes for on-ward connection to the host system and the compute plane’s shared DRAM memory.
To physically implement the 15 x 100 D1 array, Tesla has created a square hardware sub-assembly called a ‘training tile’ (right) that contains a 5 x 5 D1 IC array together with associated power supplies and cooling, as well as high-speed interfacing for adjacent identical training tiles.
Within the training tile, 25 D1 know-good-die (= 9Pflop/s of compute (BF16/CFP8)) are bonded to a carrier wafer that electrically connects them in a way that preserves maximum bandwidth between them.
Around the edge of the wafer are 40 IO connectors, 10 per edge, that will be able to deliver 9Tbit/s per edge to adjacent training tiles (36Tbit/s total IO).
Their power comes from a further dc-dc converter also integrated into the training tile’s mechanical stack (below right), which is fed from outside with 52Vdc. Total power per training tile is projected to be somewhere between 10 and 15kW.
This bulk of this heat is extracted from D1 chip side of the carrier wafer.
Training tiles, said Venkataramanan, will be installed as 2 x 3 tiles in a tray and two trays per cabinet, yielding over 100Pflop/s/cabinet, and the entire computer will have 10 cabinets (= 120 training tiles = 3,000 D1 chips) and be able to process at 1.1Eflop (BF16/CFP8).
Data transfer bandwidth will be the same between architecturally adjacent training tile whether that are physically adjacent in a tray or are in the next cabinet, according to Venkataramanan. “We broke the cabinet walls,” he said. “We integrated these tiles seamlessly.”
In use, compiler software will allow the whole processing plane to be partitioned into ‘Dojo processing units’ including one or more D1 chips and interface processing to one or more hosts.
“The compiler suite takes care of fine-grain parallelism – mapping the neural networks onto our compute plane – using multiple techniques to extract parallelism,” said Venkataramanan. Usually “model parallelism could not have been extended beyond chip boundaries. Because of our high bandwidth, we can extend it to training tiles and beyond: large networks can be mapped at low batch sizes.”
All of this information, including the images, was presented at Tesla’s AI day.
Caveat: transcription errors may have been introduced, including bits/bytes confusion from fluid use of b and B in the written presentation. Electronics Weekly has requested clarification.