Thursday, July 13, 2023

NVDA And AMD Today -- July 13, 2023

Locator: 10001SUPCOMP.

Key words:

  • blades: black boxes, made by AMD, Nvidia; 
  • systems: designed by IBM, Cray
  • dragonfly topology: link here; simply the "way" the black boxes are wired together
  • nodes: self-explanatory; where multiple blades (black boxes) "meet"
  • switches: self-explanatory

Locator: 45121CHIPS.  



Why are these important today? Answer: later. Stay tuned.

Chips, semiconductor: link here.

From wiki:


 We still haven't gotten to the answer to the question.

***************************
Supercomputers

Locations:

  • China?
  • Japan
  • US:
    • NSA?
    • Tennessee: Oak Ridge, nuclear
    • California: Lawrence Livermore National Laboratory, nuclear

Frontier

Hewlett Packard Enterprise Frontier, or OLCF-5, is the world's first and fastest exascale supercomputer, hosted at the Oak Ridge Leadership Computing Facility (OLCF) in Tennessee, United States, and first operational in 2022.
It is based on the Cray EX and is the successor to Summit (OLCF-4).
As of March 2023, Frontier is the world's fastest supercomputer.
Frontier achieved an Rmax of 1.102 exaFLOPS, which is 1.102 quintillion operations per second, using AMD CPUs and GPUs
Measured at 62.86 gigaflops/watt, Frontier topped the Green500 list for most efficient supercomputer, until it was dethroned (in efficiency) by Flatiron Institute's Henri supercomputer in November 2022.
Design: Frontier uses 9,472 AMD Epyc 7453s "Trento" 64 core 2 GHz CPUs (606,208 cores) and 37,888 Radeon Instinct MI250X GPUs (8,335,360 cores). They can perform double precision operations at the same speed as single precision.

"Trento" is an optimized 3rd Gen EPYC CPU ("Milan"), which itself is based on the Zen 3 microarchitecture.

It occupies 74 19-inch rack cabinets. Each cabinet hosts 64 blades, each consisting of 2 nodes.

Blades are interconnected by HPE Slingshot 64-port switch that provides 12.8 terabits/second of bandwidth. Groups of blades are linked in a dragonfly topology with at most three hops between any two nodes. Cabling is either optical or copper, customized to minimize cable length. Total cabling runs 90 miles
Frontier is liquid-cooled, allowing 5x the density of air-cooled architectures.

Each node consists of one CPU, 4 GPUs and 5 terabytes of flash memory. Each GPU has 128 GB of RAM soldered onto it.

Frontier has coherent interconnects between CPUs and GPUs, allowing GPU memory to be accessed coherently by code running on the Epyc CPUs.

Frontier uses an internal 75 TB/s read / 35 TB/s write / 15 billion IOPS flash storage system, along with the 700 PB Orion site-wide Lustre filesystem.
The original design envisioned hundreds of thousands of GPUs and 150–500 MW of power.
Oak Ridge partnered with HPE Cray and AMD to build the system.

The machine was built at a cost of $600 million. It began deployment in 2021 and reached full capability in 2022.
It clocked 1.1 exaflops Rmax in May 2022, making it the world's fastest supercomputer as measured in the June 2022 edition of the TOP500 list, replacing Fugaku.

El Capitan 

Hewlett Packard Enterprise El Capitan, is an upcoming exascale supercomputer, hosted at the Lawrence Livermore National Laboratory in Livermore, California, and projected to become operational in 2023. It is based on the Cray EX Shasta architecture. When deployed, El Capitan is projected to displace Frontier as the world's fastest supercomputer.

El Capitan has been announced to use an unknown number of AMD Instinct MI300 accelerated computing units (APUs). The MI300 consists of 24 AMD Zen AMD64-based CPU cores, and CDNA 3-based GPU integrated onto a single organic package, along with 128GB of HBMe RAM.

The floor space and number of racks for El Capitan have not yet been announced.

Blades are interconnected by HPE Slingshot 64-port switch that provides 12.8 terabits/second of bandwidth. Groups of blades are linked in a dragonfly topology with at most three hops between any two nodes. Cabling is either optical or copper, customized to minimize cable length. Total cabling runs 90 miles.

El Capitan has coherent interconnects between CPUs and GPUs, allowing GPU memory to be accessed coherently by code running on the Epyc CPUs.
Capitan was ordered as a part of the Department of Energy's CORAL-2 initiative, intended to replace Sierra (supercomputer), an IBM/NVIDIA machine deployed in 2018.
The original design envisioned hundreds of thousands of GPUs and 40 MW of power. LLNL partnered with HPE Cray and AMD to build the system.

So, when Japan, France, and the UK need to upgrade / update their supercomputers, with whom do youu think the partner?

*******************************
The Wagner Group

Less than a month ago, there were numerous stories suggesting the US intelligence apparatus completely missed the Wagner Group mutiny and subsequent events.  

The story quickly fell off the front pages as the US intelligence agencies closed ranks.

But when something this serious happens in Washington, DC, how do "they" solve those problems?

They throw money at the problem(s).

My hunch, NSA, CIA and the intelligence agencies looked for faster and more capable computers.

Just saying.

This slide continues to haunt me.

No comments:

Post a Comment