Sunday, August 31, 2025

Amazon's Project Rainier -- Apple Reportedly Using New, High-End Amazon Trainium2 Chips -- August 31, 2025

Locator: 49003AMAZON. 

Amazon chips: wiki. Must-read. Annapurna Labs -- an Israeli company now fully owned by Amazon.

In November 2024 Annapurna announced their second generation Trainium 2 intended for training AI models. Based on their internal testing, Amazon are claiming "a 4-times performance increase between Trainium 1 and Trainium 2".
AWS Graviton - an ARM-based CPU developed by Annapurna Labs for exclusive use by Amazon Web Services.
Note: some sources call them "Trainum 2" chips and other sources call them "Trainium2" chips. In AI, "spaces" are "tokens."

From CHIPS:

AWS (Amazon): new chips -- Ocelot (quantum computing); competing with Nvidia -- Trainium, Inferentia. World's largest supercomputer to be completed in 2025 (?) -- Project Rainier -- partnered with Anthropic, AI --> Claude.

What is Project Rainier? Link here


From the linked press release:
Spread across multiple data centers in the U.S., the sheer size of the project is unlike anything AWS has ever attempted.

AWS customer AI safety and research company Anthropic will use this brand-new “AI compute cluster” to build and deploy future versions of its leading AI model, Claude.

“Rainier will provide five times more computing power compared to Anthropic’s current largest training cluster,” said Gadi Hutt, director of product and customer engineering at Annapurna Labs, the specialist chips arm of AWS responsible for designing and building the hardware that will power the project.

To deliver on that mission, Project Rainier is designed as a massive “EC2 UltraCluster of Trainium2 UltraServers.” The first part refers to Amazon Elastic Compute Cloud (EC2), an AWS service that lets customers rent virtual computers in the cloud rather than buying and maintaining their own physical servers.

The more interesting bit is Trainium2, a custom-designed AWS computer chip built specifically for training AI systems. Unlike the general-purpose chips in your laptop or phone, Trainium2 is specialized for processing the enormous amounts of data required to teach AI models how to complete all manner of different and increasingly complex tasks—fast.

To put the power of Trainium2 in context: a single chip is capable of completing trillions of calculations a second. If, understandably, that’s a little hard to visualize: consider that it would take one person more than 31,700 years to count to one trillion. A task that would require millennia for a human to complete can be done in the blink of an eye with Trainium2.

Impressive, yes. But Project Rainier doesn’t just use one, or even a few, chips. This is where the UltraServers and UltraClusters come in.

Impressive, yes. But Project Rainier doesn’t just use one, or even a few, chips. This is where the UltraServers and UltraClusters come in.Traditionally, servers in a data center operate independently. If and when they need to share information, that data has to travel through external network switches. This introduces latency (i.e, delay), which is not ideal at such large scale.

AWS’s answer to this problem is the UltraServer. A new type of compute solution, an UltraServer combines four physical Trainium2 servers, each with 16 Trainium2 chips. They communicate via specialized high-speed connections called “NeuronLinks.” Identifiable by their distinctive blue cables, NeuronLinks are like dedicated express lanes, allowing data to move much faster within the system and significantly accelerating complex calculations across all 64 chips.

When you connect tens of thousands of these UltraServers and point them all at the same problem, you get Project Rainier—a mega “UltraCluster.”

ChatGPT confirms the above.

Trainium2-powered EC2 instances (Trn2) are generally available, offering up to 4× the performance of first-gen Trainium. A single instance packs 16 Trainium2 chips, delivering approximately 20.8 petaflops of FP8 compute, with 1.5 TB HBM3 memory and 46 TB/s memory bandwidth. Pricing performance is 30–40% better than comparable GPU-based P5e/P5en instances.

The Trn2 UltraServers—which connect 64 Trainium2 chips across four Trn2 instances via the proprietary NeuronLink interconnect—are available in preview. Each UltraServer delivers up to 83.2 petaflops, with 6 TB HBM3 and 185 TB/s bandwidth, along with 12.8 Tb/s EFAv3 networking for large-scale AI workloads.

Anthropic is a major early adopter: AWS and Anthropic are building an interconnected supercluster—or “Ultracluster”—with hundreds of thousands of Trainium2 chips to train next-gen models, estimated to be 5× the exaflops of their previous infrastructure.

AWS has labeled this deployment effort Project Rainier—a high-performance AI compute infrastructure underway in AWS data centers.

Apple is reportedly also testing and evaluating Trainium2 chips for potential AI workloads, signaling broader industry interest beyond AWS-internal users.

Where are AWS data centers located? 

  • United States
    • US East (N. Virginia) – Ashburn, Loudoun County, Virginia (largest AWS hub globally)
    • US East (Ohio) – Columbus, Ohio
    • US West (N. California) – Bay Area
    •  US West (Oregon) – Boardman, Oregon -- on the Columbia River, 165 miles east of Portland
    • AWS GovCloud (US-East/West) – Isolated regions for U.S. government, locations not fully disclosed but tied to Virginia and Oregon
  • Canada (2)
  • South America (1)
  • Europe (8)
  • Middle East (3)
  • Africa (1)
  • Asia-Pacific (10)
  • China (2,special operations, separate from the rest)