The computing APU Instinct MI300A achieves up to 4x higher performance compared to accelerators

#computing #APU #Instinct #MI300A #achieves #higher #performance #compared #accelerators

There are workloads that are limited by current hardware’s computing power. However, there are also loads for which current accelerators do not have a limit on computing power, but on data transfers. In a situation where the processor and accelerator are separate and each has its own memory, there may be a situation where moving data between the processor memory and the accelerator memory takes more time than the calculations themselves.

The Instinct MI300A from AMD is the first powerful solution that overcomes the classic concept of a CPU with its own memory and a GPU with its own memory, which are interconnected by a relatively slow PCIe interface. With the MI300A, the memory is unified, shared, and the CPU and GPU parts have equal access to it thanks to the unified address space. Therefore, if the GPU is to work with data, it does not have to be moved from one memory to another (and then possibly the result back), but everything takes place on one level.

In the case of tasks that are limited precisely by data transfers, the performance shift of the MI300A is huge and can reach up to four times the performance of a classic solution based on a processor / accelerator.

The next graph shows how much of the task processing time individual hardware solutions use for the calculations themselves (dark) and how much for data transfers (light). At the same time, this ratio explains why increasing computing power has only a minimal effect on the overall performance of accelerators for this type of tasks.

Also Read:  The first real image of the phone has been shared on the internet

The Instinct MI300A is a solution that emerged from the original Exascale Heterogeneous Processor (EHP) aka Exascale APU project that was talked about (already) in 2017. In retrospect, it’s interesting how AMD had to deal with changes in technology development. For example, the original assumption was that two quad-core processor chiplets would be used, i.e. a total of 8 cores per APU. In the end, there are 24 of them on the APU (three chiplets of eight each).

Exascale Heterogeneous Processor (AMD, 2017)

On the other hand, the development of HBM memory has been slower than originally expected. Which is a consequence of the fact that the memory manufacturers decided to make this a high-end solution that pays off only on the most powerful accelerators (instead of the originally intended widely applicable product). Instead of the originally considered HBM4, which were supposed to be layered on low-clocked graphics chiplets (so that the HBM would not burn), the HBM3 had to be used, which were finally placed classically “next to”. This eliminated the need to keep graphics chiplets at low clocks (~1 GHz) and AMD could afford clocks slightly over 2 GHz.

Instinct
MI100Instinct
MI210Instinct
MI250XInstinct
MI300AInstinct
MI300XdesignationArcturusAldebaranRigelarchitectureCDNACDNA 2CDNA 3CPU24× Zen 4formatPCIePCIeOAMsocket SH5OAMCU/MS120104
(128)220
(256)228304FP32 jarder76806656
(8192)14080
(16384)1459219456FP64 jader—–INT32 jar—–Tens. Cores440?416880??rate (max.)1502 MHz 1700 MHz2100 MHz ↓↓↓ T(FL)OPS ↓↓↓FP16
184,6181383980,61300BF16
92,3181383980,61300FP32
23,545,3
22,695,7
47,9122,6163,4FP64
11,522,647,961,381,7INT4
184,6181383??INT8184,618138319602600INT16?????INT32?????FP8 tensor3922,4*
1961,25229,8*
2614,9FP16 tensor184,61813831961,2*
980,62614,9*
1307,5BF16 tensor92,31813831961,2*
980,62614,9*
1307,5FP32 tensor46,145,395,7122,6163,4TF32 tensor
980,6*
490,31307,4*
653,7FP64 tensor
45,395,7122,6163,4INT4 tensorINT8 tensor
184,61813833922,4*
1961,25229,8*
2614,9 ↑↑↑ T(FL)OPS ↑↑↑TMU480?—cache??16 MB256 MB Infinity Cachebus4096bit4096bit8192bit8192bitcapacity
memoirs
32 GB64 GB128 GB128 GB192 GBHBM2,4 GHz3,2 GHz3,2 GHzHBM3 >5 GHzmemory.
permeable
1229 GB/s1639 GB/s3277 GB/s5,3 TB/sTDP300 W300 W500W
560W550-760W750Wtransistors50 mld.
25.6 billion 29.1 billion 58.2 billion 146 billion 153 billionGPU area750 mm²
362 mm²724 mm²660 mm²?trial7 nm6nm6nm5nm+6nmdatum20202022202120232023

Also Read:  This is the millionaire determine it's important to pay in Fortnite to purchase all of the skins.

Despite this, the originally targeted energy efficiency was exceeded. Instead of the targeted 50 GFLOPS per watt, the Instinct MI300A achieves 80-111 GFLOPS per watt (both universal computing power in double-precision). What hasn’t changed significantly is the number of stream-processors, which was originally planned to be 16,384 and finally reaches 14,592.

However, what was not discussed at all in 2017 and what the MI300A manages very well in the end is AI acceleration. When it comes to AI calculations in double-precision, the efficiency compared to the original plan is even 2x higher than the values ​​mentioned in the previous paragraph.

Leave a Reply

Your email address will not be published. Required fields are marked *