Tobias Mann
2026-01-18 12:21:00
Double precision floating point computation (aka FP64) is what keeps modern aircraft in the sky, rockets going up, vaccines effective, and, yes, nuclear weapons operational. But rather than building dedicated chips that process this essential data type in hardware, Nvidia is leaning on emulation to increase performance for HPC and scientific computing applications, an area where AMD has had the lead in recent generations.
This emulation, we should note, hasn’t replaced hardware FP64 in Nvidia’s GPUs. Nvidia’s newly unveiled Rubin GPUs still deliver about 33 teraFLOPS of peak FP64 performance, but that’s actually one teraFLOP less than the now four-year-old H100.
If you switch on software emulation in Nvidia’s CUDA libraries, the chip can purportedly achieve up to 200 teraFLOPS of FP64 matrix performance. That’s 4.4x of what its outgoing Blackwell accelerators could muster in hardware.
On paper, Rubin isn’t just Nvidia’s most powerful AI accelerator ever, but it’s the most potent GPU for scientific computing in years.
“What we found is, through many studies with partners and with our own internal investigations, is that the accuracy that we get from emulation is at least as good as what we would get out of a tensor core piece of hardware,” Dan Ernst, senior director of supercomputing products at Nvidia, told El Reg.
Emulated FP64, which is not exclusive to Nvidia, has the potential to dramatically improve the throughput and efficiency of modern GPUs. But not everyone is convinced.
“It’s quite good in some of the benchmarks, it’s not obvious it’s good in real, physical scientific simulations,” Nicholas Malaya, an AMD fellow, told us. He argued that, while FP64 emulation certainly warrants further research and experimentation, it’s not quite ready for prime time.
Why FP64 still matters in the age of AI
Even as chip designs push for ever lower-precision data types, FP64 remains the gold standard for scientific computing for good reason. FP64 is unmatched in its dynamic range, capable of expressing more than 18.44 quintillion (264) unique values.
To put that in perspective,-modern AI models like DeepSeek R1 are commonly trained at FP8, which can express a paltry 256 unique values. Taking advantage of general homogeneity of neural networks, block-floating-point data types like MXFP8 or MXFP4 can be used to expand their dynamic range.
That’s fine for the fuzzy math that defines large language models, but it’s no replacement for FP64, particularly when it’s the difference between life or death.
Unlike AI workloads, which are highly error-tolerant, HPC simulations rely on fundamental physical principles like conservation of mass and energy. “As soon as you start incurring errors, these finite errors propagate, and they cause things like blow ups,” Malaya said.
Emulated FP64 and the Ozaki scheme
The idea of using lower-precision, often integer datatypes, to emulate FP64 isn’t a new idea. “Emulation is old as dirt,” Ernst said. “We had emulation in the mid ’50s before we had hardware for floating point.”
This process required significantly more operations to complete, and often incurred a stiff performance penalty as a result, but enabled floating point mathematics even when hardware lacked a dedicated floating point unit (FPU).
By the 1980s, FPUs were becoming commonplace and the need for emulation largely disappeared. However, in early 2024, researchers at the Tokyo and Shibaura institutes of technology published a paper reviving the concept by showing that FP64 matrix operations could be decomposed into multiple INT8 operations that, when run on Nvidia’s tensor cores, achieved higher-than-native performance.
This approach is commonly referred to as the Ozaki scheme, and it’s the foundation for Nvidia’s own FP64 emulation libraries, which were released late last year. And, as Ernst was quick to point out, “it’s still FP64. It’s not mixed precision. It’s just done and constructed in a different way from the hardware perspective.”
Modern GPUs are packed with low-precision tensor cores. Even without the fancy adaptive compression found in Rubin’s tensor cores, the chips are capable of 35 petaFLOPS of dense FP4 compute. By comparison, at FP64, the chips are more than 1,000x slower.
These low-precision tensor cores are really efficient to build and run, so the question became why not use them to do FP64, Ernst explained. “We have the hardware, let’s try use it. That’s the history of supercomputing.”
But is it actually accurate?
While Nvidia is keen to highlight the capabilities FP64 emulation enables on its Rubin and even its older Blackwell GPUs, rival AMD doesn’t believe the approach is quite ready.
According to Malaya, FP64 emulation works best for well-conditioned numerical systems, with the High Performance Linpack (HPL) bench being a prime example. “But when you look at material science, combustion codes, banded linear algebra systems, things like that, they are much less well conditioned systems, and suddenly it starts to break down,” he said.
In other words, whether or not FP64 emulation makes sense actually depends on the application in question. For some it’s fine, while in others it’s not.
One of the major sticking points for AMD is that FP64 emulation isn’t exactly IEEE compliant. Nvidia’s algorithms don’t account for things like positive versus negative zeros, not number errors, or infinite number errors.
Because of this, small errors in the intermediary operations used to emulate the higher precision can result in perturbations that can throw off the final result, Malaya explained.
One way around this is to increase the number of operations used. However, at a certain point, the sheer number of operations required outweighs any advantage emulation might have provided.
All of those operations also take up memory. “We have data that shows you’re using about twice the memory capacity in Ozaki to emulate that FP64 matrices,” Malaya said.
For these reasons, the House of Zen is focusing its attention on specialized hardware for applications that rely on double and single precision. Its upcoming MI430X takes advantage of AMD’s chiplet architecture to bolster double and single precision hardware performance.
Filling the gaps
The challenges facing FP64 emulation algorithms like the Ozaki scheme aren’t lost on Ernst, who is well aware of the gaps in Nvidia’s implementation.
Ernst contended that, for most HPC practitioners, things like positive negative zeroes aren’t that big a deal. Meanwhile, Nvidia has developed supplemental algorithms to detect and mitigate issues like non-numbers, infinite numbers, and inefficient emulation operations.
As for memory consumption, Ernst conceded that it can be a bit higher but emphasized that this overhead is relative to the operation not the application itself. Most of the time, he said, we’re talking about matrices that are at most a few gigabytes in size.
So while it’s true that FP64 emulation isn’t IEEE-compliant, whether this actually matters is heavily dependent on the application in question, Ernst argued. “Most of the use cases where IEEE compliance ordering rules are in play don’t come up in matrix, matrix multiplication cases. There’s not a DGEMM that tends to actually follow that rule anyway,” he said.
Great for matrices, not so much for vectors
Even if Nvidia can overcome the potential pitfalls of FP64 emulation, it doesn’t change the fact that the method is only useful for a subset of HPC applications that rely on dense general matrix multiply (DGEMM) operations.
According to Malaya, for somewhere between 60 and 70 percent of HPC workloads, emulation offers little to no benefit.
“In our analysis the vast majority of real HPC workloads rely on vector FMA, not DGEMM,” he said. “I wouldn’t say it’s a tiny fraction of the market, but it’s actually a niche piece.”
For vector-heavy workloads, like computational fluid dynamics, Nvidia’s Rubin GPUs are forced to run on the slower FP64 vector accelerators in the chip’s CUDA cores.
However, as Ernst was quick to point out: more FLOPS doesn’t always mean useful FLOPS. The same workloads that tend to run on the FP64 vector engines rarely manage to harness more than a fraction of the chip’s theoretical performance, all because the memory can’t keep up.
We see this quite clearly on the TOP500’s vector-heavy High Performance Conjugate Gradient benchmark where CPUs tend to dominate thanks to the higher ratio of bits per FLOPS afforded by their memory subsystems.
Rubin may not deliver the fastest FP64 vector perf, but with 22 TB/s of HBM4 its real world performance in these workloads, it is likely to be much higher than the spec sheet would suggest.
Ready or not, here FP64 emulation comes
With an influx of new supercomputers powered by Nvidia’s Blackwell and Rubin GPUs coming online over the next few years, any questions regarding the viability of FP64 emulation will be put to the test sooner rather than later.
And since this emulation isn’t tied to specific hardware, there’s the potential for the algorithms to improve over time as researchers uncover scenarios where the technique excels or struggles.
Despite Malaya’s concerns, he noted that AMD is also investigating the use of FP64 emulation on chips like the MI355X, through software flags, to see where it may be appropriate.
IEEE compliance, he told us, would go a long way towards validating the approach by ensuring that the results you get from emulation are the same as what you’d get from dedicated silicon.
“If I can go to a partner and say run these two binaries: this one gives you the same answer as the other and is faster, and yeah under the hood we’re doing some scheme — think that’s a compelling argument that is ready for prime time,” Malaya said.
It may turn out that, for some applications, emulation is more reliable than others, he noted. “We should, as a community, build a basket of apps to look at. I think that’s the way to progress here.” ®










