Nvidia’s Groq 3 LPU: How SRAM-Powered Inference Chips Are Revolutionizing AI at Scale

Nvidia’s recent announcement of the Groq 3 LPU has drawn significant attention to the growing importance of specialized hardware for AI inference. At the GTC 2026 conference in San Jose, CEO Jensen Huang highlighted the chip as a key component in the company’s strategy to support the next phase of AI development, particularly as models shift from training-heavy workloads to real-time, reasoning-driven applications. The Groq 3 LPU, developed through a licensing agreement with the AI chip startup Groq, is designed to optimize the inference phase of AI workloads, where low latency and rapid token generation are critical.

According to Nvidia’s official announcement, the Groq 3 LPU integrates intellectual property licensed from Groq in a deal finalized in late 2025. The company stated that the agreement, valued at $20 billion, grants Nvidia access to Groq’s processor architecture for use in its data center platforms. This move underscores Nvidia’s effort to diversify its hardware offerings beyond traditional GPUs by incorporating alternative architectures tailored for specific AI tasks.

The Groq 3 LPU is built around a systolic array architecture that tightly integrates processing elements with on-chip SRAM memory. This design minimizes data movement by allowing computations to occur directly within the memory fabric, reducing latency compared to traditional GPU architectures that rely on off-chip high-bandwidth memory (HBM). Nvidia reports that the Groq 3 LPU delivers up to 1.2 petaFLOPS of 8-bit computation while achieving a memory bandwidth of 150 terabytes per second—significantly higher than the 22 terabytes per second bandwidth of its Rubin GPU, despite the latter’s larger 288 GB HBM capacity.

These specifications position the Groq 3 LPU as a latency-optimized accelerator ideal for the decode phase of large language model inference, where tokens are generated sequentially and response speed directly impacts user experience. In contrast, the Rubin GPU, with its high memory capacity and parallel compute strength, is better suited for the initial prompt processing (prefill) phase, which benefits from parallelism and can tolerate higher latency.

To leverage the strengths of both architectures, Nvidia has developed the Groq 3 LPX, a modular tray-based system that integrates eight Groq 3 LPUs per tray. The LPX is designed to perform in conjunction with the Vera Rubin NVL72 rack, which houses CPUs and GPUs for the prefill and early decode stages. In this disaggregated inference workflow, the Rubin handles the computationally intensive, parallelizable portions of the workload, while the Groq 3 LPU takes over the final, latency-sensitive token generation steps.

Nvidia confirmed during the GTC 2026 keynote that the Groq 3 LPX system is now in volume production, marking a transition from prototype to scalable deployment. The company emphasized that this approach allows data centers to optimize for both throughput and latency by matching the right hardware to each stage of the inference pipeline.

The development reflects a broader industry trend toward hardware specialization in AI infrastructure. As models grow in size and complexity, the computational demands of inference—particularly for reasoning-intensive applications such as AI agents and multi-step task planners—have increased significantly. Unlike training, which can be batched and scheduled, inference must respond to user queries in real time, making low-latency hardware essential for maintaining usability and scalability.

Other companies are pursuing similar strategies. Amazon Web Services, for example, has announced plans to deploy a combined system using its Trainium chips and Cerebras’ CS-3 servers to accelerate cloud-based inference. This system similarly employs inference disaggregation, assigning different stages of the workload to specialized hardware based on their computational profiles.

Analysts note that while GPUs remain versatile and widely adopted, purpose-built inference accelerators like the Groq 3 LPU may offer advantages in specific scenarios where speed and efficiency are paramount. The long-term viability of such specialized hardware will depend on factors including software compatibility, ecosystem support, and the ability to integrate seamlessly into existing data center infrastructures.

As of April 2026, Nvidia has not disclosed detailed performance benchmarks or customer adoption figures for the Groq 3 LPU or LPX systems. The company continues to position the technology as part of its broader platform strategy, emphasizing flexibility and scalability for enterprises deploying AI at scale.

For ongoing updates on Nvidia’s AI hardware roadmap, including future developments in inference acceleration and system-level integrations, users can refer to the company’s official newsroom and technical documentation portals.

What are your thoughts on the role of specialized inference chips in the future of AI infrastructure? Share your perspective in the comments below, and feel free to share this article with others interested in the evolution of AI hardware.

Nvidia’s Groq 3 LPU: How SRAM-Powered Inference Chips Are Revolutionizing AI at Scale

Related

Leave a Comment Cancel reply

Share this:

Related

Leave a Comment Cancel reply