DeepSeek Introduces Engram to Cut High‑Bandwidth Memory Needs in Large AI Models

Key Points

DeepSeek and Peking University introduced Engram, a method that separates static memory from computation in large language models.
Engram uses hashed N‑gram lookups and a context‑aware gating mechanism to retrieve knowledge efficiently.
Testing on a 27‑billion‑parameter model showed measurable benchmark improvements and better performance than pure MoE models.
The technique reduces reliance on high‑bandwidth memory, allowing models to run on standard GPU memory.
Engram integrates with existing hardware solutions, including Phison’s SSD‑based accelerators and emerging CXL standards.
By reallocating 20–25% of the sparse parameter budget to Engram, models achieve stable gains without extra FLOPs.
The method supports asynchronous prefetching across multiple GPUs, scaling memory capacity linearly.

Background and Motivation

Large language models traditionally depend on high‑bandwidth memory (HBM) to store and retrieve knowledge during inference and training. This dependency creates both performance bottlenecks and cost pressures, a factor that contributed to a rapid five‑fold increase in DRAM prices over a short period as demand for AI hardware surged.

Engram Architecture

DeepSeek, collaborating with researchers at Peking University, introduced Engram, a method that decouples static knowledge storage from the dynamic computation performed by the model. Engram stores essential information as hashed N‑grams in a static memory module, which the model accesses through efficient lookups rather than sequential processing. A context‑aware gating mechanism adjusts retrieved data to align with the model’s hidden state, enabling seamless integration with the transformer backbone without adding FLOPs or extra parameters.

Performance Benefits

In experiments with a 27‑billion‑parameter model, Engram delivered measurable improvements on standard benchmarks. By reallocating roughly 20–25% of the sparse parameter budget to the Engram memory module, the system outperformed pure Mixture‑of‑Experts (MoE) configurations while maintaining stable gains across scales. The deterministic retrieval mechanism allows memory capacity to scale linearly across multiple GPUs and supports asynchronous prefetching during inference, freeing attention mechanisms to focus on global context.

Hardware Compatibility

Engram is designed to work with existing GPU and system memory architectures, potentially avoiding the need for costly HBM upgrades. It complements other hardware‑efficient solutions such as Phison’s AI inference accelerators, which expand total memory using SSDs, and aligns with emerging Compute Express Link (CXL) standards aimed at overcoming GPU memory bottlenecks in large‑scale AI workloads.

Implications for the AI Ecosystem

The approach offers a pathway to reduce pressure on expensive memory hardware, particularly in regions where HBM access lags behind leading manufacturers. By enabling more efficient memory usage, Engram may help stabilize sharp DDR5 DRAM price swings and make large‑scale AI models more affordable to train and deploy.

Source: techradar.com