I was asked for scaling up a system very often recently. Scaling up normally means you adds more resource for better result before you reach the limit. We also want to go beyond the limit sometime. I recently needs to understand a legacy system with 100 MB code. I cannot put it to the latest Open AI GPT 4.1 with up to 1,047,576 tokens context window. I tried several ways with GPT 4.1. Each call costs me about 1 dollar with long context length and the output is not good. Why LLM cannot scale up to long context length even I already pay for the expensive call. I will explain it by math, hardware here.

Mathematical Background: How Attention Works

Attention weights are computed through the softmax function applied to attention scores. Specifically, given:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V\]
  • \(Q, K, V\) are \(n \times d\) matrices.
  • \(n\) is the context length.
  • \(d\) is the embedding dimension.

Softmax is applied row-wise on the scaled dot-product matrix:

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}\]

Each attention output is thus a weighted average of all tokens in the context.

Mathematical issues for long context length

Issue Mathematical Explanation Consequence
Softmax dilution \(\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}\) Less focused attention weights
Noise accumulation Signal-to-noise ratio decreases as \(O(\frac{1}{n^2})\) Relevant info obscured by irrelevant tokens
Positional embedding degradation Accuracy \(\sim \frac{1}{n}\) Positional info becomes less reliable
Numerical instability Exponential scaling \(e^{QK^T}\) Computational instability
Limited parameters Capacity per token \(\sim \frac{\text{Params}}{n}\) Reduced representation per token

Thus, from the theoretical and mathematical perspective, increasing context length without increasing model complexity or employing specialized architectural modifications (sparse attention, rotary embeddings, ALiBi, sliding window attention) inevitably results in degraded performance.

Numerical instability already address hardware limit but the more general one is

Precision & Numerical Stability

GPUs perform computations typically using lower-precision arithmetic (FP16, INT8) to boost performance:

  • Lower precision (FP16) arithmetic has limited dynamic range and precision.

  • For very large context sizes, small errors compound during attention computation and softmax (exponential functions), resulting in numerical instability.

Hardware limitation:

  • Increased context → Accumulation of small arithmetic errors → Numerical instability (overflow/underflow) → Decreased inference accuracy.

Go beyond the limit

Google’s Gemini 2.5-Pro performance excellent for long context length. One reason in my opinion that is Google’s ASICs TPU goes beyond hardware limit to handle numerical operation very well with FP16 than Nvidia GPU.

However, even with Google Gemini 2.5-Pro I still cannot dump 100M code to LLM. My solution go beyond the limit here is

Go beyond the limit with removing deduplication

My quick solution here is to removing logic duplication code based on my knowledge. Deduplication reduces the code from 100M to 50M even 20M. With prompt engineering, it’s quite easy to generate a nice system architecture diagram by LLM today.