Deterministic LLMs

Temperature

  • LLMs compute a probability distribution over the next token, given the previous context.
  • Next token is sampled randomly from this distribution.
  • Temperature 0 (greedy decoding), the model always picks the most probable token.

Normal sampling

sampled token Probabilities over next tokens

Greedy decoding

always pick max Same distribution, different rule
  • Greedy decoding is determinism!
  • And yet...

\([1,6,56,361,2,45]\)

Floating Point...

  • Let \(a = 1.23\times 10^{4}\), \(b = -1.23\times 10^{4}\), \(c = 1.23\).
  • \((a+b)+c = 0 + 1.23 = 1.23\)
  • \(a+(b+c) = 1.23\times 10^{4} + (-1.23\times 10^{4}) = 0\)

...Non-associativity

flowchart TB
  subgraph A["Schedule A: cores finish 1 → 2 → 3"]
    direction TB
    A1["Core 1: a"]
    A2["Core 2: b"]
    A3["Core 3: c"]
    A4["(a+b)"]
    A5["(a+b)+c = 1.23"]
    A1 --> A4
    A2 --> A4
    A4 --> A5
    A3 --> A5
  end
  subgraph B["Schedule B: cores finish 3 → 2 → 1"]
    direction TB
    B1["Core 1: a"]
    B2["Core 2: b"]
    B3["Core 3: c"]
    B4["(b+c)"]
    B5["a+(b+c) = 0"]
    B2 --> B4
    B3 --> B4
    B1 --> B5
    B4 --> B5
  end
                            

Same inputs (a, b, c); different completion order → different reduction → different result.

Deterministic Reductions?

  • Just fix the summation order of the cores?
  • Always combine results from core 1, then core 2, then core 3, even if some finish earlier.
  • Throughput \(>\) determinism.

Batching

  • Prompts of different lengths (e.g. 100 tokens vs 500 tokens).
  • Shorter prompts padded out with special padding tokens to fit a matrix.
  • Matrices are 'tiled' and computation partitioned across GPU cores.
  • Results in a different path through floating point operations and different rounding \(\rightarrow\) different response to the prompt.

Example

Batch of 5 prompts; max length = 5 tokens → 5×5 matrix. Shorter prompts padded with PAD.

0 1 2 3 4
prompt 0 42 89 PAD PAD PAD
prompt 1 12 77 234 891 PAD
prompt 2 301 PAD PAD PAD PAD
prompt 3 55 23 1024 88 PAD
prompt 4 7 1024 156 302 445

Columns = token positions (fixed by longest prompt). Rows = batch dimension. PAD cells are masked out in attention.

Batch-indeterminism

  • GPU computation can be deterministic.
  • User has no control over the batch!
  • Different batch compositions = different computations for “your” prompt.
  • So even with temperature 0 and deterministic kernels, responses can still appear non-deterministic.

Summary

  • Greedy decoding does not guarantee determinism.
  • Floating point non-associativity, indeterministic reduction algorithms.
  • Can use deterministic reduction algorithms, but at a performance cost.
  • Batching means some indeterminism is beyond hardware control.