Deterministic LLMs

Temperature

LLMs compute a probability distribution over the next token, given the previous context.
Next token is sampled randomly from this distribution.
Temperature 0 (greedy decoding), the model always picks the most probable token.

Normal sampling

Greedy decoding

Greedy decoding is determinism!
And yet...

\([1,6,56,361,2,45]\)

Floating Point...

Let \(a = 1.23\times 10^{4}\), \(b = -1.23\times 10^{4}\), \(c = 1.23\).
\((a+b)+c = 0 + 1.23 = 1.23\)
\(a+(b+c) = 1.23\times 10^{4} + (-1.23\times 10^{4}) = 0\)

...Non-associativity

flowchart TB
  subgraph A["Schedule A: cores finish 1 → 2 → 3"]
    direction TB
    A1["Core 1: a"]
    A2["Core 2: b"]
    A3["Core 3: c"]
    A4["(a+b)"]
    A5["(a+b)+c = 1.23"]
    A1 --> A4
    A2 --> A4
    A4 --> A5
    A3 --> A5
  end
  subgraph B["Schedule B: cores finish 3 → 2 → 1"]
    direction TB
    B1["Core 1: a"]
    B2["Core 2: b"]
    B3["Core 3: c"]
    B4["(b+c)"]
    B5["a+(b+c) = 0"]
    B2 --> B4
    B3 --> B4
    B1 --> B5
    B4 --> B5
  end

Same inputs (a, b, c); different completion order → different reduction → different result.

Deterministic Reductions?

Just fix the summation order of the cores?
Always combine results from core 1, then core 2, then core 3, even if some finish earlier.
Throughput \(>\) determinism.

Batching

Prompts of different lengths (e.g. 100 tokens vs 500 tokens).
Shorter prompts padded out with special padding tokens to fit a matrix.
Matrices are 'tiled' and computation partitioned across GPU cores.
Results in a different path through floating point operations and different rounding \(\rightarrow\) different response to the prompt.

Example

Batch of 5 prompts; max length = 5 tokens → 5×5 matrix. Shorter prompts padded with PAD.

	0	1	2	3	4
prompt 0	42	89	PAD	PAD	PAD
prompt 1	12	77	234	891	PAD
prompt 2	301	PAD	PAD	PAD	PAD
prompt 3	55	23	1024	88	PAD
prompt 4	7	1024	156	302	445

Columns = token positions (fixed by longest prompt). Rows = batch dimension. PAD cells are masked out in attention.

Batch-indeterminism

GPU computation can be deterministic.
User has no control over the batch!
Different batch compositions = different computations for “your” prompt.
So even with temperature 0 and deterministic kernels, responses can still appear non-deterministic.

Summary

Greedy decoding does not guarantee determinism.
Floating point non-associativity, indeterministic reduction algorithms.
Can use deterministic reduction algorithms, but at a performance cost.
Batching means some indeterminism is beyond hardware control.