Comparing LLM Output Diversity

Ask different AI models the same question with the same parameters and you'll get strikingly different results. Some models play it safe across the board. Others take genuine creative swings. The probability distributions that underlie these responses reveal a lot about how different models are trained and aligned.

Why Diversity Varies

Reinforcement Learning from Human Feedback (RLHF) is the primary technique used to make models helpful. The process shapes the distribution over tokens to favor responses that human raters consistently prefer. But here's the thing — humans tend to agree on what's safe and coherent more than they agree on what's interesting or surprising.

This means RLHF tends to compress the distribution. The model learns that the "right" answer is usually the conventional one. Low-probability responses get suppressed not because they're wrong, but because they're unexpected, and unexpectedness correlates with unpredictability, which correlates with risk from the model's perspective.

Different models handle this compression differently:

Models with heavier RLHF tend to have tighter distributions. Less variance, more consistency, fewer surprises.
Models with lighter alignment pressure maintain wider distributions, giving you access to more of what the model actually "knows" about a topic.
Some models are explicitly trained for creativity and will sample from deeper in the tail even at conventional temperature settings.

How to Measure Diversity

With Verbalized Sampling, you can quantify diversity directly. Run the same prompt on two different models with k=10 and τ=0.01, then compare:

The probability spread: How wide is the range from highest to lowest probability? A model with a 45% top response and 2% bottom response has a very different distribution than one with 28% to 18%.
The tier distribution: What percentage of responses land in Conventional vs. Creative vs. Wild? A model that's 80% conventional at threshold τ=0.01 is playing a very different game than one that's 40% conventional.
The entropy: The sum of p*log(p) across all responses gives you a single number for how "spread out" the model is being.

What the Spectrum Reveals

When you run a "Which stocks should I buy?" prompt, a conventional model might give you 10 variations on the same advice, all ranked at 30–40% probability. A less-aligned model might give you genuinely different investment philosophies — value investing, momentum trading, contrarian thinking — each with meaningful probability mass.

This doesn't mean one is better than the other. For factual questions, conventional consistency is a feature. For creative tasks, the diversity is the point. But most users default to models that suppress diversity without knowing it.

Running Your Own Comparison

The best way to understand model diversity is to run the same prompt across models side by side. Use the same k and τ values so the comparison is fair. Then look at:

How many of the top-k responses are genuinely distinct ideas vs. rewrites of the same point?
Where does each model draw the line between conventional and creative territory?
Are there responses in the wild zone that are surprisingly coherent?

The answers will tell you a lot about which model is right for which task — and maybe change how you think about what AI "capability" really means.