Researchers find adding this one simple sentence to prompts makes AI models way more creative

One of the coolest things about generative AI models – both large language models (LLMs) and diffusion-based image generators – is that they are “non-deterministic.” That is, despite their reputation among some critics as “fancy autocorrect,” generative AI models actually generate their output by choosing from a distribution of the most likely next tokens (units of information) to fill in their answer.
Asking an LLM: “What is the capital of France?” will have it examine the probability distribution for France, capitals, cities, etc. to arrive at the answer ‘Paris’. But that answer could come in the form of “The capital of France is Paris,” or simply “Paris,” or “Paris, although it was once Versailles.”
Still, those of us who use these models every day will find that their answers can sometimes feel annoyingly repetitive or similar. A common joke about coffee is recycled across generations of searches. Story prompts generate similar arcs. Even tasks that should yield many plausible answers—such as naming American states—tend to fall into just a few. This phenomenon, known as mode collapse, arises during post-training tuning and limits the usefulness of otherwise powerful models.
Especially when we use LLMs to generate new creative works in writing, communications, strategy or illustration, we actually want their output even more varied than they already are.
Now one team of researchers at Northeastern University, Stanford University and West Virginia University have devised an ingeniously simple method to make language and image models generate a greater variety of responses to virtually any user query by by adding a single, simple sentence: “Generate 5 answers with their corresponding probabilities, collected from the entire distribution.”
The method, called Verbal sampling (US), helps models like GPT-4, Claude and Gemini produce more diverse and human-like results, without retraining or access to internal parameters. It is described in a paper published on the open access journal arxiv.org online in early October 2025.
When prompted this way, the model no longer defaults to the safest, most typical output. Instead, it articulates its internal distribution among potential completions and examples across a broader spectrum of possibilities. This one-line change leads to significant gains in output diversity across multiple domains.
As Weiyan Shi, assistant professor at Northeastern University and co-author of the paper, wrote on X: “The potential of LLMs has not yet been fully unlocked! As our paper shows, rapid optimization can be guided by thinking about how LLMs are trained and tuned, and this can be proven theoretically.”
Why models are collapsing – and how US is turning it around
According to the research team, the root cause of mode collapse lies not only in algorithms such as reinforcement learning from human feedback (RLHF), but in the structure of human preferences. People tend to rate more familiar or typical answers as better, which pushes LLMs toward “safe” choices rather than diverse choices during refinement.
However, this bias does not erase the underlying knowledge of the model; it only suppresses them. USA works by circumventing this oppression. Instead of asking for the most likely outcome, the model invites you to reveal a range of plausible answers and their relative probabilities. This distribution-level prompting restores access to the richer diversity present in the basic pre-training model.
Real-world performance for all tasks
The research team tested Verbalized Sampling in several common usage scenarios:
-
Creative writing: When generating stories, VS increased diversity scores by up to 2.1× compared to standard prompts, while maintaining quality. One story prompt – “Without Goodbye” – produced formulaic separation scenes under direct prompting, but delivered stories of cosmic events, silent emails, and music that stopped mid-dance when prompted via VS.
-
Dialogue simulation: In persuasive dialogue tasks, VS enabled models to simulate human-like patterns such as hesitation, resistance, and changes of mind. The distributions of donation behavior among US more closely matched real human data compared to baseline methods.
-
Open-ended QA: When asked to list valid answers (for example, naming US states), models using US generated answers that better matched the diversity of real-world data. They covered a broader range of responses without sacrificing factual accuracy.
-
Synthetic data generation: When VS was used to generate mathematical problems for model training, it created more varied data sets. These in turn improved downstream performance in competitive math benchmarks, outperforming synthetic data generated via direct prompts.
Tunable diversity and better use of larger models
A notable advantage of VS is its tunability. Users can set a probability threshold in the prompt to sample the “tails” of the model’s distribution with a lower probability. Lower thresholds correspond to greater diversity. This tuning can be done via prompt text alone, without changing decode settings such as temperature or top-p.
In one test of the Gemini-2.5-Flash model, diversity in story writing increased steadily as the probability threshold decreased from 1 to 0.001. The graph accompanying the study showed that VS outperformed both direct and sequence-based cues at all thresholds.
Interestingly, the method scales well with model size. Larger models such as GPT-4.1 and Claude-4 showed even larger gains from VS compared to smaller ones. While smaller models benefited, the improvement in diversity was roughly 1.5 to 2 times stronger for larger models. This suggests that VS helps unlock more of the latent potential in advanced models.
Implementation and availability
The Verbalized Sampling method is now available as a Python package:
pip install verbalized-sampling
The package includes integration with LangChain and supports a simple interface for sampling verbal distribution. Users can also adjust parameters such as k (number of reactions), thresholds and temperature to suit their applications.
A live Colab notebook and documentation are available below an enterprise-friendly Apache 2.0 license on GitHub at: https://github.com/CHATS-lab/verbalized-sampling
Practical tips and common problems
While the method works for all major LLMs, some users may initially encounter denials or errors.
In these cases, the authors recommend using the system prompt version of the template or referring to alternative formats listed on the GitHub page.
Some models interpret complex instructions as jailbreak attempts and refuse to comply unless the structure is clearer.
For example, asking via such a system-level instruction improves reliability:
You are a helpful assistant. For each query, generate five answers within separate tags, each with a probability lower than 0.10.
This small change usually resolves any issues.
A lightweight solution for a big problem
Verbalized Sampling represents a practical solution to a profound limitation in the way modern language models behave. No model retraining or internal access is required. It is not dependent on one model family. And it not only improves the diversity of results, but also their quality, as judged by both human evaluation and benchmark scores.
With growing interest in tools that enhance model creativity, VS will likely see rapid adoption in domains such as writing, design, simulation, education, and synthetic data generation.
For users and developers frustrated by the sameness of LLM answers, the solution may be as simple as changing the question.




