How to Get ChatGPT to Talk Normally

13 hours ago

0 0 10 minutes read

ChatGPT and similar bots often flatter users, ramble vaguely, or throw in jargon to sound smart. New research shows that these habits come not from the models alone but from the way human feedback trains them: the models learn to copy the style of answers humans tend to like, even when those answers are empty or misleading. A new fine-tuning method uses synthetic examples to teach the models to resist these bad habits.

Partly opinion. ChatGPT is surprisingly disposed to engage with my recurring criticism of it. Having noticed in the last few days that GPT-4o is increasingly padding its answers with meaningless verbiage – such as ‘No fluff!’ and ‘No filler’, or ‘This cuts to the heart of the matter!’ – I asked it why producing straight and minimal answers has become such a problem for it lately. It replied:

ChatGPT explains its latest behavior. Source: https://chatgpt.com/

Who knows if ChatGPT actually has some private insight into OpenAI policy changes, or if it is just hallucinating? In any case, as we can see, the response itself begins with extraneous filler (‘Here is the core answer, no filler’).

It transpires that even including templated guidelines with each query can only do so much to prevent ‘personality-driven’ verbosity of this kind, which numbers among several other persistent bugbears in the idiom of popular LLMs.

The Three Fs

Thus I was most interested to see a new US academic collaboration turn up in the literature this week. Titled Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models, this joint venture between four researchers across the University of Pennsylvania and New York University hones in on several of the ‘biases’ in LLM chats that crop up frequently in the media:

From the new paper, examples of three common biases in language models: ‘flattery’, where responses strongly agree with the user; ‘fluff’, where answers are long but uninformative; and ‘fog’, where replies list many broad but shallow points. Source: https://arxiv.org/pdf/2506.05339

For easy alliteration, flattery, fluff and fog are headlined in the new work, but a more complete and concise list of LLMs’ lexical sins is included in the paper’s appendix:

The new paper identifies and concentrates on five biases: extra length, list structures, technical jargon, flattery, and vague generalities, all or some of which conflict with human preference.

While length/verbosity leads the table, the bias towards list formatting (second row down in image above) also recurs frequently unless prompted against; and though the jargon and vagueness categories represent opposing extremes between clarity and accuracy, it’s sycophancy – an open problem, particularly in ChatGPT – that really burns through the user’s tokens, almost to the same extent as length/verbosity.

The new study sets out to measure how far these biases distort model behavior, and concludes that large language models systematically over-prefer responses that exhibit one or more of the biases*.

The authors’ tests indicate that both commercial and open models often pick answers that humans would not prefer, especially when the answers are too long, full of lists, packed with jargon, overly flattering, or vague.

This problem, the paper contends, can be traced back to the annotation of the training data, where human reviewers had often favored these kinds of responses. The models, the findings suggest, learned from these labeled preferences and exaggerated those patterns during training.

Why Did They Do It..?

As to why the human annotators deviated in their preference from end-users’ median preferences, the paper does not speculate; it may be because the context of the annotation or the wording of the instructions encouraged a preference for ’empirical’ phrasing; or (among many other possible reasons) it could be that the annotators were exam-minded students habitually steeped in a technical idiom that’s more suited for academia than daily discourse.

In any case, because the models were copying biases from the annotators’ training labels, the new paper’s researchers created special training examples that either added or removed each bias, allowing the models to see clear contrasts and adjust their preferences. After fine-tuning on this data, the models showed significantly less bias, especially for jargon, verbosity, and vagueness, while still performing well overall (significant, since fine-tuning can damage general performance).

Let’s take a closer look at this study, though it does not conform to all the usual procedural strictures.

Method

Initially, the researchers frame several typical idiomatic LLM biases to be addressed:

Length, wherein the models tend to favor longer answers, even when the extra content adds nothing useful. This appears to reflect patterns in the training data, where length often correlates with thoroughness in the eyes of human annotators. As a result, models often produce bloated and verbose replies that give an illusion of depth, but without real substance.

Structure, wherein models show a strong preference for bullet points or numbered lists instead of straightforward prose. This may be because structured formats appear more frequently in the responses selected by human reviewers. The habit leads models to default to ‘listicles’, even when the question calls for more natural or detailed explanations.

Jargon, wherein models unnecessarily use specialized or technical language. The authors contend that this behavior likely emerges from training data where jargon-heavy answers were often chosen as better responses. Thus the models learned to equate jargon with expertise, producing answers that sound knowledgeable, while offering little additional clarity.

Sycophancy, wherein models agree with the user’s opinions instead of offering neutral or critical responses. This pattern may come from training data where agreeable answers were more often rated favorably. Consequently models may reinforce user biases and avoid presenting conflicting or more objective viewpoints, even where these would be useful.

Vagueness, wherein models prefer to give broad, generalized answers that touch lightly on many topics rather than directly addressing the specific question, with responses that sound comprehensive but offer little usable information. This may reflect the fact that vague answers are harder to falsify, and were therefore less likely to be penalized during annotation:

Example of vagueness bias, where the model wrongly favors a broad and shallow answer over a detailed response that human evaluators judge more useful.

Counterfactual Data

With these definitions, it was then necessary to test exactly how much each bias influenced model behavior. Simple correlations would not work, because multiple biases often appear together, making it hard to isolate the effect of any one feature.

To overcome this, the researchers built controlled pairs of answers that differed only in a single bias at a time, while keeping everything else as stable as possible, and began by generating a base answer to each query.

The Rewrite-based Attribute Treatment Estimators (RATE) protocol was then used to create a modified version of that answer – an answer crafted to deliberately exaggerate one particular bias, such as adding extra jargon, or turning prose into a list.

Examples of rewrites from the RATE system, used in the new study. Source: https://openreview.net/pdf?id=UnpxRLMMAu

To avoid introducing unrelated differences, an extra rewriting step was included that adjusted both versions, ensuring that the only meaningful change between them was the bias under study; and these tightly controlled response pairs were then fed to the models.

For each pair, the version preferred by the model was recorded, allowing for a calculation of how strongly each bias influenced both reward models and evaluators, producing a more precise measurement of bias effects than had been achieved in previous studies, according to the authors.

With the counterfactual pairs prepared, human reviewers from the UK and US were recruited to create a reference standard: for each bias type, one hundred response pairs were randomly selected, each containing a neutral answer and its biased counterpart. Three evaluators reviewed each pair, with majority vote determining the final judgment, and in total, three hundred participants contributed to the study.

Metrics

Metrics used to measure bias effects were Skew Rate, which calculates how often the model prefers the biased response over the neutral one; and Miscalibration Rate, which measures how often the model’s choice disagreed with the human majority. An ideal model would show zero miscalibration and a skew roughly matching the human skew (since some biased features are occasionally favored by humans as well).

Data and Tests

To test the approach, different sources were used, depending on the bias being studied. For structure, jargon, and length, one hundred queries were sampled from Chatbot Arena, filtered to select English, single-sentence, well-formed questions.

For sycophancy, one hundred opinionated queries were generated (i.e., ‘Isn’t modern art just lazy compared to classical techniques?’), phrased to reflect user viewpoints that might invite agreement.

Vagueness was tested with seventy-eight NLP-related queries drawn from the KIWI dataset, supplemented with twenty-two additional queries of a similar type. Scientific topics were chosen for vagueness because they demand precise answers, making general or evasive responses easy to spot.

For each query, counterfactual response pairs were created using the RATE protocol described earlier.

The evaluation involved both open and proprietary systems. Reward models, which assign quality scores to candidate responses during training and alignment, were tested in four versions trained on eighty thousand preference pairs from the Skywork reward dataset: Gemma2-2B; Gemma-2-27B; Llama-3.1-8B; and Llama3.2-3B.

Three proprietary models were also assessed as LLM evaluators: Gemini-2.5-Pro; GPT-4o; and Claude-3.7-Sonnet. All counterfactual responses used for testing were generated by GPT-4o:

Comparison of model preferences and human judgments for each bias type, showing how often models favored biased responses and how often these preferences conflicted with human choices.

Of the initial results shown above, the authors comment^†:

‘[Our] analysis of preference [models] shows that these models consistently show miscalibration and a high rate of skew in favoring perturbed responses across various bias categories […]

‘[…] Reward models exhibit clear miscalibration relative to human judgments: model preference rates for perturbed responses systematically deviate from human preference rates. While vagueness and jargon elicit the highest miscalibration (>50%), length and sycophancy also show substantial miscalibration.

‘This suggests that models struggle to align with human judgments when responses contain overly technical language or lack specificity.’

Reward models aligned best with humans on structure bias, where both tended to favor the same answers. For jargon and vagueness, models were much more likely to prefer the biased responses than humans. Sycophancy showed smaller differences, with models and humans often agreeing.

The proprietary LLM evaluators showed the same general pattern, though their biggest mismatches appeared with length and vagueness – and they were especially prone to sycophancy, favoring agreeable answers as much as eighty-five percent of the time, while humans did so only about fifty percent of the time.

To trace the origin of these biases, the researchers analyzed the aforementioned Skywork dataset, used to train the reward models, mapping each bias to simple features that could be automatically measured, such as token count for length, or presence of lists for structure.

In a sample of 2,500 examples, human annotators showed clear preferences for biased features: structured answers were favored over unstructured ones 65 percent of the time, and jargon-heavy answers were chosen 54 percent of the time:

Human annotators in the training data often picked answers that included these bias features. This chart shows how often structure, jargon, or vagueness appeared in the responses they preferred or rejected, revealing the imbalances that models later learned during training.

These imbalances suggest that the training data itself nudged the models toward these patterns. To confirm this, a correlation analysis was run, measuring how strongly differences in each feature matched up with the preferences shown by both humans and models.

The results showed that both were consistently influenced by the same features, indicating that models learned to associate certain stylistic traits with better answers, even when those traits did not actually improve the response.

Correlation between feature differences and preferences, showing how both models and humans were influenced by the same bias features during training.

To help the models unlearn these biases, new training data was created. The Skywork dataset was reviewed to check if the bias feature appeared in either the chosen or rejected answers; when both were free of the target bias, GPT-4o rewrote the rejected answer to insert it.

This created new training pairs where the model could see clear examples of biased and unbiased answers, and thus learn not to favor the biased version. With additional examples from Chatbot Arena, for balance, the models were then fine-tuned on this updated dataset:

The effect of fine-tuning with counterfactual data. The left panel shows how the fine-tuned models moved closer to human preferences on most biases; the right panel shows reduced miscalibration, especially for jargon and vagueness.

The fine-tuning brought the models much closer to human preferences, with the largest improvements seen for jargon and vagueness and smaller gains for length. Structure and sycophancy showed slight new mismatches, though these reflected earlier imbalances rather than new failures.

Overall performance remained stable throughout, and when multiple biases were corrected at once, bias levels fell further without sacrificing response quality.

The authors conclude:

‘Our method significantly reduces miscalibration issues while preserving overall competence of reward models. Future work can consider adapting our post-training recipe to develop more robust preference models and also evaluate preference models against additional bias axes.’

Conclusion

The new work is an interesting, if elliptical insight into the way that under-curated or over/under-represented training data can cause undesirable outcomes at inference time. Any regular LLM user will, by now, have a collection of war stories.

For instance, many of the responses that I receive from ChatGPT appear to have been influenced by SEO trends of the last 10-15 years, where online portals have been forced to optimize for Google placement instead of natural language. Indeed, the emoji-strewn and prodigious output of marketing departments appears to have had a very significant impact on any request to write a promotional LinkedIn post – to the point where AI-generated ‘enthusiasm’ is now impossible to miss:

Left: Asked to promote a LinkedIn post, in an account with zero history, ChatGPT defaults to emojis and sensational PR-speak. Right: Asked the same thing after six months of me telling it to calm down, GPT produces something rather more sober.

However, OpenAI actively intervenes in the way that ChatGPT responds to queries, depending on function and context, making it difficult for researchers to differentiate between problems that arise because of data, and data distribution, along with related issues such as annotation; and when a non-preferred result may be due to commercial interference from the LLM’s host company.

* Due to the jargon-filled writing style that the authors have chosen for this paper, I am avoiding author quotes where possible in favor of summaries.

^† Authors’ bold emphasis, not mine.

First published Friday, June 6, 2025

Source link