The rise of prompt ops: Tackling hidden AI costs from bad inputs and context bloat

June 29, 2025

1 6 minutes read

This article is part of the special edition of Venturebeat: “The real costs of AI: performance, efficiency and ROI to scale.” Read more from this special number.

Model providers continue to roll out more and more advanced large language models (LLMS) with longer context windows and improved reasoning options.

This allows models to process and ‘think’ more, but it also increases the calculations: the more a model absorbs and releases, the more energy it spends and the higher the costs.

Connect this with all the crafts that are involved in reason – it may need a few attempts to achieve the intended result, and sometimes the question needed simply does not need a model that can think if a doctorate and calculation expenditure can get out of hand.

This leads to prompt OPS, a completely new discipline in the Dawning Age of AI.

“Prompt Engineering is a bit like writing, creating the actual, while promptly is as publishing, where you are evolving the content,” Crawford del Prete, IDC President, Venturebeat said. “The content lives, the content changes and you want to ensure that you refine it over time.”

The challenge of using calculation and costs

Calculate use and costs are two “related but individual concepts” in the context of LLMS, David Emerson, explained scientist at the Vector institute. In general, the price users pay scales based on both the number of input tokens (what the user asks) and the number of output tokens (which the model delivers). However, they have not changed for actions behind the scenes such as meta prompts, steering instructions or pick-up-augmented generation (RAG).

Although a longer context is able to process models in one go much more text, it immediately translates into considerably more flops (a measurement of computing power), he explained. Some aspects of transformer models even scales with input length, if not well managed. Unnecessary long reactions can also delay the processing time and require extra calculations and costs to build and maintain algorithms on answers after the process in the answer to which users hope.

Usually, longer context environments stimulate providers to deliberately deliver extensive answers, said Emerson. Much heavier reasoning models (for example O3 or O1 from OpenAI, for example), for example, will often give long answers to even simple questions that produce heavy computer costs.

Here is an example:

Enter: Answer the next math problem. If I have 2 apples and I buy 4 more at the Shop after eating 1, how many apples do I have?

Export: If I eat 1, I have 1 left. I would have 5 apples if I still buy 4.

The model not only generated more tokens than necessary, it buried the answer. An engineer may then be able to design a programmatic way to extract the final answer or to ask follow -up questions such as “What is your final answer?” That makes even more API costs.

Alternatively, the promptly can be re -designed to guide the model to produce an immediate answer. For example:

Enter: Answer the next math problem. If I have 2 apples and I will buy 4 more on th itone Shop after eating 1, how many apples do I have? Start your response with “the answer is” …

Or:

Enter: Answer the next math problem. If I have 2 apples and I still buy 4 in the store after eating 1, how many apples do I have? Wrap your last answer in bold tags .

“The way in which the question is asked can reduce efforts or costs to achieve the desired answer,” said Emerson. He also pointed out that techniques such as a few shot promotion (giving a few examples of what the user is looking) can help produce faster outputs.

One danger is not knowing when advanced techniques such as chain-or thought (COT) should be used that generate (answers in steps) or self-return, which immediately encourage models to produce a lot of tokens or go through different iterations when generating reactions, Emmeron.

Not every question requires a model to analyze and analyze again before he gives an answer, he emphasized; They can be perfectly able to answer correctly when they are instructed to respond immediately. In addition, incorrect prompt-api configurations (such as OpenAI O3, which requires a high reasoning effort) will yield higher costs if a cheaper request with a lower effort would be sufficient.

“With longer contexts, users can also be tempted to use an approach to ‘everything but the sink’, where you dump as much text as possible in a model context in the hope that this will help the model to perform a task more accurately,” Emerson said. “Although more context can help to help models to perform tasks, this is not always the best or most efficient approach.”

Evolution to ask Ops

It is not a big secret that nowadays AI-optimized infrastructure can be difficult to come by; Del Prete from IDC pointed out that companies should be able to minimize the amount of GPU -inactive Time and fill in more questions in inactive cycles between GPU requests.

“How do I press more from this very, very precious raw materials?” He noted. “Because I have to solve my system use, because I just don’t have the advantage of simply throwing more capacity to the problem.”

Prompt Ops can go a long way to taking on this challenge, because it ultimately manages the life cycle of the prompt. Although prompt engineering is about the quality of the prompt, it is promptly where you repeat, Del Prete explains.

“It’s more orchestration,” he said. “I consider it the curation of questions and the curation of how you deal with AI to ensure that you get the most out of it.”

Models can tend to be ‘tired’, cycling in loops where the quality of the output breaks down, he said. Prompt Ops helps prompts, measuring, monitoring and coordinating. “I think if we look back in three or four years, it will be a whole discipline. It will be a skill.”

Although it is still an emerging field, early providers include Querypal, Progressable, Refuff and Trueleens. As promptly evolves, these platforms will continue to repeat, improve and give real -time feedback to give users more capacity to coordinate prompts over time, Dep Prete noted.

In the end, he predicted, agents will be able to coordinate, write and structure themselves. “The level of automation will increase, the level of human interaction will decrease, you will be able to make agents work autonomously in the instructions they create.”

Common mistakes

Until prompt OPS has been fully realized, there is ultimately no perfect prompt. Some of the biggest mistakes that people make, according to Emerson:

Not specific enough about the problem that needs to be solved. This includes how the user wants the model to give his answer, which should be considered when responding, limitations to take into account and other factors. “In many settings, models need a good amount of context to give a response that meets users’ expectations,” Emerson said.
Not taking into account the way in which a problem can be simplified to limit the scope of the answer. Should the answer be within a certain reach (0 to 100)? Should the answer be formulated as a problem with multiple choice instead of something open only? Can the user give good examples to contextualize the query? Can the problem be subdivided into steps for individual and simpler questions?
Do not benefit from structure. LLMs are very good at pattern recognition and many can understand code. Although the use of bullet points, specified lists or daring indicators (****) can seem “a bit messy” for the human eyes, Emerson noted, these callouts can be beneficial for an LLM. Asking for structured outputs (such as JSON or Markdown) can also help when users want to automatically process answers.

There are many other factors that must be taken into account when maintaining a production pipeline, based on best practices for Engineering, Emerson noted. These include:

Ensure that the transit of the pipeline remains consistent;
Monitoring the performance of the instructions over time (possibly against a validation set);
Setting up tests and early warning detection to identify pipeline issues.

Users can also benefit from tools that are designed to support the prompt process. For example the open source Dspy Can automatically configure and optimize the prompts for power -reducing tasks based on some labeled examples. Although this is perhaps a fairly advanced example, there are many other offers (including some built -in tools such as Chatgpt, Google and others) that can help with a fast design.

And in the end Emerson said: “I think one of the simplest things users can do is to try to stay up-to-date on effective instructions, model developments and new ways to configure and communicate models.”

Source link