AI

New training method boosts AI multimodal reasoning with smaller, smarter datasets

Researchers from MiroMind AI and several Chinese universities released this OpenMMReasonera new training framework that improves the capabilities of language models in multimodal reasoning.

The framework uses a two-stage process. It first refines a base model with a composite dataset in a supervised fine-tuning (SFT) phase. Then, a reinforcement learning (RL) phase guides the model to reason more effectively on tasks involving both text and visual data.

Experiments show that models trained with OpenMMReasoner outperform other leading visual reasoning models, often while training on a smaller, higher quality dataset. The framework and all its resources, including a trained 7B model, are completely open source and provide a reliable foundation for building applications that require traceability and robustness.

According to Kaichen Zhang, co-author of a research paper outlining the new method, OpenMMReasoner offers significant benefits for companies looking beyond large, closed systems. “A smaller open-source reasoning model has practical benefits: enterprises can deploy it locally, reduce latency, reduce token costs associated with long thought chains, maintain full control over their data, and [it is] fine-tuned to adapt to their specific downstream task,” he told VentureBeat.

The challenge of transparent multimodal reasoning

Recent advances in reinforcement learning with verifiable rewards (RLVR) have significantly improved the reasoning power of large language models (LLMs). RLVR trains LLMs to generate chain of thoughts (CoT) tokens (which mimic the reasoning processes that humans use) before generating the final answer. This improves the model’s ability to solve complex reasoning tasks such as mathematics and coding.

Motivated by this success, researchers have applied similar RL-based methods large multimodal models (LMMs), showing that the benefits can extend beyond text to improve visual comprehension and problem solving across modalities.

See also  Ryan Reynolds Instagram post about training, not a cryptic Justin Baldoni message

However, a lack of transparency in the training pipeline was a major barrier. Many multimodal reasoning studies do not provide detailed information about their data curation and training processes, making it difficult to reproduce their results or understand why these models work.

“This lack of openness limits reproducibility and obscures a deeper understanding of how reasoning LMMs are actually built and how their training dynamics evolve,” the researchers note.

The OpenMMReasoner recipe

OpenMMReasoner addresses this gap with a fully transparent and scalable training recipe, built on open-source LMMs. The researchers found it crucial to build high-quality datasets by scaling data diversity. Although the use of diverse data sources is important, increasing the diversity of correct answers for the same question was an essential area for improvement.

The first phase of the recipe is a three-step supervised fine-tuning (SFT) pipeline. It starts with data sourcing, where the team collected approximately 103,000 raw question-answer pairs from public datasets that include common visual question and answer tasks and reasoning tasks. Then they added a data distillation stepusing a powerful model (Qwen3-VL-235B-Instruct) to generate new high-quality reasoning trails for selected questions. (The data is then used to train a smaller model.)

To increase answer diversity, the team generated multiple verified reasoning trails for each question. This expanded the dataset to 583,000 samples. Finally, they implemented a “domain mixing” phase, adding data from mathematical reasoning domains to further generalize the model’s capabilities, resulting in a final SFT dataset of 874,000 examples.

The second stage is an RL recipe that uses a smaller dataset with 74,000 samples, composed of domains such as science, math, and puzzles. The model is trained with a composite reward function that takes into account both the correctness of the final answer and the consistency of the output format. To improve efficiency, the process includes an “overthinking” penalty, which discourages the model from generating excessively long answers (a problem with many reasoning models trained via RL incorrectly learning to generate excessively long reasoning strings, resulting in excessive costs and slower answers).

See also  The rise of 'micro' apps: non-developers are writing apps instead of buying them

This recipe can provide a blueprint for companies training their own models. “For companies with limited domain-specific data, a viable strategy is to first increase answer diversity for their existing data set and then use domain mixing to integrate this domain data into a general reasoning recipe like ours,” explains Zhang. “This allows the model to acquire strong general reasoning skills while adapting to industry-specific tasks, without the need for millions of samples.”

A more efficient and capable reasoning model

According to Zhang, the step-by-step process fundamentally changes the reliability of the model’s results. “Traditional models often ‘jump’ straight to an answer, meaning they explore only a small part of the reasoning space,” he said. “In contrast, a reasoning-oriented approach forces the model to explicitly examine multiple intermediate steps… [allowing it] to explore much deeper paths and arrive at answers with much more internal consistency.”

The researchers used the OpenMMReasoner recipe to generate data to refine the Qwen2.5-VL-7B-Instruct open-source vision language model. The result is a highly capable LMM that consistently outperforms state-of-the-art methods, such as Open vision reasoner (OVR), on a wide range of multimodal reasoning benchmarks. The SFT phase alone creates a strong base model that achieves superior performance and data efficiency compared to other SFT approaches, despite using a significantly smaller training dataset.

The subsequent RL phase further sharpens and stabilizes these skills, leading to more consistent and improved performance. After RL, the final model achieves state-of-the-art results on several benchmarks, including WeMath, MathVerse and MathVista.

One of the key findings was that as the model improved in multimodal reasoning, it also showed a “gradual emergence of textual reasoning behavior, indicating a transfer of reasoning competence from multimodal to purely linguistic domains,” the researchers note. This indicates that skills learned in one modality can enhance performance in another modality.

See also  Cancun boosts tourist security with 600 Marines, including jet ski patrols

“Our results show that strengthening multimodal reasoning can even improve text-only math skills – evidence that core logical skills can be transferred across modalities,” said Zhang. “Looking ahead, we expect these methods to expand to video and audio.”

The researchers also found that token efficiency is crucial. While allowing a model to generate longer reasoning steps can improve performance, excessive tokens reduce efficiency. Their results show that setting a smaller ‘reasoning budget’ can achieve similar or even better accuracy, an important consideration when deploying cost-effective business applications.

By means of open source of all components of their workflow, the researchers offer a reproducible picture of the entire process. For enterprise teams, this transparency is invaluable. “For business leaders concerned about supplier lock-in, hidden biases or opaque data sources, this level of transparency is essential,” said Zhang. “It allows teams to validate the data, adapt the pipeline for new domains, and maintain long-term independence from a single provider.”

Source link

Back to top button