Google Cloud takes aim at CoreWeave and AWS with managed Slurm for enterprise-scale AI training


Some companies are best served by tailoring large models to their needs, but a number of companies are planning to do just that build their own modelsa project that requires access to GPUs.
Google Cloud wants to play a greater role in the modeling process of companies with its new service. Vertex AI training. The service gives companies that want to train their own models access to a managed Slurm environment, data science tools and all the chips suitable for large-scale model training.
With this new service, Google Cloud hopes to turn more enterprises away from other providers and encourage the building of more business-specific AI models.
While Google Cloud has always offered the ability to customize its Gemini models, the new service allows customers to bring in their own models or customize any open source model of Google Cloud hosts.
Vertex AI Training positions Google Cloud directly against companies like KernWeef And Lambda laboratoriesas well as its cloud competitors AWS And Microsoft Azure.
Jaime de Guerre, senior director of product management at Gloogle Cloud, told VentureBeat that the company has heard from many organizations of different sizes that they need a way to better optimize compute power, but in a more reliable environment.
“What we’re seeing is that there are more and more companies building or adapting large generation AI models to introduce product offerings built around those models, or to power their business in some way,” De Guerre said. “This includes AI startups, technology companies, sovereign organizations that are building a model for a particular region, culture or language, and some large enterprises that may be building this into internal processes.”
De Guerre noted that while anyone can technically use the service, Google is targeting companies planning large-scale model training, rather than simple refinement or LoRA adopters. Vertex AI Services will focus on longer-term training engagements involving hundreds or even thousands of chips. The price will depend on the amount of computing power the company needs.
“Vertex AI Training is not about adding more information to the context or using RAG; this is about training a model where you can start with completely random weights,” he said.
Model adaptation on the rise
Companies recognize the value of building custom models that go beyond simply refining an LLM through Retrieval-Augmented Generation (RAG). Custom models would know more in-depth business information and respond with answers specific to the organization. Companies like it Arcee.ai have started offer their models for customization for customers. Adobe recently announced a new service that allows companies to do this Retrain Firefly for their specific needs. Organizations such as FICOthat create small language models specifically for the financial sectoroften buy GPUs to train them at significant cost.
Google Cloud said Vertex AI Training differentiates itself by providing access to a larger set of chips, services to monitor and manage training, and the expertise it has learned from training the Gemini models.
Some early customers of Vertex AI Training include AI Singaporea consortium of Singaporean research institutes and startups that built the 27 billion-parameter SEA-LION v4, and Salesforce‘s AI research team.
Companies often have to choose between taking an already built LLM and refining it, or building their own model. But setting up an LLM from scratch is usually infeasible for smaller companies, or it simply doesn’t make sense for some use cases. However, for organizations where a completely custom or completely new model makes sense, the problem is gaining access to the GPUs needed to run training.
Model training can be expensive
Training a model, De Guerre said, can be difficult and expensiveespecially when organizations compete with several others for GPU space.
Hyperscalers like AWS and Microsoft – and yes, Google – have argued that their massive data centers, racks and racks of high-end chips deliver the most value to enterprises. Not only will they have access to expensive GPUs, but cloud providers often offer full-stack services to help companies transition to production.
Services like CoreWeave rose to prominence for offering on-demand access to Nvidia H100s, giving customers flexibility in computing power when building models or applications. This has also created a business model in which companies with GPUs rent out server space.
De Guerre said Vertex AI Training isn’t just about providing access to train models on bare compute, where the company rents a GPU server; they also need to bring their own training software and manage timing and failures.
“This is a managed Slurm environment that helps schedule all tasks and automatically recover from failed tasks,” said De Guerre. “So if a training job slows down or stops due to a hardware failure, training will automatically restart very quickly, based on automatic checkpoints we perform when managing the checkpoints to continue with very little downtime.”
He added that this provides higher throughput and more efficient training for larger scale computing clusters.
Services like Vertex AI Training can make it easier for enterprises to build niche models or completely customize existing models. But just because the option exists doesn’t mean it’s right for every business.




