Cerebras Introduces World’s Fastest AI Inference Solution: 20x Speed at a Fraction of the Cost

August 29, 2024

0 4 minutes read

Cerebra’s systemsa pioneer in high-performance AI computing, has introduced a breakthrough solution that will revolutionize AI inference. On August 27, 2024, the company announced the launch of Cerebras Inference, the world’s fastest AI inference service. With performance metrics that dwarf those of traditional GPU-based systems, Cerebras Inference delivers 20 times the speed at a fraction of the cost, setting a new benchmark in AI computing.

Unprecedented speed and cost efficiency

Cerebras Inference is designed to deliver exceptional performance in various AI models, especially in the rapidly evolving segment of large language models (LLMs). For example, it processes 1,800 tokens per second for the Llama 3.1 8B model and 450 tokens per second for the Llama 3.1 70B model. This performance is not only 20 times faster than NVIDIA GPU-based solutions, but also comes at a significantly lower cost. Cerebras is offering this service starting at just 10 cents per million tokens for the Llama 3.1 8B model and 60 cents per million tokens for the Llama 3.1 70B model, which represents a 100x better price-performance ratio compared to existing GPU-based offerings.

Maintaining accuracy while pushing the limits of speed

One of the most impressive aspects of Cerebras Inference is its ability to maintain cutting-edge accuracy while delivering unparalleled speed. Unlike other approaches that sacrifice precision for speed, Cerebras’ solution remains within the 16-bit domain throughout the entire inference run. This ensures that performance gains do not come at the expense of the quality of AI model output, a crucial factor for developers focused on precision.

Micah Hill-Smith, co-founder and CEO of Artificial Analysis, emphasized the importance of this achievement: “Cerebras delivers speeds that are orders of magnitude faster than GPU-based solutions for Meta’s Llama 3.1 8B and 70B AI models. We measured speeds above 1,800 output tokens per second on Llama 3.1 8B, and above 446 output tokens per second on Llama 3.1 70B – a new record in these benchmarks.”

The growing importance of AI inference

AI inference is the fastest growing segment of AI compute, accounting for approximately 40% of the total AI hardware market. The advent of fast AI inference like that of Cerebras is akin to the introduction of broadband internet, unlocking new possibilities and ushering in a new era for AI applications. With Cerebras Inference, developers can now build next-generation AI applications that require complex, real-time performance, such as AI agents and intelligent systems.

Andrew Ng, founder of DeepLearning.AI, underlined the importance of speed in AI development: “DeepLearning.AI has multiple agentic workflows that require repeatedly querying an LLM to get a result. Cerebras has built an impressively fast inference capability that will be very useful in such workloads.”

Broad sector support and strategic partnerships

Cerebras has received strong support from industry leaders and has formed strategic partnerships to accelerate the development of AI applications. Kim Branson, SVP AI/ML at GlaxoSmithKline, an early Cerebras customer, highlighted the transformative potential of this technology: “Speed and scale change everything.”

Other companies, such as LiveKit, Perplexity and Meter, have also expressed excitement about the impact Cerebras Inference will have on their businesses. These companies leverage the power of Cerebras’ computing capabilities to create more responsive, human-like AI experiences, improve user interaction in search engines, and improve network management systems.

Cerebras inference: levels and accessibility

Cerebras Inference is available in three competitively priced tiers: Free, Developer and Enterprise. The Free Tier offers free API access with generous usage limits, making it accessible to a wide range of users. The Developer Tier offers a flexible, serverless deployment option, with Llama 3.1 models priced at 10 cents and 60 cents per million tokens. The Enterprise Tier is aimed at organizations with long-term workloads and offers fine-tuned models, customized service level agreements and dedicated support, with pricing available upon request.

Powering Cerebras Inference: The Wafer Scale Engine 3 (WSE-3)

At the heart of Cerebras Inference is the Cerebras CS-3 system, powered by the industry-leading Wafer Scale Engine 3 (WSE-3). This AI processor is unparalleled in size and speed, offering 7,000 times more memory bandwidth than NVIDIA’s H100. The WSE-3’s massive scale allows it to handle many simultaneous users, ensuring blistering speeds without sacrificing performance. This architecture enables Cerebras to avoid the tradeoffs that GPU-based systems typically suffer and deliver best-in-class performance for AI workloads.

Seamless integration and developer-friendly API

Cerebras Inference is designed with developers in mind. It features an API that is fully compatible with the OpenAI Chat Completions API, allowing easy migration with minimal code changes. This developer-friendly approach ensures that the integration of Cerebras Inference into existing workflows is as seamless as possible, enabling rapid deployment of high-performance AI applications.

Cerebras systems: driving innovation in all sectors

Cerebras Systems is not only a leader in AI computing, but also a major player in several industries, including healthcare, energy, government, scientific computing and financial services. The company’s solutions have been instrumental in achieving breakthroughs at institutions such as National Laboratories, Aleph Alpha, The Mayo Clinic and GlaxoSmithKline.

By delivering unparalleled speed, scalability and accuracy, Cerebras enables organizations in these industries to tackle some of the most challenging problems in AI and beyond. Whether accelerating drug discovery in healthcare or improving computing capabilities in scientific research, Cerebras is at the forefront of driving innovation.

Conclusion: A new era for AI inference

Cerebras Systems is setting a new standard for AI inference with the launch of Cerebras Inference. By offering twenty times the speed of traditional GPU-based systems at a fraction of the cost, Cerebras not only makes AI more accessible, but also paves the way for the next generation of AI applications. With its cutting-edge technology, strategic partnerships and commitment to innovation, Cerebras is poised to lead the AI industry into a new era of unprecedented performance and scalability.

To learn more about Cerebras Systems and try Cerebras Inference, visit www.cerebras.ai.