AI

From logs to insights: The AI breakthrough redefining observability

Presented by Elastic


Logs will become the primary tool for finding the “why” when diagnosing network incidents

Modern IT environments have a data problem: there’s too much of it. Organizations tasked with managing the enterprise environment are increasingly challenged to detect and diagnose issues in real time, optimize performance, improve reliability, and ensure security and compliance – all within limited budgets.

The modern observability landscape has many instruments that offer a solution. Most revolve around DevOps teams or Site Reliability Engineers (SREs) analyzing logs, metrics, and traces to uncover patterns, figure out what’s happening in the network, and diagnose why a problem or incident occurred. The problem is that this process creates information overload: a Kubernetes cluster alone can broadcast 30 to 50 gigabytes of logs per day, and suspicious behavior patterns can slip past the human eye.

“It’s so anachronistic today, in the world of AI, to just think about humans observing the infrastructure,” said Ken Exner, chief product officer at Elastic. “I hate to break it to you, but machines are better than humans at matching patterns.”

An industry-wide focus on visualizing symptoms is forcing engineers to manually search for answers. The crucial “why” lies hidden in logs, but because they contain vast amounts of unstructured data, the industry tends to use them as a last resort. This has forced teams to make costly trade-offs: either spend countless hours building complex data pipelines, dropping valuable log data and risking critical visibility gaps, or log and forget.

Elastic, the Search AI Company, recently released a new observability feature called Streams, which aims to become the primary signal for investigations by taking noisy logs and turning them into patterns, context, and meaning.

See also  How AI is Redrawing the World's Electricity Maps: Insights from the IEA Report

Streams uses AI to automatically partition and parse raw logs to extract relevant fields, significantly reducing the effort required for SREs to make logs usable. Streams also automatically surface significant events, such as critical errors and anomalies, from context-rich logs, providing SREs with early warning and clear visibility into their workload, allowing them to investigate and resolve issues faster. The ultimate goal is to show recovery steps.

“From raw, bulky, messy data, Streams automatically creates structure, puts it into a usable form, automatically alerts you to problems and helps you solve them,” says Exner. “That’s the magic of Streams.”

A broken workflow

Streams turns on its head an observability process that some say has been disrupted. Typically, SREs create metrics, logs, and traces. They then set alerts and service level objectives (SLOs) – often hard-coded rules to indicate where a service or process has crossed a threshold, or a specific pattern has been detected.

When an alert is triggered, it points to the metric that has an anomaly. From there, SREs look at a metrics dashboard, where they can visualize the problem and compare the alert to other metrics, or CPU to memory to I/O, and look for patterns.

They may then need to look at a trace and investigate the upstream and downstream dependencies within the application to determine the root cause of the problem. Once they figure out what’s causing the problems, they dig into the logs of that database or service to try to fix it.

Some companies simply try to add more tools when the current ones prove ineffective. That means SREs are hopping from tool to tool to maintain control and troubleshooting within their infrastructure and applications.

See also  The Emergence of Self-Reflection in AI: How Large Language Models Are Using Personal Insights to Evolve

“You’re jumping through different tools. You’re relying on a human to interpret these things, visually looking at the relationship between systems in a service map, visually looking at graphs on a metrics dashboard, to figure out what and where the problem is,” says Exner. “But AI automates that workflow.”

With AI-powered Streams, logs are used not only reactively to troubleshoot issues, but also to proactively process potential issues and create information-rich alerts that help teams immediately get started on troubleshooting, provide a solution for remediation, or even resolve the issue entirely, before automatically notifying the team that it has been resolved.

“I believe that logs, the richest set of information, the original signal type, are going to drive a lot of the automation that a service reliability engineer typically does today, and does it very manually,” he adds. “A human being shouldn’t have to be in that process where they do it by digging inside themselves, trying to figure out what’s going on, where and what the problem is, and then once they find the cause, they try to figure out how to debug it.”

The future of observability

Large Language Models (LLMs) can play a key role in the future of observability. LLMs excel at recognizing patterns in large amounts of repetitive data, which are very similar to log and telemetry data in complex, dynamic systems. And today’s LLMs can be trained for specific IT processes. With automation tools, the LLM has the information and tools needed to resolve database errors or Java heap issues, and more. It will be essential to include these in platforms that provide context and relevance.

See also  Real Estate Team AI; Tech Trendsetetter shares insights

Automated remediation will take some time, Exner says, but automated runbooks and playbooks generated by LLMs will become standard practice within the next few years. In other words, the recovery steps will be driven by LLMs. The LLM will provide solutions, and the human will verify and implement them, rather than hiring an expert.

Addressing skills shortages

Going all-in on AI for observability could address a major talent shortage needed to manage IT infrastructure. Hiring is slow because organizations need teams with deep experience and insight into potential problems and how to quickly resolve them. That experience can come from an LLM that is contextually grounded, Exner says.

“We can help address the skills shortage by expanding people with LLMs, making them all instant experts,” he explains. “I think this will make it much easier for us to take novice practitioners and make them expert practitioners in terms of both security and observability, and it will make it possible for a more novice practitioner to act like an expert.”

Streams in Elastic Observability is now available. Get started by reading more about the flows.


Sponsored articles are content produced by a company that pays for the post or has a business relationship with VentureBeat, and is always clearly marked. For more information please contact sales@venturebeat.com.

Source link

Back to top button