observability

future

distributed-traces

ebpf

Distributed Tracing in 2025: What the future holds

Explore our predictions for the evolution of distributed tracing, a critical component in achieving comprehensive observability, as we foresee widespread adoption and groundbreaking advancements.

Distributed Tracing in 2025: What the future holds
Author
/eden.jpgEden Federman
Nov 09 2022

This blog will predict developments in distributed tracing that will accelerate its widespread adoption, vital to achieving end-to-end observability.

But before we put forth our predictions, let’s shortly recap where we are today.

The Current State of Distributed Tracing: Theory vs Reality

If you are reading this, you probably know what distributed tracing is and why you need it (we recommend watching this video). In short, metrics and logs do not scale well when working in distributed environments. Over a certain threshold of applications, debugging issues relying solely on logs and metrics is problematic, mainly because the information provided is only relevant to the application monitored. Missing is an understanding of the flow of events through the distributed system. Distributed tracing gives us that missing context by tracking requests as they flow through the system. This allows a quick understanding of what happens to requests as they transit through the microservices that make up your distributed applications.

Sounds promising, but how many companies are actually using distributed tracing? With the rise of distributed architectures, one would expect this technology to be widely adopted by now. There are many reasons why it should be:

  • It solves a real problem of the growing complexity of distributed architecture. If you ever tried to debug a system that contains multiple microservices - you’ve felt this problem
  • There are dozens of observability companies that visualize, store and analyze distributed traces for you
  • This is not a new concept. Distributed tracing was first mentioned in the Dapper paper by Google in 2010
  • An excellent open-source ecosystem: OpenTelemetry and Jaeger are both adopted by the CNCF and are very active projects.

And yet...

The adoption of distributed tracing has not hit critical mass.

Tracing Adoption Chart

This slow adoption is due to the difficulty in implementing distributed tracing - simply put, it is much more complicated to implement than any other telemetry signal. As opposed to metrics or logs, distributed tracing is not a stand-alone signal that spans just one application, but a collaborative metric that requires the instrumentation of all your applications. Since distributed traces span across all your applications, collaboration is required between the different development teams, who all need to make sure their applications are instrumented. If some of the applications in the chain are not instrumented properly, the trace will not be able to complete.

If you attended KubeCon 2022 in Detroit, you might have heard about the pain of adopting distributed tracing. This pain was highlighted in presentations at EnvoyCon, eBPF Day, and this one (titled “Distributed Tracing - The Struggle is Real”), comparing distributed tracing to Sageway 🤔 (Great hype and potential but an early demise because of lack of adoption and impact).

We believe all of this is going to change. Here are 5 of our predictions on the future of distributed tracing:

Prediction #1 - Applicative Distributed Traces Becomes Mainstream

App Trace

We believe 2023 is the year distributed tracing will achieve mass adoption. At first, most traces will span across backend applications. The required context propagation will start at the first application layer after the load balancer, tracing across applications, backend services and databases.

Automation is key

As we explained earlier, the main challenge in adopting distributed tracing is the collaboration required across the different development teams that requires them to prioritize and implement instrumentation for all their applications.

What do you do with a manual process that needs to be implemented on every team and every application? You automate it, of course!

Automatic instrumentation has existed for years, but for wide adoption to take place, it will need to improve in multiple ways:

  1. Stability - The quality of the generated traces differs greatly according to the programming language. The OpenTelemetry community is working hard to change that and we expect stability to improve greatly in the coming months.
  2. New Compiled Languages Support - Historically, automatic instrumentation was possible only for JIT languages (i.e. Java and .NET) or interpreted languages (i.e. Python and JavaScript). Missing from that list were compiled languages (like Go and Rust). With the emergence of eBPF now allowing us to automatically instrument compiled languages, the final piece of the puzzle is in place to complete auto-instrumentation for every programming language. At keyval, we have developed automatic instrumentation for Go based on eBPF, that has been adopted by the OpenTelemetry community. In the future, Keyval plans to apply the same mechanism used for Go to other compiled languages like Rust and C++.

Finally, we predict wide adoption of distributed tracing will happen only once developers can take any container image and magically auto-instrument it, without knowledge of the underlying technology. This is why we built Odigos - an observability control plane that generates distributed traces within minutes by automatically detecting applications, applying automatic instrumentations, and managing collectors.

Prediction #2: Distributed Tracing Goes Deeper and Wider

The same eBPF innovation that allow us to instrument compiled languages, will drive further advancement in distributed tracing – enabling us to look deeper into the stack.

Automatic Instrumentation of Databases & Kubernetes

Kubernetes and Databases Trace

According to Data On Kubernetes community survey, more and more companies are running stateful workloads on kubernetes. We predict the next phase of distributed tracing will allow us to look deeper into those stateful workloads. Instead of having a single span representing a database query execution, we’ll have multiple spans that break the query execution even deeper, giving us additional visibility and detail and what is actually occurring in the database (i.e. logical plan, physical plan, query execution, etc.). We can already see this change happen in projects like sqlcommenter that enables context propagation over SQL queries. There are a few databases (like Vitess) that are already instrumented, and we expect this trend to continue and that automatic instrumentation will be developed for additional databases.

In addition, automatic instrumentation will give us better visibility into Kubernetes. There is already work being done by SIG instrumentation and projects like kspan. By using automatic instrumentation for Go, we will be able to perform context propagation across API transactions which will provide complete coverage of what happens inside of Kubernetes and the different Operators.

Cloud Providers & Frontends

Frontend and Cloud Providers Trace

The rise of OpenTelemetry as a standard also affects the main cloud providers. AWS, Google Cloud, and Azure all support OpenTelemetry. We predict that in coming years, managed services provided by these cloud providers will include instrumentation in the OTLP format by default - allowing for distributed traces that go beyond the compute services.

The trend of traces going outside of the cluster will have an affect on frontend applications as well. The Java instrumentation and JavaScript instrumentation already support Android and web browsers. Together with automatic instrumentation for CLI applications, we will see distributed traces span across frontend applications as well.

Automatic Instrumentation of CI/CD

CI CD Trace

Once we achieve coverage of everything that runs code, the next logical step will be to add everything that builds code. This is already partially supported today with projects like buildevents but requires the big effort of manually instrumenting any tool used during the CI/CD pipelines. As many of the pipelines are simply bash scripts, there is always a race between new commands being added and adding instrumentation to those commands. Automatic instrumentation of CLI applications lays the foundations for building automatic instrumentation for the entire CI/CD pipeline.

Prediction #3: Sampling will lose to better compression

All these great improvements have a cost. Adding more spans and systems to the traces will require more bandwidth. One of the common ways to solve this is by sampling. Doing statistical sampling (i.e. sampling only 5% of the requests) makes our traces useless in debugging requests that were not sampled. One interesting innovation that we expect to see is the the adoption of columnar encoding for telemetry data. Columnar encoding (such as Apache Arrow) provides better compression by grouping similar data together (column-oriented vs row-oriented) and has the potential to reduce the amount of bandwidth required for the transmission of telemetry while not dropping important data.

Prediction #4: Generating metrics based on distributed traces

Metrics can be aggregated based on distributed tracing. In fact, most of the observability vendors will already do it for you today - send them traces and you will also see different metrics graphs. SpanMetrics Processor makes this possible, without using an observability vendor. As distributed traces will become commonplace, we expect metrics to be generated based on the distributed traces instead of generating them from the target applications.

Prediction #5: Manual instrumentation is here to stay

Manual instrumentation will still have its place. There is a limit to how deep automatic instrumentation can trace. We do think that signals like continuous profiling especially when correlated with other telemetry signals will reduce the amount of manual instrumentation needed. In addition, as support for OpenTelemetry logs matures, we expect developers to use logs (familiar APIs, no learning curve) to add manual instrumentation as opposed to enriching traces with additional spans.

Summary

Distributed tracing has been around for years. With the rise of distributed systems, the need for distributed tracing has grown and yet the adoption rate for this technology is low due to the difficult implementation required across the organization. We predict all this is going to change due to recent advancements is auto-instrumentation and observability control planes that automate the entire process and produce distributed tracing within minutes.

If you want to learn more about how you can generate distributed traces instantly check out our GitHub repository. We'd really appreciate it if you could throw us a ⭐👇
https://github.com/keyval-dev/odigos

Related posts

New Odigos Website

New Odigos Website

Odigos Launches New Website with Enhanced Focus on Implementing Distributed Tracing

New Odigos Website

Ari Recht

Oct 14 2023

Odigos Version v0.1.4

Odigos Version v0.1.4

Explore the latest features, improvements, and additions in Odigos version v0.1.4, including ARM support and new destinations.

Odigos Version v0.1.4

Eden Federman

Feb 13 2023

Managing collectors on K8s – why we chose the OpenTelemetry collector for Odigos

Managing collectors on K8s – why we chose the OpenTelemetry collector for Odigos

Odigos simplifies observability with automatic instrumentation, collector management, and easy scalability. It streamlines data collection, ensuring efficient end-to-end observability for modern applications.

Managing collectors on K8s – why we chose the OpenTelemetry collector for Odigos

Eden Federman

Aug 25 2022