This blog will predict developments in distributed tracing that will accelerate its widespread adoption, vital to achieving end-to-end observability.
But before we put forth our predictions, let’s shortly recap where we are today.
If you are reading this, you probably know what distributed tracing is and why you need it (we recommend watching this video). In short, metrics and logs do not scale well when working in distributed environments. Over a certain threshold of applications, debugging issues relying solely on logs and metrics is problematic, mainly because the information provided is only relevant to the application monitored. Missing is an understanding of the flow of events through the distributed system. Distributed tracing gives us that missing context by tracking requests as they flow through the system. This allows a quick understanding of what happens to requests as they transit through the microservices that make up your distributed applications.
Sounds promising, but how many companies are actually using distributed tracing? With the rise of distributed architectures, one would expect this technology to be widely adopted by now. There are many reasons why it should be:
The adoption of distributed tracing has not hit critical mass.
This slow adoption is due to the difficulty in implementing distributed tracing - simply put, it is much more complicated to implement than any other telemetry signal. As opposed to metrics or logs, distributed tracing is not a stand-alone signal that spans just one application, but a collaborative metric that requires the instrumentation of all your applications. Since distributed traces span across all your applications, collaboration is required between the different development teams, who all need to make sure their applications are instrumented. If some of the applications in the chain are not instrumented properly, the trace will not be able to complete.
If you attended KubeCon 2022 in Detroit, you might have heard about the pain of adopting distributed tracing. This pain was highlighted in presentations at EnvoyCon, eBPF Day, and this one (titled “Distributed Tracing - The Struggle is Real”), comparing distributed tracing to Sageway 🤔 (Great hype and potential but an early demise because of lack of adoption and impact).
We believe all of this is going to change. Here are 5 of our predictions on the future of distributed tracing:
We believe 2023 is the year distributed tracing will achieve mass adoption. At first, most traces will span across backend applications. The required context propagation will start at the first application layer after the load balancer, tracing across applications, backend services and databases.
As we explained earlier, the main challenge in adopting distributed tracing is the collaboration required across the different development teams that requires them to prioritize and implement instrumentation for all their applications.
What do you do with a manual process that needs to be implemented on every team and every application? You automate it, of course!
Automatic instrumentation has existed for years, but for wide adoption to take place, it will need to improve in multiple ways:
Finally, we predict wide adoption of distributed tracing will happen only once developers can take any container image and magically auto-instrument it, without knowledge of the underlying technology. This is why we built Odigos - an observability control plane that generates distributed traces within minutes by automatically detecting applications, applying automatic instrumentations, and managing collectors.
The same eBPF innovation that allow us to instrument compiled languages, will drive further advancement in distributed tracing – enabling us to look deeper into the stack.
According to Data On Kubernetes community survey, more and more companies are running stateful workloads on kubernetes. We predict the next phase of distributed tracing will allow us to look deeper into those stateful workloads. Instead of having a single span representing a database query execution, we’ll have multiple spans that break the query execution even deeper, giving us additional visibility and detail and what is actually occurring in the database (i.e. logical plan, physical plan, query execution, etc.). We can already see this change happen in projects like sqlcommenter that enables context propagation over SQL queries. There are a few databases (like Vitess) that are already instrumented, and we expect this trend to continue and that automatic instrumentation will be developed for additional databases.
In addition, automatic instrumentation will give us better visibility into Kubernetes. There is already work being done by SIG instrumentation and projects like kspan. By using automatic instrumentation for Go, we will be able to perform context propagation across API transactions which will provide complete coverage of what happens inside of Kubernetes and the different Operators.
The rise of OpenTelemetry as a standard also affects the main cloud providers. AWS, Google Cloud, and Azure all support OpenTelemetry. We predict that in coming years, managed services provided by these cloud providers will include instrumentation in the OTLP format by default - allowing for distributed traces that go beyond the compute services.
Once we achieve coverage of everything that runs code, the next logical step will be to add everything that builds code. This is already partially supported today with projects like buildevents but requires the big effort of manually instrumenting any tool used during the CI/CD pipelines. As many of the pipelines are simply bash scripts, there is always a race between new commands being added and adding instrumentation to those commands. Automatic instrumentation of CLI applications lays the foundations for building automatic instrumentation for the entire CI/CD pipeline.
All these great improvements have a cost. Adding more spans and systems to the traces will require more bandwidth. One of the common ways to solve this is by sampling. Doing statistical sampling (i.e. sampling only 5% of the requests) makes our traces useless in debugging requests that were not sampled. One interesting innovation that we expect to see is the the adoption of columnar encoding for telemetry data. Columnar encoding (such as Apache Arrow) provides better compression by grouping similar data together (column-oriented vs row-oriented) and has the potential to reduce the amount of bandwidth required for the transmission of telemetry while not dropping important data.
Metrics can be aggregated based on distributed tracing. In fact, most of the observability vendors will already do it for you today - send them traces and you will also see different metrics graphs. SpanMetrics Processor makes this possible, without using an observability vendor. As distributed traces will become commonplace, we expect metrics to be generated based on the distributed traces instead of generating them from the target applications.
Manual instrumentation will still have its place. There is a limit to how deep automatic instrumentation can trace. We do think that signals like continuous profiling especially when correlated with other telemetry signals will reduce the amount of manual instrumentation needed. In addition, as support for OpenTelemetry logs matures, we expect developers to use logs (familiar APIs, no learning curve) to add manual instrumentation as opposed to enriching traces with additional spans.
Distributed tracing has been around for years. With the rise of distributed systems, the need for distributed tracing has grown and yet the adoption rate for this technology is low due to the difficult implementation required across the organization. We predict all this is going to change due to recent advancements is auto-instrumentation and observability control planes that automate the entire process and produce distributed tracing within minutes.