Opentelemetry Trace Sampling

How to use Odigos for cost-effective trace data sampling with no code changes

Why Sampling is important

In the world of distributed systems and microservices, tracing has become an essential tool for understanding and troubleshooting complex architectures. Distributed tracing are a powerful tool for monitoring and diagnosing complex, microservices-based architectures, enabling developers to track the flow of requests across different services. However, as the volume of trace data can grow exponentially in large-scale systems, capturing every trace can be impractical and costly. Trace sampling addresses this challenge.

This is where sampling comes into play. Sampling in distributed tracing allows teams to manage data volume and storage costs by selectively capturing a representative subset of traces, without losing critical insights into system behavior. By intelligently sampling traces, companies can strike a balance between capturing enough data to gain insights into system behavior and reducing the overhead associated with storing and analyzing large volumes of trace data.

Additional reasons companies use trace sampling include:

Performance Impact: Collecting and processing every single trace can introduce overhead on the system being traced. This can impact the performance and scalability of the application, especially in high-throughput environments. By sampling traces, companies can mitigate this performance impact and ensure that their systems continue to operate smoothly.
Focus on Relevant Data: Not all traces are equally important or informative. Sampling allows companies to focus on capturing traces that are most relevant to their specific use cases or areas of interest. Filtering out less relevant traces allows companies to focus their analysis on the most critical aspects of their system.

Tail sampling VS Head Sampling

In distributed tracing, the difference between head and tail sampling depends on when the decision to capture a trace is made. Different use cases demand different sampling strategies, for example, head sampling is typically used for reducing trace volume upfront, while tail sampling is better suited for capturing significant or unusual events after their outcomes are known.

Head-based sampling is generally more efficient as it reduces the need to collect and store data that may ultimately be discarded. By making the decision early, it minimizes resource usage and has less impact on system performance. However, head-based sampling can't apply sampling policies that require the context of the entire trace, such as identifying traces that contain error messages or other critical events. Tail-based sampling, while more resource-intensive, is better suited for capturing these important traces because it has access to the complete context before making a decision.

Sampling with Odigos: Simpler Than Ever

With the Odigos open-source project, we abstracted OpenTelemetry's sampling methods of head and tail sampling into a concept called Odigos Actions. You just focus on your use case without worrying about the underlying complexities. You simply choose your preferred Odigos Action, and we implement it in the most effective way possible.
These actions are designed to capture user use cases and simplify their implementation using the Odigos platform. In this blog, we'll walk you through a brief demo on implementing sampling for common use cases using our intuitive actions.

Example Use Case

We’re going to sample traces for Company X, which has five main applications: Frontend, Pricing, Membership, Inventory, and Coupon. Here’s an overview of their architecture:
Odigos Demo Architecture

Company X aims to implement the following sampling strategies:

Capture 100% of error traces and retain only 5% of non-error traces.
For the Frontend service:
- Capture requests to /buy only if they exceed 1 second, while still keeping 10% of shorter traces.
- Capture all requests to /products that exceed 290 ms.

Here’s how to implement it:

First, we will create the Error Sampler action to capture 100% of error traces and retain only 5% of non-error traces.

Go to Actions > Add New Action > Error Sampler:

Next, we’ll create the Latency Sampler action to capture specific requests in the Frontend service. We want to capture all requests to /products that exceed 290 ms and requests to /buy that exceed 1 second, while still keeping 10% of shorter traces.

Go to Actions > Add New Action > Latency Sampler:

Now, let’s explore Jaeger, the trace backend installed on my cluster. (Odigos supports all major tracing backends.)

The first two traces were sampled due to the presence of errors.
The third trace was sampled because it exceeded the 1-second threshold defined for the /buy endpoint.
The last trace was sampled because it exceeded the 290ms threshold defined for the /products endpoint.

Try It Yourself

Ready to see the benefits of Odigos in your own environment? Get started with our quick-start guide or watch a demo video. Explore how easy it is to implement trace sampling without any code changes and optimize your system’s performance.

Links

Odigos Sampling documentation

Feedback

As always, we would love to hear your feedback. Please reach out to us on Odigos Slack or GitHub if you have any questions or suggestions.

LEARN MORE