Log Pipelines

Log pipelines are essential for operating any system, but especially cloud native systems. Simply adding log statements to your code is not enough. Logs need to be shipped to a central place where they can be consumed and processed to deliver value to your organization.

But let's go back few steps and start from the very beginning.

Console Printing

I am sure we all remember our first days of programming, writing simple programs and practicing our new skills. When things works as expected, we were happy and excited. But when things didn't work, we needed some way to understand what went wrong. A simple way to do this is by just printing things to the console. This is probably the simplest form of observability, allowing us to examine the internal events (loop iteration, function execution) in our programs as well as some useful context like the values of variables.

While this is a very simple and common way to debug our program during development, it does not work in production:

Our programs normally run on remote machines, and we don't have access to the console.
Even when we have console access, production environments often generate substantial volumes that entail many console prints. To effectively manage this data and extract only the information of interest, we encounter a challenge due to the limitations of filtering and searching within the console stream.
Unlike our development environment, production environments may run multiple instances of our program, often making it hard to discover which instance produced the information we are interested in.
If the program crashed, terminated due to scaling down, or restarted after a new deployment, it will take down the console with it and the information of interest will be lost.

So as we can see, printing to console might work in development, but it is not a good solution for production.

Log Files

The next step in the evolution of logging is transforming these prints to a file write in the local filesystem. This common practice in cloude native software.

It solves some of the problems we mentioned above, as we can now login to a remote machine to consume those files, and we can use tools like grep to search and filter the data we are interested in.

On the flip side, we still have to deal with some of the old problems:

We need to know which instance produced the log file we are interested in and locate this file on the proper remote machine.
If the host is ephemeral (like a container), the log file might be lost when the host is terminated (if we don't have a way to persist it).

Moreover, this introduces us to a new set of challenges:

Storage - Making sure we do not fill up the disk with logs.
Structure - Adding structure to these (potentially enormous amount of) events to enable easier search and filter. For example, adding a timestamp, the name of the program, the name of the component, severity, etc.
Programmer Usability - Making sure we write to the file in a consistent and friendly to use api in our code.

Log Records

Those primitive prints we describe above, are commonly referred to as a log record. Logs are used to record events that happen in our program and the developer might find value in examining them later.

While you can record any event you want, popular events include:

Errors encountered during execution (db errors, network errors, exceptions, function that returned an error, etc)
Evidence of something happening (db connected, http endpoint hit, business events)
Debug information that might help us understand what is going on (values of variables, function arguments, branch decisions)

While there is a wealth of information to delve into regarding logs, for the scope of our current discussion, let's briefly touch on some of the common attributes found in log records:

Timestamp - when did the event happen
Severity - how important is this event (trace, debug, info, warning, error, critical)
Message - a human readable description of the event
Attributes - a set of key-value pairs that provide additional information about the event

Loggers

Like anything else in software - there is no need to reinvent the wheel. There are tons of libraries that can help us produce log records in a consistent and friendly way.

Those libraries are commonly referred to as loggers, and your programming language and runtime probably has a few of them to choose from.

Log Services / Backends

Ok, so writing logs files to the local filesystem is a good start, but it is not a good solution for a real system with dozens of components, hosts, and large amount of traffic.

We can't start sshing and greping on each host to find the log file we are interested in. It is also desirable that we can get result fast and efficiently, which usually requires some indexing and aggregation of the logs.

We need a way to collect those logs from all the instances of our program, and store them in a central place.

To the rescue comes log services or log backends. Instead of storing each log record in a file on the local host filesystem, we now ship these logs to a central service which will significantly reduce our workload:

Persist the logs so they are not lost
Provide a way to consume the logs, with a UI including search and filter
Provide APIs to run queries on the logs
Auto Scale to adapt to the amount of logs we produce
Provide a way to alert on logs that match certain criteria
Provide a way to aggregate logs from multiple instances of our program
Automate the process of rotating and deleting old logs (retention)
Provide a way to control access to the logs
Provide a way to control the cost of storing the logs
Integrate with other tools to allow us to extract analytics and dashboards based on this data

and many many more.

These services can be open-source or commercial, and you can run them yourself or use a managed SAAS service. Some popular self managed examples are: Grafana Loki, Elastic Search, and managed popular services are: Datadog, Splunk, Coralogix, Grafana Cloud, AWS CloudWatch, Google Cloud Logging, Azure Monitor and many more.

Log Pipelines

Ok, so we have implemented loggers in our code that produce log records for anything we might consider interesting and worth recording. We have also researched and chosen a log backend (self managed or paid SAAS), for which we want to ship logs. But how do we connect the two? How do we get the log records from our code to the log service?

This where things get interesting. There are many ways to do that and you should choose the right one for your use case. We will get to it soon.

So why the term pipelines? We already described the first part of the pipeline - the logger, where log records are born. We also described the last part of the pipeline - the log service, where log records are stored and consumed. The pipeline consists of all the components that a log record encounters on its journey - from recording until it is consumed downstream.

Tasks of a Log Pipeline

Now, we can simply send the log records from our application to the log service, and we are done, right?

Well, not so fast.

Let's discuss some of the practices and common requirements to do this properly:

Log Format - The log records needs to be formatted to the format expected by the log service (or next receiver in the pipeline). For example, we might read json lines and emit otlp (OpenTelemetry Protocol) or a vendor format like Datadog.
Log Transport - The log records needs to be transported to the log service. For example, we might use HTTP, gRPC, or a vendor specific protocol.
Authentication - The log service might require the sender to authenticate with some api key which the user need to provide.
Additional Attributes - We might want to add additional attributes to the log for example: the k8s namespace, the cluster name, host, os type and version, runtime version, etc.
Log Aggregation - To be more cost effective, and consume less resources, it is common to batch records before exporting them downstream.
Log Sampling and Filtering - We might want to sample the log records or filter them by severity to reduce the amount of data we send to the log service and pay for.
Retry and Backoff - If the log service experiences a temporary outage, or signals network backoff, the sender need to be able to buffer and retry sending the log records.
Flushing - We need to hook up to the application lifecycle and make sure we flush any buffered log records before the application terminates.
TLS - If the log service uses TLS, the sender need to apply encryption and authentication, which might add some overhead.

Many tasks to take care of. Of course, we do not need to implement anything ourselves. The eco system is full of libraries and tools that can help us with that. All we need to do is decide which ones to use and what tasks to delegate to each one.

Simplest Log Pipeline

As stated above, the simplest log pipeline will require no additional components. We will simply use a logger in our code to produce log records, and hook up an exporter to send them to the final destination:

Simplest Log Pipeline

This is a working setup, but it has many drawbacks - we will need to be aware of all the points mentioned above and implement them ourselves in each and every application we write. While possible, it scales poorly, is error prone and hard to maintain.

Each service needs to work harder and spend more CPU cycles to do more work around logs. This often causes higher latency for your business logic.
Log pipeline updates (versions, configuration, etc) requires redeployment of all the services and potentially code changes.
When using multiple languages and frameworks, we will need to implement and maintain the same logic and configuration in each - multiplying the work.

Let's review other options...

Log Pipeline with Sidecar Agent

Since most loggers are very good at writing to stdout/stderr or files, let's let them do it in an efficient manner, and delegate the rest of the work to a sidecar container or K8s DeamonSet agent.

The agent tails the log files, and takes care of formatting, batching, exporting, retrying, etc. This way, the application code is not aware of any of this, and we can configure the agent to do all the work for us in a separate container.

This is a very common pattern these days, and looks like this:

Logs Pipeline with Agent

Nice, we can now configure the agent once, and all the applications will benefit from it. We can also use the same agent for multiple languages and frameworks. The CPU and I/O is done away from the business logic and has low resource-print on the application. This is a good solution, but it still has some drawbacks:

Another component to deploy and maintain
The agent consumes resources on the host which translates to higher cost
There are multiple formats involved - the log records format in code, translates to file format (usually json lines), and the agent reads the file format and convert it once again to the destination format. It is easy to mess up and lose information along the way

It is common to use configuration files to tell the agent what to do.

Logs Gateway

In the previous section, we delegated all the logs processing tasks to the agent, which runs on each pod / node. Why not take it one step further and delegate all the work to a dedicated service? This is the idea behind the logs gateway pattern:

Logs Pipeline with Gateway

Great, now we have a dedicated service that takes care of all the logs processing tasks. We can configure it once, monitor it once and all the applications will benefit from it. The CPU and I/O required for exporting is away on dedicated resources.

The obvious drawbacks is that we now have another service to deploy and maintain, and we need to make sure it scales with the load. We also need to make sure it is highly available and does not become a single point of failure.

Other than that, we still face the challenge of sending the logs from the application to the gateway. This may still require the application to handle remote sockets, retries, backoff, flushing, formats conversions, etc. Lastly, if we have multiple languages and frameworks it will be harder to maintain.

Logs Agent with Gateway

To get the best of all worlds, we can combine the above sections and have a dedicated gateway service plus agent on the node/pod. This way we can do anything common to all applications in the gateway once, and have the agent take care of exporting from the application to the gateway:

Logs Pipeline Agent with Gateway

The main drawback here is the complexity, but it offers a very robost solution that targets all the requirements we mentioned above in an elegant way.

How to choose?

The above are just common examples, and there are many more. You have a toolbox of tools which you can configure, combine and compose into your own personal pipeline.

You should balance between the following factors:

Work - More components means more to maintain, configure, learn, monitor, integrate, etc. If you are a small team, you might want to keep it simple
Complexity - If you aren't sensitive to latency, you can consider a simpler, less performance sensitive solution
Operations - It is common that log pipelines are managed by the devops team, for which it's simpler to setup agent's and gateways, rather than integrating SDKs into each application source code

OpenTelemetry

Where does OpenTelemetry fit in? OpenTelemetry is about telemetry, and telemetry is any signal that helps us understand the internal state of our system (like when debugging, operating, or monitoring it). Signal is a type of telemetry data, and logs are maybe the most used and well understood signal due to it's ubiquity.

Using OpenTelemetry for log pipelines means:

Using the OpenTelemetry protocol (OTLP) to encode and transport log records between components in the pipeline
Using the OpenTelemetry Collector, which is a vendor agnostic component that can be used as a gateway, agent, or both (or for other processing tasks in the pipeline or in your vendor's backend)
Using the OpenTelemetry SDKs to instrument your code and produce log records

The OpenTelemetry eco-system is rich with modern, open-source, vendor agnostic, high performance tools and components. These components can help you build a robust log pipeline that scales with your needs and ship logs to the backend of your choice

By using a single data model and a single compatible protocol, you can mix and match components from different vendors and projects to build your own log pipeline that fits your needs. Since there are less conversions and translations, you can be sure to deliver high quality observability data to your backend.

Another major benefit is that you can use the same components and protocols to instrument your code for logs, metrics, and traces. You can even correlate between them and get a holistic view of your system.

OpenTelemetry is part of the Cloud Native Computing Foundation, and is expected to become the defacto standard for telemetry in the cloud native world. It is backed by all major cloud vendors & monitoring providers and is a safe bet for the future.

If you are starting a new project, or looking to improve your existing log pipelines, I highly recommend you check out OpenTelemetry.

OpenSource and Managed Services

While you can build a pipeline yourself and take care of configuring it, maintaining it, and operating it, there are also OpenSource tools and SAAS services that can help you do just that, so you can keep your focus on your business logic.

Such a tool is Odigos which is an OpenSource tool that automatically instrument your cloud services and ship logs (and traces and metrics) to your backend of choice. It is a great way to implement OpenTelemetry instantly, without any code changes. It automatically deploys modern log pipeline with no manual effort.

Want to learn more about odigos or talk with us directly about log pipelines and observability? Join our slack channel here

Summary

Logs are an essential component of any software system, and a crucial factor for success lies in constructing log pipelines that transport logs to your chosen backend. Investing in observability yields long-term benefits in terms of reduced downtime, enhanced system stability, and more efficient engineers equipped with the necessary tools to manage a cloud-native system effectively.

OpenTelemetry stands out as a exceptionally adaptable and efficient log pipeline implementation, offering substantial assistance in achieving these goals.

LEARN MORE