Log pipelines are essential for operating any system, but especially cloud native systems. Simply adding log statements to your code is not enough. Logs need to be shipped to a central place where they can be consumed and processed to deliver value to your organization.
But let's go back few steps and start from the very beginning.
I am sure we all remember our first days of programming, writing simple programs and practicing our new skills. When things works as expected, we were happy and excited. But when things didn't work, we needed some way to understand what went wrong. A simple way to do this is by just printing things to the console. This is probably the simplest form of observability, allowing us to examine the internal events (loop iteration, function execution) in our programs as well as some useful context like the values of variables.
While this is a very simple and common way to debug our program during development, it does not work in production:
So as we can see, printing to console might work in development, but it is not a good solution for production.
The next step in the evolution of logging is transforming these prints to a file
write in the local filesystem. This common practice in cloude native software.
It solves some of the problems we mentioned above, as we can now login to a remote machine to consume those files, and we can use tools like
grep to search and filter the data we are interested in.
On the flip side, we still have to deal with some of the old problems:
Moreover, this introduces us to a new set of challenges:
Those primitive prints we describe above, are commonly referred to as a log record. Logs are used to record events that happen in our program and the developer might find value in examining them later.
While you can record any event you want, popular events include:
While there is a wealth of information to delve into regarding logs, for the scope of our current discussion, let's briefly touch on some of the common attributes found in log records:
Like anything else in software - there is no need to reinvent the wheel. There are tons of libraries that can help us produce log records in a consistent and friendly way.
Those libraries are commonly referred to as loggers, and your programming language and runtime probably has a few of them to choose from.
Ok, so writing logs files to the local filesystem is a good start, but it is not a good solution for a real system with dozens of components, hosts, and large amount of traffic.
We can't start
greping on each host to find the log file we are interested in. It is also desirable that we can get result fast and efficiently, which usually requires some indexing and aggregation of the logs.
We need a way to collect those logs from all the instances of our program, and store them in a central place.
To the rescue comes log services or log backends. Instead of storing each log record in a file on the local host filesystem, we now ship these logs to a central service which will significantly reduce our workload:
and many many more.
These services can be open-source or commercial, and you can run them yourself or use a managed SAAS service. Some popular self managed examples are: Grafana Loki, Elastic Search, and managed popular services are: Datadog, Splunk, Coralogix, Grafana Cloud, AWS CloudWatch, Google Cloud Logging, Azure Monitor and many more.
Ok, so we have implemented loggers in our code that produce log records for anything we might consider interesting and worth recording. We have also researched and chosen a log backend (self managed or paid SAAS), for which we want to ship logs. But how do we connect the two? How do we get the log records from our code to the log service?
This where things get interesting. There are many ways to do that and you should choose the right one for your use case. We will get to it soon.
So why the term pipelines? We already described the first part of the pipeline - the logger, where log records are born. We also described the last part of the pipeline - the log service, where log records are stored and consumed. The pipeline consists of all the components that a log record encounters on its journey - from recording until it is consumed downstream.
Now, we can simply send the log records from our application to the log service, and we are done, right?
Well, not so fast.
Let's discuss some of the practices and common requirements to do this properly:
Many tasks to take care of. Of course, we do not need to implement anything ourselves. The eco system is full of libraries and tools that can help us with that. All we need to do is decide which ones to use and what tasks to delegate to each one.
As stated above, the simplest log pipeline will require no additional components. We will simply use a logger in our code to produce log records, and hook up an exporter to send them to the final destination:
This is a working setup, but it has many drawbacks - we will need to be aware of all the points mentioned above and implement them ourselves in each and every application we write. While possible, it scales poorly, is error prone and hard to maintain.
Let's review other options...
Since most loggers are very good at writing to
stderr or files, let's let them do it in an efficient manner, and delegate the rest of the work to a
sidecar container or
K8s DeamonSet agent.
The agent tails the log files, and takes care of formatting, batching, exporting, retrying, etc. This way, the application code is not aware of any of this, and we can configure the agent to do all the work for us in a separate container.
This is a very common pattern these days, and looks like this:
Nice, we can now configure the agent once, and all the applications will benefit from it. We can also use the same agent for multiple languages and frameworks. The CPU and I/O is done away from the business logic and has low resource-print on the application. This is a good solution, but it still has some drawbacks:
It is common to use configuration files to tell the agent what to do.
In the previous section, we delegated all the logs processing tasks to the agent, which runs on each pod / node. Why not take it one step further and delegate all the work to a dedicated service? This is the idea behind the logs gateway pattern:
Great, now we have a dedicated service that takes care of all the logs processing tasks. We can configure it once, monitor it once and all the applications will benefit from it. The CPU and I/O required for exporting is away on dedicated resources.
The obvious drawbacks is that we now have another service to deploy and maintain, and we need to make sure it scales with the load. We also need to make sure it is highly available and does not become a single point of failure.
Other than that, we still face the challenge of sending the logs from the application to the gateway. This may still require the application to handle remote sockets, retries, backoff, flushing, formats conversions, etc. Lastly, if we have multiple languages and frameworks it will be harder to maintain.
To get the best of all worlds, we can combine the above sections and have a dedicated gateway service plus agent on the node/pod. This way we can do anything common to all applications in the gateway once, and have the agent take care of exporting from the application to the gateway:
The main drawback here is the complexity, but it offers a very robost solution that targets all the requirements we mentioned above in an elegant way.
The above are just common examples, and there are many more. You have a toolbox of tools which you can configure, combine and compose into your own personal pipeline.
You should balance between the following factors:
Where does OpenTelemetry fit in? OpenTelemetry is about telemetry, and telemetry is any signal that helps us understand the internal state of our system (like when debugging, operating, or monitoring it). Signal is a type of telemetry data, and logs are maybe the most used and well understood signal due to it's ubiquity.
Using OpenTelemetry for log pipelines means:
The OpenTelemetry eco-system is rich with modern, open-source, vendor agnostic, high performance tools and components. These components can help you build a robust log pipeline that scales with your needs and ship logs to the backend of your choice
By using a single data model and a single compatible protocol, you can mix and match components from different vendors and projects to build your own log pipeline that fits your needs. Since there are less conversions and translations, you can be sure to deliver high quality observability data to your backend.
Another major benefit is that you can use the same components and protocols to instrument your code for logs, metrics, and traces. You can even correlate between them and get a holistic view of your system.
OpenTelemetry is part of the Cloud Native Computing Foundation, and is expected to become the defacto standard for telemetry in the cloud native world. It is backed by all major cloud vendors & monitoring providers and is a safe bet for the future.
If you are starting a new project, or looking to improve your existing log pipelines, I highly recommend you check out OpenTelemetry.
While you can build a pipeline yourself and take care of configuring it, maintaining it, and operating it, there are also OpenSource tools and SAAS services that can help you do just that, so you can keep your focus on your business logic.
Such a tool is Odigos which is an OpenSource tool that automatically instrument your cloud services and ship logs (and traces and metrics) to your backend of choice. It is a great way to implement OpenTelemetry instantly, without any code changes. It automatically deploys modern log pipeline with no manual effort.
Want to learn more about odigos or talk with us directly about log pipelines and observability? Join our slack channel here
Logs are an essential component of any software system, and a crucial factor for success lies in constructing log pipelines that transport logs to your chosen backend. Investing in observability yields long-term benefits in terms of reduced downtime, enhanced system stability, and more efficient engineers equipped with the necessary tools to manage a cloud-native system effectively.
OpenTelemetry stands out as a exceptionally adaptable and efficient log pipeline implementation, offering substantial assistance in achieving these goals.