What is SnappyFlow?

SnappyFlow is a comprehensive monitoring and log management solution addressing the needs of today’s cloud-native applications. This blog highlights SnappyFlow’s signature analysis feature that users will find extremely useful to improve their troubleshooting effectiveness, reduce noise and reduce log storage costs.

Monitoring

Infrastructure monitoring
Application monitoring
Kubernetes monitoring
Cloud services monitoring

Log Management

Out-of-box standard parsers
Search & Analytics
Feature extraction at ingest
Signature based filtering
Signature based filtering

APM

Trace services, transactions, spans
Multi-service analysis
Asynchronous analysis
Anomalous span analysis
Jaeger integration

Dashboards

Powerful dashboard builder
Pre-built dashboards & auto recommendation
Rich correlation within application context

Alerts

Pre-built alert library with auto-recommendation
Auto-thresholding
Integration with multiple notification systems
Noise reduction constructs

Easy Onboarding

sfAgent, sfPoller, sfPod
Single agent for Metrics, Logs, Tracing
Simple discovery & configuration
Multi-cloud support

What is the Problem We Solve for?

Logs provide valuable insights about an application. They are useful to troubleshoot issues happening with the application, track user access, understand usage of application’s features, track load patterns etc. Consequently, log analysis and log management solutions have become a “must-have” in a SRE’s tool repertoire, more so, with the growing complexity of cloud-native stacks that the SRE is needing to manage.

There are several good log management solutions in the market but most have two pronounced drawbacks:

Triaging issues is not easy. Users have to find a trail of logs amongst a vast deluge of logs and cutting through the noise is not easy and has a direct impact on resolution time
As the size of deployments, number of deployments and load grow, the volume of logs exponentially grows. Obviously, the cost explodes as well

What are Log Signatures in SnappyFlow?

Log Signature is a unique feature in SnappyFlow that is used to reduce noisy logs, improving triaging as well as reduce log storage costs.

Signature is a string pattern present in the log, which uniquely identifies it. String patterns used to define signatures can contain variables $w or $i . The variable $w represents a word consisting of alphanumeric characters and $i a decimal number.

For example, the signature, ‘’missed heartbeat from $w@’’ would uniquely identify these logs

"missed heartbeat from provision@stage-apm-sfapm-apm-celery-provision-5579fbffc9-st9dm"
"missed heartbeat from notify@stage-apm-sfapm-apm-celery-notify-78b46bd6cc-85shb"
"missed heartbeat from default@stage-apm-sfapm-apm-celery-default-6c44857687-4zx2t"

Signature Group is a grouping of multiple signatures that are related to a problem or a workflow.

Users can perform the following operations related to Log Signatures in SnappyFlow:

Add or delete a signature
Group multiple signatures into a group
Get volume statistics og logs based on a signature or a group
Hide or Unhide logs belonging to a signature or a group
Show only logs belonging to a signature or a group
Stop or Restart collection of logs to primary store that belong to the signature or a group
Stop or Restart collection of logs to archive that belong to a signature or a group

SnappyFlow’s Overall Signature Analysis Flow

So, How Does Signatures Help Users?

A large proportion of logs are of very little or no value. Many of these just add to volume and cost. We have seen situations where 80-95% of logs may belong to this category. Users live with them because it is not easy to selectively turn them off at the source.

With SnappyFlow, users can turn on or off the collection of the log with a single click. In the example below, with just 2 clicks we are able to turn off 2 logs that are taking 40% of storage space, logs that have very little value. If the user does indeed want to retain the log for a future purpose, the user can continue to store the log in the archive with 10-40x compression and search the archive as needed.

When a critical issue occurs and SRE is racing to troubleshoot the issue, SRE has to first wade through a ton of noisy logs to get to the few logs of interest. The experience can range from irritating to frustrating depending on the situation at hand

With SnappyFlow, if user finds a noisy log and wants to mask it out, all he the user has to do is to “hide” a log or set of logs and they will be removed from the log view
Depending on the problem that a user is troubleshooting, the user’s field of interest is a finite set of logs. These logs of interest vary based on the problem. User would ideally like to see the trail of these logs of interest, i.e., when, where & how many, and easily mask-out everything else

This is not possible in most log management solutions and the workflow in these solutions is fairly cumbersome. Users typically filter logs based on log levels, instance and file, after which they search for individual logs or scroll through logs to find what they are looking for. This is a time-consuming process with a big impact on resolution time

Suppose user is debugging an OOM issue and is interested in a set of 8-10 logs to understand the behavior of the application, user can group these logs into a group called “OOM” and only show logs that belong to “OOM” group. Overtime, user can create multiple such groups that correspond to playbooks of specific issues

What is trace retention

Tracing is an indispensable tool for application performance management (APM) providing insights into how a certain transaction or a request performed – the services involved, the relationships between the services and the duration of each service. This is especially useful in a multi-cloud, distributed microservices environment with complex interdependent services. These data points in conjunction with logs and metrics from the entire stack provide crucial insights into the overall application performance and help debug applications and deliver a consistent end-user experience.

Amongst all observability ingest data, trace data is typically stored for an hour or two. This is because trace data by itself is humongous. For just one transaction, there will be multiple services or APIs involved and imagine an organization running thousands of business transactions an hour which translates to hundreds of millions of API calls an hour. Storing traces for all these transactions would need Tera Bytes of storage and extremely powerful compute engines for index, visualization, and search.

Why is it required

To strike a balance between storage/compute costs and troubleshooting ease, most organizations choose to retain only a couple of hours of trace data. What if we need historical traces? Today, modern APM tools like SnappyFlow have the advantage of intelligently and selectively retaining certain traces beyond this limit of a couple of hours. This is enabled for important API calls and certain calls which are deemed anomalous by the tool. In most troubleshooting scenarios, we do not need all the trace data. For example, a SaaS-based payment solutions provider would want to monitor more important APIs/services related to payments rather than say customer support services.

Intelligent trace retention with SnappyFlow

SnappyFlow by default retains traces for

HTTP requests with durations > 90th percentile (anomalous incidents)

In addition to these rules, users can specify additional rules to filter out services, transaction types, request methods, response codes and transaction duration. These rules are run every 30 minutes and all traces that satisfy these conditions are retained for future use.

With the built-in trace history retention and custom filters enabled, SREs and DevOps practitioners can look further to understand historical API performance, troubleshoot effectively and provide end-users with a consistent and delightful user experience.