Tracing in Application Performance Troubleshooting- Part 1/3
By
Pramod Murthy
Published On :
This is the first part in the 3 part series that touches upon the basics of tracing, how you can use tracing data to troubleshoot and setting up tracing functionality in SnappyFlow.
Why Tracing?
Automation of business processes has radically changed the application landscape. Traditionally, applications used to carry out a basic business procedure and store in a database the results of that process. Multiple such applications were built, each working independently in implementing a simple process. Collaboration between such applications used to take place offline. As the need to provide real-time and online services increased, applications became more complex and there was an increase in real-time interaction between applications. Applications evolved from simple sequential execution model to distributed concurrent execution model.
Tracking Down Problems with Distributed, Asynchronous and Concurrent Applications is Hard
Traditionally, failures were tracked by their symptoms - alerts based on time series data, or monitoring error events from logs. Once a symptom is found, identifying the root cause is done by analyzing logs and/or by correlating time series metrics from different applications and instances. Because of the asynchronous and concurrent nature of execution, it is very difficult to trace the exact sequence of events which led to the failure by these traditional means.
Tracing stitches together or aggregates the actions performed by applications to service a request. This aggregated data is then presented in a chronologically organized manner for analysis. A context is created when a request is received by an application and the request is tracked, using this context, through all the execution paths in the application(s). At each execution path, the entry and exit times along with other useful information are logged. The traces thus collected are analyzed using powerful visualizations to identify the hot spots and bottlenecks.
Trace logs contain spans, transactions and traces. Trace is a collection of transactions and spans that have a common starting point or root. Spans contain information about the activity in an execution path. It has the measurement of time from start to end of the activity and also includes parent child relationship with other spans. Transaction is a special span, which is captured at the entry of a service like http/rpc processor, message broker, cron job etc.
Using Trace Data to Troubleshoot Effectively
The trace data is used to quickly identify the root-cause of the failure. In asynchronous concurrent applications, failures or delays occur in one of the many execution paths. To effectively, detect these failures or delays, powerful visualization and analysis tools are required. In order to trouble shoot a failure user needs to know:
Contextual view of execution – an easy to track view to understand the sequence in which the transaction execution progressed and the time taken in each step
Child transactions and spans – delays in child transactions can contribute to overall delays. Prior execution times in terms of average, median, 95 Percentile, 99 Percentile helps compare the current execution with reference to previous runs
Time spent in each span and comparison with prior execution - it is important to know how the current span duration rank in comparison to previous runs. Typically this is done by comparing the current duration value with the average, median, 95 Percentile, 99 Percentile values
Percentage of time spent by a span with respect to overall time - this will help identify the hot spots in execution
Cumulative span execution time – cumulative span execution time is computed after considering the span parallelism. This value measures delay contributed by all spans to the overall delay. The gap between the cumulative span execution time and the total transaction duration gives an indication about the time the transaction is either spending in additional processing or waiting on resources - I/O wait, DB locks, compute time, event loop saturation etc
Stack traces – stack traces are useful to quickly identify the error execution path and pin point the failure reason. Stack traces, provide a list of stack frames from the point, where execution failed up to the start of application
SnappyFlow supports distributed tracing compliant with Opentracing standard. Tracing allows users to visualize the sequence of steps a transaction (whether API or non-API such as a Celery job) takes during its execution. This analysis is extremely powerful and allows pinpointing the source of problems such as abnormal time being spent on an execution step or identifying point of failure in a transaction.
What is trace retention
Tracing is an indispensable tool for application performance management (APM) providing insights into how a certain transaction or a request performed – the services involved, the relationships between the services and the duration of each service. This is especially useful in a multi-cloud, distributed microservices environment with complex interdependent services. These data points in conjunction with logs and metrics from the entire stack provide crucial insights into the overall application performance and help debug applications and deliver a consistent end-user experience.
Amongst all observability ingest data, trace data is typically stored for an hour or two. This is because trace data by itself is humongous. For just one transaction, there will be multiple services or APIs involved and imagine an organization running thousands of business transactions an hour which translates to hundreds of millions of API calls an hour. Storing traces for all these transactions would need Tera Bytes of storage and extremely powerful compute engines for index, visualization, and search.
Why is it required
To strike a balance between storage/compute costs and troubleshooting ease, most organizations choose to retain only a couple of hours of trace data. What if we need historical traces? Today, modern APM tools like SnappyFlow have the advantage of intelligently and selectively retaining certain traces beyond this limit of a couple of hours. This is enabled for important API calls and certain calls which are deemed anomalous by the tool. In most troubleshooting scenarios, we do not need all the trace data. For example, a SaaS-based payment solutions provider would want to monitor more important APIs/services related to payments rather than say customer support services.
Intelligent trace retention with SnappyFlow
SnappyFlow by default retains traces for
SnappyFlow by default retains traces for
HTTP requests with durations > 90th percentile (anomalous incidents)
In addition to these rules, users can specify additional rules to filter out services, transaction types, request methods, response codes and transaction duration. These rules are run every 30 minutes and all traces that satisfy these conditions are retained for future use.
With the built-in trace history retention and custom filters enabled, SREs and DevOps practitioners can look further to understand historical API performance, troubleshoot effectively and provide end-users with a consistent and delightful user experience.