Pramod Murthy

Published On :

This is part third of the three part tracing series. To illustrate how SnappyFlow can be used to analyze application performance, we will consider a sample application and walk through the traces. The sample application provides users the ability to provision computing instances in different public cloud providers.

User creates a request to provision instances using the web UI provided by this application
Application accepts the request after validation, and provides a task ID back to the client
Client UI monitors the status of the task progress using the task ID
Application is implemented as an asynchronous concurrent application and has a provision-manager service and one or more provision-worker service
Communication between the provision-manager and provision-worker services is via a message broker – Kafka

Application topology is shown in the figure below:

The execution flow for the various user requests is summarized in the table below:

Method URL	Action
POST /api/provision	Request is processed by provision-manager. Task details are updated in DB. Sends task execution request to worker pool via Kafka. Responds to the request with 202 ‘Accepted’. One of the workers reads the request from Kafka. Worker accepts the task and sends “in-progress” notification to the manager via Kafka. Worker continues to process the request. Manager reads the request from notification-topic and updates the task status. Worker Completes the provisioning task and sends “completed” notification to the manager via Kafka. Manager reads the notification and updates the task status
PUT /api/provision/{taskID}	Manager updates the task details synchronously and sends 200 OK response
DELETE /api/provision/{taskID}	Manager deletes the task details synchronously and sends 200 OK response
GET /api/workers/get-counters	Manager accepts the request. Forwards the request to worker. Worker responds to the forwarded request with the required data. Manager responds to the original request using the data received from worker
GET /api/workers/reset-counters	Manager accepts the request. Forwards the request to worker. Worker responds to the forwarded request with the required data. Manager responds to the original request using the data received from worker

Flame Graph Showing the Sequence of Execution

In the example application, when a client request is received in provision-manager, a trace context is created and added to that request. When this request traverses through the processing functions like Kafka producer, Kafka consumer, dB query, http processing etc. the trace context is extracted and maintained. The trace data produced at each stage of execution will contain this trace context. The trace context helps stitching together the events and visualization of the execution path.

The figure above shows a flame graph showing the initial execution of the “POST /api/provision”. The sequence diagram shows the 3 steps:

Provision-manager receiving the request, updating the task database, sending notification to provision-worker through Kafka-topic. Notice how the trace context, kept track of the communication between provision-manager, Kafka producer and provision-worker
Provision-worker reads the task details from Kafka and starts processing. It sends a notification to provision-manager to inform the task status
Provision-manager reads the notification from provision-worker and updates the task status in DB

All traces where captured by sfTrace, without any modification to the application code. sfTrace, instrumented the java classes at runtime to introduce the trace context and collect the start and end of each action.

Flame graph can help identify gaps in execution. If there are significant gaps in execution between transactions or spans those get highlighted in flame graph. These gaps may indicate probable delays caused because of resource constraints or other scheduling issues.

Related Transactions

The journey of a trace through various services is summarized in the Transactions View. These services may get invoked through HTTP/RPC request, message broker service or similar inter-process communication frameworks. The transaction summary view, lists the related transactions in a chronological order along with duration spent in each transaction, performance stats from previous runs, and the percentage duration contributed by each related transaction with reference to the overall duration.

If a related transaction consumes higher time than the average or 95 percentile stats from previous runs, that transaction could be causing a potential delay and will need to be analyzed in detail using the Span view.

The percentage duration spent by each transaction provides an easy way to identify hot spots in execution.

Span List

Each transaction is broken down into spans. Each span represents an execution path which the transaction took to complete the run.

Spans are different types – DB, Cache, Template, External, Custom spans. Span list can be filtered by span type for analysis. Span view also shows.

Total number of spans – number of execution points detected
Net transaction duration – total time spent in child transactions
Net span duration – total time spent in all spans contributing to the overall trace duration (this time is adjusted for parallel execution of spans)
Net span duration percentage – net span duration as a percentage of overall duration. If this value is not very significant compared to the overall duration, spans are not contributing to delays

Similar to transaction view, span view provides, the following metrics for each span:

The time spent by each span
Performance stats for this span from the previous runs
The percentage of time contributed by each span in the overall execution

Filtering Traces for Analysis

The sfTrace collects traces related to all services which it monitors. Real-time view provides an insight into overall performance of the application. Filters can be used to analyze and drill down to problem transactions.

Response Time is often used as a measure of performance. By selecting a specific transaction, it is possible to see the variation in response time range. Once an transaction with very high variance between the minimum and maximum response time is identified, the response time filter slider can be used to select a range. All traces with response time within the selected range are shown, which then can be analyzed using flame graph, transaction list and span list.

The result of a transaction or response code for HTTP request is another important parameter monitored. If the result is FAILURE or HTTP Response codes indicate failures, those traces can be filtered separately and analyzed.

Aggregate View

In aggregate view, summary of all traces is captured. The view shows, for each unique trace, the captured number of occurrences, average number of spans captured, number of errors, response time - average, median, 95 percentile, 99 percentile values.

By clicking a specific trace from Aggregate view, user can view the real-time stats and drill down to the specific instances of the trace for trouble shooting.

Transaction Map

Transaction map visual summary of the interaction between different application components. A transaction in its execution depends on Databases, message buses and other application services. Transaction map shows you the connectivity between different components along with statistics on query rate, average, minimum and maximum response times. This high level view helps drilling down to the individual transaction flows and identify anomalies.

Example – Troubleshooting Trace Delays

An example to illustrate how trace can help trouble shoot issues quickly. Consider the following flame graph of a trace. The flame graph shows the trace with all spans encountered during its execution. All spans show DB operations to Postgres, Redis and Elasticsearch. Closer view of the flame graph shows, many Elasticsearch DB spans with longer in duration.

To analyze this further, Span view of the trace is selected. In the span view shown below, notice that 95.3% of the total transaction duration is spent in DB Spans.

To view the spans contributing to this high percentage of time, the spans can be sorted based on duration percentage.

The sorted span list, clearly shows the eight spans contributing to this high percentage of execution time. Examining the stack trace for the affected DB span shows the actual Elasticsearch query. Analyzing the query confirms the behavior. Developers can now easily use this information to optimize the query and execution.

Example – Troubleshooting Exceptions

sfTrace captures exceptions generated during a transaction execution along with the stack trace. This makes the debugging easy and fast.

Using the filter, select traces which show Failure.

Select one of the traces to view the flame graph.

The flame graph shows that transaction encountered an error. An exception occurred almost at the end of execution.

In the Transaction view, the result status shows Failure.

Span list shows all the 7 spans. Error may have occurred in one of the spans.

The error tab shows the details of this exception with complete stack trace. No further debugging is needed, root cause is detailed here - one of the DB spans querying Elasticsearch was using an index which is not present in Elasticsearch.

Example – Analyze Trace with Historical Performance Data

By using the response time filter it was observed that some of the traces in a service were taking longer time to complete.

Viewing the flame graph for this trace showed two potential issues. The parent transaction of the trace is invoking another service and that service is contributing to some of the execution time.

Transaction view showed the time spent in executing the related service (41.31% of the total duration). Also, it highlighted that the child service took more time compared the previous median execution time (see the yellow highlight). It also showed that the total duration was beyond the previous high 99 percentile value.

In Span list, select the child transaction and see the list of spans associated with the child transaction. It was observed that there were many spans showing higher execution time compared to the prior runs. The child transaction had 93 spans.

To see which spans are contributing to the higher execution times, the list is sorted based on the % time spent column (IN(%)). Observe many spans runtime varying more than the last 95 Percentile or 99 Percentile values.

Click on the affected span, to view its details. The detail shows the DB query being executed along with full statement. Analyzing each of the highlighted span helps identify the delays and optimize code for further performance improvement.

Conclusion

From this blog series, we hope you could appreciate the power of tracing in performance troubleshooting.

SnappyFlow is designed to help organizations monitor a variety of applications, physical and virtual infrastructure including Kubernetes pods. The built in features such as ready to use dashboards, tracing, synthetics monitoring and alerting help monitor your infrastructure with complete application context and help troubleshoot problems with remarkable speed. SnappyFlow clients have been able to reduce troubleshooting times from weeks to hours and optimize their overall running costs.

To experience SnappyFlow in action, get in touch with us or sign up for a Free Trial.

‍

What is trace retention

Tracing is an indispensable tool for application performance management (APM) providing insights into how a certain transaction or a request performed – the services involved, the relationships between the services and the duration of each service. This is especially useful in a multi-cloud, distributed microservices environment with complex interdependent services. These data points in conjunction with logs and metrics from the entire stack provide crucial insights into the overall application performance and help debug applications and deliver a consistent end-user experience.

Amongst all observability ingest data, trace data is typically stored for an hour or two. This is because trace data by itself is humongous. For just one transaction, there will be multiple services or APIs involved and imagine an organization running thousands of business transactions an hour which translates to hundreds of millions of API calls an hour. Storing traces for all these transactions would need Tera Bytes of storage and extremely powerful compute engines for index, visualization, and search.

Why is it required

To strike a balance between storage/compute costs and troubleshooting ease, most organizations choose to retain only a couple of hours of trace data. What if we need historical traces? Today, modern APM tools like SnappyFlow have the advantage of intelligently and selectively retaining certain traces beyond this limit of a couple of hours. This is enabled for important API calls and certain calls which are deemed anomalous by the tool. In most troubleshooting scenarios, we do not need all the trace data. For example, a SaaS-based payment solutions provider would want to monitor more important APIs/services related to payments rather than say customer support services.

Intelligent trace retention with SnappyFlow

SnappyFlow by default retains traces for

HTTP requests with durations > 90th percentile (anomalous incidents)

In addition to these rules, users can specify additional rules to filter out services, transaction types, request methods, response codes and transaction duration. These rules are run every 30 minutes and all traces that satisfy these conditions are retained for future use.

With the built-in trace history retention and custom filters enabled, SREs and DevOps practitioners can look further to understand historical API performance, troubleshoot effectively and provide end-users with a consistent and delightful user experience.