In this blog series part- 2, we will consider a sample JAVA application that consists of two programs, a provision-manager and a provision-worker. We will use SnappyFlow's sfTrace agent to analyze the performance of this sample application. sfTrace can auto-instrument JAVA, python and nodeJS applications.
Instrumenting JAVA Applications
sfTrace JAVA agent uses an open source code generation and instrumentation library ByteBuddy. ByteBuddy leverages the instrumentation APIs provided by JAVA byte code. It modifies the Java classes during the runtime of an application. Application is not re-compiled for instrumentation. This modified code generates information about the code paths like processing http request, queries databases etc. The sfTrace Java agent can instrument a wide range of technologies like web frameworks, application servers/servlets, data stores, networking frameworks, asynchronous frameworks, scheduling frameworks, messaging frameworks, logging frameworks etc. It is also possible to create custom instrumentation through the agent APIs.
Setting Up sfTrace to Trace the Example Application
The example application is written in Java and consists of two java programs provision-manager and provision-worker. To start tracing the applications:
- Install sfAgent, the SnappyFlow agent in the instances where you are running these applications. sfAgent is used to monitor VM performance and dependencies
- Detailed Installation and configurations instructions are available in https://docs.snappyflow.io/docs/Quick_Start/getting_started#sfagent
- To trace provisioning manager and worker do the following:
java -javaagent:/opt/sfagent/sftrace/java/sftrace-java-agent.jar - Dsftrace.service_name=provision-manager -jar provision-manager.jar
java -javaagent:/opt/sfagent/sftrace/java/sftrace-java-agent.jar - Dsftrace.service_name=provision-worker -jar provision-worker.jar
Provision-manager and provision-worker processes are run under the sfTrace java agent. sfTrace Java agent automatically instruments the application and the libraries to intercept the execution steps to extract information and create trace data.
What is trace retention
Tracing is an indispensable tool for application performance management (APM) providing insights into how a certain transaction or a request performed – the services involved, the relationships between the services and the duration of each service. This is especially useful in a multi-cloud, distributed microservices environment with complex interdependent services. These data points in conjunction with logs and metrics from the entire stack provide crucial insights into the overall application performance and help debug applications and deliver a consistent end-user experience.
Amongst all observability ingest data, trace data is typically stored for an hour or two. This is because trace data by itself is humongous. For just one transaction, there will be multiple services or APIs involved and imagine an organization running thousands of business transactions an hour which translates to hundreds of millions of API calls an hour. Storing traces for all these transactions would need Tera Bytes of storage and extremely powerful compute engines for index, visualization, and search.
Why is it required
To strike a balance between storage/compute costs and troubleshooting ease, most organizations choose to retain only a couple of hours of trace data. What if we need historical traces? Today, modern APM tools like SnappyFlow have the advantage of intelligently and selectively retaining certain traces beyond this limit of a couple of hours. This is enabled for important API calls and certain calls which are deemed anomalous by the tool. In most troubleshooting scenarios, we do not need all the trace data. For example, a SaaS-based payment solutions provider would want to monitor more important APIs/services related to payments rather than say customer support services.
Intelligent trace retention with SnappyFlow
SnappyFlow by default retains traces for
SnappyFlow by default retains traces for
HTTP requests with durations > 90th percentile (anomalous incidents)
In addition to these rules, users can specify additional rules to filter out services, transaction types, request methods, response codes and transaction duration. These rules are run every 30 minutes and all traces that satisfy these conditions are retained for future use.
With the built-in trace history retention and custom filters enabled, SREs and DevOps practitioners can look further to understand historical API performance, troubleshoot effectively and provide end-users with a consistent and delightful user experience.