Servers and applications generate logs to report informational events, error events, debug events or fatal events. These logs are used to troubleshoot problems, identify user behavior, identify hosts requesting access to the applications, identify anomalous behavior etc. Further, they are also used to analyze differences in behavior of an application on a day to day basis. Appearance of a log which was not seen earlier may indicate (a) a new scenario being triggered or (b) an unexpected behavior due to a change within an application.
In some scenarios, enterprise may need to analyze old logs for security forensics and gather crucial information about cyber-attacks, fraudulent access etc. To perform such analysis, logs will need to be retained for a longer period of time. Certain regulatory compliance also may necessitate longer retention periods for logs.
Existing Solutions and Gaps
Existing log management solutions like Splunk and Elasticsearch replicate logs by compressing and shifting logs from primary storage to an offline archive. When archived logs are required to be analyzed, these are brought back from the archival store to primary store and searched. This design has several shortcomings, namely:
It is not possible to search archive and consequently, users retain logs for longer periods in primary storage itself
The process of bringing back an archive to primary storage cumbersome, impacts the performance of the cluster as well as makes sizing of primary storage unpredictable
Searching archived logs usually spans a long period and involves a large amount of data. The optimum way to search archives is through background jobs and with an ability to retain results, search within results, setup recurring jobs etc. Both Splunk and Elasticsearch, only support interactive searches which leads to poorer usability as well as requires larger clusters
SnappyFlow’s Log Archival Approach
SnappyFlow’s SaaS solution provides comprehensive log archival functionality that leverages S3 compatible object store. Unlike competitive solutions users can search & visualize logs seamlessly in both the active data and archive data without the need to move data from archives to primary storage. Further, SnappyFlow analyses logs, extracts signature, adds search metadata and finally compresses the logs before storing the logs into archive.
Key Benefits from SnappyFlow’s Archival Feature to Users
10 - 40x compression plus maintaining data in S3 translates to significant cost reduction
Extensive search using regex patterns
Ability to visualize results, zoom into results, search within results, join results, store results for reference
Achieve faster search of logs at petabyte scale as a result of superior organization, meta-data, signatures and other innovative technique
Signature based log filtering and analysis
What is trace retention
Tracing is an indispensable tool for application performance management (APM) providing insights into how a certain transaction or a request performed – the services involved, the relationships between the services and the duration of each service. This is especially useful in a multi-cloud, distributed microservices environment with complex interdependent services. These data points in conjunction with logs and metrics from the entire stack provide crucial insights into the overall application performance and help debug applications and deliver a consistent end-user experience.
Amongst all observability ingest data, trace data is typically stored for an hour or two. This is because trace data by itself is humongous. For just one transaction, there will be multiple services or APIs involved and imagine an organization running thousands of business transactions an hour which translates to hundreds of millions of API calls an hour. Storing traces for all these transactions would need Tera Bytes of storage and extremely powerful compute engines for index, visualization, and search.
Why is it required
To strike a balance between storage/compute costs and troubleshooting ease, most organizations choose to retain only a couple of hours of trace data. What if we need historical traces? Today, modern APM tools like SnappyFlow have the advantage of intelligently and selectively retaining certain traces beyond this limit of a couple of hours. This is enabled for important API calls and certain calls which are deemed anomalous by the tool. In most troubleshooting scenarios, we do not need all the trace data. For example, a SaaS-based payment solutions provider would want to monitor more important APIs/services related to payments rather than say customer support services.
Intelligent trace retention with SnappyFlow
SnappyFlow by default retains traces for
SnappyFlow by default retains traces for
HTTP requests with durations > 90th percentile (anomalous incidents)
In addition to these rules, users can specify additional rules to filter out services, transaction types, request methods, response codes and transaction duration. These rules are run every 30 minutes and all traces that satisfy these conditions are retained for future use.
With the built-in trace history retention and custom filters enabled, SREs and DevOps practitioners can look further to understand historical API performance, troubleshoot effectively and provide end-users with a consistent and delightful user experience.