Predictive performance management of a petabyte scale Kubernetes application for a global hi-tech major

Download PDF

SnappyFlow’s powerful Kubernetes monitoring capabilities helped troubleshoot performance issues and right size EFS storage providing significant cost savings.

The client has hundreds of thousands of hardware and software endpoints that routinely send Petabytes of logs and metrics to a centralized monitoring application. This application is built on a microservices architecture and runs on Kubernetes pods. The huge volumes of data ingestion and scalability of the application required a powerful monitoring solution to ensure faster troubleshooting, right sizing of Kubernetes Pods and lower storage costs.

PetaBytes of Data

The client has hundreds of thousands of hardware and software endpoints that routinely send Petabytes of logs and metrics to a centralized monitoring application. This application is built on a microservices architecture and runs on Kubernetes pods. The huge volumes of data ingestion and scalability of the application required a powerful monitoring solution to ensure faster troubleshooting, right sizing of Kubernetes Pods and lower storage costs.
While this is rather routine, the client faced some unique challenges.
Large size of diagnostics data per device
1000’s of devices sending data in bursts
Massive Storage requirement running into PetaBytes on EFS
This unique situation presented another unique issue. The system was designed to scale up as an when necessary. They key questions were – was the application scaling up efficiently? Was the scaling due to increased data load or due to inherent application issues? The large volume of data coming in at short bursts added another problem in the network layer – packets lost, connection time outs and response time degradation. If a pod were to fail and restart, it was impossible to attribute the problem to overload or to application issues.

How SnappyFlow Helped

SnappyFlow provided an application centric view of the overall system and provided a simplified view of application performance and metrics data. SnappyFlow helped in understanding system load, how it was balanced between pods, resource utilization and provided insights to fine tune individual systems to provide predictable performance, cost, and scaling.
SnappyFlow helped to
Detect root causes of out of memory issues through better observability of application metrics, container metrics and logs
Reduce infrastructure foot print through right sizing of containers and hosts which was made possible by understanding load and performance patterns
Huge savings in data costs by resolving performance bottlenecks and drastically reducing data buildup
Detect faulty elements and trigger support requests
Detect systemic patterns linked to quality of products/versions
Predict failures through signature analysis
75%
Reduction in Storage Costs
5x
Reduction in troubleshoot times: From weeks to hours
Daily ingest
2TB of raw data parsed to 75GB structured data and stored in Elasticsearch per day
Archive data
2TB of data per day
Ingest rate
3000 requests at peak (transferring data of 250MB to 3GB per request)

Benefits

Scalability
The overall system was fine tuned to cater to Large point loads from 1000’s of devices
Overall data pipeline was streamlined
Right sizing of pods for proper scaling
Debuggability
Provide a hierarchical view from a stack to application to pods to containers
Linking metrics to bring actionable insights – application, Kubernetes, logs
Powerful Kubernetes, node.js/express, Java monitoring with APM
Storage cost
Huge reduction in EFS storage costs streamlining data pipeline
Use of tiered S3 to save costs
Get in touch
Or fill the form below, we will get back!
Is SnappyFlow right for you ?
Sign up for a 14-day trial