SnappyFlow helps run a massive big data stack at peak performance

Download PDF

Sustaining peak performance across multiple, hyperscale Hadoop clusters is no easy job, especially when mission critical data analytics depends on it.

The client’s app teams run their data analytics jobs written in Map Reduce, Hive and Spark on multiple, 1000 node Hadoop clusters, each with 10s of thousands of jobs running per day. Powering these jobs was a complex infrastructure and operations teams faced significant challenges in troubleshooting performance issues at this scale.
Fortune 500
Client
5000+ Nodes
Hadoop cluster
5-10 TB
Logs/day
10K+
Jobs/day/cluster

Many unanswered questions

Why is an app running slow? Is it due to cluster or data?
Why were the apps crashing?
Why is the Cluster not performing as per expected SLA?
What was the reason for erratic performance issues like node failures or high latency?
Why won’t an app run on one cluster but not on another?

And a nightmare for troubleshooters

Where to look and what to look for?
Information was spread across multiple Big Data platform components
Multiple independent dashboards for Oozie, Yarn, Map Reduce, Spar, Name Node Stats, Yarn Stats and Linux metrics
Server issue? Yes, but which Server?
Were the servers healthy when running jobs?
Which servers to isolate? Was there any unusual activity?
More logs but less joy
Too many verbose logs
No information on relevancy or correlation
How SnappyFlow helped
Data Ingestion
Ingest 5-10 TB of data per day with a 1-year retention
Plugin suite for Linux, Name Node, Resource Manager, Yarn, Oozie Service, Hive, Hadoop Logs
Cost effective data managementcompared to alternatives
Key Analysis
Run Comparisons
Compare the same workflow across its multiple runs, trends. Data size vs runtime CPU Efficiency
Workflow /Cluster performance correlation
How did the Hadoop services and Hadoop nodes affect the workflow performance
Analyze performance of a specific workflow action across runs
how did a workflow action perform across runs
Workflow Gantt Chart
Illustrates the progress of workflow actions and child jobs
Node Performance analysis
Which nodes were used to run the app, how did these nodes perform typically for the same workflow across the runs
Comparison of workflow with a baseline
Select a baseline workflow and compare it with other badly performing workflows
Map and Reduce Job analysis
Stragglers in map jobs, data spread across jobs, shuffle performance, gc performance
Bringing significant benefits
Lower resolution time
Order of magnitude reduction in resolution time through improved diagnostics
Improved capacity planning
Accurate assessment of capacity needs of the jobs to allow improved schedule as well as infrastructure planning
Lower CapEx need
Accurate assessment of capacity needs of the jobs to allow improved scheduling as well as better long term infrastructure planning
Diagnostic workflows
Run Comparisons
Compare the same workflow across its multiple runs, trends. Data size vs runtime CPU Efficiency
Workflow /Cluster performance correlation
How did the Hadoop services and Hadoop nodes affect the workflow performance
Analyze performance of a specific workflow action across runs
how did a workflow action perform across runs
Workflow Gantt Chart
Illustrates the progress of workflow actions and child jobs
Node Performance analysis
Which nodes were used to run the app, how did these nodes perform typically for the same workflow across the runs
Comparison of workflow with a baseline
Select a baseline workflow and compare it with other badly performing workflows
Map and Reduce Job analysis
Stragglers in map jobs, data spread across jobs, shuffle performance, gc performance
Name Node service
Name Node stats
JVM performance
RPC performance
Resource Manager Service
Cluster statistics (containers, applications, memory, cpu, Operations delay)
Yarn JVM performance
Yarn JVM performance
Yarn JVM performance
Callable queue
JDBC connections
JVM Memory
Jobs processed
Hadoop Data nodes and Manager nodes linux metrics
CPU, RAM, Disk, Network performance metrics
Hadoop Logs
NameNode, oozie, resourcemanager, datanode, nodemanager, zookeeper, application container logs
App Runs
Application runs across timeRuntime trends, data processing efficiency, data size vs runtime
Runtime trends, data processing efficiency, data size vs runtime
Illustrate graphically progression of stage runs
Analysis of task execution time, data processed
Analysis of task execution time, data processed
Analyze task runs in each stageTask parallelism, task spread across executorsTask execution time analysis – average, min, max, 75 percentile, 25 percentileTask Data processing – input, shuffle, output, result sizesData locality, task locality
Executor performance
Input data distribution across executors
Shuffle data distribution across executors
Execution time breakdown
Spark App performance correlation with Cluster performance
How did the hadoop services and hadoop nodes affect spark application performance
Get in touch
Or fill the form below, we will get back!
Is SnappyFlow right for you ?
Sign up for a 14-day trial