Sustaining peak performance across multiple, hyperscale Hadoop clusters is no easy job, especially when mission critical data analytics depends on it.

The client’s app teams run their data analytics jobs written in Map Reduce, Hive and Spark on multiple, 1000 node Hadoop clusters, each with 10s of thousands of jobs running per day. Powering these jobs was a complex infrastructure and operations teams faced significant challenges in troubleshooting performance issues at this scale.

Fortune 500

Client

5000+ Nodes

Hadoop cluster

5-10 TB

Logs/day

10K+

Jobs/day/cluster

Many unanswered questions

Why is an app running slow? Is it due to cluster or data?

Why were the apps crashing?

Why is the Cluster not performing as per expected SLA?

What was the reason for erratic performance issues like node failures or high latency?

Why won’t an app run on one cluster but not on another?

And a nightmare for troubleshooters

Where to look and what to look for?

Information was spread across multiple Big Data platform components

Multiple independent dashboards for Oozie, Yarn, Map Reduce, Spar, Name Node Stats, Yarn Stats and Linux metrics

Server issue? Yes, but which Server?

Were the servers healthy when running jobs?

Which servers to isolate? Was there any unusual activity?

More logs but less joy

Too many verbose logs

No information on relevancy or correlation

How SnappyFlow helped

Data Ingestion

Ingest 5-10 TB of data per day with a 1-year retention

Plugin suite for Linux, Name Node, Resource Manager, Yarn, Oozie Service, Hive, Hadoop Logs

Cost effective data managementcompared to alternatives

Key Analysis

Run Comparisons

Compare the same workflow across its multiple runs, trends. Data size vs runtime CPU Efficiency

Workflow /Cluster performance correlation

How did the Hadoop services and Hadoop nodes affect the workflow performance

Analyze performance of a specific workflow action across runs

how did a workflow action perform across runs

Workflow Gantt Chart

Illustrates the progress of workflow actions and child jobs

Node Performance analysis

Which nodes were used to run the app, how did these nodes perform typically for the same workflow across the runs

Comparison of workflow with a baseline

Select a baseline workflow and compare it with other badly performing workflows

Map and Reduce Job analysis

Stragglers in map jobs, data spread across jobs, shuffle performance, gc performance

Bringing significant benefits

Lower resolution time

Order of magnitude reduction in resolution time through improved diagnostics

Improved capacity planning

Accurate assessment of capacity needs of the jobs to allow improved schedule as well as infrastructure planning

Lower CapEx need

Accurate assessment of capacity needs of the jobs to allow improved scheduling as well as better long term infrastructure planning

Diagnostic workflows

Run Comparisons

Compare the same workflow across its multiple runs, trends. Data size vs runtime CPU Efficiency

Workflow /Cluster performance correlation

How did the Hadoop services and Hadoop nodes affect the workflow performance

Analyze performance of a specific workflow action across runs

how did a workflow action perform across runs

Workflow Gantt Chart

Illustrates the progress of workflow actions and child jobs

Node Performance analysis

Which nodes were used to run the app, how did these nodes perform typically for the same workflow across the runs

Comparison of workflow with a baseline

Select a baseline workflow and compare it with other badly performing workflows

Map and Reduce Job analysis

Stragglers in map jobs, data spread across jobs, shuffle performance, gc performance

Name Node service

Name Node stats
JVM performance
RPC performance

Resource Manager Service

Cluster statistics (containers, applications, memory, cpu, Operations delay)
Yarn JVM performance
Yarn JVM performance

Yarn JVM performance

Callable queue
JDBC connections
JVM Memory
Jobs processed

Hadoop Data nodes and Manager nodes linux metrics

CPU, RAM, Disk, Network performance metrics

Hadoop Logs

NameNode, oozie, resourcemanager, datanode, nodemanager, zookeeper, application container logs

App Runs

Application runs across timeRuntime trends, data processing efficiency, data size vs runtime

Runtime trends, data processing efficiency, data size vs runtime

Illustrate graphically progression of stage runs
Analysis of task execution time, data processed

Analysis of task execution time, data processed

Analyze task runs in each stageTask parallelism, task spread across executorsTask execution time analysis – average, min, max, 75 percentile, 25 percentileTask Data processing – input, shuffle, output, result sizesData locality, task locality

Executor performance

Input data distribution across executors
Shuffle data distribution across executors
Execution time breakdown

Spark App performance correlation with Cluster performance

How did the hadoop services and hadoop nodes affect spark application performance

Get in touch

Write to support@snappyflow.io

Or fill the form below, we will get back!

Is SnappyFlow right for you ?
Sign up for a 14-day trial

SnappyFlow helps run a massive big data stack at peak performance

Sustaining peak performance across multiple, hyperscale Hadoop clusters is no easy job, especially when mission critical data analytics depends on it.

Many unanswered questions

And a nightmare for troubleshooters