L_monitoring.sp11 - CS 525 Advanced Distributed Systems...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
1 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta (Indy) Distributed Monitoring April 5, 2011 All Slides © IG Acknowledgments: Some slides by Steve Ko, Jin Liang, Ahmed Khurshid,Abdullah Al-Nayeem
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Distributed Applications run over Many Environments The Internet Gnutella peer to peer system PlanetLab Grids, Datacenters, Cloud Computing Distributed applications Large scale 1000’s of nodes Unreliable nodes and network Churned node join, leave, failure
Background image of page 2
3 Management and Monitoring of Distributed Applications Monitoring of nodes, system-wide or per-node Many applications can benefit from this, e.g., DNS, cooperative caching, CDN, streaming, etc., on PlanetLab Web hosting in server farms, data centers, Grid computations, etc. A new and important problem direction for next decade [CRA03], [NSF WG 05], [IBM, HP, Google, Amazon] Goal more end-to-end than cluster or network management Today typically constitutes 33% of TCO of distributed infrastructures Will only get worse with consolidation of data centers, and expansion and use of PlanetLab
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 What do sysadmins want? Two types of Monitoring Problems: I. Instant (on-demand) Queries across node population, requiring up-to- date answers {average, max, min, top-k, bottom-k, histogram etc.} for {CPU util, RAM util, disk space util, app. characteristics, etc.} E.g., max CPU, top-5 CPU, avg RAM, etc. I. Long-term Monitoring of node contribution availability, bandwidth, computation power, disk space Requirements: Low bandwidth For instant queries, since infrequent For long-term monitoring, since node investment Low memory and computation Scalability Addresses failures and churn Good performance and response
Background image of page 4
5 Existing Solutions: Bird’s Eye View CENTRALIZED DECENTRALIZED Decentralized overlays Instant Queries Long-term monitoring
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 Existing Monitoring Solutions Centralized/Infrastructure-based Decentralized
Background image of page 6
7 Existing Monitoring Solutions Centralized/Infrastructure-based : user scripts, CoMon, Tivoli, Condor, server+backend, etc. Efficient and enable long-term monitoring 1. Provide stale answers to instant queries (data collection throttled due to scale) CoMON collection: 5 min intervals HP OpenView: 6 hours to collect data from 6000 servers! 1. Often require infrastructure to be maintained No in-network aggregation Could scale better Decentralized
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Existing Monitoring Solutions Centralized/Infrastructure-based Decentralized : Astrolabe, Ganglia, SWORD, Plush, SDIMS, etc. Nodes organized in an overlay graph, where nodes maintain neighbors according to overlay rules, E.g., distributed hash tables (DHTs) – Pastry-based E.g., hierarchy - Astrolabe Can answer instant queries but need to be maintained all the time Nodes spend resources on maintaining peers according to overlay rules => Complex failure repair Churn => node needs to change its neighbors to
Background image of page 8
Image of page 9
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 12/08/2011 for the course CS 525 taught by Professor Gupta during the Spring '08 term at University of Illinois, Urbana Champaign.

Page1 / 43

L_monitoring.sp11 - CS 525 Advanced Distributed Systems...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online