This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Chukwa: A large-scale monitoring system Jerome Boulon firstname.lastname@example.org Yahoo!, inc Andy Konwinski email@example.com UC Berkeley Runping Qi firstname.lastname@example.org Yahoo!, inc Ariel Rabkin email@example.com UC Berkeley Eric Yang firstname.lastname@example.org Yahoo!, inc Mac Yang email@example.com Yahoo!, inc Abstract We describe the design and initial implementation of Chukwa, a data collection system for monitoring and an- alyzing large distributed systems. Chukwa is built on top of Hadoop, an open source distributed filesystem and MapReduce implementation, and inherits Hadoops scal- ability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying monitoring and anal- ysis results, in order to make the best use of this collected data. 1 Introduction Hadoop is a distributed filesystem and MapReduce  implementation that is used pervasively at Yahoo! for a variety of critical business purposes. Production clusters often include thousands of nodes. Large distributed sys- tems such as Hadoop are fearsomely complex, and can fail in complicated and subtle ways. As a result, Hadoop is extensively instrumented. A two-thousand node clus- ter configured for normal operation generates nearly half a terabyte of monitoring data per day, mostly application- level log files. This data is invaluable for debugging, performance measurement, and operational monitoring. However, processing this data in real time at scale is a formidable challenge. A good monitoring system ought to scale out to very large deployments, and ought to handle crashes gracefully. In Hadoop, only a handful of aggregate met- rics, such as task completion rate and available disk space, are computed in real time. The vast bulk of the generated data is stored locally, and accessible via a per- node web interface. Unfortunately, this mechanism does not facilitate programmatic analysis of the log data, nor the long term archiving of such data. To make full use of log data, users must first write ad-hoc log aggregation scripts to centralize the required data, and then build mechanisms to analyze the collected data. Logs are periodically deleted, unless users take the initiative in storing them. We believe that our situation is typical, and that lo- cal storage of logging data is a common model for very large deployments. To the extent that more sophisticated data management techniques are utilized, they are largely supported by ad-hoc proprietary solutions. A well docu- mented open source toolset for handling monitoring data thus solves a significant practical problem and provides a valuable reference point for future development in this area. We did not aim to solve the problem of real-time mon- itoring for failure detection, which systems such as Gan- glia already do well. Rather, we wanted a system that would process large volumes of data in a timescale of minutes, not seconds, to detect more subtle conditions, and to aid in failure diagnosis. Human engineers and op-and to aid in failure diagnosis....
View Full Document
- Spring '08