kasick - Black-Box Problem Diagnosis in Parallel File...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick 1 , Jiaqi Tan 2 , Rajeev Gandhi 1 , Priya Narasimhan 1 1 Electrical & Computer Engineering Department Carnegie Mellon University Pittsburgh, PA 152133890 {mkasick, rgandhi, priyan}@andrew.cmu.edu 2 DSO National Labs, Singapore Singapore 118230 tjiaqi@dso.org.sg Abstract We focus on automatically diagnosing different perfor- mance problems in parallel file systems by identify- ing, gathering and analyzing OS-level, black-box perfor- mance metrics on every node in the cluster. Our peer- comparison diagnosis approach compares the statistical attributes of these metrics across I/O servers, to identify the faulty node. We develop a root-cause analysis proce- dure that further analyzes the affected metrics to pinpoint the faulty resource (storage or network), and demonstrate that this approach works commonly across stripe-based parallel file systems. We demonstrate our approach for realistic storage and network problems injected into three different file-system benchmarks (dd, IOzone, and Post- Mark), in both PVFS and Lustre clusters. 1 Introduction File systems can experience performance problems that can be hard to diagnose and isolate. Performance prob- lems can arise from different system layers, such as bugs in the application, resource exhaustion, misconfig- urations of protocols, or network congestion. For in- stance, Google reported the variety of performance prob- lems that occurred in the first year of a clusters opera- tion [10]: 4080 machines saw 50% packet-loss, thou- sands of hard drives failed, connectivity was randomly lost for 30 minutes, 1000 individual machines failed, etc. Often, the most interesting and trickiest problems to diagnose are not the outright crash (fail-stop) failures, but rather those that result in a limping-but-alive sys- tem (i.e., the system continues to operate, but with de- graded performance). Our work targets the diagnosis of such performance problems in parallel file systems used for high-performance cluster computing (HPC). Large scientific applications consist of compute- intense behavior intermixed with periods of intense par- allel I/O, and therefore depend on file systems that can support high-bandwidth concurrent writes. Parallel Vir- tual File System (PVFS) [6] and Lustre [23] are open- source, parallel file systems that provide such applica- tions with high-speed data access to files. PVFS and Lus- tre are designed as client-server architectures, with many clients communicating with multiple I/O servers and one or more metadata servers, as shown in Figure 1. Problem diagnosis is even more important in HPC where the effects of performance problems are magnified due to long-running, large-scale computations. Current diagnosis of PVFS problems involve the manual analysis of client/server logs that record PVFS operations through code-level print statements. Such (white-box) problem diagnosis incurs significant runtime overheads, and re-...
View Full Document

This note was uploaded on 11/12/2011 for the course CE 726 taught by Professor Staf during the Spring '11 term at SUNY Buffalo.

Page1 / 14

kasick - Black-Box Problem Diagnosis in Parallel File...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online