Performance Debugging for
Distributed Systems of Black Boxes
Marcos K. Aguilera
Jeffrey C. Mogul
Janet L. Wiener
HP Labs, Palo Alto
MIT Lab for Computer Science
Many interesting large-scale systems are distributed systems of
multiple communicating components. Such systems can be very
hard to debug, especially when they exhibit poor performance.
The problem becomes much harder when systems are composed
of “black-box” components: software from many different (per-
haps competing) vendors, usually without source code available.
Typical solutions-provider employees are not always skilled or ex-
perienced enough to debug these systems efFciently. Our goal is
to design tools that enable modestly-skilled programmers (and ex-
perts, too) to isolate performance bottlenecks in distributed systems
composed of black-box nodes.
We approach this problem by obtaining message-level traces of
system activity, as passively as possible and without any knowledge
of node internals or message semantics. We have developed two
very different algorithms for inferring the dominant causal paths
through a distributed system from these traces. One uses tim-
ing information from RPC messages to infer inter-call causality;
the other uses signal-processing techniques. Our algorithms can
ascribe delay to speciFc nodes on speciFc causal paths. Unlike
previous approaches to similar problems, our approach requires no
modiFcations to applications, middleware, or messages.
Categories and Subject Descriptors
]: Testing and Debugging—
uted debugging, testing tools
Algorithms, Performance, Measurement
Performance debugging, black box systems, distributed systems,
The order of author names is random.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proFt or commercial advantage and that copies
bear this notice and the full citation on the Frst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciFc
permission and/or a fee.
October 19–22, 2003, Bolton Landing, New York, USA.
Copyright 2003 ACM 1-58113-757-5/03/0010 .
Many commercially-important systems, especially Web-based
applications, are composed of a number of communicating com-
ponents. These are often structured as distributed systems, with
components running on different processors or in different pro-
cesses. ±or example, a multi-tiered system might start with requests
from Web clients that ²ow through a Web-server front-end and then
to a Web “application server,” which in turn makes calls to a data-
base server, and perhaps additional services (authentication, name
service, credit-card authorization, customer relationship manage-