: Global Comprehension for Distributed Replay
, Gautam Altekar
, Petros Maniatis
, Timothy Roscoe
, Ion Stoica
University of California at Berkeley,
Intel Research Berkeley,
Debugging and profiling large-scale distributed applica-
tions is a daunting task. We present
, a system
for debugging distributed applications that combines de-
terministic replay of components with the power of sym-
bolic, low-level debugging and a simple language for ex-
pressing higher-level distributed conditions and actions.
allows the programmer to understand the col-
lective state and dynamics of a distributed collection of
coordinated application components.
, we consider several distributed
problems, including routing consistency in overlay net-
works, and temporal state abnormalities caused by route
flaps. We show via micro-benchmarks and larger-scale
application measurement that
can be used inter-
actively to debug large distributed applications under re-
play on common hardware.
Distributed applications are complex, hard to design
and implement, and harder to validate once deployed.
The difficulty derives from the distribution of applica-
tion state across many distinct execution environments,
which can fail individually or in concert, span large ge-
ographic areas, be connected by brittle network chan-
nels, and operate at varying speeds and capabilities. Cor-
rect operation is frequently a function not only of single-
component behavior, but also of the global collection of
states of multiple components. For instance, in a mes-
sage routing application, individual routing tables may
appear correct while the system as a whole exhibits rout-
ing cycles, flaps, wormholes or other inconsistencies.
To face this difficulty, ideally a programmer would be
able to debug
the whole application
, inspecting the state
of any component at any point during a debugging ex-
ecution, or even creating custom invariant checkers on
global predicates that can be
ously as the system runs. In the routing application ex-
ample, a programmer would be able to program her de-
bugger to check continuously that no routing cycles exist
across the running state of the entire distributed system,
as easily as she can read the current state of program vari-
ables in typical symbolic debuggers.
, the system we present in this paper, is a
first step towards realizing this vision.
tures the distributed execution of a system, (2) replays
the captured execution trace within a symbolic debug-
ger, and (3) extends the debugger’s programmability for
complex predicates that involve the
state of the re-
played system. To our knowledge, this is the first replay-
based debugging system for unmodified distributed ap-
plications that can track arbitrary global invariants at the
fine granularity of source symbols.
Capture and replay in