This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: A Gossip-Style Failure Detection Service Robbert van Renesse, Yaron Minsky, and Mark Hayden * Dept. of Computer Science, Cornell University 4118 Upson Hall, Ithaca, NY 14853 Abstract Failure Detection is valuable for system management, replication, load balancing, and other distributed services. To date, Failure Detection Services scale badly in the number of members that are being monitored. This paper describes a new protocol based on gossiping that does scale well and provides timely detection. We analyze the protocol, and then extend it to discover and leverage the underlying network topology for much improved resource utilization. We then combine it with another protocol, based on broadcast, that is used to handle partition failures. 1 Introduction Accurate failure detection in asynchronous (non-realtime) distributed systems is notoriously dif- ficult. In such a system, a process may appear failed because it is slow, or because the network connection to it is slow or even partitioned. Because of this, several impossibility results have been found [7, 9]. In systems that have to make minimal progress even in the face of process failures, it is still important to try to determine if a process is reachable or not. False detections are allowable as long as they are reasonable with respect to performance. That is, it is acceptable to report an exceedingly slow process, or a badly connected one, as failed. Unfortunately, when scaled up to more than several dozens of members, many failure detectors are either unreasonably slow, or make too many false detections. Although we are not aware of any publications about this, we know this from experiences with our own Isis, Horus, and Ensemble systems (see, for example, [13, 14, 15]), as well as from experiences with Transis . In this paper, we present a failure detector based on random gossiping that has, informally, the following properties: 1. the probability that a member is falsely reported as having failed is independent of the number of processes. 2. the algorithm is resilient against both message loss (or rather, message delivery timing failures) and process failures, in that a small percentage of lost (or late) messages or small percentage of failed members does not lead to false detections. This work is supported in part by ARPA/ONR grant N00014-92-J-1866, ARPA/RADC grant F30602-96-1-0317 and AFOSR grant F49620-94-1-0198. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of these organizations or the U.S. Government. The current address of Mark Hayden is DEC SRC, 130 Lytton Ave., Palo Alto, CA....
View Full Document