Distributed Eventually Perfect Failure Dectector (EPFD)

Partitialy Synchronous System
Eventually Perfect Failure Dectectors (EPFD)

This is the note for Reliable Distributed Algorithm Week 2 - Basic Abstractions and Failure Dectectors. It discusses what’s a Partitialy Synchronous System, and one of Failure Dectectors - Eventually Perfect Failure Dectectors (EPFD) under the assumption of Partitialy Synchrony.

Partitialy Synchronous System

Synchronous Model assumes :
- Synchronous computation
  - Known upper bound on how long it takes to perform computation
- Synchronous communication
  - Known upper bound on message transmission delay
- Synchronous physical clocks
  - Nodes have local physical clock
  - Known upper bound clock-drift rate and clock skew
Partitialy Synchronous System
- Asynchronous system which eventually becomes synchronous
  - Cannot know when, but in every execution, some bounds eventually will hold
  - “My server processes requests within one week when it is running, and it will eventually be running for at least a week, I just don’t know when that will be.”
Fail-noisy model
- Crash-stop process model
- Perfect links
- Eventually Perfect failure detector (◊P)
Partially synchronous model
- To guarantee safety properties any algorithm has to assume the failure detector inaccurate
- Eventual accuracy is only used to guarantee liveness

Eventually Perfect Failure Dectectors (EPFD)

Definition

An EPFD, defined in a partially synchronous model, should satisfy the following properties:

Completeness: Every process that crashes should be eventually suspected permanently by every correct process
Eventual Strong Accuracy: No correct process should be eventually suspected by any other correct process

Algorithm

See here : EPFD-Algorithm

Algorithm 1 Increasing Timeout

Algorithm 2 Increasing Timeout with sequence numbers

My Question :

I wonder what’s the difference between two different versions? (EPFD - Algorithm)

Without sequence number :

upon event ⟨ pp2p, Deliver | p, [HeartbeatReply] ⟩ do 
    alive := alive ∪ {p}

With sequence number :

upon event ⟨ pp2p, Deliver | p, [HeartbeatReply, n] ⟩ do 
    if n = seqnum ∨ p ∈ suspected then
        alive := alive ∪ {p}

I can understand the algorithm without sequence number, when a process receive a heartbeat message from p, it just marks p as alive. But why we need to check n == seqnum || suspected.contains(p) in second algorithm? What’s the advantage to use a sequence number?

TA’s answer :

Both algorithms are correct, but the one with sequence numbers typically leads to shorter “convergence” times (i.e. time until the FD has detected and adjusted to the latency of the system) and less “noise” in real implementations, because it rejects more heartbeats that are outdated.

Without sequence numbers the algorithm will accept responses from very old phases (if the actual network delay is much longer than the heartbeat delay), which can lead, under certain timing conditions, to awkward resonances where processes alternate between suspected and restored over and over again until they finally settle, which may have adverse effects on downstream components (which typically do some work whenever a suspect or restore is issued).

Implementation (Using Kompics Scala library)

Kompics Scala’s documentation is here

An Eventually Perfect Failure Detector (EPFD), in Kompics terms, is a component that provides the following port (already imported in the notebook).

class EventuallyPerfectFailureDetector extends Port {
 indication[Suspect];
 indication[Restore];
}

Simply put, your component should indicate or ‘deliver’ to the application the following messages:

case class Suspect(src: Address) extends KompicsEvent;
case class Restore(src: Address) extends KompicsEvent;

(Actual implementation is in private git repository)

Few questions remaining :

how to write tests for EPFD?
how does trigger() work?