Long-lasting outages affecting Internet connectivity are widespread and contribute significantly to overall unavailability. To address such problems, we developed LIFEGUARD, a system for automatic failure localization and remediation. LIFEGUARD uses active measurements and a historical path atlas to locate faults, even in the presence of asymmetric paths and failures. Locating Internet faults facilitates recovery, allowing edge ISPs to steer traffic around failures, potentially without requiring the involvement of the network causing the failure.

This site describes outages that our LIFEGUARD deployment isolated. The system has run continuously since January, 2011. We currently use a host at each of 12 PlanetLab sites as sources. We monitor paths between those sources and the following sets of destinations:
  • approximately 100 routers in highly-connected PoPs located in the core of the Internet based on iPlane’s traceroute atlas 
  • about 200 targets located on the edge of the Internet, with paths traversing at least one of the above centralized PoPs.
  • ~80 PlanetLab nodes. These targets provide us with a mechanism for validating our results.

In our current deployment, failure isolation is triggered by unreachability events detected by repeated ping failures.

We recommend starting here as it walks through how the system works in detail.

In the future, we plan to publish outage location information in real time as a service to operators.  We are currently investigating how best to communicate the information in service of this goal.  In the meantime, we have put up a few examples.