Long-lasting outages affecting Internet connectivity are widespread and contribute significantly to overall unavailability. To address such problems, we developed LIFEGUARD, a system for automatic failure localization and remediation. LIFEGUARD uses active measurements and a historical path atlas to locate faults, even in the presence of asymmetric paths and failures. Locating Internet faults facilitates recovery, allowing edge ISPs to steer traffic around failures, potentially without requiring the involvement of the network causing the failure. This site describes outages that our LIFEGUARD deployment isolated. The system has run continuously since January, 2011. We currently use a host at each of 12 PlanetLab sites as sources. We monitor paths between those sources and the following sets of destinations:
In our current deployment, failure isolation is triggered by unreachability events detected by repeated ping failures. Click the links for the outages listed on the left sidebar to view the isolated failure and text explaining how LIFEGUARD identified the problem. We recommend starting here, as it walks through how the system works in detail. In the future, we plan to publish outage location information in real time as a service to operators. We are currently investigating how best to communicate the information in service of this goal. In the meantime, we have put up a few examples. |