Monday night, as Hurricane Sandy barreled down on New York City, the websites for the United Nations, Buzzfeed, Gawker, the Huffington Post, Daily Kos, Bloomberg news and Livestream went down. (As of this writing, the UN and Gawker are still down.) How did such large and well-engineered websites lose service? Here’s the breakdown.
1. A hosting company called Datagram thought it would be a good idea to put a data center in lower Manhattan.
Datagram is a building full of computers that never switch off. They host key portions of various websites, including Buzzfeed.com and Gawker.com, and possibly also Huffingtonpost.com and Bloomberg.com.
Datagram told Buzzfeed that its basement was flooded with five feet of water. Flooding at 33 Whitehall, where Datagram is located, is rare but, as Sandy proves, certainly not impossible.
Update from Bloomberg: They do not host any of their services with Datagram. Their issues could be related to the larger spectrum of connectivity problems that plagued other East Coast data centers, including those used by the Huffington Post.
2. ConEd, New York City’s electrical utility, preemptively shut off power to lower Manhattan, including Datagram.
3. Data centers have backup electrical generators, but the fuel pumps that supply Datagram’s were in the basement, which flooded.
“Basement flooded, fuel pump off line—we got people working on it now. 5 feet of water now,” an official from Datagram told Buzzfeed via text message. As of this morning, Datagram’s website was back up, and carried only this message:
We are continuing to battle flooding and fiber outages in downtown New York and Connecticut. Verizon and other carriers in the area are down as well. Generators are unable pump fuel due to the flooding in the basements. Downed trees are causing fiber outages all across the northeast. We will provide updates as soon as possible. Thank you for your patience.
Rackspace, a different hosting company, outlined on its blog how it prepared its data centers in Northern Virginia for Hurricane Sandy. Rackspace is pretty intense about its disaster planning, but this is a reasonable approximation of how most data centers handle the possibility of power failure:
Currently, the generator fuel tanks in both [Northern Virginia data centers] IAD1 and IAD2 are full and the generators can run for roughly 60 hours—that’s 2 ½ days—without refueling. If needed, we can refuel the generators while they are running. Our emergency fuel provider is on notice that we may require additional fuel, which will be delivered within 24 hours of a service call. Additionally, our facility partner stands at the ready and maintains its own emergency fuel delivery contracts. We have also contacted our generator maintenance and repair vendor, which will place a diesel engine specialist on site at IAD1 if required.
4. Most of the websites involved continued to be available on the West Coast, but redirecting users on the East Coast can take time.
Contingency planning is tough, and it’s clear that the websites involved weren’t adequately prepared for an eventuality like Sandy. It’s possible, when one set of servers fails, to redirect traffic to another set of servers located elsewhere, but because of the fundamental architecture of the internet, this can take time. “DNS propagation,” as it’s known, can take anywhere from minutes to hours, depending on how things are set up, and until the numerical location of alternate servers has fully spread through the network, a site will remain offline for at least some of its users.
Update: Poynter has a detailed account of what happened to the Huffington Post, including network failure at a backup data center.