Two years of disaster planning massively reduced recovery time objective, company says.
If you’re unsure how resilient your organization is to a disaster, there’s a simple way to find out: unplug one of your datacenters from the internet and see what happens.
That’s what Dropbox did in November, though with a bit more forethought. It had been planning to take the San Jose datacenter (its largest) offline for some time, and performed extensive tests prior to the actual event. It actually took all three datacenters in the city offline by physically pulling each site’s main fiber connection from its port.
Dubbed the “SJC blackhole,” the experiment was determined to be a success after 30 minutes had elapsed with what Dropbox described as no impact to its global availability. “In the unlikely event of a disaster, our revamped failover procedures showed that we now had the people and processes in place to offer a significantly reduced RTO [recovery time objective],” Dropbox said in a postmortem of the event.
According to the company, RTOs were reduced from eight to nine minutes down to four or five.
What was Dropbox thinking?
After parting ways with previous hosting service AWS and building its own datacenters, Dropbox said it realized there was a problem: its metadata was highly replicated, but block data wasn’t. “Given San Jose’s proximity to the San Andreas Fault, it was critical we ensured an earthquake wouldn’t take Dropbox offline,” the company said.
The first attempt Dropbox made to eliminate its centrality was called Magic Pocket, a system that distributes block data to multiple datacenters, which can serve portions of files at the same time, without worries about a single datacenter outage eliminating service. This is known as an active-active system because multiple nodes are serving files to users simultaneously.
Dropbox ultimately settled on an active-passive failure model, which still replicates blocks across multiple datacenters, but only serves files from a single location. It said this was necessary to implement its plan because of limitations imposed by how Dropbox itself chose to manage metadata.
“These choices severely limited our architectural choices when designing an active-active system, and made the resulting system much more complex,” Dropbox said.
Failing over and over
A May 2020 failover tooling failure caused a 47-minute long service outage, which pushed Dropbox into high gear on improving its disaster recovery systems. It started by implementing a dedicated disaster recovery team, which rebuilt Dropbox’s failover-handling software before running tests, of which the November 2021 shutdown was part.
Testing began at Dropbox’s two Dallas Fort Worth datacenters, and initially things were less than smooth – due to the team not realizing all of its S3 proxies were running from the datacenter it took offline. A second test proved more successful, which led to the San Jose experiment.
“Much like our second DFW test, we saw no impact to global availability—and ultimately reached our goal of a 30-minute SJC blackhole,” Dropbox said.
Dropbox’s postmortem is worth paying attention to: not only did it find a way to successfully distribute its services and make its entire system more resilient, it also shows the type of work it takes for a large enterprise to commit to that type of project.
The entire effort to improve resiliency was described by Dropbox as a multi-year, multi-team project. Its nature as a cloud service may mean Dropbox is more complex than other enterprises, but that should serve as a motivator: disaster recovery planning in other companies may be a lot easier.
Dropbox also recommends that other companies perform regular disaster recovery practise exercises. “Like a muscle, it takes training and practise to get stronger.” ®
Courtesy of: Brandon Vigliarolo