The company details the 20 hour ordeal in this blog, from the first symptoms of the attack to Amazon denying they had a problem to Amazon finally taking significant action.
Here's how it started, from Bitbucket's blog:
What we were seeing on the server was high load, even after turning off anything that took up CPU. Load is a result of stuff "waiting to happen", and after reviewing iostat, it became apparent that the "iowait" was very high, while the "tps" (transactions per second) was very low for our EBS volume. We tried several things at this point:
Un-mounting and re-mounting the volume.
Runing xfs_check on the volume, which reported no errors (we use XFS.)
Moving our instances and volumes from us-east-1b to both us-east-1a and us-east-1c.
None of these resolved the problem, and it was at this point we decided to upgrade to the "Gold plan" of support to gain access to the 1-hour turnaround technical support with Amazon.
What is described next is many hours of back and forth between Bitbucket's management and administrator and Amazon support. Nearly 17 hours into the ordeal, Amazon, according to Bitbucket, finally figured out what had happened:
We were attacked. Bigtime. We had a massive flood of UDP packets coming in to our IP, basically eating away all bandwidth to the box. This explains why we couldn't read with any sort of acceptable speed from our EBS, as that is done over the network. So, basically a massive-scale DDOS. That's nice.
This is 16-17 hours after we reported the problem, which frankly, is a bit disheartening. Why did it take so long to discover? Oh well.
Shortly after Amazon blocked the offending UDPs, the attackers switched tactics and fired a TCP SYNFLOOD. Fortunately, that was quickly filtered and rectified.
It's a good bet that you are going to read a number of stories about how this event shows the weakness of cloud computing. How businesses are too utterly dependent on cloud computing service providers, and many may even jump off of a cliff and question the entire cloud services model. But this isn't a new story, it's a very old IT story.
Denial-of-Service attacks have been with us in a big way since 2000, when Mafiaboy took down a number of large e-Commerce sites, including Yahoo!, CNN, and others. Many other problems that cause downtime have been with us since the beginning: power outages, dying systems, broken applications, business partners going out of business, the admin who decides to lock everyone out of the network, hurricanes, floods, and strong winds. You get the idea.
When any of these things happen, you need to open your business-continuity and disaster recovery plan and put it into action.