On Monday, December 15, 2008 I went to work for an online video start-up. On my first day our video players loaded on 384 of our partner’s web pages. As you can see from the first chart that was actually a pretty good day for us (it was actually our biggest day to date) as we were averaging less than 100 pages a day. During my interview with the founder and CEO he confidently told me the company would be one of the 1000 busiest sites within one year. I believed him.
|Chart 1 – 384 page views on Dec 15, 2008|
I had a comfortable job as a Network Operations Director with a mature and stable company, but I felt like I needed a change so I was looking a little for something new. A colleague had recently found a job which had been posted on craigslist.org, so I was trolling IT position postings there. I was frustrated because most jobs wanted specific skills, like Active Directory, Exchange, routers, firewalls, etc. I didn’t want to do just one thing or focus on a narrow area. Then I came across a post on craigslist looking for a utility player, one who had varied experience and could wear many hats. It was a match made in heaven. (I later found out this company was frustrated by not being able to find someone with a wide range of experience as all previous applicants had been siloed in specific areas. They had actually given up and pulled the post less than a day after I ran across it.)
Based on the CEO’s statement that we were going to become one of the one thousand busiest web services on the Internet I was tasked with building a scale-able system that could grow rapidly (along with all other IT related duties, but that’s another story…). Oh, and because we were a start-up funded solely by friends and family of our founder I had an extremely lean budget for equipment, facilities and personnel. Basically the budget was zero.
Admittedly I was a little naive, but I’m also optimistic and very determined. So I set out to do what I was asked.
At the time we were sharing a handful of servers with another start-up in a colo across the country. I had 20 year-old 1U Dell servers, a couple gigabit switches, two entry-level Cisco firewalls, and two low-end load balancers. I quickly put together a fairly lean list of servers and networking equipment I needed and tried to get a few hundred grand to buy that and setup at least one separate location. The answer came back that I couldn’t even spend one tenth of what I needed & I had to figure out how to make things work without any capital expenditure.
Then on January 19-24, 2009 while I was trying to figure out how to work miracles we had our first Slashdot effect event when one of our partners had an article containing our player featured on politico.com (note: at the time we were mainly politically oriented, now we are a broad-based news, entertainment and sports organization). We went from averaging less than 100 player loads (AKA page views) per day to over 500,000 in a single day. Needless to say our small company was ecstatic, but I was a bit nervous. While our small infrastructure handled the spike, it did so just barely.
|Chart 2 – January 19-24, 2009 Slashdot effect|
When I started with this new company was when I was introduced to Amazon Web Services and started dabbling with EC2 and S3 right away. In fact, we started running our corporate website on EC2 a little over a month before I started, and we ran it on the exact same server for just over five years.
Admittedly I was somewhat hesitant to use AWS. First, the concept of every server having a public IP address, then the fact that they didn’t have an SLA, and finally the only way to load balance was to build your own with something like HA Proxy on EC2 servers. But the compelling factors, elasticity, pay as you go, no CapEx, etc., were really attractive, especially to someone like me who didn’t have any money for equipment, nor could I hire anyone to help build and maintain the infrastructure.
Sometime in the spring of 2009 when AWS announced Elastic Load Balancing I was swayed and fully embraced moving to “the cloud.” I started right away copying our (~200 GB) video library and other assets to S3, and started a few EC2 servers on which I started running our web, database and application stacks. By August of 2009 we were serving our entire customer-facing infrastructure on AWS, and averaging a respectable quarter million page views per day. In October of that year we had our second 500,000+ day, and that was happening consistently.
|Chart 3 – 2009 Traffic|
Through most of 2009 our company had 1 architect, 0 DBA’s (so this job defaulted to me), and 1 operations/infrastructure guy (me), and we were outsourcing our development. We finally started to hire a few developers and brought all development in-house, and we hired our first DBA, but it was still a skeleton crew. By the end of that year we were probably running 20-30 EC2 servers, had a couple ELB’s, and stored and served (yes, served) static content on S3. Things were doing fairly well and we were handling the growth.
|Chart 4 – Explosive Growth in 2010|
2010 was a banner year for us. In Q1 we surpassed 1 million, 2 million and even 5 million page views per day. And by Q3 we were regularly hitting 10 million per day. Through it all we leveraged AWS to handle this load, adding EC2 servers, up-sizing servers, etc. And (this is one of my favorite parts) didn’t have to do a thing with ELB as AWS scaled that up for us as needed, automatically.
We were still a skeleton crew, but finally had about ten people in the dev, database and operations group(s). Through this all and well beyond we never had more than one DBA, and one operations/infrastructure guy.
I can’t say this growth wasn’t without pain though. We did have a few times when traffic spikes would unexpectedly hit us, or bottlenecks would expose themselves. But throughout this time we were able to optimize our services making them more efficient, more able to grow and handle load, and even handle more calls per server driving costs (on a per call basis) down considerably. And, yes, we benefited greatly from Amazon’s non-stop price reductions. I regularly reported to our CEO and others about how our traffic was growing exponentially but our costs weren’t. Win, win, win!
I’m a bit of a data junky and I generate and keep detailed information on number of calls/hits to our infrastructure, amount of data returned per call, and ultimately cost per call. This has enabled me to keep a close eye on performance and costs. And I’ve been able to document when we’ve had numerous wins and fails. I’ve identified when particular deployments have begun making more calls or returning more data usually causing slower performance and always costing more money. I’ve also been able to identify when we’ve had big wins by improving performance and saving money.
The main way I’ve done this is to leverage available CPU capacity when servers have been underutilized on evenings and weekends. Currently on a daily basis I analyze close to 1 billion log lines, effectively for free. This is a high-level analysis looking at things like numbers of particular calls, bandwidth, HTTP responses, browser types, etc.
|Chart 5 – More Growth in 2011|
|Chart 6 – Continued Growth in 2012 and 2013|
|Chart 07 – AWS CloudWatch Stats Showing Over 400,000 Calls Per Day|