We've had a bad couple of weeks and we are the first to admit it. For April we had 99.62% uptime, which means that we were down for 2h 45m 5s. This is by far the worst month in awhile, latency has increased and we have a bunch of frustrated clients. I wanted to give you all the inside view of what's going on and what we are doing to fix it.
While no one was looking Embedly has grown. Here are a few stats.
- 1,200 URLs per second average with a peak of 4,600.
- We will serve requests for ~2,500,000,000 URLs over the next month.
- Team of 4 Engineers.
We on-boarded a few large clients that have quadrupled our unique traffic over the last 2 weeks and brought the pain train. Here is the nerdy, technical story of how Embedly scaled to handle the load, the tools that we used and why we went down. If you are a "Social Media Expert" or care about your Klout score, you should stop reading now.
At Embedly we measure uncached URLs per second (UUPS), as they are the bottleneck of the system. About a month ago we were doing about 25 UUPS, today that number stands between 100-125 with spikes up to 800 UUPS. Yes, I just made up a new measurement, but if Groupon can do it, so can I.
Before on-boarding these clients we did some initial load testing and it was very clear that our current system was not going to work. We had about 30 1 Gig boxes on Rackspace with 4 instances of a Tornado app running on each. This was excessive for traffic at the time, but we kept them up just in case. Here is what happened over the next 2 weeks.
4/11: Load Testing
Our first load test was a complete failure, but it was short, so we only fell over for a few minutes. Like any good startup, we threw more boxes at the problem. 60 1 Gig boxes here we come. To the cloud!
4/16: Load Testing
This one was sustained for a few hours and it cost Embedly about a half hour of down time. This was when we knew we were in trouble.
The first bottleneck was in the app servers that actually make requests and do the parsing. Tornado, like any async framework, is only as good as the time that you spend not blocking. Parsing large HTML documents and images means that each app blocks the IOLoop for a substantial amount of time. Because of this we were always memory bound, rather than CPU bound. Each thread could only do so much work and we couldn't push any more work through them.
Enter ZMQ. The only way we could push more work is to create more instances of Tornado onto each box. To do this we set up frontends and workers using the PUSH/PULL pattern in ZMQ.
No one likes queues because they create single points of failure. ZMQ is a little better, but the trade off is in configuration. If you have 30 workers across 30 boxes, everyone has to know about each other. In Embedly's case, that's about 1800 ports that need to stay semi static. We drop, create and get migrated by Rackspace so often that this wasn't feasible.
Instead we opted for larger boxes that contained 8 frontends and 30 workers on each. The frontends PUSH/PULL down to the workers and the workers PUB/SUB back to the frontends. This allows us to scale quicker without worrying about notifying existing frontends that new workers are available.
pyzmq comes with built in Tornado support. A quick ioloop.install() and then ZMQ can run off the same IOLoop that Tornado is running on.
4/19: Jimbo
Once we deployed this fix we were able to keep up with the load testing traffic, but then it became a game of Whac-A-Mole. All the supporting systems we had in place couldn't handle the load.
The first to go was Analytics. Our real time reporting process (Jimbo) is based on LogStash dumping logs into a Redis queue that workers pull off of that tells us how we are doing. That queue got backed up to about a million items, then died. We rely on Jimbo pretty heavily for health checks, so we were flying blind.
More workers, helped, but now we have abandoned Jimbo completely for about 16 88 lines of node.js, Statsd and Graphite. Jimbo gave us more insights, but maintaining it took time away from keeping the site up.
4/25: Cassandra
Next to die was Cassandra. I believe that this is mostly our fault, rather than the tool itself, but after about a terabyte of data we got a ton of unavailable exceptions from PyCassa. Each one of these errors cost us about 3 seconds of blocking time. Lowering timeout helped, but in reality we had too much data and not enough boxes. TTL also isn't working properly for us as well, hence why we have a terabyte of data in Cassandra.
Luckily Embedly's storage library (Coffer) is configurable so we can shut off writes and reads via config files. We took Cassandra out, life goes back to normal. We will eventually add Cassandra back in, as it gives us more permanent storage for things like RSS feeds and API payloads. We just won't be putting all our data in there forever anymore. At this point we are feeling pretty good.
4/27: Couchbase
This weekend Couchbase took a dive. We had a pretty good run with it, but when Couchbase got 60% full it died hard. We were simply saving too much data. At 60% Couchbase starts writing to disk and everything falls over.
We can't save the cluster at that time and need to bring up a new one. Saturday and Sunday we had 2 different Couchbase clusters, rotating traffic around them after one died. This might have been a new low. Literally the worst possible way to handle traffic that I know of. I hope you don't judge us for this one.
4/29: Fixed?
Sunday we finally fixed the issue by creating a 180 GB Couchbase cluster without replication. We also lowered cache time to 3 hours instead of 5 days. Our working set now fits into about 15% of capacity which seems to be a sweet spot. In Couchbase's defense it does handle 14,000 ops per second for us.
And that brings us up today. Defcon 3.
We obviously know that this isn't the solution. We could buy everyone in the company a car each month with what we spend on hosting. We do however need to make smarter choices about technology, caching and persistent storage.
We apologize for the issues. We are working on bettering the service everyday.
Going forward there there are a ton of optimizations we're plan on making. Async DNS, analytics, long term storage, multiple availability zones, faster image processing, a URL fetching service etc, etc. If any of the above interests you, we are hiring!
BTW, If you find yourself in this situation, strip everything down to the bare bones and get a big cache.
Thanks to Ben Darnell for helping us with blocking in Tornado and more importantly the team here that made it happen.
Sean