Blog

Just dropping some knowledge on y’all.

The Rise of Open Graph

42% of all URLs that Embedly processes have one or more Open Graph tags.

If you aren't familiar with Open Graph, it's the semantic metadata that Facebook introduced in 2010. Initially, it could only provide the title, image, and description for links and a few other objects, but it's been extended to power pretty much every third-party application in the stream. Yes, the special sauce that allowed Viddy and SocialCam to amass millions of users in days is Open Graph.

Recently on Quora, we were asked how Open Graph had affected us. So we added a few variables to our Statsd/Graphite setup. In this graph, the purple region represents links we've crawled that provide Open Graph metadata as a percentage of all links:

Open_graph_percent

Embedly's crawler doesn't go out looking for URLs. We only process URLs that have been shared through our API. You can then postulate that our Open Graph average is actually higher as the sites that are shared more are optimized to be shared in Facebook. All these graphs were generated over the last 36 hours, which is a sample size of 12 Million URLs.

Open_graph_types

By far the most popular tags are title, description, image, type and site_name. The Article/Blog type is the most prevalent tags out there.

Open_graph_video

Video is the most popular rich type. People are fairly good about setting width, height and type, but less have a secure_url.

Open_graph_audio

Audio is barely used. We do have far less audio providers than video, so this isn't a shock.

Open_graph_location

Location is used a bit more often than Audio. This one is surprising, because large providers like Foursquare don't use Open Graph location tags; instead, they have a special Facebook app syntax. We are going to look into this one a little more.

It's astonishing how quickly Facebook was able to affect metadata. Open Graph is trending up, so I assume this percentage will increase greatly over the next few years.

Open_graph_trends

The next post in this series will compare how different formats have been adopted. 

Sean

If you are interested in working with this data, checkout our jobs page.

 

 

 

Scaling

We've had a bad couple of weeks and we are the first to admit it. For April we had 99.62% uptime, which means that we were down for 2h 45m 5s. This is by far the worst month in awhile, latency has increased and we have a bunch of frustrated clients. I wanted to give you all the inside view of what's going on and what we are doing to fix it.

Defcon_3
While no one was looking Embedly has grown. Here are a few stats.

  • 1,200 URLs per second average with a peak of 4,600.
  • We will serve requests for ~2,500,000,000 URLs over the next month.
  • Team of 4 Engineers.

We on-boarded a few large clients that have quadrupled our unique traffic over the last 2 weeks and brought the pain train. Here is the nerdy, technical story of how Embedly scaled to handle the load, the tools that we used and why we went down. If you are a "Social Media Expert" or care about your Klout score, you should stop reading now.

At Embedly we measure uncached URLs per second (UUPS), as they are the bottleneck of the system. About a month ago we were doing about 25 UUPS, today that number stands between 100-125 with spikes up to 800 UUPS. Yes, I just made up a new measurement, but if Groupon can do it, so can I.

Before on-boarding these clients we did some initial load testing and it was very clear that our current system was not going to work. We had about 30 1 Gig boxes on Rackspace with 4 instances of a Tornado app running on each. This was excessive for traffic at the time, but we kept them up just in case. Here is what happened over the next 2 weeks.

4/11: Load Testing

Our first load test was a complete failure, but it was short, so we only fell over for a few minutes. Like any good startup, we threw more boxes at the problem. 60 1 Gig boxes here we come. To the cloud!

4/16: Load Testing

This one was sustained for a few hours and it cost Embedly about a half hour of down time. This was when we knew we were in trouble.

The first bottleneck was in the app servers that actually make requests and do the parsing. Tornado, like any async framework, is only as good as the time that you spend not blocking. Parsing large HTML documents and images means that each app blocks the IOLoop for a substantial amount of time. Because of this we were always memory bound, rather than CPU bound. Each thread could only do so much work and we couldn't push any more work through them.

Enter ZMQ. The only way we could push more work is to create more instances of Tornado onto each box. To do this we set up frontends and workers using the PUSH/PULL pattern in ZMQ.

No one likes queues because they create single points of failure. ZMQ is a little better, but the trade off is in configuration. If you have 30 workers across 30 boxes, everyone has to know about each other. In Embedly's case, that's about 1800 ports that need to stay semi static. We drop, create and get migrated by Rackspace so often that this wasn't feasible.

Instead we opted for larger boxes that contained 8 frontends and 30 workers on each. The frontends PUSH/PULL down to the workers and the workers PUB/SUB back to the frontends. This allows us to scale quicker without worrying about notifying existing frontends that new workers are available.

pyzmq comes with built in Tornado support. A quick ioloop.install() and then ZMQ can run off the same IOLoop that Tornado is running on.

4/19: Jimbo

Once we deployed this fix we were able to keep up with the load testing traffic, but then it became a game of Whac-A-Mole. All the supporting systems we had in place couldn't handle the load.

The first to go was Analytics. Our real time reporting process (Jimbo) is based on LogStash dumping logs into a Redis queue that workers pull off of that tells us how we are doing. That queue got backed up to about a million items, then died. We rely on Jimbo pretty heavily for health checks, so we were flying blind.

More workers, helped, but now we have abandoned Jimbo completely for about 16 88 lines of node.js, Statsd and Graphite. Jimbo gave us more insights, but maintaining it took time away from keeping the site up.

4/25: Cassandra

Next to die was Cassandra. I believe that this is mostly our fault, rather than the tool itself, but after about a terabyte of data we got a ton of unavailable exceptions from PyCassa. Each one of these errors cost us about 3 seconds of blocking time. Lowering timeout helped, but in reality we had too much data and not enough boxes. TTL also isn't working properly for us as well, hence why we have a terabyte of data in Cassandra.

Luckily Embedly's storage library (Coffer) is configurable so we can shut off writes and reads via config files. We took Cassandra out, life goes back to normal. We will eventually add Cassandra back in, as it gives us more permanent storage for things like RSS feeds and API payloads. We just won't be putting all our data in there forever anymore. At this point we are feeling pretty good.

4/27: Couchbase

This weekend Couchbase took a dive. We had a pretty good run with it, but when Couchbase got 60% full it died hard. We were simply saving too much data. At 60% Couchbase starts writing to disk and everything falls over.

We can't save the cluster at that time and need to bring up a new one. Saturday and Sunday we had 2 different Couchbase clusters, rotating traffic around them after one died. This might have been a new low. Literally the worst possible way to handle traffic that I know of. I hope you don't judge us for this one.

4/29: Fixed?

Sunday we finally fixed the issue by creating a 180 GB Couchbase cluster without replication. We also lowered cache time to 3 hours instead of 5 days. Our working set now fits into about 15% of capacity which seems to be a sweet spot. In Couchbase's defense it does handle 14,000 ops per second for us.

And that brings us up today. Defcon 3.

We obviously know that this isn't the solution. We could buy everyone in the company a car each month with what we spend on hosting. We do however need to make smarter choices about technology, caching and persistent storage.

We apologize for the issues. We are working on bettering the service everyday.

Going forward there there are a ton of optimizations we're plan on making. Async DNS, analytics, long term storage, multiple availability zones, faster image processing, a URL fetching service etc, etc. If any of the above interests you, we are hiring!

BTW, If you find yourself in this situation, strip everything down to the bare bones and get a big cache.

Thanks to Ben Darnell for helping us with blocking in Tornado and more importantly the team here that made it happen.

Sean

 

Yesterday

We had a bad day yesterday and we are still waiting on some permission to publish a postmortem about what exactly happened. In the mean time, here are the basics.

  1. We got a spike.
  2. Things crashed.
  3. We hit our Rackspace API and RAM limits at the same time.
  4. Membase is corrupted.
  5. Traffic dies down.
  6. Rackspace ups our limit
  7. Things get better.

It was less than ideal for everyone involved. So while we wait, we have set up http://status.embed.ly to notify everyone of whats going on behind the scenes.

There is already a post on there (Analytics is down). 

We are also going to use it to announce smaller updates to the API like bug fixes and small enhancements as well.

It should be interesting, so follow along!

Sean 

 

Welcome Andy Pellett

Andy joined Embedly a few weeks ago. We told him we wouldn't announce it till he pushed his first major change to Embedly. That happened today.

Andy pushed a new more efficient way of pulling, parsing and saving images to obtain the correct meta data. This dramatically reduces the number of HTTP calls Embedly has to make. 

Andy grew up in Alaska and received his Bachelors and Masters from the University of Maine. He hates condiments and is a decent fisherman.

Anrope

If you notice that Embedly is a bit faster today, thank Andy.

 

oEmbed for Spotify

Spotify is the default music service around the office, so when they added embeds, we jumped all over it. They launched the "Spotify Play Button" with Tumblr, but it should be on every service. The branding is interesting, "embed" is only mentioned once in that post, where "Play Button" is mentioned 6 times.

Here is a Storify with example Spotify embeds. 

I have to say, it's pretty awesome. Press play on any track above, everything is in sync and it just works. 

Here is how to use it yourself:

oEmbed API call:

http://api.embed.ly/1/oembed?url=http%3A%2F%2Fopen.spotify.com%2Ftrack%2F6ol4...

Explorer View:

http://embed.ly/docs/explore/oembed?url=http%3A%2F%2Fopen.spotify.com%2Ftrack...

You can also use Embedly's Parrotfish plugin to see Spotify in Twitter or the Embedly Wordpress plugin for easy blogging.

Enjoy!

Sean

 

Why no internetz, Boston?

We finally moved into our new Boston West End/North Station office space (blog post coming) on Portland St, March 2nd. After spending the first 6 months of Embedly in San Francisco, the next year in the Boston’s Innovation district on the Waterfront and then 10 months in the Cambridge Innovation Center (directly across from MIT), we decided to round out our tour of Boston and select office space in the heart of the City – right across the street from the legendary Boston Garden: home of  the Celtics and Bruins.  The area is currently being revived, not only because Embedly has moved in, but also as we are 2 blocks over from the new luxury 12-15 floor Archstone Apt buildings and about 3-4 blocks away from Government Center/Boston City Hall,  Fanuel Hall Marketplace, Suffolk County Courthouse, and the Financial District. We plan on being here for the foreseeable future, and the energy of new space and a new hire has been extremely productive for Embedly.

(download)

Art (me), was in charge of setting up internet. This is pretty important for a web-based company to have and in this day and age, every business should have a solid internet connection. Our research lead to Comcast and Verizon; both of which informed us that we were eligible for Cable and DSL, respectively.  This was music to our ears and it was great to know that we would be avoiding the misery of a DSL connection.  We immediately followed up with the New England Comcast business rep and ended up meeting with an onsite engineer to proceed with our setup. Despite our original conversations, the engineer came back to say that there was no way we were getting cable in our building.  Our initial excitement was quickly lost when he mentioned there was actually a draft construction plan in front of Boston officials. The plan is an estimated $60,000 in cost, that no one seems to want to agree on, with neither the City of Boston nor Comcast assuming any responsibility for this projects completion. It seems that our office office building containing 10+ businesses is not worthy of their consideration.

With options quickly disappearing, we were forced to take Verizon DSL.  Verizon and Comcast must be working together in this City splitting referrals because Verizon quickly fell into the ‘Over Promise, Under Deliver’ bucket. After starting off with a paltry 5Mbps, Verizon has made us jump through hoops to try to upgrade to the 10-15 Mbps "Fastest" plan. According to their phone sales reps we are about 1400 feet from their Verizon central office, which qualifies us for getting the upgraded speed. Unfortunately, this did not go as planned (see embedded pdf of our email conversation).  After the Verizon Boston central office offered us the upgrade, a week passed with no response from the Verizon side and it may not even be available!?!  We just don't get it. Bob has a 4G connection of 14Mbps down/4Mbps up on his cell phone. Why no bandwidth for businesses?

We find this whole issue ironic when you look at Boston’s push to attract businesses. The City of Boston wants us to innovate and to keep technology businesses in the City, but the services to allow us to do so are severely lacking. Lets make "good" internet available everywhere in the city, lets figure out a solution that allows companies to grow and "stay" in Boston. I am pretty sure San Francisco has about 10 different internet offerings for businesses at competitive prices. 

One more departing note -  our floor mates have propositioned us with a 100Mbps fiber line, at a pricey $2000/month. Really, thats my alternative? Our rent is barely that high. Get it together Boston.

 

Post sources:

* Email w/ Verizon

Click here to download:
verizon_email_log.pdf (161 KB)
(download)

* Andy (the new guy):

Moved into his Boston West End Apt  and within 1 day had a 25 Mbps RCN connection.

 

We've Missed Provider Updates

We have not done a providers blog post in over 6 months, and really do miss finding some shiny new videos or images to present to you guys. Our provider queue is heavy with budding video startups who are even sending us links to videos hosted on localhost, but being early to the embedding game is a good thing.

We have a unique bunch to show you today: a Napster for photos, a professional social network, real-time video casting, and an E-Learning site.

Lets jump to it with a few examples:

* Tipi Trampoline from Pinterest.

Pinterest

Linkedin

* Spreecast with Embedly airing on Spreecast.

Spreecast

* Lesson on Circumference and Area from ShowMe.

Showme

Check out our Spreecast w/ Spreecast. Enjoy!

 

Why blindly following meta tags is a bad idea.

We do something that is completely radical when it comes to a description of a page. We try to pick the best one. GASP!

Here is a hypothetical situation. For this html, what would the user expect the description to be?

Embedly will pick the following excerpt:

"This is a funny and insightful article that somehow got on this evil site that I would like to share with my friends. I would expect when I share this link that the first sentence of the article is the description."

Facebook will pick:

"WIN A FREE IPAD: http://fake.net"

Google will pick:

"WIN A FREE IPAD: http://fake.net"

Though interestingly enough, Google will use: "This is a funny and insightful article that somehow got on this ..." as the title.

If you ever wondered why Embedly doesn't blindly follow meta tags, this is why.

Screen_shot_2012-03-30_at_10
Screen_shot_2012-03-30_at_10

Embedly Challenge Results

On Friday, Embedly offered Hacker News a coding challenge. Apply.embed.ly asked developers to solve 3 different problems and submit their solutions. We didn't force people to apply for 1 of the 3 positions that we have open, just nerd out on some problems. We are going to talk about the results and the answers.

Here is a quick funnel of users to apply.embed.ly

1. Sum of Digits

Based on a question from Project Euler given the formula:

R(n) is the the sum of the digits for n!.
For example, 10! = 3628800
R(10) = 3 + 6 + 2 + 8 + 8 + 0 + 0 = 27.

We loved this question for the golf aspect. In python R(n) it can be written:

import math;
R = lambda x: sum(map(int, str(math.factorial(x))))

To actually solve it, almost everyone used a a brute force algorithm. Like so:

min([i for i in range(1000) if R(i) == 8001])

We got a total of 1008 distinct answers for this question. 758 were seen less than 2 times (some people tried to brute force the value)

The top three answers:

  1. 787 (992)
  2. 0 (384)
  3. 802 (105)

2. Standard Deviation of P tags

This one was a mess, when the problem was first put up we had a very large and invalid, random-generated HTML file. If you used the Chrome console, lxml, nokogiri or ran the html through Tidy you got the 'correct' answer. If you used a sax parser, the answers were much different.

After a few confused tweets, we allowed any answer between 0.5 and 2.0. We then simplified the html greatly. This allowed people to manually count the depths of each p tag or use the white space to determine the depth. This may have defeated the purpose, but ok internet, you win.

We got a total of 242 distinct answers for this question. 117 were seen less than 2 times.

The top 3 answers:

  1. 1.4 (335)
  2. 0.767 (164)
  3. 1.253 (101)

3. Zipf's Law.

We simplified Zipf's law to:

Z(x) = [x, x/2, x/3, x/4...]

This described the frequency distribution for words in a random body of text. Given that x = 2520 and a text of 900 unique words, how many words make up half the text?

This one got a little confusing too.

We can get the word count by using:

words = [2520/float(i) for i in range(1, 901)]
word_count = sum(words)

We can then iterate over the words till they are greater than 50% of the total word count.

min([i for i in range(30) if sum(words[:i]) > sum(words)/2.0])

It got a bit hairy when it came to rounding. We were in the wrong here by using float instead of integer because it doesn't make sense to have fractional word counts. We should have accepted 21 instead of 22.

The top 3 answers:

  1. 22 (450)
  2. 21 (204)
  3. 20 (120)

Hacking

We intentionally made it easy to hack apply.embed.ly. The url paths were /1 /2 /3 and every time you got a question right, we just added a cookie 'au_embedly_1=true' for the problem you solved. Only Will Pearson used this to his advantage and skipped a problem.

Standing Out.

Some notable examples of different ways people solved this.

  1. A couple people solved it in the Chrome console, no text editor needed.
  2. 6 minutes. The total time it took one college sophomore to solve it.
  3. Ruby one liners for all: https://gist.github.com/1792968/cbb3f5c22ff2e7d174734c780df87e8b9e85153e
  4. All in Mathematica: https://gist.github.com/1797321
  5. You can solve the first problem in J in 27 chars: "(+/ "1 f"0 !i.1000x) i.8001"
  6. A number of people used excel to solve a majority of the problems. There seems to be a lot of finance nerds lurking on HN.

 

Gists:

If you are interested in seeing the solutions everyone posted here you go. I embedded a gist of gists because I cannot for the life of me figure out how to get Posterous not to embed them.

 

#New#New Parrotfish - Twitter Plugin Released

We woke up yesterday with smiles on our faces and using our favorite chrome plugin (Parrotfish) . Then we got the news that the #new#new Twitter was released. Jaws dropped, tweets flew in, and profanities flew out. Our users expected results. @hotdogsladies tweeted that lovemaking was just not the same without Embedly. We concur.

For a brief moment we considered retiring Parrotfish. Surely in this latest release Twitter would have implemented embeds the way they should have from the beginning. Lucky for us, it appears to be the same crippled system that caused us to create Parrotfish in the first place.

So, off to the Batcave Sean and Bob went. Afterall, we do our best work with bats circling. Who doesn't? They (Sean and Bob, not the bats) woke up this morning, ready again to tread through the depths of a Twitter re-design, this time armed with some new toys that we have created over the last few months.

We now present to you the latest and greatest Parrotfish ready to conquer your timeline (the Twitter one, not the Facebook one):

New_new_parrotfish

  •  Enabled with SSL support for embeds and images. (Secure)
  •  Better favicons and logos.
  • Available in Chrome and Safari. (FF you're next)

Get it right away at Embedly Labs.