Browse > Home / Blog / Quick Q&A on Extractiv

| Subcribe via RSS

Quick Q&A on Extractiv

January 31st, 2010 Posted in Blog

I had so much fun writing up my answers to Mark Johnson’s panel questions that I thought I’d put together another “mock” interview — with myself.

This time, I’m going to be tackling some of the more popular questions we get regarding Extractiv. As a brand-new start-up (only about 8 weeks old), we’re still finding our strengths, but I thought it’d be safe to share a little more about who we are — and what we’re trying to do under the Extractiv name. Want to know more? Write us at support@extractiv.com; we’d be happy to answer any questions you might have (or to show you a demo)!

(As always, the views expressed on this blog are mine, and do not necessarily reflect the views of Language Computer or Extractiv or its subsidiaries or parent companies. Well, until we get the Extractiv Blog put together and start blogging there in earnest, that is.)

Interview after the jump…

Andy Hickl: What is Extractiv?

Extractiv is a new content provisioning service that helps consumers “make sense” of large amounts of unstructured text. We use natural language processing — in conjunction with one of the world’s best distributed computing platforms — in order to turn text into structured data that can be used in a variety of apps, such as sentiment tracking or semantic search.

AH: Why did you build Extractiv? Why now?

We’re building Extractiv because we wanted to give consumers a better way to access all of the knowledge that’s available on the Web.

AH: Okay, so you’re all about getting knowledge from the Web. Isn’t that what search engines do?

Well, yes and no.

Search engines are great ways to get your hands on lots of relevant content related to a keyword query. Want 10 million pages on Labrador Retrievers? Or all the Tweets talking about the Grammy awards? We’d recommend you use a search engine.

But search engines can only take you so far. Let’s say you want a list of all of the men who have ever won a Grammy award. (That’s a pretty disparate group, mind you: one that includes Bill Clinton as well as George Clinton.) Sorry to say, but search — even semantic search — ain’t going to help you much here. If you speak SPARQL, you can try to pull the knowledge out of a pre-compiled, hand-vetted knowledge repository like NNDB or DBPedia. If you don’t? You’re left hoping that the Grammys compiled a list that you can use.

Most of the time, however, the knowledge you want won’t have been compiled into a single, handy-dandy list. What do you do if you want the list of people who have been killed at U.S. sporting events since 1925? Or the comprehensive list of people who have been killed by Somali pirates? Well, before Extractiv, you had to:

  1. Search the Web.
  2. Download lots and lots of documents.
  3. Start reading.

AH: Okay, that’s not much fun. But how does Extractiv help?

Instead of simply search the Web for pages which might (or might not) be relevant to your query, Extractiv goes one step further and actually extracts the exact piece of knowledge you’re looking for.

Simply put, we turn a bit of text like this:

An unlikely nominee, Clinton won his second consecutive nod for music’s top awards in the best spoken word album category for the recording of his best-selling autobiography “My Life.” Earlier this year, the former leader of the free world won a golden gramophone statuette for lending his voice to the spoken word recording of Russian folk tale of “Peter and the Wolf.” Earlier this year, the former leader of the free world won a golden gramophone statuette for lending his voice to the spoken word recording of Russian folk tale of “Peter and the Wolf.”

into a structured record like this:

GRAMMY WINNER: Bill Clinton, 2004, spoken word, “Peter and the Wolf”

where Bill Clinton refers to the name of the winner, 2004 refers to the year he won, and so on.

But we don’t do that just for one bit of text: we do it for the millions of pages we encounter on a Web crawl. Extractiv’s unique distributed computing platform makes it possible for us to crawl — and extract content from — zillions of pages at the same time. (Our performance is pretty unbeatable, too: we’re currently able to download and extract content from 1 million pages in just under an hour.)

AH: Whoa. But what kinds of content can I extract? I’m not exactly interested in male Grammy winners, you know.

What, you’re not? That’s okay. We aren’t either.

Extractiv currently offers more content extractors than any other provider: including more than 10,000 different types of named entities, along with hundreds of facts, attributes, relationships, and events.

We also have the ability to create custom extractors for practically any content type imaginable. Want a list of all of the IED bombings in Iraq since 2008? We can do that. Want a list of sex scandals involving U.S. politicians? We can do that, too.

AH: Who’s behind Extractiv?

Extractiv’s a joint venture between two companies: 80Legs and Language Computer. It’s really a great match. 80Legs offers the world’s first truly scalable web crawling platform, while Language Computer provides some of the world’s best — and most scalable — natural language processing tools.

AH: Are you based in the Bay Area?

No, we’re 100% Texan. (And darned proud of it.) Language Computer is based in Dallas. 80Legs is out of Houston.

AH: What products do you offer?

We’re currently in alpha with two products: a content extraction service and a sentiment tracking service. Both are available for demos. Just shoot us an email at support@extractiv.com, and we’ll show you what we can do.

3 Responses to “Quick Q&A on Extractiv”

  1. Stefano Bertolo Says:

    would be interesting to have an idea of the precision/recall scores and how they are distributed across the 10,000 entity types.

    also interesting would be to know what counts as an entity type. for example, would

    “whitening toothpaste”
    “toothpaste containing whitening”

    count as two distinct types?


  2. andy Says:

    @Stefano: We’re in the progress of pulling a lot of metrics together now that we plan to make available to customers. However, have you seen comparable metrics for OpenCalais?

    Also: right now, most of the “named entities” we capture are proper names; so, type #toothpaste would capture both “Crest whitening toothpaste” and “Aqua Fresh toothpaste with whiteners”.


  3. Tweets that mention AndyHickl.com » Blog Archive » Quick Q&A on Extractiv -- Topsy.com Says:

    [...] This post was mentioned on Twitter by andyhickl, 80legs, Shion Deysarkar, SemanticWeb Eqentia, Twitt3r News Eqentia and others. Twitt3r News Eqentia said: New Blog Post: Quick Q&A on Extractiv http://bit.ly/9CymrR #semanticweb #web3 #nlp: Source: #semanticweb @ Twitter… http://ow.ly/16t9LN [...]


Leave a Reply