Browse > Home / Blog / Avoiding Search Overload

| Subcribe via RSS

Avoiding Search Overload

August 5th, 2009 Posted in Blog

3642650246_707852816a

Like you, we’ve heard a lot this summer about the challenges facing America:  the financial crisis, healthcare reform, and worst of all:  search overload.

Well, here at Swingly HQ, we’ve been doing our part.  We’ve been trying to find new ways to figure out what kinds of information are most relevant to a particular search topic.

While relevance modeling isn’t exactly new, it’s becoming an increasingly important problem for semantic search applications.   Information Extraction apps are rapidly increasing the amount of factual information that’s available from the Internet.  That’s good.  Unfortunately, instead of being buried under mountains of irrelevant information, we’re now being overwhelmed with gigabytes of factual information which may (or may not) be exactly what we’re looking for.  That’s bad.

So, what’s a new semantic search app to do?  Full details after the jump.

Let’s imagine you’re interested in learning more about Jesse “The Body” Ventura.  Well, if you’ve got access to a named entity recognizer (like the one we use with Swingly), you might be able to infer that he’s a:

  • person
  • Navy SEAL
  • professional wrestler
  • actor
  • politician
  • mayor
  • governor
  • talk show host

While these classes might not tell you anything you don’t already know, they (in theory) can provide semantic search apps with some of the intelligence needed to provide better, more informative results.

Since we know he’s a talk show host (among other things), we could have our app focus on finding information that’s relevant to any talk show host, such as:

  • the station he’s on
  • the format / length / medium of his show
  • when his show started / ended
  • how big his audience is

But since he’s a former politician-turned-talk show host, we might want to go further and find other talk show host-related facts that may only be relevant for talk show hosts with this kind of background, such as:

  • his political affiliation
  • his endorsements
  • the blogs that cover him

These kinds of facts are most likely not relevant for other kinds of talk show hosts:  e.g. about your local sports-talk radio jock, the CarTalk guys, etc.

But we’re not out of the woods just yet.

While information extraction apps now able to capture lots and lots of different types of facts (including some of the ones I’ve listed above), they still require a human to tell them which classes of facts are relevant (e.g. political affiliation for a former politician-turned-talk show host) — and which ones aren’t (like home runs hit for a hockey player).

What’s worse?  Even though extractor ontologies are growing, most of the facts that we’ll need coverage for won’t be covered by an extractor.  Despite lots of efforts to reduce the cost of creating (and maintaining) extractors, ensuring adequate coverage across multiple domains requires serious investment in time and money.

So, what’s a semantic search app to do?  Start small — and work your way up.

At LCC, John Lehmann (and his team) recently developed a new algorithm which determines the most relevant predicates for each of individual names — or classes of names — mentioned in Swingly’s index.

For example, if you’re interested in information on NASCAR great Richard Petty, we think you might want be most interested in sentences which contain any of the following predicates.  (I’ve bolded some of the ones that I think are particularly interesting.)

{ win=46, appear=13, take=11, lead=10, drive=6, hold=6, race=6, finish=6, make=6, step=5, qualify=4, begin=4, compete=3, announce=3, mark=3, managed to qualify=3, fill=3, tie=2, visit=2, own=2, run=2, leave=2, return=2, come=2, suffer=2, remain=2, be part of=2, provide=2, drop=2, crash=2, follow=2, participate=2, spend=2, wear=2, miss=2, is currently=1, was back=1, serve=1, host=1, duplicate=1, form=1, log=1, tangle=1, record=1, start=1, discuss=1, running eleventh=1, put=1, give=1, use=1, bump=1, express=1, recognize=1, donate=1, unveil=1, stay=1, edge=1, slam=1, produce=1, send=1, remark=1, pull=1, become=1, snap=1, overcome=1, rebound=1, rivaled only=1, get=1, enter=1, raced alongside=1, collaborate=1, sign=1, failed to qualify=1, wave=1, match=1, release=1, was formerly=1, collect=1, trying to pass=1, allow=1, voice=1, set=1, retire=1, achieve=1, circle=1, established to honor=1, supply=1, develop=1, dominate=1, chose to run=1, begrudge=1, feel=1, tried to pass=1, outlast=1, sliding sideways=1, pit=1, claim=1, …}

Or, if you’re interested in wizards (yes, LCC provides access to a named entity type #wizard), you might be preferentially interested in answers which talk about:

{ appear=29, use=24, tell=22, take=21, have=18, find=17, give=15, make=14, become=13, ask=11, create=10, leave=8, visit=7, turn=7, reveal=7, send=7, place=6, choose=6, discover=6, hide=5, cast=5, put=5, hear=5, arrive=5, summon=5, defeat=5, recruit=4, meet=4, possess=4, provide=4, narrate=4, was once=4, destroy=4, confront=4, see=4, help=4, allow=4, order=4, offer=4, begin=4, air=3, go=3, agree=3, entrust=3, learn=3, lives backwards=3, inform=3, argue=3, hold=3, imprison=3, return=3, fall=3, kill=3, instruct=3, live=3, warn=3, appoint=3, die=3, conjure=3, begging to die=2, associate=2, search=2, drink=2, serve=2, convince=2, face=2, subdue=2, witness=2, transform=2, look=2, seek=2, prophesy=2, change=2, lead=2, watch=2, battle=2, advertise=2, catch=2, interpret=2, revive=2, throw=2, returned to punish=2, ally=2, locate=2, rush=2, mature=2, approach=2, involve=2, avoid=2, sacrifice=2, flee=2, announce=2, acquire=2, remind=2, save=2, travel=2, request=2, hire=2, …}

We then use LCC’s dependency parsers in order to expand each of these predicates into a set of semantically-typed triplets.  We then use submit these triplets to a modeling framework to learn the triplets which are most relevant for the individual entity, the entity type, or the user’s query:

Richard Petty:

  • #driver – win – #race, #person – win – #raceEvent
  • #driver – take – #ordinalNumber
  • #driver – passed – #driver

#Wizards:

  • #wizard – cast – #spell
  • #wizard – killed – {#person, #monster}
  • #wizard – suffered – #quantity

We then use the output of this modeling to pick out Q&As from Swingly’s index which are expected to be most relevant with respect to the query.  This means that for a generic query like Richard Petty, the top Q&As that Swingly returns goes from:

  • Where can I buy Richard Petty die-cast replica cars?
  • Which number did NASCAR retire in honor of Richard Petty?
  • What type of car did Richard Petty race in 1964?

to

  • What racing championship did Richard Petty win 7 times?
  • What races did Richard Petty win more than once?
  • How many races did Richard Petty win over his career?

Have we eliminated the pernicious problem of search overload?  Nah. (Well, not yet.)  But we expect techniques like these will be able to push high-quality, relevant content to the top of search results — even when users don’t give us enough information to figure out what they’re really looking for.

Author’s Note:  “Search Overload” is one of the taglines used in the Bing marketing campaign. I like Bing.  And anything that’s going to reduce the amount of crap I have to deal with from my search engine.

Leave a Reply