Overview of TREC 2006 - Text REtrieval Conference - NIST

Overview of TREC 2006

Sponsored by: NIST, DTO Ellen Voorhees

Text REtrieval Conference (TREC)


TREC 2006 Program Committee Ellen Voorhees, chair James Allan Chris Buckley Gord Cormack Sue Dumais Donna Harman Bill Hersh David Lewis

John Prager Steve Robertson Mark Sanderson Ian Soboroff Karen Sparck Jones Richard Tong Ross Wilkinson


TREC 2006 Track Coordinators Blog: Ounis, de Rijke, Macdonald, Mishne, Soboroff Enterprise: Nick Craswell, Arjen de Vries, Ian Soboroff Genomics: William Hersh Legal: Jason Baron, David Lewis, Doug Oard Question Answering: Hoa Dang, Jimmy Lin, Diane Kelly Spam: Gordon Cormack Terabyte: Stefan Büttcher,Charles Clarke,Ian Soboroff


TREC 2006 Participants Arizona State U. Australian Nat. U. & CSIRO Beijing U. Posts & Telecom Carnegie Mellon U. Case Western Reserve U. Chinese Acad. Sciences (2) The Chinese U. of Hong Kong City U. of London CL Research Concordia U. (2) Coveo Solutions Inc. CRM114 Dalhousie U. DaLian U. of Technology Dublin City U. Ecole Mines de Saint-Etienne Erasmus, TNO, & U. Twente Fidelis Assis Fudan U. (2) Harbin Inst. of Technology Humboldt U. & Strato AG Hummingbird IBM Research Haifa IBM Watson Research Illinois Inst. of Technology Indiana U.

Inst. for Infocomm Research ITC-irst Jozef Stefan Institute Kyoto U. Language Computer Corp. (2) LexiClone LowLands Team Macquarie U. Massachusetts Inst. Tech. Massey U. Max-Planck Inst. Comp. Sci. The MITRE Corp. National Inst of Informatics National Lib. of Medicine National Security Agency National Taiwan U. National U. of Singapore NEC Labs America, Inc. Northeastern U. The Open U. Oregon Health & Sci. U. Peking U. Polytechnic U. Purdue U. & CMU Queen Mary U. of London

Queensland U. of Technology Ricoh Software Research Ctr RMIT U. Robert Gordon U. Saarland U. Sabir Research, Inc. Shanghai Jiao Tong U. Stan Tomlinson SUNY Buffalo Technion-Israel Inst. Tech. Tokyo Inst. of Technology TrulyIntelligent Technologies Tsinghua U. Tufts U. UCHSC at Fitzsimons U. of Alaska Fairbanks U. of Albany U. of Amsterdam (2) U. of Arkansas Little Rock U. of Calif., Berkeley U. of Calif., Santa Cruz U. Edinburgh U. of Glasgow U. of Guelph U. of Hannover Berlin

U. & Hospitals of Geneva U. of Illinois Chicago (2) U. Illinois Urbana Champaign U. of Iowa U. Karlsruhe & CMU U. of Limerick U. MD Baltimore Cnty & APL U. of Maryland U. of Massachusetts The U. of Melbourne U. degli Studi di Milano U. of Missouri Kansas City U. de Neuchatel U. of Pisa U. of Pittsburgh U. Rome “La Sapienza” U. of Sheffield U. of Strathclyde U. of Tokyo U. Ulster & St. Petersburg U. of Washington U. of Waterloo U. Wisconsin Weill Med. College, Cornell York U.


TREC Goals •

To increase research in information retrieval based on large-scale collections

•

To provide an open forum for exchange of research ideas to increase communication among academia, industry, and government

•

To facilitate technology transfer between research labs and commercial products

•

To improve evaluation methodologies and measures for information retrieval

•

To create a series of test collections covering different aspects of information retrieval Text REtrieval Conference (TREC)

The TREC Tracks Blog Spam

Personal documents

Legal Genome Novelty Q&A

Retrieval in a domain Answers, not docs

Enterprise Terabyte Web VLC

Web searching, size

Video Speech OCR X→{X,Y,Z} Chinese Spanish

Beyond text

Beyond just English

Interactive, HARD

Human-in-the-loop

Filtering Routing

Streamed text

Ad Hoc, Robust

Static text 2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992


Common Terminology • “Document” broadly interpreted, – for example • email message in enterprise, spam tracks • blog posting plus comments in blog track

• Classical IR tasks • ad hoc search: collection known; new queries • routing/filtering: standing queries; streaming document set • known-item search: find partially remembered specific document • question answering Text REtrieval Conference (TREC)

TREC 2006 Themes • Explore broader information contexts • different document genres – newswire (QA); web (terabyte) – blogs; email (enterprise, spam); corporate repositories (legal, enterprise); scientific reports (genomics, legal)

• different tasks – ad hoc (terabyte, enterprise discussion, legal, genomics); known-item (terabyte); classification (spam) – specific responses (QA, genomics, enterprise expert); opinion finding (blog)


TREC 2006 Themes • Construct evaluation methodologies • fair comparisons on massive data sets (terabyte, legal) • quality of a specific response (QA, genomics) • evaluation strategies balancing realism and privacy protection (spam, enterprise) • distributed efficiency benchmarking (terabyte)


Terabyte Track • Motivations • investigate evaluation methodology for collections substantially larger than existing TREC collections • provide test collection for exploring system issues related to size

• Tasks • traditional ad hoc retrieval task • named page finding task • efficiency task – systems required to report various timing and resource statistics Text REtrieval Conference (TREC)

Terabyte Collection • Documents • ~ 25,000,000 web documents (426 gb) • spidered in early 2004 from .gov domain • includes text from pdf, word, etc. files

• Topics • 50 new information-seeking topics for ad hoc • 181 named page queries contributed by participants • 100,000 queries from web logs for efficiency

• Judgments • new pooling methodology in ad hoc to address bias • duplicate detection using DECO Text REtrieval Conference (TREC)

Ad Hoc Task • Emphasis on exploring pooling strategies – task guidelines strongly encouraged manual runs to help pools • [unspecified] prize offered for most unique rels

– judgment budget allocated among 3 pools • traditional pools to depth 50; used for scoring runs by trec_eval • traditional pools starting at rank 400; not used for scoring • sample of documents per topic such that number judged topic-dependent; used for scoring runs by inferred AP [Yilmaz & Aslam] Text REtrieval Conference (TREC)

Ad Hoc Task Results Automatic, title-only runs; 50 new topics 0.5

bpref

0.3

MAP 0.2

infAP

zetadir

mg4jAutoV

CoveoRun1

AMRIMtp20006

uogTB06QET2

JuruTWE

TWTB06AD01

indri06AlceB

0

humT06xle

0.1

uwmtFadTPFB

Score

0.4


0.6

0.5

0.4

0.3

0.2

0.1

0 indri06Nsdp uogTB06MP CoveoNPRun2 THUNPNOSTOP humTN06dp1 MU06TBn5 zetnpft uwmtFnpsRR1

ictb0603 TWTB06NP02

50 45 40 35 30 25 20 15 10 5 0


UAmsT06n3SUM

% topics with no page found

Named Page Finding Results

MRR

Efficiency Task • Goal: compare efficiency/scalability despite different hardware • 100,000 queries vetted for hits in corpus and seeded with queries from other tasks • queries divided into 4 streams; within-stream queries required to be processed serially; between streams could use arbitrary interleaving • participants also report efficiency measures for running open source retrieval system with known characteristics for normalization • P(20) or MRR measured for seeded queries Text REtrieval Conference (TREC)

Efficiency Task Results Fastest Best Precision

Fastest Best Latency

1

10000 4630

0.9 2202

1680 808 534

0.7

Prec(20)

0.6

1000

229 167 109

0.5

100

79 55

0.4 29 0.3

13

32 13

Latency (ms)

0.8

10

0.2 0.1 0

1 MU

CW I

RMIT

MPI

UW MT

P6

HUM

THU


Question Answering Track • Goal: return answers, not document lists • Tasks: • define a target by answering a series of factoid and list questions about that target, plus returning other info not covered by previous questions • complex interactive question answering (ciQA)

• AQUAINT document collection source of answers for all tasks • 3 GB text; approx. 1,033,000 newswire articles Text REtrieval Conference (TREC)

Question Series 185 Iditarod race 185.1 FACT

In what city does the Iditarod start?

185.2 FACT

In what city does the Iditarod end?

185.3 FACT

In what month is it held?

184.4 FACT

Who is the founder of the Iditarod?

185.5 LIST

Name people who have run the Iditarod

185.6 FACT

How many miles long is the Iditarod?

185.7 FACT

What is the record time in which the Iditarod was won?

185.8 LIST

Which companies have sponsored the Iditarod?

185.9 Other

75 series in test set with 6-9 questions per series 19 19 18 19

People Organizations Things Events

403 total factoid questions 89 total list questions 75 total “other” questions Text REtrieval Conference (TREC)

Main task • Similar to previous 2 years, but: – questions required to be answered within particular time frame • present tense latest in corpus • past tense implicit default reference to timeframe of series

– factoid questions relatively less important in final score – conscious attempt to make questions somewhat more difficult • temporal processing Text REtrieval Conference (TREC)

Series Score • Score a series using weighted average of components Score = 1/3FactoidScore+1/3ListScore+1/3OtherScore

• Component score is mean of scores for questions of that type in given series • FactoidScore: average accuracy. Individual question has score of 1 or 0 • ListScore: average F measure score. Recall & precision of response based on entire set of known answers. • OtherScore: F(β=3) score for that series’ Other question, calculated using “nugget” evaluation Text REtrieval Conference (TREC)

Average Series Score 0.5

0.4

0.3

0.2

0.1

0 lccPA06

cuhkqaepisto

ed06qar1

ILQUA1

QASCU1

Roma2006run3

InsunQA06

other


QACTIS06A

list

NUSCHUAQA3

factoid

FDUQAT15A

Series Task Results

LCCFerret

Pyramid Nugget Scoring • Variant of nugget scoring for ‘Others’ • suggested by Lin and Demner-Fushman [2006] • vital/okay distinction by single assessor is major contributor to instability of the evaluation; thus have multiple judges “vote” on nuggets, and use percentage voting for a nugget as nugget’s weight • each QA assessor given list of nuggets created by primary assessor for series and asked simply to say vital/okay/NaN • pyramid of votes does not represent any single user, but does increase stability and average series scores highly correlated Text REtrieval Conference (TREC)

Single vs. Pyramid Nugget Scores Pyramid Average F(3) Score

0.3

0.2

0.1

0 0

0.1

0.2

0.3

Primary Assessor Average F(3) Score


Complex Interactive QA • Goals: • investigate richer user contexts within QA • have (limited) actual interaction with user

• Task inspired by TREC 2006 relationship QA task and HARD track • “essay” questions • interaction forms allowed participants to solicit information from assessor (surrogate user)


Complex Questions • Questions taken from relationship type identified in AQUAINT pilot • question formed from a relationship template • also included narrative giving more details What evidence is there for transport of [goods] from [entity] to [entity]? What [financial relationship] exists between [entity] and [entity]? What [organizational ties] exist between [entity] and [entity]? What [familial ties] exist between [entity] and [entity]? What [common interests] exist between [entity] and [entity]? What influence/effect does [entity] have on/in [entity]? What is the position of [entity] with respect to [issue]? Is there evidence to support the involvement of [entity] in [event/entity]?


ciQA Protocol • Perform baseline runs • Receive interaction form responses • interaction forms web forms that ask assessor for more information • content is participant's choice, but assessor spends at most 3 minutes/topic/form

• Perform additional (non-baseline) runs exploiting additional info Text REtrieval Conference (TREC)

Relationship Task Results Pyramid F(3) Scores 0.4

Final Score

0.3

Automatic

0.2

Manual 0.1

0 0

0.1

0.2

0.3

0.4

Initial Score


Genomics Track • Track motivation: explore information use within a specific domain • focus on person experienced in the domain

• 2006 task • single task with elements of both IR and IE • instance-finding, QA task where unit of retrieval is a (sub) paragraph


Genomics Track Task • Documents • full-text journal articles provided through Highwire Press • associated metadata (eg MEDLINE record) available • 162,259 articles from 49 journals; about 12.3GB HTML

• Topics • 28 well-formed questions derived from topics used in 2005 (two questions discarded as had no relevant passages) • topics based on 4 generic topic type templates and instantiated from real user requests – e.g.,

What is the role of DRD4 in alcoholism? How do HMG and HMGB1 interact in hepatitis?

• System response • ranked list of up to 1000 passages (pieces of paragraphs) • each passage a contribution to the answer Text REtrieval Conference (TREC)

Task Evaluation • Relevance judging • pools built from standardized paragraphs by mapping retrieved passage to its unique standard • judged by domain experts using 3-way judgments: not/possibly/definitely relevant • assessor marked contiguous span in paragraph as answer • answers assigned a set of MeSH terms


Task Evaluation • Scoring – document: • standard ad hoc retrieval task (MAP) • doc is relevant iff it contains a relevant passage • collapse system ranking so doc appears just once

– aspect: • use MeSH term sets to define aspects • retrieved passage that overlaps with marked answer assigned aspect(s) of that answer

– passage • calculate fraction of answers (in characters) that are contained in retrieved passages Text REtrieval Conference (TREC)

Score 0.6

0.5

0.4

0.3

0.2

0.1

0 UICGenRun1

UICGenRun2

NLMInter

THU3

Aspect

THU2

Document

THU1

iitx1

PCPsgRescore

PCPClean

PCPsgAspect

UIUCinter

Passage


UIUCinter2

Genomics Track Results

UICGenRun3

Enterprise Track • Goal: investigate enterprise search, searching the data of an organization to complete some task • find-an-expert task • ad hoc search for email messages that discuss reasons pro/con for a particular past decision

• Document set • crawl of W3C public web as of June 2004 • ~300,000 documents divided into five major areas: discussion lists (email), code, web, people, other Text REtrieval Conference (TREC)

Search-for-Experts Task • Motivation • exploit corporate data to determine who are experts on a given topic

• Task • given a topic area, return a ranked-list of people; also return supporting documents • people are represented as an id from a canonical list of people in the corpus • evaluate as standard ad hoc retrieval task where a person is relevant if he/she is indeed an expert [and expertise is documented] • topics and judgments constructed by participants Text REtrieval Conference (TREC)

Expert Search Topic W3C translation policy I want to find people in charge or knowledgeable about W3C translation policies and procedures I am interested in translating some W3C document into my native language. I want to find experts on W3C translation policies and procedures. Experts are those persons at W3C who are in charge of the translation issues or know about the procedures and/or the legal issues. I do not consider other translators experts.


Search-for-Experts No Support

Support Required

1

0.7 0.6 0.5 0.4 0.3

UvAprofiling

FDUSO

ICTCSXRUN01

THUPDDSNEMS

UMaTDFb

IBM06MA

PRISEXB

SRCBEX5

0

SJTU04

0.2 0.1 kmiZhu1

R-Precision

0.9 0.8


Discussion Search • Retrieve (email) documents containing arguments pro/con about a decision • discussion list portion of collection only • topics created at NIST based on categories produced from last year’s track • messages judged not relevant, on-topic, containing argument – assessor marked whether argument was pro, con, both but this was not part of system’s task


Discussion Email Search 1 0.9 0.8

THUDSTHDPFSM

0.7

srcbds5

0.6

DUTDS3

0.5

UAmsPOSBase

0.4

york06ed03

0.3

UMaTDMixThr

0.2

IIISRUN

0.1

IBM06JAQ

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1


Blog Track • New track in 2006 • explore information access in the blogosphere

• One shared task • find opinions about a given target • traditional ad hoc retrieval task evaluation, similar to discussion search in enterprise track

• Document set • set of blogs collected in Dec 2005-February 2006 & distributed by University of Glasgow • document is a permalink: blog post plus all its comments • ~3.2 million permalinks, some other info collected Text REtrieval Conference (TREC)

Topics & Judgments • 50 topics • created at NIST by backfitting queries from blog search engine logs • title was submitted query; NIST assessor created rest of topic to fit

• Judgments • judgments distinguished posts containing opinions from simply on-topic posts; also marked polarity • systems not required to give polarity, but did have to distinguish opinions from on-topic • assessors could skip document in pool if post likely offensive, but never exercised that option Text REtrieval Conference (TREC)

Sample Topics ``March of the Penguins’’ Provide opinion of the film documentary ``March of the Penguins’’. Relevant documents should include opinions concerning the film documentary ‘’March of the Penguins’’. Articles or comments about penguins outside the context of this film documentary are not relevant.

larry summers Find opinions on Harvard President Larry Summers’ comments on gender differences in aptitude for mathematics and science. Statements of opinion on Summers’ comments are relevant. Quotations of Summers without comment or references to Summers’ statements without discussion of their content are not relevant. Opinions on innate gender differences without reference to Summers’ statements are not relevant.


Opinion Results Automatic, Title-Only Runs 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

w oqln2 ParTiDesDmt2 uicst UAmsB06All uscsauto pisaB1Des UABas11 UALR06a260r2

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1


Legal Track • Goal: contribute to the understanding of ways in which automated methods can be used in the legal discovery of electronic records • Timely track in that changes to the US Federal Rules of Civil Procedure regarding electronic records take effect Dec. 1


Legal Track Collection • Document set • almost 7 million documents made public through the tobacco Master Settlement Agreement • snapshot of the University of California Library’s Legacy Tobacco Document Library produced by IIT CDIP project [IIT CDIP Test Collection, version 1.0] • documents are XML records including metadata and a text field consisting of the output of OCR • wide variety of document types including scientific reports, memos, email, budgets… Text REtrieval Conference (TREC)

Legal Track Collection • Topic set • modeled after actual legal discovery practice and created by lawyers • five (hypothetical) complaints with several requests to produce (topics) per complaint • also included a negotiated Boolean query statement that participants could use as they saw fit • 46 topics distributed; 39 used in scoring


Legal Track Collection • Relevance judgments • pools built from participant’s submitted runs (λ=100 for one run per participant & λ=10 for remaining runs) … • plus set of documents retrieved by a manual run created by a document set expert who targeted unique relevant docs (approx. 100 docs/topic) … • plus a stratified sample of a baseline Boolean run (up to 200 documents/topic) • judgments made by law professionals


Ranked Retrieval Results 1

Precision

0.9 0.8

humL06t

0.7

SabLeg06aa1

0.6

york06la01

0.5

UMKCB2

0.4 0.3

UmdComb

0.2

NUSCHUA2

0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall Text REtrieval Conference (TREC)

R-Precision Results baseline

sableg06ao1

HumL06t

0.8

0.7

0.6

0.4

0.3

0.2

0.1

33

62

6

50

158

28

162

1

1

18

137

78

13

34

245

37

64

320

97

17

46

188

19

354

9

481

69

291

35

505

4

80

36

162

5

130

192

165

0 125

R-prec

0.5


Spam Track • Motivation: • assess quality of an email spam filter’s actual usage • lay groundwork for other tasks with sensitive data

• How to get appropriate corpus? • true mail streams have privacy issues • simulated/cleansed mail streams introduce artifacts that affect filter performance • track solution: create software jig that applies given filter to given message stream and evaluates performance based on judgments • have participants send filters to data Text REtrieval Conference (TREC)

Spam Tasks • 4 email streams [ham; spam; total] • • • •

public English public Chinese MrX2 (private) SB2 (private)

[12,910; 24,912; 37,822] [21,766; 42,854; 64,620] [ 9039; 40,135; 49,174] [ 9274; 2695; 11,969]

• 3 tasks • immediate feedback filtering • delayed feedback filtering • active learning


Evaluation • Ham misclassification rate (hm%) • Spam misclassification rate (sm%) • ROC curve • assumes filter computes a “spamminess” score • use score to compute sm% as function of hm% • area under ROC curve is measure of filter effectiveness • use 1-area expressed as a % to reflect filter ineffectiveness (1-ROCA)%


Immediate vs. Delayed Feedback SB2 Corpus

0.10

0.01 oflS3b2 ijsS1b2 CRMS3b2 tufS4b2 hubS2b2 tamS4b2 hitS1b2 bptS2b2 dalS1b2

% Spam Misclassification (logit scale)

% Spam Misclassification (logit scale)

0.01

1.00

10.00

50.00 0.01

0.10

1.00 % Ham Misclassification (logit scale)

Immediate Feedback

10.00

0.10

oflS4b2d ijsS1b2d CRMS3b2d hubS2b2d tufS4b2d tamS4b2d hitS1b2d bptS2b2d dalS1b2d

1.00

10.00

50.00 50.00 0.01

0.10

1.00 % Ham Misclassification (logit scale)

10.00

50.00

Delayed Feedback


Active Learning SB2 Corpus 100 oflA1b2 hubA1b2 ijsA1b2 bptA2b2 hitA1b2

1-ROCA (%)

10

1

0.1

0.01 100

1000 Training Messages

10000


Future • TREC expected to continue into 2007 • TREC 2007 tracks: • all but terabyte continuing • adding “million query” track – goal is to test hypothesis that a test collection built from very many very incompletely judged topics is a better tool than a traditional TREC collection

• start planning for an eventual Desktop search track – goal is to test efficacy of search algorithms for the desktop – privacy/realism considerations severe barriers to traditional methodology Text REtrieval Conference (TREC)

Track Planning Workshops • Thursday, 4:00-5:30pm Blog, LR B Genomics, Green Terabyte, LR D

Desktop, LR A Legal, LR E

• Friday, 10:50am-12:20pm Enterprise, LR B QA, Green

Million Query, LR A Spam, LR D