Characterising the Emergent Semantics in Twitter Lists - VideoLectures

4 downloads 82 Views 2MB Size Report
Characterising the Emergent Semantics in Twitter Lists. Introduction .... 26% life-science biotech. subClassOf. 26% writers authors developer. 11% google.
Characterising the Emergent Semantics in Twitter Lists Andrés García-Silva †, Jeon-Hyung Kang*, Kristina Lerman*,Oscar Corcho † † {hgarcia, ocorcho}@fi.upm.es Facultad de Informática Universidad Politécnica de Madrid, Spain *{jeonhyuk,lerman}@isi.edu Information Sciences Institute, University of Southern California, USA

Introduction Twitter Lists

Characterising the Emergent Semantics in Twitter Lists

2

Introduction Curators and List Names

Characterising the Emergent Semantics in Twitter Lists

3

Introduction Members and List Names

Characterising the Emergent Semantics in Twitter Lists

4

Introduction Subscribers and List Names

Characterising the Emergent Semantics in Twitter Lists

5

Introduction

• Previous examples showed individual uses of lists • Some list names where related among them

• What about if we group the lists?

Characterising the Emergent Semantics in Twitter Lists

6

Introduction Lists where the Yahoo!Finance user is a member grouped by frequency of membership

Lists where the NASDAQ user is a member grouped by number of subscriptions

Characterising the Emergent Semantics in Twitter Lists

7

Introduction: Research questions • Is it possible to identify related keywords from list names according to the use given by the different user roles? • Are two list names related if they have been used by a similar set of curators? • Are two list names related if a similar set of users have subscribe to the corresponding lists? • Are two list names related if their corresponding lists have a similar set of members?

• What kind of user roles will generate more related keywords? • What types of relations between keywords can we obtain? •

Synonyms, is-a, siblings..?

Investment

Stocks

Curator 1

Banks

PersonalBanking

Curator 2

List members

Subscriber 1

Characterising the Emergent Semantics in Twitter Lists

8

Approach

Elicit related keywords from Twitter lists

Twitter Lists

Characterise the semantics of the relations

Schema Representation of keywords

Model to identify similar keywords

Based on curators

Vector Space Model

Based on subscribers Based on members

Characterising the Emergent Semantics in Twitter Lists

Latent Dirichlet Allocation

Pairs of related keywords per Schema Rep. and Model

9

Approach

Elicit related keywords from Twitter lists

Characterise the semantics of the relations

Similarity based on WordNet Path Length Pairs of related keywords per Schema Rep. and Model

Wu & Palmer (Hierarchical Inf.)

Synonyms Is-a Siblings Indirect is-a Specificity of relations

Jiang & Conrath (Distributional Inf.) Synonyms (sameAs) SPARQL queries over general KBs published as Linked Data DBpedia, OpenCyc, and UMBEL

Binary relations (TypeOf, BT) Object Prop. (Occupation)

Characterising the Emergent Semantics in Twitter Lists

10

Experiment: Setup

• Data set • Total • 297,521 lists, 2,171,140 members, 215,599 curators, and 616,662 subscribers • We extracted 5932 unique keywords from list names; 55% of them were found in WordNet. • We use approximate matching of the list names with dictionary entries • The dictionary was created from Wikipedia article titles

Characterising the Emergent Semantics in Twitter Lists

11

Experiment: Execution Elicit related keywords from Twitter lists

Data set

Schema Representation of keywords

Model to identify similar keywords

Based on curators

Vector Space Model

Based on subscribers

Latent Dirichlet Allocation

Based on members

Characterise the semantics of the relations Similarity based on WordNet WordNet Similarity

Path Length

Pairs of related keywords per Schema Rep. and Model

Each keyword with the 5 Most related

Wu & Palmer (Hierarchical Inf.) Jiang & Conrath (Distributional Inf.)

Characterising the Emergent Semantics in Twitter Lists

12

Experiment: Data Analysis

Correlation Values (-1 to 1)

Pearson's coefficient of correlations

Average J&C distance and W&P similarity

Characterising the Emergent Semantics in Twitter Lists

13

Experiment: Data Analysis Path Length in WordNet Path Length

Members VSM

LDA

Subscribers VSM

LDA

Curators VSM

LDA

1 (synonyms)

8.58%

10.87% 3.97%

3.24%

1.24% 0.50%

2 (is-a)

3.42%

3.08%

1.93%

0.47%

0.70% 0.00%

3 (Siblings, ind. Is-a)

2.37%

3.77%

2.96%

2.06%

2.38% 4.03%

>3

67.61%

65.5%

67.2%

67.5%

77.8% 75.8%

% of relations found by each schema representation and model

In average 97.65% of the relations with a path length greater than 3 involve a common subsumer

Characterising the Emergent Semantics in Twitter Lists

14

Experiment: Data Analysis Depth (LCS) and path length as indicators of specificity

Relations in WordNet

Depth of the least common subsumer

Relations with dept(LCS) >=5

Length of the path setting up the relation

Characterising the Emergent Semantics in Twitter Lists

15

Experiment: Findings Summary •

Similarity models based on members • •



The majority of relations found by any model have a path length >= 3 and involve a common subsumer. •



produce the results that are most correlated to the results of similarity measures based on WordNet find more synonyms and direct relations is-a when compared to the other models (path length).

Depth of LCS • VSM based on subscribers produces the highest number of specific relations (depth of LCS >= 5 or 6).

Similarity models based on curators produce a lower number of relations.

Characterising the Emergent Semantics in Twitter Lists

16

Experiment: Execution Elicit related keywords from Twitter lists

Data set

Schema Representation of keywords

Model to identify similar keywords

Based on curators

Vector Space Model

Based on subscribers

Latent Dirichlet Allocation

Based on members

Ontological Relations between keywords

Characterise the semantics of the relations SPARQL queries over general KBs published as Linked Data DBpedia, OpenCyc, and UMBEL

Characterising the Emergent Semantics in Twitter Lists

Pairs of related keywords per Schema Rep. and Model

Each keyword with the 5 Most related

17

Experiment

• We anchor 63.77% of the keywords extracted from Twitter Lists to DBPedia resources

Characterising the Emergent Semantics in Twitter Lists

18

Experiment Vector-space model based on members (direct relations) Relation type Broader Term 26% subClassOf 26% developer 11% genre 11% largest city 6% Others 20%

Example of keywords life-science biotech writers authors google google_apps funland comedy houston texas -

Vector-space model based on subscribers (relations of length 3) Linked data pattern (54.73%): x -> object

Suggest Documents