Characterising the Emergent Semantics in Twitter Lists. Introduction .... 26% life-science biotech. subClassOf. 26% writers authors developer. 11% google.
Characterising the Emergent Semantics in Twitter Lists Andrés García-Silva †, Jeon-Hyung Kang*, Kristina Lerman*,Oscar Corcho † † {hgarcia, ocorcho}@fi.upm.es Facultad de Informática Universidad Politécnica de Madrid, Spain *{jeonhyuk,lerman}@isi.edu Information Sciences Institute, University of Southern California, USA
Introduction Twitter Lists
Characterising the Emergent Semantics in Twitter Lists
2
Introduction Curators and List Names
Characterising the Emergent Semantics in Twitter Lists
3
Introduction Members and List Names
Characterising the Emergent Semantics in Twitter Lists
4
Introduction Subscribers and List Names
Characterising the Emergent Semantics in Twitter Lists
5
Introduction
• Previous examples showed individual uses of lists • Some list names where related among them
• What about if we group the lists?
Characterising the Emergent Semantics in Twitter Lists
6
Introduction Lists where the Yahoo!Finance user is a member grouped by frequency of membership
Lists where the NASDAQ user is a member grouped by number of subscriptions
Characterising the Emergent Semantics in Twitter Lists
7
Introduction: Research questions • Is it possible to identify related keywords from list names according to the use given by the different user roles? • Are two list names related if they have been used by a similar set of curators? • Are two list names related if a similar set of users have subscribe to the corresponding lists? • Are two list names related if their corresponding lists have a similar set of members?
• What kind of user roles will generate more related keywords? • What types of relations between keywords can we obtain? •
Synonyms, is-a, siblings..?
Investment
Stocks
Curator 1
Banks
PersonalBanking
Curator 2
List members
Subscriber 1
Characterising the Emergent Semantics in Twitter Lists
8
Approach
Elicit related keywords from Twitter lists
Twitter Lists
Characterise the semantics of the relations
Schema Representation of keywords
Model to identify similar keywords
Based on curators
Vector Space Model
Based on subscribers Based on members
Characterising the Emergent Semantics in Twitter Lists
Latent Dirichlet Allocation
Pairs of related keywords per Schema Rep. and Model
9
Approach
Elicit related keywords from Twitter lists
Characterise the semantics of the relations
Similarity based on WordNet Path Length Pairs of related keywords per Schema Rep. and Model
Wu & Palmer (Hierarchical Inf.)
Synonyms Is-a Siblings Indirect is-a Specificity of relations
Jiang & Conrath (Distributional Inf.) Synonyms (sameAs) SPARQL queries over general KBs published as Linked Data DBpedia, OpenCyc, and UMBEL
Binary relations (TypeOf, BT) Object Prop. (Occupation)
Characterising the Emergent Semantics in Twitter Lists
10
Experiment: Setup
• Data set • Total • 297,521 lists, 2,171,140 members, 215,599 curators, and 616,662 subscribers • We extracted 5932 unique keywords from list names; 55% of them were found in WordNet. • We use approximate matching of the list names with dictionary entries • The dictionary was created from Wikipedia article titles
Characterising the Emergent Semantics in Twitter Lists
11
Experiment: Execution Elicit related keywords from Twitter lists
Data set
Schema Representation of keywords
Model to identify similar keywords
Based on curators
Vector Space Model
Based on subscribers
Latent Dirichlet Allocation
Based on members
Characterise the semantics of the relations Similarity based on WordNet WordNet Similarity
Path Length
Pairs of related keywords per Schema Rep. and Model
Each keyword with the 5 Most related
Wu & Palmer (Hierarchical Inf.) Jiang & Conrath (Distributional Inf.)
Characterising the Emergent Semantics in Twitter Lists
12
Experiment: Data Analysis
Correlation Values (-1 to 1)
Pearson's coefficient of correlations
Average J&C distance and W&P similarity
Characterising the Emergent Semantics in Twitter Lists
13
Experiment: Data Analysis Path Length in WordNet Path Length
Members VSM
LDA
Subscribers VSM
LDA
Curators VSM
LDA
1 (synonyms)
8.58%
10.87% 3.97%
3.24%
1.24% 0.50%
2 (is-a)
3.42%
3.08%
1.93%
0.47%
0.70% 0.00%
3 (Siblings, ind. Is-a)
2.37%
3.77%
2.96%
2.06%
2.38% 4.03%
>3
67.61%
65.5%
67.2%
67.5%
77.8% 75.8%
% of relations found by each schema representation and model
In average 97.65% of the relations with a path length greater than 3 involve a common subsumer
Characterising the Emergent Semantics in Twitter Lists
14
Experiment: Data Analysis Depth (LCS) and path length as indicators of specificity
Relations in WordNet
Depth of the least common subsumer
Relations with dept(LCS) >=5
Length of the path setting up the relation
Characterising the Emergent Semantics in Twitter Lists
15
Experiment: Findings Summary •
Similarity models based on members • •
•
The majority of relations found by any model have a path length >= 3 and involve a common subsumer. •
•
produce the results that are most correlated to the results of similarity measures based on WordNet find more synonyms and direct relations is-a when compared to the other models (path length).
Depth of LCS • VSM based on subscribers produces the highest number of specific relations (depth of LCS >= 5 or 6).
Similarity models based on curators produce a lower number of relations.
Characterising the Emergent Semantics in Twitter Lists
16
Experiment: Execution Elicit related keywords from Twitter lists
Data set
Schema Representation of keywords
Model to identify similar keywords
Based on curators
Vector Space Model
Based on subscribers
Latent Dirichlet Allocation
Based on members
Ontological Relations between keywords
Characterise the semantics of the relations SPARQL queries over general KBs published as Linked Data DBpedia, OpenCyc, and UMBEL
Characterising the Emergent Semantics in Twitter Lists
Pairs of related keywords per Schema Rep. and Model
Each keyword with the 5 Most related
17
Experiment
• We anchor 63.77% of the keywords extracted from Twitter Lists to DBPedia resources
Characterising the Emergent Semantics in Twitter Lists
18
Experiment Vector-space model based on members (direct relations) Relation type Broader Term 26% subClassOf 26% developer 11% genre 11% largest city 6% Others 20%
Example of keywords life-science biotech writers authors google google_apps funland comedy houston texas -
Vector-space model based on subscribers (relations of length 3) Linked data pattern (54.73%): x -> object