MRR: an Unsupervised Algorithm to Rank Reviews by ... - GitHub

MRR: an Unsupervised Algorithm to Rank Reviews by Relevance Vinicius Woloszyn

Henrique D. P. dos Santos

et al.

Department of Computer Science Federal University of Rio Grande do Sul and Pontifical Catholic University of Rio Grande do Sul 2017 IEEE/WIC/ACM International Conference on Web Intelligence

Leipzig, August 24, 2017

Vinicius Woloszyn, Henrique D. P. dos Santos,MRR: et al.an(UFRGS) Unsupervised Algorithm to Rank Reviews Leipzig, by Relevance August 24, 2017

1 / 21

Introduction

Many works address the problem of ranking documents by their relevance. Most of them rely on supervised algorithms such as classification and regression. Annotated: Neural Network, SVM Statistics: TF-IDF, Readability, POS-Tag


2 / 21

Introduction

The quality of results produced by supervised algorithms is dependent on the existence of a large, domain-dependent training data set. Amazon, Yelp Netflix, IMDB

Unsupervised methods are an attractive alternative to avoid the labor-intense and error-prone task of manual annotation of training datasets.


3 / 21

MRR - Ranking documents by their relevance

Graph-based Vertices are the documents (review), and the edges are defined in terms of the similarity between pairs of documents (ratings score and textual). f (u, v ) = α ∗ sim txt(u, v ) + (1 − α) ∗ sim star(u, v )

(1)

α : tune similarity function


4 / 21


Similarity Functions Textual Cosine similarity of TF-IDF vectors sim txt(u, v ) = cos(tfidf (t.t), tfidf (v .t))

(2)

Stars Euclidean distance normalized by Min-Max scaling sim star (u, v ) = 1 −

|u.rs − v .rs| − min(rs) max(rs) − min(rs)


(3)

5 / 21


Graph Centrality Hypothesis: a relevant document has a high centrality index since it is similar to many other documents. Centrality index produces a ranking of vertices’ importance, indicating the ranking of the most relevant document.


6 / 21

MRR - Graph-Specific Similarity Threshold

Graph Pruning Centrality is dependent on the existence of edges between nodes. Prune the graph based on a minimum similarity between review. E : mean of graph similarity ( W 0 (u, v ) =

1, 0,

f (u, v ) ≥ E ∗ β otherwise

(4)

β : tune prune function


7 / 21

Main steps of the MRR algorithm

♦♦ 3 ♥♥

0.55

0.9

7

♠♠ 4 ♥ ♦♦♦ ♣♣

0.55 0.01

2

0.45

(A) Similarity Function

♠♠ 3 ♥ ♦♦

♦ 2 ♣♣

8

0.8

0.15

0.08

♠♠ 4 ♥

0.8

♠♠ 4 ♥ ♦♦♦ ♣♣

0.32

♦♦ 3 ♥♥

7

0

0.85

0.9

8

0.8

♠♠ 4 ♥ 0.9

0.8

0

♦ 2 ♣♣

♦♦ 3 ♥♥

♠♠ 4 ♥

0.85

0.9

2

♠♠ 3 ♥ ♦♦

0.34

♠♠ 4 ♥ ♦♦♦ ♣♣ ♠♠ 3 ♥ ♦♦

♦ 2 ♣♣ 0.08

(B) Graph-Specic Threshold

0.22

(C) PageRank Scores

(A) Builds a similarity graph G between pairs of documents; (B) Prune by removing all edges lower than the similarity threshold; (C) Employ PageRank to obtain the centrality scores;


8 / 21

MRR Algorithm Algorithm 1 - MRR Algorithm (R, α, β): S 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

for each u, v ∈ R do W [u, v ] ← α ∗ sim txt(u, v )+(1-α) ∗ sim star(u, v ) end for E ← mean(W ) for each u, v ∈ R do if W [u, v ] ≥ E ∗ β then W 0 [u, v ] ← 1 else W 0 [u, v ] ← 0 end if end for S ← PageRank(W 0 ) Return S


9 / 21

Experiment Design

Dataset: reviews (rating score and text) of electronics and books from the Amazon website. Gold Standard: Human perception of helpfulness: h(r ∈ R) =

vote+ (r ) vote+ (r ) + vote− (r )

(5)

Metric: Normalized Discounted Cumulative Gain as NDCG@n

Vinicius Woloszyn, Henrique D. P. dos Santos,MRR: et al.an(UFRGS) Unsupervised Algorithm to Rank ReviewsLeipzig, by Relevance August 24, 2017

10 / 21

Amazon Dataset

Electronics Books Votes 48.20 (± 302.84) 29.71 (± 73.58) Positive 40.12 (± 291.99) 20.60 (± 64.18) Negative 8.08 (± 22.27) 9.11 (± 21.44) Rating 3.73 (± 1.50) 3.41 (± 1.54) Words 350.32 (± 402.02) 287.44 (± 273.75) Products 383 461 Total 19,756 24,234 Table: Profiling of the Amazon dataset.


11 / 21

MRR Evaluation

Experiments: Baselines comparison; Graph-Specific Threshold Assessment; Parameter Sensibility; and Run-time Performance.


12 / 21

Experiment Design

Baselines: TSUR et al. (2009) as REVRANK; Core Virtual Review (200 most frequent words), Rank by similarity distance to Core

Wu et al. (2011) as PR HS LEN; Sentences similarity based on POS-Tags, PageRank, Hits and Length

SVM Regression: a) textual features TF-IDF and the star score, b) the same features used by Wu et al. (2011)


13 / 21

Relevance Ranking Assessment

SVM WU SVM TFIDF REVRANK PR HS LEN MRR

NDCG@1

NDCG@5

0.80770 0.85539 0.66052 0.72689 0.79877

0.91817 0.93119 0.68172 0.77131 0.81876

Table: Mean Performance on Book Reviews

MRR statistically outperformed all unsupervised baselines


14 / 21

Relevance Ranking Assessment

SVM WU SVM TFIDF REVRANK PR HS LEN MRR

NDCG@1

NDCG@5

0.76416 0.88986 0.67903 0.87434 0.89403

0.91535 0.94621 0.72133 0.87184 0.89246

Table: Mean Performance on Electronic Reviews

MRR statistically outperformed all unsupervised baselines MRR is comparable to supervised methods


15 / 21

Graph-Specific Threshold Assessment

MRR performance is always better using a Graph-Specific threshold.


16 / 21

Parameter Sensibility: α and β

α in all settings had a low influence (4%) β produced the highest variation (17%). Nevertheless when 0.8 ≤ β ≤ 0.9, the MRR varying only 6% .


17 / 21

Run-time Assessment Time required for producing a ranking for 383 products (log scale)

MRR presents a significantly lower running time Vinicius Woloszyn, Henrique D. P. dos Santos,MRR: et al.an(UFRGS) Unsupervised Algorithm to Rank ReviewsLeipzig, by Relevance August 24, 2017

18 / 21

Final Remarks

Contributions: Unsupervised method: does not depend on an annotated training set; Faster than other graph-centrality methods; It performs well in different domains (e.g. closed vs. open-ended); Significantly superior to the unsupervised baselines, and comparable to a supervised approach in a specific setting.


19 / 21

Further Work

Next steps: Others clustering techniques for graph; Methods to select the most relevant reviews; Segmented Bushy Path widely explored in text summarization;


20 / 21

Thanks

Thank You! Question? source: https://github.com/vwoloszyn/MRR contact: [email protected]


21 / 21

MRR: an Unsupervised Algorithm to Rank Reviews by ... - GitHub

MRR: an Unsupervised Algorithm to Rank Reviews by ... - GitHub

Suggest Documents

An unsupervised learning algorithm - Core

Learning to efficiently rank - GitHub Pages

An Unsupervised Algorithm for Segmenting Categorical ... - CiteSeerX

An Adaptive Unsupervised Segmentation Algorithm based ... - CiteSeerX

An Unsupervised Algorithm for Segmenting Categorical ... - CiteSeerX

Unsupervised Neural-Network-Based Algorithm for an

An efficient algorithm for learning to rank from preference graphs

Fixed-Rank Representation for Unsupervised Visual ... - CiteSeerX

Fixed-Rank Representation for Unsupervised Visual Learning

The DES Algorithm Illustrated - GitHub

Unsupervised Genetic Algorithm Deployed for

An Approach to Algorithm Design by Patterns

An improved memetic algorithm using ring neighborhood ... - GitHub

An improved memetic algorithm using ring neighborhood ... - GitHub

LEDIR: An Unsupervised Algorithm for Learning Directionality of ...

Voting Experts: An Unsupervised Algorithm for ... - Paul Cohen

An Unsupervised Feature Selection Algorithm with Feature Ranking ...

An Unsupervised Point Alignment Detection Algorithm - IPOL Journal

An Algorithm for the Unsupervised Learning of Morphology

An Algorithm for Unsupervised Topic Discovery from ... - CiteSeerX

An Information-theoretic Unsupervised Learning Algorithm for Neural ...

An Algorithm for the Unsupervised Learning of Morphology - first

An Unsupervised Alignment Algorithm for Text ... - Semantic Scholar

An unsupervised learning algorithm for fatigue crack ... - CiteSeerX