Web scraping, text mining and machine learning ...

Web scraping, text mining and machine learning experiences at Istat with R and other open source systems Giulio Barcaroli, Tiziana Tuoto Methods, Quality and Metadata Division Italian National Institute of Statistics (Istat)

Giulio Barcaroli, Tiziana Tuoto - Web scraping, text mining and machine learning experiences at Istat with R and other open source systems – CMStatistics 2016 – 9-11 December 2016, University of Seville, Spain

Outline  Overview of Istat experiments on the use of new sources of data for statistical production, in particular Internet data sources (Internet queries, social networks, websites)  Focus on the ICT in Enterprises survey  First results of this combined use of survey data and data from the Internet are illustrated  Illustration of open source software employed


Big Data: which sources, and which use Source type

Domain(s)

Internet data (web scraping)

ICT in Enterprises, Consumer Prices and Agritourism statistics

Online Search data (Google queries)

Labour Force statistics

Mobile Phone data (CDR)

Mobility and Tourism

Scanner data

Consumer Prices

Social Media

Social statistics (e.g. Consumer Confidence)

Traffic webcams & Satellite Imagery

Traffic and Agriculture statistics


Internet as Data source: web scraping So far, Istat is being using web scraping techniques in three different domains: 1. Consumer prices index: scraping of prices related to electronic products (pc, laptops, tablets, smartphones, …) and airlines tickets 2. Agritourism: scraping of websites content in order to collect information on agritourism farms (services availability, prices, …) 3. ICT in Enterprises: scraping of websites content in order to collect information on eterprises (e-commerce, job vacancies, presence in social networks, …) The first is in production, while the other two are under evaluation.


4

The Istat Survey on ICT in Enterprises The «Survey on the use of ICT by Enterprises» is carried out in all Member States of the European Union. In Italy, the survey investigates on a universe of about 182,000 enterprises with at least 10 employees, by means of a sampling survey involving 32,000 of them, of which 61% are respondent (about 19,000).


5

The Istat Survey on ICT in Enterprises This is a subsection of the questionnaire. In the 2016 round of the survey, more than 14,000 (74%) declared a website and indicated related URL.


6

Current estimation approach One of the target estimate of the survey is the «Number of enterprises that have a website and use it for online ordering (e-commerce)» Under the current approach, this number is estimated by using the calibration estimator:

ˆ )  βˆ   w y YˆGREG  (  x k )  βˆ   d k ( yk x k  βˆ )  YˆHT  ( X  X HT k k kU

ks

ks

The estimate is obtained by modifying the initial weights (that depend only on the inclusion probabilities of sampling units) using as auxiliary variables the number of firms and the number of employees, according to the information contained in the Business Register ASIA. This allows to reduce the bias induced by the high non response rate, as final weights let each respondent unit represent also non respondent units having similar characteristics. Giulio Barcaroli, Tiziana Tuoto - Web scraping, text mining and machine learning experiences at Istat with R and other open source systems – CMStatistics 2016 – 9-11 December 2016, University of Seville, Spain

7

Big Data approach Under this approach, there is a combined use of survey data together with the data directly collected on the Internet: 1. All websites used by enterprises included in the target population are subject to scraping, in order to collect the html text contained in them. 2. Text is processed by using text processing techniques, in order to produce a «document-term matrix». 3. Survey data act as the training set, used to tune a machine learning algorithm where the target variable is «e-commerce (yes/no)» and the explanatory variables are the terms in the document-term matrix. 4. The algorithm is applied to the whole set of population, in order to predict the value of «e-commerce (yes/no)» for all of them. 5. The estimate of «e-commerce (yes/no)» can be then obtained by counting the values «yes» in the whole population of units:

YˆAlg   yk  ks

 yˆ k   yk 

k (U  s )

ks

* F( x ,   k i)

k (U  s )


8

Web scraping + text processing + machine learning

Websites and social networks

Reference population: 182,000 enterprises

Big Data: Internet as Data Source

e-commerce e-recruitment e-tendering …

10,000 websites

Texts 1. Web scraping 2. Text processing and machine learning

Population frame (ASIA)

32,000 enterprises

Sample selection

14,000 URLs

Data collection on 19,000 enterprises

Microdata


Predictors


Web scraping and text processing


11

Dimensionality reduction = feature selection The information we give as input to the learners contains: • signals • noise If we do not filter noise, signals cannot be understood, and the predictive capability of the algorithm is reduced. For this reason, in most cases a step of features selection must be performed prior to tuning the learners.

In our case, a dramatic reduction of the number of terms extracted from the scraped text is obtained by applying: • analysis of correspondence (thus obtaining a reduction from more than 50,000 terms to near 1,000); • importance of terms in Random Forest (from 1,000 to 200).


12

Machine learning algorithms

Using the documents-terms matrix obtained after the steps of web scraping and text mining, several models were estimated and performance indicators for each of them were calculated. Learner

1. Logistic 2. Naïve Bayes 3. Rand Forest 4. Bagging 5. Boosting 6. Neural Net 7. SVM 8. SLAD

Accuracy

Sensitivity

0.83 0.80 0.83 0.82 0.81 0.82 0.83 0.84

0.53 0.46 0.53 0.44 0.50 0.52 0.64 0.62

Indicators Specificity Est. diff. 0.89 0.01 0.87 0.00 0.90 0.01 0.90 0.03 0.88 0.00 0.89 0.01 0.88 0.01 0.90 0.01

F1 measure 0.53 0.46 0.55 0.48 0.50 0.52 0.59 0.60


p-value 0.01625 0.99490 0.00006 0.11520 0.56530 0.10180 0.00018 0.00018

13

Machine learning algorithms The sample estimates currently obtained from the survey were compared with the estimates obtained by applying two different models (logistic model and random forest) to the matrix obtained by web-scraping.

Estimates of the variances were also calculated for these different estimates, in order to evaluate their reliability


14

Comparison of estimates: % of ecommerce in different domains


15

Comparison of estimates: variance of estimators


16

Improvement of the estimation procedure for ICT survey In the short term:

 adoption of new approaches (e.g.: Deep Learning, Natural Language Processing, ontologies) so to improve the accuracy of predictions;  maximisation of coverage of websites by asking URLs directly to enterprises;  extension of the procedure also to enterprises < 10 employees.


The use of open source software During the experiment carried out in the ICT pilot, a number of open source software and systems have been used and evaluated, namely: • Java library “Jsoup” embedded in the ADamSoft system, and Solr for web scraping • Library “TreeTagger” for text processing • “NLTK” for natural language processing • R packages “RTextTools”, “rattle” and “caret” for machine learning


Web scraping Two systems have been used to perform web scraping: • Jsoup embedded in AdaMSoft for web mining • Solr (http://lucene.apache.org/solr/), an open source enterprise search platform built on top of Apache Lucene






Text processing: lemmatisation Once having scraped texts from websites, the library TreeTagger has been used in order to perform lemmatisation . The goal is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be differ color The TreeTagger library can be used to tag German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish, Romanian, Czech, Coptic and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available.


Text processing: natural language processing If we consider the single terms as predictors, and proceed to detect the most important of them (“best words”), we are not considering a very important source of information, that is the link between words. Natural language processing permits to individuate not only single words, but also couples, triples, etc. for each of them calculate their influence on a characteristic of the document, by “tokenizing” the text.


Text processing: natural language processing If we consider the single terms as predictors, and proceed to detect the most important of them (“best words”), we are not considering a very important source of information, that is the link between words. Natural language processing permits to individuate not only single words, but also couples, triples, etc. for each of them calculate their influence on a characteristic of the document, by “tokenizing” the text.


Text mining: natural language processing For instance, instead of considering the single terms cart card that are for sure positively correlated to “e-commerce”, but in some cases do not uniquely identify it, we consider the bigrams add cart credit card these are much more strong predictors of e-commerce. The use of the Natural Language ToolKit (NLTK) Python software has greatly increased the performance of the learners applied not only to best words, but also to bigrams: Learner

Best words + bigrams + logistic

Accuracy

Sensitivity

0.91

0.64

Indicators Specificity Est. diff. 0.98

0.05

F1 measure

p-value

0.74

2.2e-16


Machine learning: RTextTools R package At the very beginning, first experiences of application of machine learning techniques have been carried out by using RTextTools. This package is a machine learning package for automatic text classification that makes it easy to prepare textual data for the application of machine learning techniques. The package is a wrapper that includes different packages and allows to apply many different algorithms: • Support Vector Machines • Ensemble learners: boosting, bagging, random forests, • Lasso and Elastic-Net regularised Generalised Linear Model (glmnet) • decision trees, • neural networks, • maximum entropy. Also offers the production of standard performance indicators.


Machine learning: Rattle Also Rattle, the R data mining system, has been used because of its friendly interface enabling to perform all modelling, prediction and evaluation activities without being obliged to write code.

Rattle allows to apply 1. decision trees 2. SVM 3. neural networks 4. random forests 5. boost 6. linear models


Machine learning: caret Presently, the package “caret” is under evaluation. This package not only allows to apply machine learning algorithms and evaluate their performance, like the others, but offers a framework for the optimisation of their application, by resampling and parameters tuning.


References G. Barcaroli, A. Nurra, S. Salamone, M. Scannapieco, M. Scarnò, D. Summa (2015). Internet as Data Source in the Istat Survey on ICT in Enterprises. Austrian Journal of Statistics, Volume 44, 31-43. April 2015.

G. Barcaroli, G.Bianchi, R.Bruni, A. Nurra, S. Salamone, M. Scarnò (2016). Machine learning and statistical inference: the case of Istat survey on ICT. Proceedings of the 48th scientific meeting of the Italian Statistical Society, Salerno, Italy. June 2016. G. Barcaroli, D. Fusco, P. Giordano, M. Greco, V. Moretti, P. Righi, M. Scarnò - ISTAT Farm Register: Data Collection by Using Web Scraping for Agritourism Farms. Proceedings 7th International Conference on Agricultural Statistics (ICAS 2016). Rome, Italy October 2016


Thank you for your attention [email protected] [email protected]


Web scraping, text mining and machine learning ...

Web scraping, text mining and machine learning ...

Suggest Documents

Web scraping, text mining and machine learning ...

Web scraping, text mining and machine learning ...

Machine Learning and Data Mining

Data Mining & Machine Learning

Use of web scraping and text mining techniques in the Istat ... - Q2014

MS1b Statistical Machine Learning and Data Mining

Machine Learning and Data Mining - Springer

Data Mining: Practical Machine Learning Tools and

Machine Learning and Data Mining - Springer

Web Scraping Handout - Cindy Royal

A Machine Learning Workbench for Data Mining

Design Pattern Mining Enhanced by Machine Learning

Design Pattern Mining Enhanced by Machine Learning

W4240/W6240 Data Mining/ Statistical Machine Learning

Types and Classes of Machine Learning and Data Mining Lloyd ...

Fuzzy Methods in Machine Learning and Data Mining: Status and ...

BeautifulSoup: Web Scraping with Python - NYU

Text Mining Services for Trialogical Learning - CiteSeerX

Text Classification Using Machine Learning Techniques - CiteSeerX

Machine Learning for Text Indexing - DepositOnce

IRJET- Text Highlighting – A Machine Learning Approach

Machine Learning in Automated Text Categorization

Machine Learning for Text Indexing - DepositOnce

A Machine Text-Inspired Machine Learning Approach for Identification