Machine learning and statistical inference: the case of Istat survey on ICT Machine learning e inferenza statistica: il caso dell’indagine Istat sull’ICT Giulio Barcaroli1, Gianpiero Bianchi2, Renato Bruni3, Alessandra Nurra4, Sergio Salamone5, Marco Scarnò6 Abstract Istat is experimenting web scraping, text mining and machine learning techniques in order to obtain a subset of the estimates currently produced by the sampling survey on “Survey on ICT usage and e-Commerce in Enterprises”, yearly carried out by Istat and by the other member states in the EU. Target estimates of this survey include the characteristics of websites used by enterprises to present their business (for instance, if the website offers e-commerce facilities or job vacancies). The aim of the experiment is to evaluate the possibility to use the sample of surveyed data as a training set in order to fit models that will be applied to the generality of websites. The usefulness of such an approach is twofold: (i) to enrich the information available in the Business Register, (ii) to increase the quality of the estimates produced by the survey. These different objectives can be reached by combining web scraping procedures together with text mining and machine learning techniques, making optimal use of all available information. Abstract L‘Istat sta sperimentando l’utilizzo di tecniche di web scraping, data mining e machine learning al fine di ottenere un sottoinsieme delle stime attualmente prodotte dalla rilevazione campionaria "Indagine sull'utilizzo dell'ICT e dell'ecommerce nelle imprese", annualmente effettuata dall'Istat e dagli altri stati membri dell'UE. Le stime-obiettivo interessate in questa fase riguardano le caratteristiche dei siti web utilizzati dalle imprese per presentare la loro attività (per esempio, se il sito offre servizi di e-commerce o offerte di lavoro). Lo scopo dell'esperimento è quello di valutare la possibilità di utilizzare il campione dei dati rilevati come un training set al fine di stimare i modelli che verranno applicati alla generalità dei siti web. La finalità è duplice: (i) arricchire le informazioni disponibili nel Registro Imprese, (ii) aumentare la qualità delle stime prodotte dall'indagine. Questi diversi obiettivi possono essere raggiunti mediante l’utilizzo 1 2 3 4 5 6
ISTAT (
[email protected]) ISTAT (
[email protected]) University of Rome “Sapienza” (
[email protected]) ISTAT (
[email protected]) ISTAT (
[email protected]) CINECA (
[email protected])
2
G.Barcaroli, G.Bianchi, R.Bruni, A.Nurra, S.Salamone, M.Scarnò
di procedure di web scraping insieme a tecniche di text mining e di machine learning, ottimizzando l’uso di tutte le informazioni disponibili. Key words: web scraping, text mining, machine learning, logic-based classification, statistical inference, ICT survey 1. Introduction The “Survey on ICT usage and e-Commerce in Enterprises” is carried out yearly by Istat and by the other member states in the EU. Among the target estimates there are a number of characteristics that represent the relation between the enterprise and its customers, explicated through the enterprise’s website (for instance the presence of online ordering or job application facilities). Our objective is to use survey data as a ground truth to estimate a classification model enabling to predict the values of target variables, based on the scraped content of the websites accessed with the URL indicated in the questionnaires. This model can be applied to the whole population of enterprises (near 190,000, with some 70% owning a website), permitting to: (i) enrich the information available in the Business Register, (ii) increase the quality of the estimates produced by the survey, by adopting a model based approach that can reduce the Mean Square Error. This paper reports the substantial improvements obtained with respect to the results presented in [1]. After having described in more detail the ICT survey in paragraph 2, we will illustrate the new characteristics of the scraper that has been implemented (par. 3), together with a new algorithm used as a classifier (par. 4). In par. 5 the results are presented, while in par. 6 conclusions are reported together with indications on the future work. 2. The Istat Survey on ICT usage and e-Commerce in Enterprises Since 2001 the European Commission has established the annual Information Society surveys to benchmark the Information and Communications Technologies (ICT) driven development in enterprises and individuals. The Community survey on ICT usage and e-commerce in enterprises (ICT survey) is developed in close collaboration with Member States and OECD and annually is adapted to the new needs of users and policy makers and to capture technological changes7. The survey is intended to measure the degree of use of new technologies in enterprises and provides EU with information for comparison among Member States and evaluation of national policies on their capacity to grasp the potential of 7
The ICT survey is part of the European Community statistics on the information society following the Commission Regulation No 808/2004 which establishes the legal basis for harmonized statistics on ICT usage in enterprises. The methodology and the data are available at European level on the Eurostat website at the link http://ec.europa.eu/eurostat/web/information-society/overview .
3
Machine learning and statistical inference: the case of Istat survey on ICT
technological progress. The survey provides a wide and articulated set of indicators on Internet activities and connection used, e-Business, e-Commerce, ICT skills, eInvoice. ICT survey is also one of the major sources of data for the Digital Agenda Scoreboard measuring progress of the European digital economy. The target population is about Italian enterprises with at least 10 persons employed active in the manufacturing, electricity, gas and steam, water supply, sewerage and waste management, construction and non-financial services. The survey is carried out as a stratified sample for the enterprises with 10-249 persons employed, while all the enterprises with at least 250 persons employed are included. The strata are defined by the combination of economic activities, the size classes of statistical units in terms of persons employed and the administrative regions in which enterprises are located8. The final estimations take into account a calibration weight attributed to each unit and derived accordingly to the theory of Deville and Särndal [5]. In 2015 the sample involved 31,738 enterprises representative of a universe of 188,625 units. Since 2012, the technique used for data collection is the self-completion of an online HTML questionnaire and in year 2015 response rate was about 61%. The internal coherence of the data is verified trough a specific editing and imputation (E&I) process, that considers both an automatic editing and the possibility to recall an enterprise to treat the most influential errors. The automatic editing considers some deterministic rules and a mixed procedure based on the Fellegi-Holt paradigm [6], as implemented by the software SCIA, System for the Automatic Control and Imputation, developed by Istat. In some cases (as for quantitative data), corrective methods are adopted to reduce the effect of nonrespondents and wrong answers checking on the consistency of data in comparison with administrative sources, structural business statistics and economic data of previous year of the same survey. Editing and imputation (E&I) procedure is controlled by an analysis of performance conducted with transition matrices on raw and clean data, reflecting a limited impact of E&I procedure on final estimates. This was also possible thanks to the set of online rules included in the online model allowing to remove inconsistencies or missing answers during data collection. In Table 1 are the distribution of the main variables collected in 2011-2015 period: compared with an increase of companies claiming to have a website, the percentage of those using it has remained fairly stable over the years with exception for privacy policy statement. Table 1 – Indicators on web site and web site utilities, year 2011-2015 (percentages out total enterprises with at least 10 persons employed) Web site and web site utilities Enterprises with a web site or a homepage Online ordering, reservation or booking 8
2011 62.6 13.5
2012 64.5 10.6
Years 2013 67.3 11.7
2014 69.2 11.5
The region attributed to the company is the legal or administrative one resulting from Asia.
2015 70.7 12.8
4
G.Barcaroli, G.Bianchi, R.Bruni, A.Nurra, S.Salamone, M.Scarnò Product catalogues or price lists Products personalization Website content personalization Online job application Privacy policy statement
33.4 2.4 4 7.6 31.1
32.1 2.1 4.9 8.2 31.3
33.1 2.9 4.7 8.4 33.9
33 2.2 4.7 9.7 36.6
33.3 3 5.7 9.8 42.7
It has to be noted that, in the framework of this survey (as in the Manual of the survey9) the ‘website’ is intended as the virtual place managed and owned by the enterprise itself. In this sense it is not only a HTML page with some information about the location of the enterprise, but it has to present its business (emarketplaces, used by enterprise to sell its products, are not included in such definition ). It is also important to underline that in the Eurostat definition of web sales are included those made via an online store (web shop), via web forms on a website, via extranet and via “apps”. This means that the main concept investigated here (website and ordering/sales) is narrower than web sales definition. 3. Web scraping, text processing and features selection From a statistical point of view, the aim of this study is the implementation of supervised classification models to map a series of explanatory variables in a desired value. The explanatory variables are represented by the characteristics of the enterprises’ websites, from which it should be possible to verify if a given service is offered through the web. We consider the website of each enterprise (i.e. all its web pages) as a single document characterized by a set of terms; these could include also other elements of the web pages, like their tags10. To extract and collect the information from websites we make use of a web scraping and text mining approach, that permits to transform unstructured data (a document accessible through the web) into structured data. The software we used (for this task and for all the steps illustrated in this paragraph) is based on ADaMSoft11, an Open Source general-purpose package written in Java for data management, data analysis, ETL, etc., that integrates, between others, methods and libraries to parse HTML pages. The procedure permits to scrape in parallel more than one website; obviously, the time required depends on the number of pages to analyse and on the net speed. We chose to have 20 parallel processes, each of them able to analyse all the links to a maximum of two levels from the main page (i.e main pagelinklink), considering a limit of 1,000 pages retrieved. The procedure took almost 20 hours 9
Methodological Manual are downloadable at the following link: http://ec.europa.eu/eurostat/web/information-
society/methodology . 10
HTML elements that has a meaning, intended in a semantic sense or behavioural for the browser, like for instance: “put a button in the page (with a given image and a tooltip) that, if pressed, will cause this action”). 11 https://en.wikipedia.org/wiki/ADaMSoft or its documentation that can be found at http://adamsoft.sourceforge.net/documentation.html (in the sections Web and Text Mining]
Machine learning and statistical inference: the case of Istat survey on ICT
5
and produced a final data set with more than 200 million of records, related to all the tags analysed, contained in 10,289 (80%) of the initial 12,816 websites. Reasons for not having succeeded in scraping all the websites are (i) wrong specification of the URLs, (ii) errors in communicating with their servers, (iii) technologies not supported by the parser (mainly websites implemented with ADOBE Flash). Next to the scraping step, collected texts have been processed by first treating the non ASCII character (for example the “/”, “&”, etc.), that have been transformed in spaces. The resulting text was tokenized to obtain all the single elements (words). Such step permitted to extract those meaningful terms hidden inside the HTML code (for instance “/basket.jpg” becomes “basket jpg”). Due to the huge amount of different terms (several millions), we reduced the inflected words to their stem, base or root form (lemmatization step), with reference to the Italian and English dictionaries. At the end of this step, we obtained 10,094 websites characterized by a list of valid terms, the base for the construction of the document-term matrix, i.e. a matrix in which are contained the frequencies of the terms (columns) in each website (rows). A further transformation permitted to reduce the terms (initially more than 50,000) to only those having influence on the values of each target variable. We randomly selected a set of rows from the websites-terms matrix (the set of data on which the classification models had to be trained). Then we collapsed the rows in two parts, corresponding to the different values of each characteristic to be modelled (for instance: “yes” or “no” for online ordering). Each row reports the frequencies of the related terms. We then made use of the Lexical Correspondence Analysis, that permitted to obtain for each term a coefficient of projection in the first (and unique) resulting dimension, associated to its cosine (that can be interpreted as the quality of its projection). Both these values permitted us to reduce the number of the terms (and the initial matrix) to those really discriminant for the target characteristic (1,000 in total). 4. Logic-based classification Several classification techniques have been proposed in the literature, based on completely different paradigms. We test a selection of the most established ones in the next paragraph. In particular, a methodology that generally provides good accuracy but also additional useful byproducts is the Statistical and Logical Analysis of Data (SLAD), recently proposed in [3] as an evolution of the classical Logical Analysis of Data (LAD) [2]. This methodology is based on Boolean techniques and is closely related to Decision Trees and Nearest Neighbor methods, actually constituting an extension of the latter two, as showed in [4]. LAD technique is inspired by the mental processes that a human being applies when learning to classify from examples. Data are organized into records, that are
6
G.Barcaroli, G.Bianchi, R.Bruni, A.Nurra, S.Salamone, M.Scarnò
constituted by the rows of the mentioned matrix (i.e., the enterprises’ websites). Each record is a set of fields (i.e., the terms). Each field has a value (i.e., the number of occurrences of a term). The class is presence or absence of the aspect under analysis (e.g., “Online ordering”). The procedure can be roughly described as follows. Data are initially encoded in binary form by a discretization process called binarization. This is obtained by using the training set for computing specific values, called cut-points, to convert each field into a set of binary attributes. Cut-points should be set at values representing some kind of watershed for the analyzed phenomenon. For a given numerical field, this can be done by considering, for each couple of adjacent values belonging to records of opposite classes, the middle value. Cut-points are then used to binarize the values of that numerical field into a set of binary attributes, each representing being above or below the value of a cut-point. Since in practice the number of binary attributes obtainable with the above procedure is very large, a selection step is needed to compute an effective binarization of reasonable size. The selection step is modeled as a binary optimization problem either of set covering or of knapsack (see [7] for details on these Combinatorial Optimization problems). The selected set of binary attributes is then used to build the patterns. A pattern is a conjunction of binary attributes characterizing a class. Each binary attribute can be seen as a condition. A conjunction of binary attributes is a positive pattern if it evaluates to True (i.e., it covers) at least a certain number of positive records and it does not cover more than a certain number of non-positive records. A negative pattern is defined symmetrically. In the space defined by the original fields of the records, each record is a point and each pattern corresponds to a polyhedron (the intersection of a finite number of equations and inequalities). Now, a new record located in a region of the space covered only by positive patterns is classified as positive, and vice versa. However, most of the regions of the space are actually covered by patterns of mixed classes. In this case, each pattern must receive a weight, that is a measure of its importance, and then a weighted sum determines the class of the record under classification. More details can be found in [3]. An additional key feature of LAD methodologies is that patterns can be seen as a compact description of the data, or, in other words, as an interpretation of the analyzed phenomenon. Therefore, the described procedure can also be used to perform rules extraction tasks. Procedure SLAD has been applied to the described dataset by using a number of additional steps described below. This was necessary because, after the initial selection of a training set L and of a test set T, preliminary tests with several classifiers clearly showed that there exists some inherent noise in the class label: records with positive features sometimes have a negative label and vice versa. This makes the obtained dataset very hard to be accurately classified. Moreover, the dataset is imbalanced, since the proportion of positive records is sensibly smaller than that of the negative ones, and this causes poor performances especially in the prediction of positive records. In order to reduce the noise in the class attribution
Machine learning and statistical inference: the case of Istat survey on ICT
7
and deal more successfully with the dataset, we develop additional filtering steps. These steps are allowed to use the class information in the case of the training set, but only our prediction of the class in the case of the test set. We initially apply SLAD procedure to L, and generate a set of positive patterns P+1 and a set of negative ones P-1. Then, we use the patterns in P+1 and P-1 to classify the test set T, so that each record receives a tentative label. After this, we use this initial class prediction to further treat T. For each field fi of the data records corresponding to word wi , we evaluate its positive valence by counting the presences of wi in positive records of L and its absences from negative records of L but subtracting its presences in positive records with negative features (noisy positives) and its absences from negative records with positive features1 (noisy negatives). The fields with the best positive valences constitute a set B that is used to generate two new sets P+2 and P-2 of positive and negative patterns. Patterns P+2 are obtained by solving a set covering problem corresponding to cover all positive records in L with the combinations of elements of B. This combinatorial optimization problem is solved by means of a dual heuristic procedure. Patterns P-2 are obtained symmetrically. Finally, the test set T is classified by using the patterns P+1 , P-1 , P+2 and P-2 , with the latter two sets having the larger weight values. The described reinforcement of the SLAD procedure leads to better performances. 5. Computational results The execution of the web scraping procedure, the processing of scraped texts and the feature selection step produced a documents/terms matrix, where each row represents a website and each column is referred to an influent word. The intersection cell reports the number of times the term is contained in the document. To each row are also attached the values of the target variables observed in the 2015 round of the survey, regarding the presence of utilities reported in Table 2, plus the “Links or references to the enterprise's social media profiles (Links to social)” and ” Possibility of online submitting complaints (Online complaints)”. This matrix is the input to the fitting of models in which observed variables are the y’s and the terms are the x’s. The 10,094 rows of the matrix have been equally split in train and test datasets, both with 5,047 rows. Different learners have been considered: one belonging to the classical statistical parametric models (the Logistic model), others to the ensemble learners (Random Forest, Adaptive Boosting, Bootstrap Aggregating), together with Naïve Bayes, Neural Networks, Support Vector Machines (SVM) and Statistical and Logical Analysis of Data (SLAD). We report the results obtained applying to the test dataset each learner fitted in the train dataset for one target variable, namely “Online ordering”. The indicators that have been considered in order to comparatively evaluate the performance of the different learners are: (1) accuracy (rate of correct predictions on the total of cases);
8
G.Barcaroli, G.Bianchi, R.Bruni, A.Nurra, S.Salamone, M.Scarnò
(2) sensitivity (or recall: rate of true positives on total number of positives); (3) specificity (rate of true negatives on total number of negatives); (4) difference between (i) proportion of positives on the total calculated on observed and (ii) proportion of positives calculated on predicted; (5) F1-measure: harmonic mean of recall and precision (rate of true positives on the total of predicted positives); (6) p-value related to the test: [Accuracy > Non Informative Rate] . For the first 6 learners (so excluding SVM and SLAD) a further features selection step has been carried out, by selecting 200 terms (out of the 1,000 selected in the previous step of the Lexical Analysis of Correspondences) by making use of the “importance” of terms evaluated by a preliminary application of Random Forest. This has significantly increased the performance of some learners (for instance, of the Logistic model). Performance obtained by each learner is reported in Table 3. Table 3 – Performance of predictors applied for variable “Online ordering (yes/no)” Learner 1. Logistic 2. Naïve Bayes 3. Random Forest 4. Bagging 5. Boosting 6. Neural Net 7. SVM 8. SLAD
(1) 0.83 0.80 0.83 0.82 0.81 0.82 0.83 0.84
(2) 0.53 0.46 0.53 0.44 0.50 0.52 0.64 0.62
(3) 0.89 0.87 0.90 0.90 0.88 0.89 0.88 0.90
Indicators (4) 0.01 0.00 0.01 0.03 0.00 0.01 0.01 0.01
(5) 0.53 0.46 0.55 0.48 0.50 0.52 0.59 0.60
(6) 0.01625 0.99490 0.00006 0.11520 0.56530 0.10180 0.00018 0.00018
P-values related to Naïve Bayes and Boosting show that the null hypothesis regarding the significance of the difference between the obtained accuracy and the Non Informative Rate (the proportion on the total of the most numerous category, the negatives) has to be rejected. SVM results to be the best among classic learners. On the other hand, the new classifier SLAD is comparable to SVM, since its overall accuracy is 1% better (and it is the best among all learners), while its sensitivity is only 2% worse than SVM and specificity is 2% better. SLAD is the best in terms of F1-measure. Observe that these performances are in general better than the ones recorded in previous experiments (reported in [1]). This is due to different reasons: (i) a higher efficiency of the new web scraping procedure, able to capture much more information; (ii) a better features selection step: not only the Lexical Analysis of Correspondence, but also the importance of terms as detected by Random Forest; (iii) a choice of the threshold used to decide, on the basis of the score output by each learner, if the predicted value is positive or negative: this threshold has been determined for the first six learners by minimizing the difference in the proportions of positives calculated on observed and predicted values.
9
Machine learning and statistical inference: the case of Istat survey on ICT
The previous learners, with the exclusion of SVM and SLAD (that require a heavy fine tuning of parameters, and will be applied in a next phase) have been used for the whole set of target variables, obtaining the results reported in Table 4. For each variable, the results obtained by the best predictor (always Random Forest, with the exception of “Online ordering” where SLAD was used) are reported. Looking at the p-values, the predictions related to four target variables out of nine reveal to be non-significant. Best results in terms of F1-measure are obtained for variables “Catalogue”, “Links to social” and “Online job application”, as well as for “Online ordering”. Also in this case, results are in general better than those reported in [1], especially for “Online ordering”, “Online job application” and “Catalogue”. Taking into account the different performance of SLAD or SVM with respect to Random Forest, better results may be expected by applying these two learners to the whole set of target variables. Table 4 – Results obtained for the different variables Variable Online ordering, reservation or booking Product catalogue or price lists Products personalization Orders tracking Website content personalization Links to social Privacy policy statement Online job application Online complaints submission
(1) 0.84
(2) 0.62
Indicators (3) (4) 0.90 0.01
(5) 0.60
(6) 0.00018
0.70 0.84 0.86 0.92 0.72 0.92 0.80 0.67
0.65 0.18 0.37 0.19 0.65 0.15 0.64 0.51
0.73 0.91 0.92 0.96 0.78 0.96 0.86 0.77
0.67 0.17 0.32 0.16 0.67 0.15 0.64 0.53
< 2e-16 1.0 1.0 1.0 < 2e-16 1.0 < 2e-16 4.855e-09
0.01 0.00 0.01 0.00 0.01 0.00 0.01 0.02
6. Future work and conclusions Web scraping and text mining can be applied on real-world websites to produce datasets for the automatic detection of some aspect of interest by means of classification algorithms. However, this turns out to be a difficult task. A dataset gathered in this way is likely to contain some inherent noise in the class label: records with positive features sometimes have a negative label and vice versa. This makes the obtained dataset very hard to be classified. Moreover, as it happens in the analyzed “Online ordering” case, the dataset is often imbalanced: the proportion of positive records is sensibly smaller than that of the negative ones. This causes poor performances especially in the prediction of positive records (sensitivity). After testing several classifiers, in this work we have developed improvements of the Statistical and Logical Analysis of Data (SLAD) that allows better performances, so as to be comparable to SVM. Results are encouraging, especially if compared with those obtained in previous experiments.
10
G.Barcaroli, G.Bianchi, R.Bruni, A.Nurra, S.Salamone, M.Scarnò
Next steps in the short term will be: (i) the application of SVM and SLAD to the whole set of target variables; (ii) a more detailed analysis of the learners performance by also considering subdomains in the target population, defined by cross-classifying enterprises by dimension and economic activity; (iii) a further improvement of the performance of the models by adding additional explicative variables consisting not only of single terms, but the joint consideration of vectors of them, where these vectors could represent sequences of terms relevant for each characteristic object of interest. Only when a certain degree of quality of the resulting predictive models will be guaranteed, their application to the whole population of enterprises owning a website will be performed. A crucial task will be also the retrieval of the URLs related to the websites for the whole population of enterprises. Finally, once having predicted the values of the target variables for all reachable units in the population, the quality of produced estimates will be analysed and compared to the current sampling estimates obtained by the survey: a quality framework specifically dedicated to this task is being defined.
References [1] G. Barcaroli, A. Nurra, S. Salamone, M. Scannapieco, M. Scarnò, D. Summa.
Internet as Data Source in the Istat Survey on ICT in Enterprises. Austrian Journal of Statistics, Volume 44, 31-43. April 2015 [2] E. Boros, P.L. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, I. Muchnik. An Implementation of Logical Analysis of Data. IEEE Transactions on Knowledge and Data Engineering, Vol. 12(2), 292-306, 2000. [3] R. Bruni, G. Bianchi. Effective Classification using Binarization and Statistical Analysis. IEEE Transactions on Knowledge and Data Engineering Vol. 27(9), 2349-2361, 2015. [4] Y. Crama and P.L. Hammer. Boolean Functions: Theory, Algorithms, and Applications. Cambridge University Press, New York, 2011. ISBN: 9780521847513. [5] Deville, J.C. and Särndal, C.E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association 87, 367-382 [6] I. P. Fellegi and D. Holt (1976). A Systematic approach to automatic edit and imputation. Journal of the American Statistical Association, Vol. 71, No. 353 (Mar., 1976), pp. 17- 35.
Machine learning and statistical inference: the case of Istat survey on ICT
11
[7] G.L. Nemhauser and L.A. Wolsey, Integer and Combinatorial Optimization, John Wiley & Sons, New York, NY, 1999.