Checking Eligibility of Google and Microsoft Machine Learning Tools ...

Checking Eligibility of Google and Microsoft Machine Learning Tools for use by JACK e-Learning System

Master Thesis For MSc. of Applied Computer Science – Systems Engineering At: Paluno – The Ruhr Institute for Software Technology, Department of Specification of Software Systems, DuisburgEssen University

Reem Kabbani 3013969 Essen, 22.07.2016

First Evaluator: Prof. Dr. Michael Goedicke Second Evaluator: Prof. Dr. Klaus Pohl

Statutory Declaration I hereby certify that I have made this work without the help of third parties, and given only the sources and tools. I've done all the points taken from the sources whole or in substance identified as such. This work has not yet been submitted to any authority in a similar form.

Essen, 22.07.2016

Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

II

Abstract At the department of Specification of Software Systems the training and examination system JACK is operated, this system already has some features for assessing free-textbased answers; however, it still needs some Machine Learning tools to develop these capabilities in order to make these features work more efficiently. In this Master Thesis, Microsoft and Google machine learning tools are investigated for their eligibility of use by JACK’s assessment component which is dedicated for freetext-based answers. The proposed approaches depend mainly on some new Deep Learning models, which have been a topic of interest for many researchers in the last few years. This was due to the fact that many promising results have been achieved using such models, especially in speech recognition and web search. These Deep Learning models are Paragraph Vector proposed by Google researchers and Deep Structured Semantic Model proposed by Microsoft.

Zusammenfassung Am Lehrstuhl der Spezifikation von Softwaresystemen das Ausbildungs- und Prüfungssystem JACK wird entwickelt. Dieses System hat bereits einige Funktionen, um die Freitext-basierte Antworten bewerten zu können; Es muss jedoch noch einige maschinelles Lernen Werkzeuge, um diese Fähigkeiten zu entwickeln, damit es effizienter operieren wird. In dieser Masterarbeit ist Microsoft und Google maschinellen Lernens Framework für seine Förderfähigkeit der Verwendung von JACK-Einschätzungskomponente untersucht, die für Freitext-basierte Antworten dediziert ist. Die vorgeschlagenen Ansätze hängen vor allem von einigen neuen Tieflernmodelle, die ein Thema von Interesse für viele Forscher in den letzten Jahren gewesen. Dies war aufgrund der Tatsache, dass viele versprechende Ergebnisse solcher Modelle erreicht wurden können, vor allem bei der Spracherkennung und Web-Suche. Diese Modelle sind: Paragraph Vector vorgeschlagen von Google Forscher und Deep Structured Semantic Modell vorgeschlagen von Microsoft.

Note of Thanks This work including research, structuring, analyzing, experimenting and documenting was realized with the help of Dr. Michael Striewe, the professional lecturer at Duisburg-Essen University, who kindly gave his continuous feedback and guidance so that the goals of this thesis can be achieved. Lingual corrections and many significant text improvements were performed with the help of Ms. Sawsan Kabbani, the professional Arabic-English translator, who kindly provided her expertise in this field. And Of course many thanks for the extra ordinary care and support of my family, especially mom and dad, who encouraged me all the time to provide a useful and well done work. Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

III

Figures List Figure 1: Overview of JACK system architecture showing all main components. This figure is taken from [2]. ______________________________________________________________ 2 Figure 2: OpenNLP, NELA, and NLP Checkers integrated within JACK Backend [3]. _____ 4 Figure 3: The object used for the evaluation [4]. ______________________________________ 5 Figure 4: An illustration of artificial neuron [8]. ______________________________________ 8 Figure 5: An illustration of multi-layer perceptron (MLP) [8]. __________________________ 9 Figure 6: An illustration of Feedforward Backpropagation Algorithm [9]. ________________ 9 Figure 7: Steps of creating a Deep Boltzmann Machine – Deep Neural Network [1]. ______ 10 Figure 8: An abstract illustration of Recurrent Neural Network (RNN) [14]. _____________ 11 Figure 9: A part of the t-SNE of the phrase representation [17]. ________________________ 13 Figure 10: An illustration of Neural Net Language Model by [18]. _____________________ 14 Figure 11: An illustration of the word embedding matrix W [1]. _______________________ 15 Figure 12: A framework for learning word vectors [20]. ______________________________ 15 Figure 13: Word Embedding of the word "song" in the sentence "cat sat on the mat" [21]. _ 16 Figure 14: The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word [16]._______________ 17 Figure 15: A framework for learning paragraph vector [20]. __________________________ 20 Figure 16: An illustration of the steps of Semantic Hashing [1]. ________________________ 26 Figure 17: Illustration of the DSSM [26]. ___________________________________________ 27 Figure 18: Building semantic embedding on top of sub-words units (i.e. SWU or letter n-grams) [1]. _______________________________________________________________________ 29 Figure 19: Initialization Step of DSSM for semantic embedding learning [1]. ____________ 30 Figure 20: Training Step of DSSM for semantic embedding learning [1]. ________________ 31 Figure 21: Runtime Step of DSSM for semantic embedding learning [1]. ________________ 31 Figure 22: Comparison between Auto-encoders of [28] & DSSM for semantic embedding learning [1]. ________________________________________________________________ 32 Figure 23: An illustration of deeplearning4J tool [37]. ________________________________ 43 Figure 24: Most interesting StackExchange statatistics from 2015, which reflects the huge database available in this network [41]. ________________________________________ 48 Figure 25: Entity Relation Diagram (ERD) of StackExchange Data Dump. ________________ 51 Figure 26: A snapshot taken from MS-Excel for the lists of scores grouped by question IDs using the Subtotal feature. ___________________________________________________ 58 Figure 27: Results of normalizing human scores. ____________________________________ 59 Figure 28: the resulting new list after sort and replacement with original human scores. __ 60 Figure 29: a partial screen shot of DSSM Rank, CDSSM Rank, and Ideal Rank side by side with corresponding DCGp, IDCGp and nDCGp metrics. _____________________________ 61 Figure 30: Average values of all the used metrics. ___________________________________ 61 Figure 31: A snap shot of some captured output results of Doc2Vec (i.e. Paragraph Vector) in Java. ______________________________________________________________________ 62 Figure 32: Sample GUI similar as the one of [3] that can be used for Sent2Vec.V2 Checker. 66 Figure 33: Selecting a checker out from available checkers in JACK. ____________________ 67 Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

IV

Figure 34: Sample solution for the formerly added question. __________________________ 67 Figure 35: Sample assessment result, with an overall result of full mark. ________________ 68 Figure 36: Overall process of free-text-based answers assessment in JACK.______________ 68 Figure 37: Sub process "Assess Result using selected Checker" that takes place directly after submitting the student’s answer. _____________________________________________ 69


V

Tables List Table 1: The optimal weights given for the checkers in the calibration phase [4]. __________ 6 Table 2: Results of the test run using different weights [4]. _____________________________ 6 Table 3: The performance of Paragraph Vector and bag-of-words models on the information retrieval task [19]. __________________________________________________________ 23 Table 4: Word hashing token size and collision numbers as a function of the vocabulary size and the type of letter n-grams [26]. ____________________________________________ 28 Table 5: Comparative results with the previous state of the art approaches and various settings of DSSM [26]. ______________________________________________________________ 34 Table 6: Summary of available open source tools which can be used for our task. ________ 46 Table 7: Some of currently available forums in StackExchange with corresponding statistics according to [41]. ___________________________________________________________ 49


VI

Content List Statutory Declaration _____________________________________________________ II Abstract ________________________________________________________________ III Zusammenfassung _______________________________________________________ III Note of Thanks __________________________________________________________ III Figures List ______________________________________________________________ IV Tables List _______________________________________________________________ VI Content List ____________________________________________________________ VII 1

Introduction __________________________________________________________ 1 1.1 Motivation _______________________________________________________________ 1 1.2 Problem Description ______________________________________________________ 1 1.2.1 1.2.2

2

What we already have__________________________________________________________ 1 What we want to do ___________________________________________________________ 1

Highlights on the E-Learning Platform (JACK) ____________________________ 2 2.1 Application Environment __________________________________________________ 2 2.1.1 2.1.2 2.1.3 2.1.4 2.1.4.1 2.1.4.2 2.1.4.3

3

Overview of JACK System Architecture __________________________________________ 2 JACK Implementation Specifications ____________________________________________ 3 Structure of NLP Checker _______________________________________________________ 3 Free-text-based answers Checker (RegExTool) _____________________________________ 4 Brief description [4] ____________________________________________________________ 4 Evaluation [4] _________________________________________________________________ 5 Conclusion ____________________________________________________________________ 7

Background about Machine Learning _____________________________________ 8 3.1 Artificial Neural Networks _________________________________________________ 8 3.2 Deep Learning ___________________________________________________________ 10 3.2.1 3.2.2

Deep Boltzmann Machine - Deep Neural Network (DBN-DNN) ____________________ 10 Recurrent Neural Network (RNN) ______________________________________________ 11

3.3 Language Model _________________________________________________________ 12 3.3.1 3.3.2

Neural Net Language Model ___________________________________________________ 12 Recurrent Neural Net Language Model __________________________________________ 14

3.4 Word Embedding Model of Google ________________________________________ 15 3.4.1 3.4.2

4

Continuous Bag of Words (CBOW) _____________________________________________ 17 Skip-gram ___________________________________________________________________ 17

Semantic Similarity Retrieval using Deep Learning _______________________ 19 4.1 Paragraph Vector Model from Google ______________________________________ 19 4.1.1 4.1.2 4.1.3

The distributed memory model _________________________________________________ 20 Advantages of paragraph vectors _______________________________________________ 22 Information Retrieval with Paragraph Vectors ___________________________________ 22

4.2 Deep Structured Semantic Model (DSSM) from Microsoft ____________________ 24 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6

Latent Semantic Models and the Use of Click-through Data _______________________ 25 Deep Learning models for Semantic Modeling ____________________________________ 25 Deep Structured Semantic Models for Web Search ________________________________ 26 Word Hashing ________________________________________________________________ 28 Learning the DSSM ___________________________________________________________ 29 Implementation Details _______________________________________________________ 30


VII

4.2.7 4.2.8

5

Data Sets and Evaluation Methodology _________________________________________ 32 Results ______________________________________________________________________ 33

Available tools for Deep Learning ______________________________________ 35 5.1 Tools based on Paragraph Vector from Google [20] __________________________ 35 5.1.1 gensim _______________________________________________________________________ 35 5.1.1.1 Features _____________________________________________________________________ 35 5.1.1.2 Basic components in this tool __________________________________________________ 35 5.1.1.2.1 Corpus ______________________________________________________________________ 35 5.1.1.2.2 Vector _______________________________________________________________________ 36 5.1.1.2.3 Model _______________________________________________________________________ 36 5.1.1.3 Examples ____________________________________________________________________ 36 5.1.1.3.1 Example 1 – Corpora and Vector Spaces: _________________________________________ 37 5.1.1.3.2 Example 2 – Topics and Transformations: _______________________________________ 37 5.1.1.3.2.1 Some Available Transformations:__________________________________________ 39 5.1.1.3.3 Example 3 – Similarity Queries: ________________________________________________ 40 5.1.1.3.3.1 Initializing query structures_______________________________________________ 40 5.1.1.3.3.2 Performing queries _______________________________________________________ 41 5.1.1.3.3.3 Important Notes _________________________________________________________ 41 5.1.2 Sentence2Vec _________________________________________________________________ 41 5.1.3 Doc2Vec based on gensim, Python ______________________________________________ 42 5.1.4 Doc2Vec in Deeplearning4j, Java _______________________________________________ 43 5.1.4.1 Deeplearning4j _______________________________________________________________ 43 5.1.4.2 Doc2vec based on JAVA _______________________________________________________ 44

5.2 Tools based on Deep Structured Semantic Model from Microsoft [26] _________ 45 5.2.1 Sent2Vec.V2 __________________________________________________________________ 45 5.2.1.1 Package structure _____________________________________________________________ 45 5.2.1.2 Usage _______________________________________________________________________ 46 5.2.1.2.1 Sent2Vec using DSSM _________________________________________________________ 46

5.3 Summary _______________________________________________________________ 46

6

Available Datasets for Semantic Similarity Retrieval _____________________ 48 6.1 StackExchange___________________________________________________________ 48 6.1.1 6.1.2

7

Main features in StackExchange ________________________________________________ 49 StackExchange Data Dump ____________________________________________________ 50

Deep Learning for free-text-based Answers Assessment ____________________ 52 7.1 Proposed Approach ______________________________________________________ 52 7.1.1 7.1.2 7.1.3 7.1.3.1 7.1.3.2 7.1.3.3 7.1.4

8

Problem approximation _______________________________________________________ 52 Adding Feedback generation feature _____________________________________________ 52 Prerequisites for the solution design ____________________________________________ 53 Training and testing dataset ___________________________________________________ 53 Tools ready to use ____________________________________________________________ 53 Predefined metrics for experiments ______________________________________________ 53 Technical Design _____________________________________________________________ 54

Experiments using Deep Learning tools __________________________________ 55 8.1 Evaluation Method _______________________________________________________ 55 8.2 Data acquisition and preprocessing ________________________________________ 55 8.3 Performed Experiments and Results _______________________________________ 57 8.3.1 8.3.2

Using Sent2Vec.V2 ____________________________________________________________ 57 Using Doc2Vec-JAVA _________________________________________________________ 62

8.4 Summary and Discussion _________________________________________________ 63

9

Proposed Solution Design ______________________________________________ 66


VIII

9.1 Sample GUI _____________________________________________________________ 66 9.2 To Be Process ____________________________________________________________ 68

10

Conclusion ___________________________________________________________ 70

10.1 Thesis Summary and Outcome ____________________________________________ 70 10.2 Future Work_____________________________________________________________ 72

11

Appendix ____________________________________________________________ 73

11.1 Appendix 1 - Discounted cumulative gain __________________________________ 73 11.1.1 11.1.2 11.1.3 11.1.4

Cumulative Gain _____________________________________________________________ 73 Discounted Cumulative Gain __________________________________________________ 73 Normalized DCG _____________________________________________________________ 74 Example _____________________________________________________________________ 74

11.2 Appendix 2 – Implementation of nDCGp ___________________________________ 76

References _______________________________________________________________ 79


IX

1 Introduction 1.1 Motivation Computer ability to understand the languages of human beings was an obstacle faced the researchers for decades. The understanding of meaning which depends on the context and solving the ambiguity are considered complex problems and confirm major difficulties that make the task for good understanding of text by the computer a real challenge. Most of Natural Language Processing tasks require deep understanding for text, for this reason Microsoft and Google were both working for many years regarding this subject. After ten years started from 1999 to 2009 the only progress was in the field of recognizing the pronounced speech, which is considered one of the most difficult tasks that require a great effort in the field of NLP. Since 2012 the chance has begun to emerge by the using of deep learning techniques in the neural networks, in the meanwhile the rate of error didn't exceed more than 7% only [1].

1.2 Problem Description At the department of Specification of Software Systems the training and examination system JACK is operated. Indeed, this system has some merits to evaluate the free text answers; however, it still needs some Machine Learning tools to develop these capacities in order to work more efficiently. In this Master thesis, the ability of using Machine Learning tools of Google and Microsoft to evaluate the free text is investigated. Hopefully, this will help to integrate JACK with such tools through some future work, so that the features of free-text based assessment become enhanced. 1.2.1 What we already have  



JACK e-Learning System, which already has a component for assessing free-textbased answers. Machine Learning tools provided by Google such as: o Word Embedding: Word2Vec (Continuous Bag of Words, Skip-gram). o Paragraph Vector Representation: Sentence2Vec and Doc2Vec. o Recurrent Neural Network (RNN). Machine Learning tools provided by Microsoft such as: o Deep Neural Network (DNN). o Deep Structured Semantic Model (DSSM).

1.2.2 What we want to do The goal of this project is to check out the eligibility of Microsoft and Google machine learning tools for use by JACK’s assessment component for free-text-based answers. This check includes an explanation of State of the Art in Deep Learning field, a comparison between different approaches, a proposed solution that addresses the problem and then some support for this solution by a number of experiments and evaluations. Finally a discussion is presented about most likely helpful tools that drives to a proper conclusion to be invested in future work. Real adaptation of JACK that allows for integration with these tools and other supporting tools exceeds the scope of this thesis. Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 1/82

2 Highlights on the E-Learning Platform (JACK) 2.1 Application Environment For a better understanding of the current status of JACK e-Learning platform and its free-text-based answers assessment component, it will be a good idea to have an overview of JACK and try to take a closer look at the current performance of the NLP Checkers that are already built in JACK. 2.1.1 Overview of JACK System Architecture A good illustration of JACK design is shown through the component based system architecture in Figure 1, as follows:

Figure 1: Overview of JACK system architecture showing all main components. This figure is taken from [2].

According to [2] the architecture is organized in two parts to be run on different servers: The first part runs on the “Core Server”, which is designed as a classical “three-tier architecture” and consists of: Data storage layer (i.e. Persistence), Business logic layer (including synchronous grading) and Presentation layer, which consists of: the webfrontend and the web-service for Eclipse integration, which are directly responsible for the appropriate presentation of exercises and exercise resources, and Submission of students’ solutions. Web-frontend and the web-service apply for workers integration, which are on one hand directly responsible for presentation of solutions, either for manual review via the web frontend or for automated grading on another server. On the other hand, they are responsible for enabling the submission of results from manual assessment by teachers or automated grading components. However, Web-frontend solely provides means for authoring, import and export of exercises, Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 2/82

courses and exams by teachers. All of the three presentation components call methods of the business logic core to provide or request data. The business logic core cares for processing of this data, e.g. permission checks for access to exercises and exams, running synchronous checks on solutions, and maintenance of the job queue for asynchronous checks. The business logic core in turn calls the persistence layer or external services (e.g. for User authentication) to fulfill its tasks. The second part runs on the “Worker Servers”, where worker servers are designed as a service-oriented system and cares for asynchronous grading, and each worker server can be equipped with different marking components. For example, two main marking components are: static checker component and dynamic checker component. Each of which is concrete, complete, has runnable implementation in such a component. Several worker servers can connect to the same core server and one worker server can also connect to several core servers. These worker servers issue periodic requests for work to the core server and use the methods provided by the web-service for workers to retrieve data and send back results later, if a job is available for them. 2.1.2 JACK Implementation Specifications JACK is implemented in Java, where EJB3 is used for the core server, and JSF2 for the web-frontend; additioanlly, OSGi is used for the worker servers and grading components. Web-services are used for the communication between core and workers, also web-services are used for access to other clients like Eclipse. One of the most important interfaces is the IChecker interface, where each marking component (also called “checker component” or “checker” for short) implements a Java interface named IChecker which defines a unified interface for all marking components. This interface includes the following IChecker methods:  getCheckerId method: takes no parameters and is supposed to return a unique string identifying a marking component.  doCheck method: takes arbitrary resources and configuration information organized as typed pairs of name and value as input and returns an object containing the checker result and feedback messages.  getNewResources method: takes no parameters and returns a list of arbitrary resources, if these are created during the grading process.  getCheckerAttributes method: takes no parameters and returns a list of arbitrary pairs of name and value, if these are created during the grading process. 2.1.3 Structure of NLP Checker The most important marking component for us here is the NLP checker, which is just like any other checker in JACK, needs to implement IChecker interface. Indeed, there are already three implemented NLP checkers in JACK, these are: Bayes Checker, BLEU Checker and RegEx Checker. Each of which is realized as a Plug-in component that can work within JACK whenever the user chooses to use a specific checker for assessment task. However, some NLP supporting functions are needed to perform some commonly known NLP pre-processing steps, such as chunking, sentence detection, punctuation filtering, stop-words filtering, tagging, tokenization, N-grams Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 3/82

extraction, in addition to some math and utility functions. These important functions are provided by a library, also implemented in Java, this tool is called NELA, which is in turn based on OpenNLP library. The encapsulation of this library within JACK is explained further more in [3]. As a result, the component designed for free-text-based answers assessment can have a structure similar as Figure 2 below:

Figure 2: OpenNLP, NELA, and NLP Checkers integrated within JACK Backend [3].

2.1.4 Free-text-based answers Checker (RegExTool) 2.1.4.1 Brief description [4] According to [4], the presented approach here for the automatic assessment of free text responses provides a configurable combination of three methods: Bayes, BLEU- and RegEx. Each of these methods provides an assessment of the submitted answer in the form of points, which through some weights set by teachers, is combined into a total score. These methods have been implemented based on the architecture of the test system JACK, which is a summative system to support formative assessments. Due to the diverse requirements for the assessment of free-text based answers, Checker components can be developed independent from each other, which are combined together at runtime in JACK. Each of the three has been implemented in such a checker component. The evaluation of a response is calculated from the components through the feedback scores, which are weighted through some rules defined by the teachers. The Bayesian method is based on the Bayes-theory method for automatic assessment of essays. Using this method essays can be classified in some predefined nominal scaled classes after a training phase. The method can be applied to various types of terms such as words or n-grams, which are composed of n words phrases. The output of this process is a mapping between the evaluation classes and the responses, by which each response can be assigned to a class. The categories, by which responses are rated, allow the teachers to classify responses into several schemes. The method was chosen because it can assess the appropriate statistical approach by the teacher’s unforeseen formulations. A major disadvantage of this method is the Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 4/82

large number of responses, which are required for the training phase and are not normally available for new designed tasks. The Bilingual Evaluation Understudy (BLEU) is in its original form a method for automatic evaluation of machine translation. Machine translated sets of text (called candidates) are compared, using n-grams of different orders, with several manually produced reference translations of the same basic output sentences. Candidates, which differ materially from the references in their length, are evaluated by BLEU. The problem of automatic evaluation of machine translation by the BLEU method can be mapped to the problem of automatic evaluation of free text responses, by which students’ responses are handled just as the machine translations, while the correct pattern responses created by the teachers are interpreted as the reference translations. The integrated approach in the RegEx (i.e. the regular expressions) method extends the method described in [5]. The teacher creates here a template for each task to be assessed, which contains a set of evaluation keys. An evaluation key in turn consists of a regular expression, a score and optional alternatives in the form of another key. These alternatives can be defined with the same score (e.g. for synonyms) or a lower score (e.g. for inaccurate statements). To evaluate a response a spanned tree of evaluation keys is run, and the scores of those keys are summed once a regular expression matches an answer. Since no training data are required for this process, it can be used directly for new tasks by careful definition of an evaluation template. However, the correct answers that are not covered by any evaluation key are rated as bad answers, although they were correct. According to [4], other methods have been considered for the selection of the combining process as well. Because of the poor correlation with manual reviews, the Vector Space Model was discarded in favor of the BLEU method. The latent semantic analysis (LSA), as well as the Bayesian method, requires a large amount of training data. Indeed, the Bayesian method was, due to its configuration options, preferred over LSA. 2.1.4.2 Evaluation [4] For the evaluation task of [4] answers for the question shown in Figure 3 below have been digitized. This question was presented in the course "Programming" and therefore taken from the archived exam documents.

Figure 3: The object used for the evaluation [4].


Page 5/82

The student responses were taken unchanged, i.e. including all spelling and grammatical errors. The manual evaluations of digital exam answers were adapted to the scoring system in JACK, since the students were able to achieve up to 5 points in the exam while a solution in JACK is assigned up to 100 points. Hereinafter, the "points" refer to the number of points attainable in JACK, while the term "point level" refers to the exam assigned rating. A total of 161 replies with an average length of 19 words have been digitized, of which random 128 replies (80%) were used for the training phase and 33 responses (about 20%) for the determination of the optimum weights. This was with consideration of the distribution of the answers with respect to their score. In the calibration phase, these 33 responses were evaluated individually by the previously trained procedure. Weighting was calculated from the evaluation in each case, which minimized the arithmetic mean of the deviations of manual evaluations and maximized the number of matches. The optimal weighting for the checkers is shown in Table 1 below. 𝝎𝑩𝒂𝒚𝒆𝒔

𝝎𝑩𝑳𝑬𝑼

𝝎𝑹𝒆𝒈𝑬𝒙

Min. Deviation (𝒎𝒊𝒏𝑨)

0.63

0.04

0.33

Max. Matches (𝒎𝒂𝒙𝑼 )

0.89

0.00

0.11

Weighting

Table 1: The optimal weights given for the checkers in the calibration phase [4].

By application of the weightings for the determination of total scores, some ratings outside the set {0, 20, 40, 60, 80, and 100} have occurred. To ensure comparability with the manual evaluations, these scores were rounded up to the nearest number of points of this set. According to Table 1 above, The BLEU method had no or only a very small proportion of the optimal weights, at least in the configuration used. Both for the minimum deviation (𝑚𝑖𝑛𝐴 ) and for the maximum number of matches (𝑚𝑎𝑥𝑈 ), the Bayesian method dominates. Evaluation model Mean deviation Median 18,750 Min. Deviation 10,000 (𝒎𝒊𝒏𝑨) Max. Matches 5,000 (𝒎𝒂𝒙𝑼 )

Standard deviation 22,472 10,328

Matches 50,00% 50,00%

Deviation ≤ 20 68,75% 100,00%

8,944

75,00%

100,00%

Table 2: Results of the test run using different weights [4].

The answers used for the test run were collected from a task entered in JACK, see Figure 3 above, and have been offered for students for exam preparation, where entries were manually corrected. Among the answers, 16 unique responses were correct with an average of 29 words total. Table 2 shows the results obtained in the test run. The metrics by which the calculation models were evaluated, were:   

The average deviation of the automatic manual of the evaluations. The standard deviation of these deviations. The rate of exact matches.


Page 6/82



The ratio of the maximum tolerance of 20 points (equivalent to a point level within the exam).

Both weights reach for the latter task an optimal value of 100%; thus it differs from all the answers of the test data to a point level maximum. The weighting 𝑚𝑎𝑥𝑈 scored with 75% are actually the best match score of the compared evaluation models. However, we need to consider this weighting in the remaining two criteria. It is clear that 𝑚𝑖𝑛𝐴 against 𝑚𝑎𝑥𝑈 have twice the average deviation with 10 points. Perhaps this is due to fluctuations that may be caused by the small number of test data. As a conclusion we can say that for all the three combined methods with JACK, i.e. Bayes, the BLEU and RegEx, we can help exploiting the strengths of each method successfully using adjustable weights. In principle, this approach is not dependent on the training data; however, in the presence of training data, some methods can be activated, where unknown formulations can be processed. According to [4], the performance of the BLEU method, which can be compared with the unexpectedly poor sections of [6], could be improved by more selective reference responses. Please take note that the generation of feedback on submitted solutions was not yet integrated into this approach in JACK. 2.1.4.3 Conclusion Indeed, this type of processing using RegEx rules can be helpful for giving a detailed feedback with a rational reasoning in future work, but the difficulty of forming all possible rules makes such approach cost, effort and time consuming. Furthermore, if we would like to make use of some training datasets, it would be so difficult to gather, and even if we would like to make use of data dumps from the internet, a long phase of preprocessing will be required before exploiting them in such approaches. Since such preprocessing can still be faulty after all, it will be even harder to achieve good and reliable results in the real assessments of text based exams.


Page 7/82

3 Background about Machine Learning Two main machine learning techniques are required to have a good understanding of the tools provided by Google and Microsoft for similarity retrieval: Artificial Neural Networks and Deep Learning. Furthermore, some more information about Language model and Word Embedding concepts needs to be provided. Therefore, it is strongly recommended to have an overview of all these basics through the background provided in this chapter.

3.1 Artificial Neural Networks One of the basic machine learning technologies of this thesis is the artificial neural networks. The neural network (NN) is a connection of nodes seen as ‘artificial neurons’. An artificial neuron is a computational model inspired by the natural neurons. Natural neurons receive signals through synapses located on the dendrites or membrane of the neuron. When the signals received are strong enough (surpass a certain threshold), the neuron is activated and emits a signal though the axon. This signal might be sent to another synapse, and might activate other neurons [7]. Figure 4 illustrates an artificial neuron.

Figure 4: An illustration of artificial neuron [8].

The complexity of real neurons is highly abstracted when modelling artificial neurons. These basically consist of inputs (like synapses), which are multiplied by weights (strength of the respective signals), and then computed by a mathematical function which determines the activation of the neuron. Another function (which may be the identity) computes the output of the artificial neuron (sometimes in dependence of a certain threshold) [7]. NNs combine artificial neurons in order to process information just like in the multi-layer perceptron, see Figure 5 below. The higher a weight of an artificial neuron is, the stronger the input which is multiplied by it will be. Weights can also be negative, so we can say that the signal is inhibited by the negative weight. Depending on the weights, the computation of the neuron will be different. By adjusting the weights of an artificial neuron we can obtain the output we want for specific inputs. However, when we have an NN of hundreds or thousands of neurons, it would be quite complicated to find by hand all the necessary weights. Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 8/82

But we can find algorithms which can adjust the weights of the NN in order to obtain the desired output from the network. This process of adjusting the weights is called learning or training [7].

Figure 5: An illustration of multi-layer perceptron (MLP) [8].

The backpropagation algorithm is used in layered feed-forward NNs. This means that the artificial neurons are organized in layers, and send their signals “forward”, and then the errors are propagated backwards. The network receives inputs by neurons in the input layer, and the output of the network is given by the neurons on an output layer. There may be one or more intermediate hidden layers. The backpropagation algorithm uses supervised learning, which means that we provide the algorithm with examples of the inputs and outputs we want the network to compute, and then the error (difference between actual and expected results) is calculated.

Figure 6: An illustration of Feedforward Backpropagation Algorithm [9].


Page 9/82

The idea of the backpropagation algorithm is to reduce this error, until the NN learns the training data. The training begins with random weights, and the goal is to adjust them so that the error will be minimal [10]. This adjustment of weights is performed by means of specific formulas that aims at minimizing the difference between the actual and targeted output. Figure 6 above illustrates this algorithm briefly.

3.2 Deep Learning The ancient term "Deep Learning" was first introduced to Machine Learning by Dechter (1986), and to Artificial Neural Networks (NNs) by Aizenberg et al (2000). Deep Learning has revolutionized Pattern Recognition and Machine Learning. It became especially popular in the context of deep NNs, the most successful Deep Learners, which are much older though, dating back half a century [11]. Learning or credit assignment is about finding weights that make the NN exhibit desired behavior. Depending on the problem and how the units are connected, such behavior may require long causal chains of computational stages, where each stage transforms (often in a non-linear way) the aggregate activation of the network. Deep Learning in NNs is about accurately assigning credit across many such stages [11]. To measure whether credit assignment in a given NN application is of the deep or shallow type, we consider the length of the corresponding credit assignment paths, which are chains of possibly causal connections between subsequent unit activations [11]. In deep learning many models were proposed; however, only few models are a topic of interest in this Master Thesis. 3.2.1 Deep Boltzmann Machine - Deep Neural Network (DBN -DNN) This model was originally founded for acoustic modeling in Speech Recognition, and is definitely important to understand before going through the details of Deep Structured Semantic Model (DSSM), which is the main topic of interest in this thesis. Figure 7 illustrates the sequence of operations used to create a DBN (Deep Boltzmann Machine) with three hidden layers and to convert it to a pre-trained DBN-DNN [12]:

Figure 7: Steps of creating a Deep Boltzmann Machine – Deep Neural Network [1].


Page 10/82

1. A restricted Gaussian-Bernoulli Boltzmann Machine (GRBM) is trained to model a window of frames of real-valued acoustic coefficients. 2. The states of the binary hidden units of the GRBM are used as data for training a restricted Boltzmann Machine (RBM). 3. This is repeated to create as many hidden layers as desired. 4. Then the stack of RBMs is converted to a single generative model, a DBN, by replacing the undirected connections of the lower level RBMs by top-down, directed connections. 5. Finally, a pre-trained DBN-DNN is created by adding a “softmax”1 output layer that contains one unit for each possible state of each HMM (Hidden Markov Model). 6. The DBN-DNN is then discriminatively trained to predict the HMM state corresponding to the central frame of the input window in a forced alignment. 3.2.2 Recurrent Neural Network (RNN) Sequence-processing recurrent NNs (RNNs) are the ultimate NN. In fully connected RNNs, all units have connections to all non-input units. Unlike feedforward NNs, RNNs can implement while loops, recursion, etc. The program of an RNN is its weight matrix. RNNs can learn programs that mix sequential and parallel information processing in a natural and efficient way [11]. This type of NNs has additional output nodes which are copied back to its inputs with a time delay [13]. Training is with Back-Propagation through Time, see Figure 8 below. This model is important to keep in mind while going through some later sections such as “Recurrent Neural Net Language Model (RNNLM)”.

Figure 8: An abstract illustration of Recurrent Neural Network (RNN) [14].

1

Softmax function is used at the output layer to ensure that 0 >> class MyCorpus(object): >>> def __iter__(self): >>> for line in open('mycorpus.txt'): >>> # assume there's one document per line, tokens separated by whitespace >>> yield dictionary.doc2bow(line.lower().split())

The assumption that each document occupies one line in a single file is not important; you can mold the __iter__ function to fit your input format, whatever it is. Walking directories, parsing XML, accessing network... Just parse your input to retrieve a clean list of tokens in each document, then convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside__iter__. The text file 'mycorpus.txt' contains the following lines, which can be useful to see before going through the next steps: Human machine interface for lab abc computer applications A survey of user opinion of computer system response time The EPS user interface management system System and human system engineering testing of EPS Relation of user perceived response time to error measurement The generation of random binary unordered trees The intersection graph of paths in trees Graph minors IV Widths of trees and well quasi ordering Graph minors A survey

Now to construct the dictionary without loading all texts into memory, the following statements can do the work: >>> # collect statistics about all tokens >>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt')) >>> # remove stop words and words that appear only once >>> stoplist = set('for a of the and to in'.split()) >>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist >>> if stopword in dictionary.token2id] >>> once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1] >>> dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once >>> dictionary.compactify() # remove gaps in id sequence after words that were removed

5.1.1.3.2 Example 2 – Topics and Transformations : In the previous example on Corpora and Vector Spaces, a corpus of documents represented was created as a stream of vectors. Now we can use that corpus and fire up gensim: >>> from gensim import corpora, models, similarities >>> dictionary = corpora.Dictionary.load('/tmp/deerwester.dict') >>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 37/82

In this example, we can see how to transform documents from one vector representation into another. This process serves two goals: 1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way. 2. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction). Now let’s create a transformation: Please note that the transformations are standard Python objects, typically initialized by means of a training corpus. So using the same corpus variable: >>> tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

Now is the turn of the transforming vectors: From now on, Tf-Idf is treated as a readonly object that can be used to convert any vector from the old representation (bag-ofwords integer counts) to the new representation (Tf-Idf real-valued weights). So, to apply a transformation to a whole corpus: >>> corpus_tfidf = tfidf[corpus] >>> for doc in corpus_tfidf: print(doc) Output: [(0, 0.57735026918962573), (1, 0.57735026918962573), (2, 0.57735026918962573)] [(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.32448702061385548), (6, 0.44424552527467476), (7, 0.32448702061385548)] [(2, 0.5710059809418182), (5, 0.41707573620227772), (7, 0.41707573620227772), (8, 0.5710059809418182)] [(1, 0.49182558987264147), (5, 0.71848116070837686), (8, 0.49182558987264147)] [(3, 0.62825804686700459), (6, 0.62825804686700459), (7, 0.45889394536615247)] [(9, 1.0)] [(9, 0.70710678118654746), (10, 0.70710678118654746)] [(9, 0.50804290089167492), (10, 0.50804290089167492), (11, 0.69554641952003704)] [(4, 0.62825804686700459), (10, 0.45889394536615247), (11, 0.62825804686700459)]

Transformations can also be serialized, one on top of another, in a sort of chain: >>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation >>> corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

Here Tf-Idf corpus was transformed via Latent Semantic Indexing into a latent 2-D space (2-D because we set num_topics=2). Now th question is: what do these two latent dimensions stand for? Let’s inspect with models.LsiModel.print_topics(): >>> lsi.print_topics(2) Output: topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + 0.061*"system" + -0.060*"response" + -0.060*"time" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"


Page 38/82

topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + 0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"

It appears that according to LSI, “trees”, “graph” and “minors” are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words. As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic: >>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly ... print(doc) Output: [(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications" [(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time" [(0, -0.090), (1, 0.724)] # "The EPS user interface management system" [(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS" [(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement" [(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees" [(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees" [(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering" [(0, -0.617), (1, 0.054)] # "Graph minors A survey"

Model persistency is achieved with the save() and load() functions: >>> lsi.save('/tmp/model.lsi') # same for tfidf, lda, ... >>> lsi = models.LsiModel.load('/tmp/model.lsi')

5.1.1.3.2.1 Some Available Transformations : Gensim implements several popular Vector Space Model algorithms: Firstly, Term Frequency - Inverse Document Frequency Tf-Idf expects a bag-ofwords (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality, except that features which were rare in the training corpus will have their value increased. It therefore converts integer-valued vectors into real-valued ones, while leaving the number of dimensions intact. It can also optionally normalize the resulting vectors to (Euclidean) unit length. >>> model = tfidfmodel.TfidfModel(bow_corpus, normalize=True)

Secondly, Latent Semantic Indexing, LSI (or sometimes LSA) transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality. For the toy corpus above we used only 2 latent dimensions, but on real corpora, target dimensionality of 200–500 is recommended as a “golden standard”. >>> model = lsimodel.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)


Page 39/82

Note: LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile! >>> model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus >>> lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model >>> ... >>> model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents >>> lsi_vec = model[tfidf_vec] >>> ...

5.1.1.3.3 Example 3 – Similarity Queries: Based on the previous examples we can construct the required similarities queries for our task as follows: >>> from gensim import corpora, models, similarities >>> dictionary = corpora.Dictionary.load('/tmp/deerwester.dict') >>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, "From strings to vectors"

Now we can use this tiny corpus to define a 2-dimensional LSI space: >>> lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

Now suppose a user typed in the query “Human computer interaction”. We would like to sort out the nine corpus documents in decreasing order of relevance to this query: >>> doc = "Human computer interaction" >>> vec_bow = dictionary.doc2bow(doc.lower().split()) >>> vec_lsi = lsi[vec_bow] # convert the query to LSI space >>> print(vec_lsi) [(0, -0.461821), (1, 0.070028)]

In addition, we will be considering cosine similarity to determine the similarity of two vectors, since the cosine similarity is a standard measure in Vector Space Modeling. 5.1.1.3.3.1 Initializing query structures To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. In our case, they are the same nine documents used for training LSI, converted to 2-D LSA space. But that’s only incidental, we might also be indexing a different corpus altogether. >>> index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it

Index persistency is handled via the standard save() and load() functions: >>> index.save('/tmp/deerwester.index') Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 40/82

>>> index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

This works as well for all similarity indexing classes (similarities.Similarity, similarities.MatrixSimilarity andsimilarities.SparseMatrixSimilarity). 5.1.1.3.3.2 Performing queries To obtain similarities of our query document against the nine indexed documents: >>> sims = index[vec_lsi] # perform a similarity query against the corpus >>> print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples [(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.1063926), (7, -0.098794639), (8, 0.05004178)]

Cosine measure returns similarities in the range (the greater, the more similar), so that the first document has a score of 0.99809301 etc. Now we sort these similarities into descending order, and obtain the final answer to the query “Human computer interaction”: >>> sims = sorted(enumerate(sims), key=lambda item: -item[1]) >>> print(sims) # print sorted (document number, similarity score) 2-tuples [(2, 0.99844527), # The EPS user interface management system (0, 0.99809301), # Human machine interface for lab abc computer applications (3, 0.9865886), # System and human system engineering testing of EPS (1, 0.93748635), # A survey of user opinion of computer system response time (4, 0.90755945), # Relation of user perceived response time to error measurement (8, 0.050041795), # Graph minors A survey (7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering (6, -0.1063926), # The intersection graph of paths in trees (5, -0.12416792)] # The generation of random binary unordered trees

5.1.1.3.3.3 Important Note s  



(Please note that the original documents were added in their “string form” to the output comments, to improve clarity). When examining the previous results obtained by genism, we can say that it practically works on the semantic level rather than the lexical one, since it has ranked “The EPS user interface management system” before “Human machine interface for lab abc computer applications”, and that is what we need. The class similarities.MatrixSimilarity is only appropriate when the whole set of vectors fits into memory. For example, a corpus of one million documents would require 2GB of RAM in a 256-dimensional LSI space, when used with this class. Without 2GB of free RAM, you would need to use the similarities.Similarity class. This class operates in fixed memory, by splitting the index across multiple files on disk, called shards. It uses similarities.MatrixSimilarity and similarities.SparseMatrixSimilarityinternally, so it is still fast, although slightly more complex.

5.1.2 Sentence2Vec A tool written in Python for mapping a sentence with arbitrary length to vector space. This tool provides an implementation of the Paragraph Vector in [20], which is in turn based on gensim. Indeed, the efficient use and documentation of this tool is not as good as the one described in genism; however, some tests were made using this tool, but Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 41/82

apparently, the features provided in this tool are so far insufficient for our specific task. Additionally, there is no available training dataset combined with this tool, neither any pre-trained models nor any executable files for a quick start or performance testing. Therefore, no further details are to be provided in this thesis about it. 5.1.3 Doc2Vec based on gensim, Python Based on gensim there was an extension called Doc2Vec implemented in Python, which also realizes the research work presented by [20]. According to [34], in this tool we can easily adjust the dimension of the representation, the size of the sliding window, the number of workers... etc. In this doc2vec architecture, the corresponding algorithms are “Distributed Memory” (DM) and “Distributed Bag of Words” (DBOW). Since the distributed memory model performed noticeably better in the paper, that algorithm is the default when running Doc2Vec. However, we can still force the DBOW model by using the dm=0 flag in constructor [34]. According to [34], the input to Doc2Vec is an iterator of LabeledSentence objects, where each object represents a single sentence, and consists of two simple lists: a list of words and a list of labels as follows: sentence = LabeledSentence(words=[u'some', u'words', u'here'], labels =[u'SENT_1'])

The algorithm can then run through the sentences iterator twice: once to build the vocab, and once to train the model on the input data, learning a vector representation for each word and for each label in the dataset. Although such architecture permits more than one label per sentence, it is not likely to be the most popular use case. Instead, it would probably have a single label per sentence, which is the unique identifier for the sentence, in most cases. One could implement this kind of use case for a file with one sentence per line by using the following class as training data: class LabeledLineSentence(object): def __init__(self, filename): self.filename = filename def __iter__(self): for uid, line in enumerate(open(filename)): yield LabeledSentence(words=line.split(), labels =['SENT_%s' % uid])

Now for training Doc2Vec learns representations for words and labels simultaneously. If we would like to only learn representations for words, we can use the flag train_lbls=False in our Doc2Vec class. Similarly, if we only wish to learn representations for labels and leave the word representations fixed, the model also has the flag train_words=False. Finally, one can save and load gensim Doc2Vec instances in the usual way: using the optimized Doc2Vec.save() and Doc2Vec.load() methods: model = Doc2Vec(sentences) ... model.save('/tmp/my_model.doc2vec') # to save the model # then load the model back: model_loaded = Doc2Vec.load('/tmp/my_model.doc2vec')

Helper functions like model.most_similar(), model.doesnt_match() and model.similarity() also exist. The raw words and label vectors are also accessible either individually via


Page 42/82

model['word']. So, to get the most similar words/sentences to the first sentence (labeled with SENT_0, for example), we would do the following: print model.most_similar("SENT_0") [('SENT_48859', 0.2516525387763977), (u'paradox', 0.24025458097457886), (u'methodically', 0.2379375547170639), (u'tongued', 0.22196565568447113), (u'cosmetics', 0.21332012116909027), (u'Loos', 0.2114654779434204), (u'backstory', 0.2113303393125534), ('SENT_60862', 0.21070502698421478), (u'gobble', 0.20925869047641754), ('SENT_73365', 0.20847654342651367)]

Further explanations and examples based on this tool can be found in [35]. 5.1.4 Doc2Vec in Deeplearning4j , Java 5.1.4.1 Deeplearning4j Deeplearning4j is an open-source, distributed deep-learning project in Java and Scala and spearheaded by the people at Skymind, a business intelligence and enterprise software firm. It provides, in addition to the open-source code, tutorials on how to use Word2vec tools (Extracting Relations From Raw Text), Eigenvectors, PCA and Entropy, Regression, Predictive Analytics and Neural Nets, Restricted Boltzmann machines, Convolutional Nets (images), Long Short-Term Memory Networks and RNNs, Deep Auto-encoders, Deep-belief networks [36]. Deeplearning4j contains an n-dimensional array class for Java and Scala, which is scalable on Hadoop8. It utilizes GPU support for scaling on AWS9, includes a general vectorization tool for machine-learning libs, and most of all relies on ND4J: A matrix library much faster than NumPy and other projects written in C++ [37]. Figure 23 below illustrates the structure of this project:

Figure 23: An illustration of deeplearning4J tool [37]. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. 8

9

Amazon Web Services


Page 43/82

5.1.4.2 Doc2vec based on JAVA Doc2vec is a tool implemented in Java and is an extension of word2vec that learns to correlate labels and words, rather than words with other words. The main purpose of Doc2Vec is associating arbitrary documents with labels, so labels here are required. The first step is coming up with a vector that represents the “meaning” of a document, which can then be used as input to a supervised machine learning algorithm to associate documents with labels [37]. In the ParagraphVectors builder pattern, the labels() method points to the labels to train on. Here is a full working example of measuring similiartiy with paragraph vectors, which is originally based on ParagraphVectorsTextExample available at GitHub on [38]: 1. public class ParagraphVectorsTextExample { 2. private static final Logger log = LoggerFactory.getLogger(ParagraphVectorsTextExample.class); 3. public static void main(String[] args) throws Exception { 4. ClassPathResource resource = new ClassPathResource("/raw_sentences.txt"); 5. File file = resource.getFile(); 6. SentenceIterator iter = new BasicLineIterator(file); 7. InMemoryLookupCache cache = new InMemoryLookupCache(); 8. TokenizerFactory t = new DefaultTokenizerFactory(); 9. t.setTokenPreProcessor(new CommonPreprocessor()); 10. LabelsSource source = new LabelsSource("DOC_"); 11. ParagraphVectors vec = new ParagraphVectors.Builder() .minWordFrequency(1) .iterations(5) .epochs(1) .layerSize(100) .learningRate(0.025) .labelsSource(source) .windowSize(5) .iterate(iter) .trainWordVectors(false) .vocabCache(cache) .tokenizerFactory(t) .sampling(0) .build(); 12.

vec.fit();

13.

for(int i = 1; i < 100; i++) log.info("\n"+vec.similarity("DOC_0", "DOC_"+i));

14. 15. }

}

The above example shows how to first load a resource file such as raw_sentences.txt, then iterate over all the lines, i.e. paragraphs, of this file line by line, then prepare the cashing parameter as well as the tokenizer. Please take note that such preparations are needed for any use of ParagraphVectors model. Additionally, most of the parameters fed into this model are necessary the underlying neural network. One interesting observation here can be the LabelsSource class, which performs the labeling for each line (or paragraph) by adding a number to the end of each label, e.g. DOC_0, DOC_1, DOC_2 … etc., such that we can ensure unique labeling for each paragraph. The output results of such example is to be discussed with some supporting snapshots in a later section.


Page 44/82

One more thing to refer to, before moving into the next section, is that this package does not provide any executable files for a quick testing. However, a huge training dataset within this package can be the raw_sentences.txt file which includes exactly 97,162 lines, each of which contains a sentence, i.e. single paragraph.

5.2 Tools based on Deep Structured Semantic Model from Microsoft [26] 5.2.1 Sent2Vec.V2 Sent2vec is a C# tool provided by Microsoft that maps a pair of short text strings (e.g., sentences or query-answer pairs) to a pair of feature vectors in a continuous, lowdimensional space where the semantic similarity between the text strings is computed as the cosine similarity between their vectors in that space [39]. Sent2vec performs the mapping using Deep Structured Semantic Model (DSSM) proposed in [26], or the DSSM with convolutional-pooling structure (CDSSM) proposed in [40]. Details about CDSSM are out of the scope of this thesis. This package contains, beside source code, some really helpful data templates for training, in addition to a pre-trained model that can be easily exploited by an executable file for an easier and a faster start. The only weakness part in this package is the lack of the detailed documentation. Indeed, there is a user manual that supports the quick use of the tool, but the source codes for example are not combined with a detailed in line documentation through comments, and there isn’t actually a wide range of examples for the various types of use. 5.2.1.1 Package structure This package is available from Microsoft Research web site and contains the following: 

A source code folder: The source code can be loaded to enable the use of the trained models for semantic mapping. The code is written in c-sharp. The DSSM model trainer is also provided, which is implemented using a GPU-accelerated linear algebra library developed on CUDA 6.5.  A sample folder that contains the following subfolders: o sent2vec:  run.bat: this is the batch file which can work as a sample execution file.  pairs.txt: here we can add the text pairs of which we would like to get the similarity scores.  dssm_model: here we find the pre-trained source and target models, which are needed to find out the similarity scores using DDSM.  cdssm_model: here we find the pre-trained source and target models, which are needed to find out the similarity scores using CDDSM. o Bin: contains all needed binary files for run.bat o Training: contains training data templates, which are designed based on a relatively large set of anonymized training data.  Data: this folder contains: → L3g.txt: Letter-trigrams text file. → train.pair.tok.tsv: tab separated words files for the tokenized pairs of training texts. Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 45/82

 

→ dev.pair.tok.tsv: tab separated words files for the tokenized pairs of training texts (after grouping and refinement). Training batch files Configuration text files

5.2.1.2 Usage 5.2.1.2.1 Sent2Vec using DSSM To try the sample available in this package, we just need to run the batch file, (i.e. run.bat file). However, the following resources should be available in order to run this batch file properly:  Input: o inSrcModel: the source model, i.e., the neural net to embed the source string (e.g. smodel binary file). o inTgtModel: the target model, i.e., the neural net to embed the target string (e.g. tmodel binary file). o inSrcVocab: the vocab used by the source model (e.g. vocab binary file). o inTgtVocab: the vocab used by the target model. o inFilename: input sentence pair file, each line is a pair of short text strings, separated by tab (i.e. same as above mentioned file pairs.txt).  Output: o outFilenamePrefix: output file for the similarity scores and the semantic vectors of the input sentence pairs (e.g. the following text files: dssm_out.score.txt, dssm_out.src_vect.txt, dssm_out.tgt_vect.txt).

5.3 Summary Table 6 below summarizes the open source tools described in this section with the main features they tend to have: Tool

Documentation

Sent2Vec.V2

Doc2Vec-Java gensim Sentence2Vec

Doc2Vec-Python

Quick start

Pre-trained Model

Training Data

Programming Language

Partially - Available within the code as exe available

Available

Partially available

C#

Fully available Available (with examples) as code

Not

Not

Java

Available

Available

Fully available Not (with examples) Available

Not

Not

Available

Available

Partially - Not within the code Available available

Not

Not

Available

Available

Partially - Not within the code Available available

Not

Not

Available

Available

Python Python

Python

Table 6: Summary of available open source tools which can be used for our task.


Page 46/82

Based on Table 6 above, the most quick to use and test is the one of Sent2Vec.V2. However, for further possible adaptations a better documentation of the source code might be needed. Moreover, JACK system, which is the e-Learning target platform in this thesis, is based on JAVA, so Doc2Vec-JAVA can be a tool of interest for technical reasons, since it is written in JAVA as well. This does not make gensim a bad tool at all, especially because of the well explained examples combined with the documentation, but since we would prefer a quick start and a programming language compatibility, Sent2Vec.V2 and Doc2Vec-JAVA are selected to be used for the experiments of our task. Finally, Sentence2Vec and Doc2Vec-Python does not provide significant or additional features compared with the previously mentioned tools, and that’s why we will not come to real tests using gensim, Sentence2Vec or Doc2VecPython.


Page 47/82

6 Available Datasets for Semantic Similarity Retrieval Available language resources for semantic similarity retrieval training and testing are limited to some extent, since we do not only need question-answer pairs, but rather the most probable human scores (or votes) of answers as input for training step. In other words, we need a dataset composed of at least the following three elements: 1. Question. 2. One or more answers. 3. Most probable human score combined with each answer. Obtaining question-answers pairs is possible by some automatic crawling on search engines. However, finding a proper human score combined with each answer is not always available. Therefore, a convenient dataset can be obtained from online forums, such as StackExchange network, where a human rating system is available. This network involves multiple specialized forums in diverse topics such as: technology, culture and recreation, life and arts, science, business … etc. Please note that datasets obtained from cultural, scientific and technical topics are the most proper ones for the assessment of free-text-based answers.

6.1 StackExchange Stack Exchange is a network of 150+ Q&A communities including Stack Overflow, the preeminent site for programmers to find, ask, and answer questions about software development. Founded in 2008 by Joel Spolsky and Jeff Atwood, the company was built on the premise that serving the developer community at large would lead to a better, smarter Internet. Since then, the Stack Exchange network has grown into a top50 online destination, with Stack Overflow alone serving more than 40 million professional and novice programmers every month. Figure 24 below illustrates some important statistics about StackExchange until 2015.

Figure 24: Most interesting StackExchange statatistics from 2015, which reflects the huge database available in this network [41]. Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 48/82

StackExchange currently contains the following most popular professional forums: Forum

Short Description

Questions

Answers

Answered Rate

users

StackOverflow Q&A for professional 12 Mega and enthusiast programmers

19 Mega

73%

5.6 Mega

ServerFault

Q&A for system and 221 Kilos network administrators

381 Kilos

80%

263 Kilos

SuperUser

Q&A for computer 313 Kilos enthusiasts and power users

472 Kilos

69%

430 Kilos

Mathematics

Q&A for people 612 Kilos studying math at any level and professionals in related fields

874 Kilos

79%

255 Kilos

Ask Ubuntu

Q&A for Ubuntu users 228 Kilos and developers

299 Kilos

66%

384 Kilos

Arqade

Q&A for passionate 313 Kilos video-gamers on all platforms

472 Kilos

69%

430 Kilos

178 Kilos

96%

121 Kilos

106 Kilos

67%

127 Kilos

English Language usage

Q&A for linguists, 71 Kilos & etymologists, and serious English language enthusiasts

Ask Different

Q&A for power users 72 Kilos of Apple hardware and software

Table 7: Some of currently available forums in StackExchange with corresponding statistics according to [41].

6.1.1 Main features in StackExchange In this network, good answers are voted up and rise to the top, and the person who asked can mark one answer as "accepted". Moreover, questions here need to be about specific issues within each site's area of expertise, also they need to be about real problems or questions that was actually encountered. All questions are tagged with their subject areas. Each can have up to 5 tags, since a question might be related to several subjects. A really useful feature for us is that the user’s reputation score goes up when others vote up to his questions, answers and edits. By earning reputation, the user can unlock new privileges like the ability to vote, comment, and even edit other people's posts. At the highest levels, professional users have access to special moderation tools. Users will be able to work alongside our community moderators to keep the site focused and helpful. A nice feature is the ability to add comments to ask Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 49/82

for more information or clarify a question or answer. A user can always comment on his own questions and answers. Once a user earns 50 reputation, you can comment on anybody's post. The concept of Badges are also used as special achievements you earn for participating on the site. They come in three levels: bronze, silver, and gold. 6.1.2 StackExchange Data Dump This data dump is an anonymized dump of all user-contributed content on the Stack Exchange network uploaded by Stack Exchange, Inc on 01/03/2016 and is available online on [42]. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive is dedicated to only one forum, and each of which includes Posts, Users, Votes, Comments, PostHistory and PostLinks on a separate compressed folder. The complete ERD schema is shown below in Figure 25. The most important table for our task, as shown in later sections, is Posts table, where the post can be a question or an answer depending on PostTypeId field. Score field matters at most when the post type is an Answer (i.e. PostTypeId = 2), since we need the human score combined with the answer: on one hand, for the training phase of Deep Learning Models, and on the other hand for the comparison between testing phase results and the most probable human score. This comparison is the main goal of assessment in this thesis’ experiments. However, other fields in Posts.xml table can be useful as well. For example, fields such as ViewCount, AnswerCount, CommentCount, FavoriteCount might be good indicators for the importance of a question and/or the quality of an answer. Moreover, the text fields Title and Body are even more important, since they are binding the title of the question and the body of the question/answer respectively. Another important field is the Tags field, which can be much helpful for selecting question/answer pairs evolving on one or more topics of interest. Some auxiliary fields such as ParentID, which exists only if post type is an Answer, can be really helpful for querying over the whole posts available for a specific forum of interest. For further details about the design architecture of this data dump, please refer to the ReadMe text file available on [42]. In this file some more information is available in a form of plain text, which describes the structure of the tables (or entities) and the datatypes of their fields as well. However, to make the xml database structure more readable, a simplified ERD was constructed by the author of this thesis specifically for this purpose.


Page 50/82

Figure 25: Entity Relation Diagram (ERD) of StackExchange Data Dump.


Page 51/82

7 Deep Learning for free-text-based Answers Assessment 7.1 Proposed Approach 7.1.1 Problem approximation The problem of automatic assessment of free-text based answers in e-Learning Systems can be so similar to the problem of information retrieval or rather the semantic similarity retrieval, which is in turn solved in the web search engines of Google and Microsoft by means of word embedding approaches; therefore the similarity assessment task can be approximated as follows: The text of the question in a free-text based exam can be treated as a query typed into the web search engine, while each of the student’s sentences in this exam can be treated as the documents retrieved by the web search engine. In other words, the deserved mark to each student’s sentence can be equivalent to the semantic similarity score between the query (i.e. the teacher’s question) and the retrieved documents (i.e. the student’s sentences). By means of such approximation, the retrieved semantic similarity score is equivalent to the score to be assigned to each sentence in the student’s answer, which is entered into an e-Learning system like JACK. Therefore, if we had a pre-trained semantic model such as Paragraph Vector or DSSM, see chapter 4 - Semantic Similarity Retrieval based on Deep Learning, we will only need to segment each free-text based answer, provided by the student in the exam, into sentences. These sentences can be split by punctuation marks, and then we may use one of the above mentioned semantic models in order to measure the semantic similarity score between two input sentences: the first is the teacher’s question and the second is one of the split sentences of the student’s answer. The total assigned score to the student’s answer can be simply the average or the normalized summation of all sentence scores. 7.1.2 Adding Feedback generation feature By means of the proposed problem approximation in the previous section we can give a semi-like feedback to the student. This is true to some extent, since that the solution proposed provides a relevance score to each sentence. This can be definitely helpful for the student to detect the weakness and strength points in his answer, taking into consideration that providing the standard expected solution combined with sentences scores can be helpful as well. Moreover, the concept of querying can be applied as well on the answers repository from previous years, so that some of the anonymized best answers in this repository can be provided as examples of standard, or at least, expected well-formed answers. However, the relevance score does not need to be accurate as expected for the extremely poor answers that do not relate to any linguistically correct answer at all. In order to give a reliable assessment for the evaluation of such answers a deeper test analysis supported with statistics and numbers will be required. Furthermore, such poor answers can be detected in some


Page 52/82

preprocessing phases, which in turn can use some spell and grammar checkers to assess the quality of the answer linguistically. 7.1.3 7.1.3.1

Prerequisites for the solution design Training and testing dataset

The required data structure for training and testing the models or tools, which are used in our task, can come in different forms; however, any structure to be exploited in the proposed solution must have at least three basic elements for each block, as mentioned in chapter 6 - Available Datasets for Semantic Similarity Retrieval: the teacher’s question in a pure free text form, one or more student’s answers also in a pure free text form, and a predefined human score assigned to each student’s answer. This predefined human score will be on one hand necessary for the training phase, and on the other hand for the comparison between testing phase results that will give a similarity score as output, and the predefined human score related to each answer that is taken from the repository. In other words, the input for the training phase will be all of the three elements, i.e. question, answer and score; while the output is the trained model. However, the input for the testing phase will be only the questionanswer pairs, and the output here will be the similarity score. 7.1.3.2

Tools ready to use

Based on the overview of available deep learning tools, see chapter 5, Available tools from Google and Microsoft for Deep Learning, the selected candidate tools to be tested in this thesis experiments are Sent2Vec.V2, which already has a pre-trained model with an executable file ready to use, and Doc2Vec built in JAVA, which is supported by examples within the same context and provided by some expert contributors. Other tools such as gensim, Sentence2Vec, or Doc2Vec built in Python are supported by examples also within the same context; however, these tools do not differ, in principle, so much from Doc2Vec in JAVA. Therefore, and because of programming compatibility issues, focusing on Sent2Vec.V2 and Doc2Vec in JAVA is expected to be more useful for the goal of integration with JACK e-Learning System. 7.1.3.3

Predefined metrics for experiments

After obtaining the execution results, the output similarity scores should be compared with the expected scores, i.e. the human scores which are taken from the repository for the same question-answer pairs. In other words, one or more predefined assessment metrics, which are used usually for information retrieval tasks, need to be applied in this final comparison for the experiments performed by the author of this thesis. One of the most famous metrics used for measuring the efficiency of web search engines, or semantic similarity retrieval tools, is the NDCG metric, see Appendix 1. Other metrics such as Pearson product moment correlation coefficient10 is also used for the same purpose. However, it’s more preferable to use the NDCG metric, since it

10

In statistics, the Pearson product-moment correlation coefficient (sometimes referred to as the PPMCC or PCC or Pearson's r) is a measure of the linear correlation between two variables X and Y, giving a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation [51].


Page 53/82

has been used too in the previously explained research provided by Microsoft in [26], see section 4.2 - Deep Structured Semantic Model (DSSM) from Microsoft. In this way, it can be easier to compare the results of the tests made by the author of this thesis with the results provided by one of the main references in this thesis, which is [26]. The following chapter - Experiments using Deep Learning tools - will address, discuss and evaluate the obtained results from the deep learning tools under investigation. 7.1.4 Technical Design After testing the tools ready to use in practice, some general technical solution design need to be provided to define what we already have and what we will need practically for the real implementation. Therefore, the formerly described overview of JACK, see chapter 2 – Highlights on the E-Learning Platform (JACK), especially the essay grading component, is of great significance since we shall need such background to know how far we are from the real realization, or integration, within JACK System. Based on this overview, some details about the technical solution design, i.e. for integration of the finally selected tool with JACK, are to be provided in a later chapter to ease the task of concrete design realization in future work.


Page 54/82

8 Experiments using Deep Learning tools 8.1 Evaluation Method In this section we shall go through the experiments and their results; however, some data acquisition and preprocessing operations must be performed prior to real experiments, such as querying the available databases for questions and answers with their related scores, and data reformation and cleansing. Afterwards, the obtained results are stated, analyzed, summarized and finally explained.

8.2 Data acquisition and preprocessing As shown before in section 6.1. StackExchange, some testing datasets can be obtained from such networks, by which some xml database files are available for download and public use. However, some queries with data cleansing operations need to be performed prior to real experiments. For example, to perform experiments that best suit our semantic similarity retrieval task on free-texts, we need to choose the forums that have a good diversity of questions and a sufficient number of scored answers; such as Computer Science forum, which contains various questions in different subfields in computer science. This forum was specifically selected since it contains free-texts rather than formulas and symbols. For example, a good performing XQuery can be as follows: for $q in db:open("csstackexchange")/posts/row, $a in db:open("csstackexchange") /posts/row let $qu := xs:normalizedString($q/@Title) let $an := xs:normalizedString($a/@Body) let $score := $a/@Score where $q/@PostTypeId = "1" and $a/@PostTypeId = "2" and $a/@ParentId = $q/@Id and $q/@AnswerCount >= 6 group by $qu, $an order by $qu, $an return {$qu} {$an} {$score}

In this XQuery the three fields Title (i.e. question), Body (i.e. answer) and Score (i.e. answer score) are the output of this query. Please take note that for each row in the Posts table, the field PostTypeId is equal to 1 for questions and 2 for answers, see section 6.1.2. StackExchange Data Dump. In the experiments performed by the author of this thesis, all of the resulting 806 question-pairs from applying this XQuery were exploited in the real tests. For the so called data cleansing, the function xs:normalizedString helps in replacing all whitespaces like tabs and new line controls by one single space. Please note that the questions using this XQuery with less than 6 answers are neglected. One example of the results rows is as follows: "For small values of n, O(n) can be treated as if it's O(1)"
Practically, it's the point where building the hash table takes more than the benefit you gain from the improved lookups. This will vary a lot based on how often you're doing the lookup, versus how often you're doing other things. O(1) vs O(10) isn't a big deal if you do it once. If you do it thousands of times a second, even that matters (though at least it matters at a linearly increasing rate).

Page 55/82

However, such structure is not suitable enough for our tools; therefore, some more operations need to be done as well. For example, we need to remove the unwanted html tags (e.g.
, often …. etc. ) and then separate the pairs of question-answers by tabs. Moreover, scores with/without question ids need to be in a different list with respect to the same order of question-answers pairs. This is necessary for the correct comparison with similarity scores, which are obtained by the tools exploited in this thesis’ experiments, as shown in the following section. Anyway, replacing the unwanted html tags can be performed in several ways; one of them can be by executing the following JavaScript in an IE Browser:
var xhttp = new XMLHttpRequest(); xhttp.onreadystatechange = function() { if (xhttp.readyState == 4 && xhttp.status == 200) { myFunction(xhttp); } }; xhttp.open("GET", "QuestionAnswers_CS.xml", true); xhttp.send(); function myFunction(xml) { var xmlDoc = xml.responseXML; var question = strip(xmlDoc.getElementsByTagName("post")[0].childNodes[0].nodeValue); var answer = strip(xmlDoc.getElementsByTagName("comment")[0].childNodes[0].nodeValue); document.getElementById("demo").innerHTML = question +'tabToken' + answer; } function strip(myText) { var tmp = document.createElement("DIV"); tmp.innerHTML = myText; return tmp.textContent || tmp.innerText || ""; }

The output of executing this JavaScript code in an IE Browser is the pairs of questionanswers, which are extracted from QuestionAnswers_CS.xml, in the form of: question tabToken answer However, placing some tab-spaces is not troublesome, since some word editing applications such as MS-Word can be used for removing or replacing the tabToken between question and answer, see the above mentioned myFunction JavaScript function, by one tab-space.


Page 56/82

8.3 Performed Experiments and Results 8.3.1 Using Sent2Vec.V2 The most powerful candidate to be exploited in free-text-based answers assessment is Sent2Vec.V2, which can be directly used without real troubles. Therefore, some further explanations about this tool are provided here for a better understanding of the features provided within this package. Using this tool, or package, requires neither additional installations nor extra source codes. As shown in section 5.2., Tools based on Deep Structured Semantic Model from Microsoft, this tool already provides a pre-trained model for both DSSM and CDSSM. Both DSSM and CDSSM have the same basic structure; however, minor enhancements can be realized through CDSSM; therefore, for further information about CDSSM model kindly refer to [40]. After all, employing any of these two models can show good performance. The most important thing about this tool is that all what we need here to obtain the semantic similarity scores from is the question-answer pairs separated by tabs in the file pairs.txt, see section 5.2.1.1. Package Structure. This is valid when we want to use the pre-trained model in this package; however, training data template is also available in this package, which can help very much to retrain the model with some additional data. Please take note that the training template files here represent only a comprised and anonymized dataset of the training data, which can help to train the model of DSSM as in [26]. In other words, each anonymized query-result pair comprises more than one concrete query-result pair. For example, the concrete query-result pairs can look like:   

What kind of money to take to Paris? Europe / France Euro. What kind of money to take to Washington DC? USA Dollar. What kind of money to take to England? England British Pound.

While the anonymized query-result pair, for all of the above concrete pairs, looks like: 

What kind of money to take to e1

location country currency_used.

Therefore the total number of comprising training template pairs, which is tokenized in the training files with .tsv extension, is very less than the concrete number mentioned in [26], which is 16,510 English queries where each of which is combined with 15 documents (or results). However, for the real training we shall need some concrete dataset, which is apparently not available within this package. An important thing to keep in mind here is that we need a file containing the tri-grams needed for Word Hashing task in both phases: testing, such as vocab file, and training phase such as l3g.txt. As mentioned before, all what we need to do here after data acquisition and preprocessing is to place the tab separated question-answer pairs in the text file pairs.txt, then we have to run the batch file from command line. The resulting files are as follows: semantic similarity scores files, such as: dssm_out.score.txt & cdssm_out.score.txt, and the word embedding of the source model as vector Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 57/82

representation files, such as: dssm_out.src_vect.txt & cdssm_out.src_vect.txt, in addition to the word embedding of the target model as vector representation files, such as: dssm_out.tgt_vect.txt & cdssm_out.tgt_vect.txt; however, these vector representations are not interesting for the required measurements that are needed for the quality assessment of such a tool; therefore, the interesting files for us will be only the semantic similarity scores files, mentioned above. Now we need to compare the resulting similarity scores with the expected scores (or the human scores if so to speak). To perform this we can create a new MS-Excel sheet and place the resulting similarity score lists which resulted from executing the batch file on DSSM and CDSSM pretrained models, side by side in the same order with human scores which were taken from the same data dump.

Figure 26: A snapshot taken from MS-Excel for the lists of scores grouped by question IDs using the Subtotal feature. Please note that creating the Subtotals feature already exists in MS-Excel.

A very helpful move is to get the question ids from the data dump as well and perform a grouping operation using the subtotal feature, which already exists in MS-Excel application. The resulting Excel sheet will look like the Figure 26 above. An interesting thing to mention here is that some of the semantic similarity scores here were negative scores, i.e. with a minus sign. This can be interpreted as an extremely


Page 58/82

irrelevant or wrong answer, which might be helpful for the feedback to be generated by JACK to the student. Before applying the assessment metrics such as NDCG we will need to sort the lists firstly ascendant based on ids, secondly descendant based on the human scores. Afterwards, the results are normalized and reordered to ease the calculations of NDCG by means of some Macro Script which has been programmed in VBA of MS-Excel by the author of this thesis, for further details about how this implementation was realized, kindly refer to Appendix 2. To normalize the human scores of each answers group, which is grouped by the question ID, we need first to divide the highest human score by the count of answers in the same group. The result of this division is then used to normalize each human score by dividing the human score by the former division value. This will assign the highest value to the greatest human score and so on. For example, if we take the same values as displayed in Figure 26, we get the following:

Figure 27: Results of normalizing human scores. Note that the highest human score is assigned the greatest value and the other cells are assigned a value of relevance appropriate with the human score in the same cell.

Back through the snap shot above, we need to sort the scores of DSSM and CDSSM descending, then we need to replace each value with the original human score, which represents the relevance of the corresponding answer. For example, we need to place the highest DSSM score, i.e. 0.789307 at line 7, at the top of the scores list. Afterwards,


Page 59/82

we need to replace this score with the original human score, i.e. the human score of 2 located originally at line 7. The resulting new list for these two operations, i.e. sorting then replacing with original human score, is shown in Figure 28 below. These two operations are needed to convert the similarity scores list into a ranked list based on relevance, i.e. human sores as shown in Figure 27 above or ideal rank as shown in Figure 28 below. Please take note that these two operations are also applied on CDSSM score list the same way.

Figure 28: the resulting new list after sort and replacement with original human scores. In other words these lists now represent the documents relevance ranked in the context of similarity retrieval and DCGp metric. Please note that these lists are labeled now as DSSM Rank, CDSSM Rank, and Ideal Rank respectively, which are to be used by NDCG calculations later on.

Now we have to make use of these new lists to find out the following three values: DCGp value for DSSM scores list, DCGp value for CDSSM scores list, and IDCGp for the human score, i.e. the ideal rank. The steps which were followed to figure out these values are as described in Appendix 1 - Discounted cumulative gain. After all, we will need to normalize the resulting DCGp values, by finding out nDCGp for DSSM and CDSSM respectively, also refer here to Appendix 1. A partial snap shot for the final results are shown below in Figure 29 as follows:


Page 60/82

Figure 29: a partial screen shot of DSSM Rank, CDSSM Rank, and Ideal Rank side by side with corresponding DCGp, IDCGp and nDCGp metrics.

Please take note here, that Figure 29 above shows that when the rank of DSSM and CDSSM was poor compared with ideal rank, such as in the group of id 358, the nDCGp was less than required. On the contrary, when the rank of DSSM and CDSSM correlates to the ideal rank to a very high extent, such as in the group of id 14, the nDCGp was just as required. Finally, the average values of DCGp, IDCGp, nDCGp are calculated at the bottom of results lists, as shown in Figure 30 below:

Figure 30: Average values of all the used metrics.

Now if we take look at the average values, we can say that the assessment, by means of nDCGp metrics for both models DSSM and CDSSM, shows relatively high results, i.e. 0.75 for DSSM and 0.76 for CDSSM, which makes such models reliable to a very high extent in the context of semantic similarity retrieval, or rather free-text based answers assessment. Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 61/82

8.3.2 Using Doc2Vec-JAVA The results obtained in this section is based on the example mentioned before in section 5.1.4.2. Doc2Vec based on JAVA. In these results an interesting thing observation is noticed here, which might be considered one of the main drawbacks of the required labeling there, is the need to feed the testing sentences side by side with training sentences and then build the model. Such step is needed to get the testing sentences plotted into the semantic space with respect to training sentences, which are in turn considered as the corresponding context to such testing sentences. Take the piece of code, described in section 5.1.4.2. Doc2Vec based on JAVA, as an example. In this Example the measured similarity is figured between the first sentence, with label DOC_0, and the following sentences one by one until the paragraph (or here the line sentence) of label DOC_99. This means that the similarity was found here only for sentences that were already labeled and plotted to semantic space through training of Paragraph Vector model. Nevertheless, the output results does not ensure correspondence with human assessment for all of the testing sentences. Figure 31 below shows some output results, where the results tagged with symbol refer to poor similarity and the ones tagged with symbol refer to good correspondence with expected similarity.

Figure 31: A snap shot of some captured output results of Doc2Vec (i.e. Paragraph Vector) in Java. “Human Assessment” column in this table shows where the poor results are with tagging symbol such as

, while the extremely poor results are tagged with symbol sults are tagged with symbol

, elsewhere the good re-

.

In general, using such approach gives a ratio for poor results of no more than 8 among 99, i.e. the error rate is no more than 8% among the first 100 sentences. However, we need to take into consideration that the dataset contained about 100,000 training sentences, which is already very huge. This is a significant drawback especially if we Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 62/82

take into consideration that the need for labeling and plotting testing sentences into the semantic space through training the model is not an efficient solution in practice at all, since this will take a very long time, repeatedly for each tested sentence. Furthermore, by examining the positive results or even the rest of results, take for example sentences tagged with symbol above, we might find a relevance between the two tested sentences, but it is not necessarily the same meaning of relevance that we seek for in our question answering task. In other words, the relevance here might mean that the two sentences are so likely to show up in the same context, but that does not mean that the second sentence is a candidate answer for the first. The most possible explanation for such observation is that this model, unlike DSSM and CDSSM, was designed originally for topic modelling tasks and not for measuring similarity tasks like Information Retrieval or Question Answering Assessment tasks. Therefore, the meaning of relevance in such tool is so likely to be different than our targeted relevance. Such drawbacks are so critical when we want to talk about a cost and time consuming task like our question answering assessment task. If we go back through the piece of code mentioned above, one can argue that we can measure the similarity between training sentences with new testing sentences, which had never been fed into the trained model or even labeled, by means of other functions such as similarityToLabel. However, the output results in this case were extremely poor and did not even come close to the expected similarity values, i.e. human assessment. Anyway, the piece of code that was used for this aim was as follows: for(int i = 1; i < 100; i++) log.info("\n"+vec.similarityToLabel("Will they make it?", "DOC_"+i));

Another argument can be that possibly better results can be obtained if we use less but longer paragraphs, because the richer the paragraph is the more informative it will be. However, this argument is not true to some extent, since that the more features the paragraph contains the more difficult it will be to represent in the semantic space. Therefore, using longer paragraphs for training will require even an extremely larger dataset for training. Anyway, some experiments were performed using the same dataset mentioned in the previous section 8.2. Data acquisition and preprocessing, whereas mentioned before this dataset contains 806 question-answer pairs. Therefore, such dataset may be a good candidate for training and/or testing to check out the validity of the former argument. Nevertheless, performing the desired experiment did not show any acceptable correlation to the expected results, a possible reason is, as mentioned before, the longer paragraphs we have the more difficult it will be to represent them. In other words, we may here emphasize the fact that we shall need an extremely huge dataset for training and testing, if we wanted to exploit the ParagraphVector model, provided in such JAVA tool, for our answers assessment task.

8.4 Summary and Discussion In this chapter we described the obtained results from each tool with some assessment combined with each description. As a summary we shall here emphasize the meaning Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 63/82

of relevance that we seek for in our study: as mentioned before, DSSM was designed mainly for Information Retrieval and Question Answering Assessment tasks, and the dataset used for training is taken from web search engines, which can be the ideal dataset for our task. However, Paragraph Vector was mainly designed for clustering into the same topic class that can help in topic modelling tasks, and the meaning of similarity here does not mean necessarily a relationship between questions and candidate answers. Indeed, Sent2Vec.V2 model does not require an individual training phase per exam or exercise. However, since Sent2Vec.V2 package has offered the implementation of DSSM and CDSSM models as an open source, a customized training phase can be performed, where some teacher’s own assessments side by side with crawled dataset from web search engines can be exploited as training data; therefore, a better adjustment to evaluator needs can be realized through this package. Anyway, we should keep in mind that a new training over large dataset of questionanswer pairs might take days of runtime execution for training a new model. Although this long phase of training will not affect the time execution of the real exam assessment, this might be considered an expensive solution to some extent, because performing a customized training phase for each teacher assessment and/or each level of assessment difficulty can be considered a demanding task anyway. This is true especially when taking into consideration that to get an assessment closer to teacher’s rating, a large corpus of training data should be available; elsewhere we might get through a long training phase without perceiving a real change in assessment scores. For all of these reasons, we may say that DSSM and CDSSM are the best models to be exploited in JACK for free-text-based answers assessment. Moreover, according to the original research work that is described in [20], ParagraphVector model was checked in the context of Information Retrieval for relevance between several web search results given a query, this means that in order to perform tests here we should know the ideal rank of some query results, then we need to check whether similarity between a new paragraph and the ranked paragraphs does not conflict to this ranking. In other words, this should give a higher similarity to the top ranked paragraph only if the new paragraph is truly the best possible result given the same query, and so on. Indeed, the problem here is that all of the ParagraphVector tools does not train the model over paragraphs, in training phase, with respect to a given query for each group of pre-ranked results, which makes ParagraphVector considered a type of unsupervised learning. Such drawback can be possibly because this model was originally meant to be for topic modelling rather than IR tasks. Anyway, if we would like to exploit Doc2Vec-Java tool whatever it takes, some other solutions might be thought of to solve such drawbacks: In all cases, we shall need a huge preprocessed dataset using all the needed data cleansing techniques. Beside that we may both perform parallel processing on several servers, memory management and also make use of the Virtual Memory of these servers. Or we might think of some Cloud Computing options, which will be better with parallel processing algorithms. Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 64/82

After all, if a sufficient financial support is there, one might think of communicating with Skymind team, the business owners of DL4J, and buy a customized solution for this specific task. Therefore, because of current difficulties of exploiting Paragraph Vector tools in our task and because of the better performance, ready to use, as well as the availability of the source code, training data templates and most importantly a pre-trained model in Sent2Vec.V2 package, the best candidate for the targeted similarity assessment in this thesis will be Sent2vec.V2 of Microsoft; thus, the proposed solution design in the next chapter will discuss the integration design of Sent2Vec.V2 within JACK and will not come to integration proposals of JACK with Doc2Vec or any other tool.


Page 65/82

9 Proposed Solution Design After selecting the most convenient tool that fits our needs, which is Sent2Vec.V2, and taking a closer look at JACK’s main structure, especially the NLP checkers, we may now have a clear vision about the integration of Sent2Vec.V2 with JACK, since this integration will be as similar as the formerly built-in NLP checkers but with very little changes.

9.1

Sample GUI

The proper GUI for our task is likely to be so similar to the one of formerly built NLP checkers. Take for example Figure 32 below, where the exercise needs to be initialized with proper attributes, such as title, category, level of difficulty, internal/external description, penalty percent, minimum success score, and finally the evaluation rule. Please take note the number 100 in evaluation rule here is a fake number, while the real rule should take the form of {#c123} * 𝑤𝑖 , where 123 is an example of checker id, which is actually automatically generated for each running checker within JACK, and 𝑤𝑖 is the weighting factor for the result of the checker.

Figure 32: Sample GUI similar as the one of [3] that can be used for Sent2Vec.V2 Checker. Please take note that AES vie type here stands for Automated Essay Scoring, and the Evaluation here is should be of the form {#c123} * 𝑤𝑖 , instead of 100.


Page 66/82

The next step will be choosing the checker to be used for assessment, see Figure 33 below, where a number of available checkers should appear here in the list. For example, if we want to implement a new checker component to run Sent2Vec.V2 model, then this checker should show up in the list of checkers just as shown in Figure 33 below:

Figure 33: Selecting a checker out from available checkers in JACK. Please take note that JACK’s version, from which this snap shot is taken, no NLP checkers were added yet.

Moreover, the answer of each question to be assessed can be entered through a very simple GUI, such as the one shown in the Figure 34 below:

Figure 34: Sample solution for the formerly added question. Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 67/82

A sample assessment of such answer can look like the one in Figure 35 below:

Figure 35: Sample assessment result, with an overall result of full mark.

9.2

To Be Process

To summarize the whole assessment process and for a better understanding of the steps that need to take place in the final application, the following BPMN process is shown in Figure 36 below:

Figure 36: Overall process of free-text-based answers assessment in JACK.

As shown in Figure 36 above, a sub process called “Assess Result using selected Checker” needs to be expanded to show the part to be realized in future work, this sub-process can be as follows, see Figure 37 below: The process here will have two main paths, the first is for assessing the answer as a whole, while the second is for assessing the answer Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 68/82

as split sentences by punctuation marks to get a detailed feedback, if it is required. In both cases, the overall answer or the sentences are then placed side by side with question on one line, inside a tab separated text file. This text file will be then the input for the next step, which is running Sent2Vec.V2 executable file. The results of this execution, i.e. similarity score between the answer as a whole or between each sentence and the corresponding question, are then automatically stored in an output text file, where the resulting similarity scores are placed in same order as the input questionanswer pairs file. After reading the resulting similarity scores, the feedback can be formulated based on relevance, i.e. similarity score, of each sentence with the corresponding question. Within this feedback it is also possible to have samples of standard solutions for the question, which should be predefined by the teacher, for a better clarification. The final scores can then be normalized and adapted to get along with the teacher’s grading system. After that the final score of Sent2Vec.V2 can be weighted and added to the weighted final scores of other checkers, which might be syntactical, spelling or even other semantic checkers. By sending the overall final score supported with feedback the to-be sub-process shall end at this point, where end results are ready for display. Figure 37 below illustrates the whole to-be process visualized within a BPMN diagram.

Figure 37: Sub process "Assess Result using selected Checker" that takes place directly after submitting the student’s answer.


Page 69/82

10 Conclusion 10.1 Thesis Summary and Outcome In this thesis the problem of automatic assessment of free-text-based answers in an eLearning System was investigated. Firstly, the thesis started with an introduction, composed of motivation and a brief problem description. Secondly, some highlights on JACK and the built-in NLP checkers were made. Thirdly, some basic background about machine learning techniques, such as neural networks and deep learning, was presented. Fourthly, some semantic similarity retrieval techniques were explained, such as: Paragraph Vector, provided by researchers at Google, and Deep Structured Semantic Model (DSSM), provided by Microsoft Research department. Fifthly, an overview of available tools, provided by volunteer contributors as well as Microsoft researchers, was clarified and supported with examples. Sixthly, some proper datasets for our task were examined in detail, such as StackExchange data dump. Seventhly, a Problem's approximation as an Information Retrieval task was provided, in addition to some details about some metrics that were used for validation of some self-driven experiments, such as nDCGp metric. Eighthly, some real experiments were driven to pick the most suitable tool for integration with JACK e-Learning System. Ninthly, the proposed solution design, which gathers all the information needed for the targeted integration with JACK, was provided and supported by some analysis diagrams, such as to-be process, where a simple technical solution was provided as the final output of this thesis. Finally, thesis summary and outcome was stated to provide a complete overview of the whole thesis. As a result of this study for deep learning techniques, in the context of semantic similarity retrieval, the following outcomes were discovered: Problem approximation as an information retrieval task is an efficient method, especially when some pretrained deep learning models are ready for use, such as Sent2Vec.V2, where some pretty good results are expected to be gained using such a tool. In other words, Sent2Vec.V2 model does not require an individual training phase per exam or exercise. Nevertheless, a customized training through this package can deliver a better adjustment to evaluator needs. However, if we choose to perform some exclusive training, by means of some available large datasets such as Stack Exchange, this can be managed, but we shall still need data cleansing and preprocessing techniques. Additionally, we should keep in mind that a new training over large dataset of question-answer pairs might take days of runtime execution for training a new model. Although this long phase of training will not affect the time execution of the real exams assessments (i.e. To-Be Process mentioned before), this might be considered an expensive solution to some extent, because performing a customized training phase for each teacher assessment and/or each level of assessment difficulty can be considered a demanding task anyway. This fact should be kept in mind especially because getting an assessment closer to teacher’s rating requires a large corpus of the new training dataset; elsewhere, we might get through a long training phase without perceiving a real change in Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 70/82

assessment scores. Anyway, although self-driven training will require huge memory and parallel processing over servers, this problem can be solved by means of some Cloud Computing platforms and services. In fact, since other deep learning models such as Paragraph Vector, which is trained by unsupervised algorithm, is more about topic modelling rather than measuring similarity scores with new examples, DSSM and CDSSM models, which use weekly supervised training, were found to be more usable and beneficial in practice rather than Paragraph Vector models, such as Doc2Vec tools whether in JAVA or Python. Another important outcome was the ability to find a very simple technical design for the integration solution within JACK by exploiting the ready-to-use executable file of Sent2Vec.V2 model, which does not require any time-consuming techniques, such as middle ware, and even if we choose to train a new model, the training phase will occur offline anyway and only once before each teacher/exam customization task. This simple solution solves the possible incompatibility issues that might rise up because of the difference in programming language between JACK, which is implemented in JAVA, and Sent2Ve.V2 model, which is implemented in C#.


Page 71/82

10.2 Future Work Based on the proposed solution design, which has been provided in chapter 9, some code implementations need to be realized based on the provided technical model and the visualized to-be analysis process. After all, real implementation is not expected to be of a big trouble, since some similar implementations are already available and tested within JACK, such as the implementations of NLP Checkers, and the preprogrammed read/write functions, which are necessary to handle input and output of Sent2Vec.V2 model through web services techniques. However, in future work, some more enhancements can be investigated in the context of rational and detailed Feedback generation by means of such deep learning tools.


Page 72/82

11 Appendix 11.1 Appendix 1 - Discounted cumulative gain Discounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval, it is often used to measure effectiveness of web search engine algorithms or related applications. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks [43]. Two assumptions are made in using DCG and its related measures: 1. Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks). 2. Highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than irrelevant documents. DCG originates from an earlier, more primitive, measure called Cumulative Gain: 11.1.1

Cumulative Gain

Cumulative Gain (CG) is the predecessor of DCG and does not include the position of a result in the consideration of the usefulness of a result set. In this way, it is the sum of the graded relevance values of all results in a search result list. The CG at a particular rank position is defined as:

Where

is the graded relevance of the result at position .

The value computed with the CG function is unaffected by changes in the ordering of search results. That is, moving a highly relevant document above a higher ranked, less relevant, document does not change the computed value for CG. Based on the two assumptions made above about the usefulness of search results, DCG is used in place of CG for a more accurate measure. 11.1.2

Discounted Cumulative Gain

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result. The discounted CG accumulated at a particular rank position is defined as [44]:

Previously there has not been any theoretically sound justification for using a logarithmic reduction factor [45] other than the fact that it produces a smooth Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 73/82

reduction. But Wang et al. (2013) [46] give theoretical guarantee for using the logarithmic reduction factor in NDCG. The authors show that for every pair of substantially different ranking functions, the NDCG can decide which one is better in a consistent manner. An alternative formulation of DCG [47] places stronger emphasis on retrieving relevant documents:

The latter formula is commonly used in industry including major web search companies [44] and data science competition platform such as Kaggle [48]. 11.1.3

Normalized DCG

Search result lists vary in length depending on the query. Comparing a search engine's performance from one query to the next cannot be consistently achieved using DCG alone, so the cumulative gain at each position for a chosen value of should be normalized across queries. This is done by sorting documents of a result list by relevance, producing the maximum possible DCG till position , also called Ideal DCG (IDCG) till that position. For a query, the normalized discounted cumulative gain, or nDCG, is computed as:

The nDCG values for all queries can be averaged to obtain a measure of the average performance of a search engine's ranking algorithm. Note that in a perfect ranking algorithm, the will be the same as the producing an nDCG of 1.0. All nDCG calculations are then relative values on the interval 0.0 to 1.0 and so are crossquery comparable. The main difficulty encountered in using nDCG is the unavailability of an ideal ordering of results when only partial relevance feedback is available. 11.1.4

Example

According to [49] presented with a list of documents in response to a search query, an experiment participant is asked to judge the relevance of each document to the query. Each document is to be judged on a scale of 0-3 with 0 meaning irrelevant, 3 meaning completely relevant, and 1 and 2 meaning "somewhere in between". For the documents ordered by the ranking algorithm as

The user provides the following relevance scores:


Page 74/82

That is: document 1 has a relevance of 3, document 2 has a relevance of 2, etc. The Cumulative Gain of this search result listing is:

Changing the order of any two documents does not affect the CG measure. If and are switched, the CG remains the same, i.e. 11. DCG is used to emphasize highly relevant documents appearing early in the result list. Using the logarithmic scale for reduction, the DCG for each result in order is:

So the

1

3

0

N/A

2

2

1

2

3

3

1.585

1.892

4

0

2.0

0

5

1

2.322

0.431

6

2

2.584

0.774

of this ranking is:

Now a switch of and results in a reduced DCG because a less relevant document is placed higher in the ranking; that is, a more relevant document is discounted more by being placed in a lower rank. The performance of this query to another is incomparable in this form since the other query may have more results, resulting in a larger overall DCG which may not necessarily be better. In order to compare, the DCG values must be normalized. To normalize DCG values, an ideal ordering for the given query is needed. For this example, that ordering would be the monotonically decreasing sort of the relevance judgments provided by the experiment participant, which is:

The DCG of this ideal ordering, or IDCG, is then:

And so the nDCG for this query is given as:


Page 75/82

11.2 Appendix 2 – Implementation of nDCGp This appendix shows how the implementation of nDCG metric, mentioned before, was realized in some VBA code within the Excel sheet, where the final measurements to assess DSSM and CDSSM were performed: Public Sub Calc_Click() // This function is bound to Calc button Dim r As Integer Dim EndR As Integer Dim myCount As Integer For EndR = 1 To 914 // 914 is the total count of lines in Excel sheet to be processed Set myRange = Worksheets("Analysis").Range("A" & EndR) Set myCells = myRange.Cells If Not IsEmpty(myCells.Value) Then // check whether Count Column is not empty myCount = Worksheets("Analysis").Cells(EndR, 2).Value // read the total count of corresponding group of answers (please remember that grouping here is based on question id) r = EndR – myCount // find the row where the group range starts // Perform normalization: Call calcNormalizedHumanScore(r, EndR - 1, "E") Call calcNormalizedRelevance(r, EndR - 1, "C") Call calcNormalizedRelevance(r, EndR - 1, "D") // Figure out DCGp values, where Column E stands for DCGp of Ideal Rank, C for DCGp of DSSM Rank, D for DCGp of CDSSM rank Call calcDCGp(r, EndR - 1, "E") Call calcDCGp(r, EndR - 1, "C") Call calcDCGp(r, EndR - 1, "D") // Figure out normalized DCGp (i.e. nDCGp) Worksheets("Analysis").Cells(r, 9).Value = Worksheets("Analysis").Cells(r, 6).Value / Worksheets("Analysis").Cells(r, 8).Value Worksheets("Analysis").Cells(r, 10).Value = Worksheets("Analysis").Cells(r, 7).Value / Worksheets("Analysis").Cells(r, 8).Value End If Next EndR End Sub Public Sub calcNormalizedHumanScore(mySubRangeStart As Integer, mySubRangeEnd As Integer, myColumn As String) mySubRange = myColumn & mySubRangeStart // Configure name of first Cell in the group range mySubRangeCnt = Application.WorksheetFunction.RoundUp(Application.WorksheetFunction.Max(Worksheets("Analysis").Range(mySubRange)) / Worksheets("Analysis").Range(myColumn & (mySubRangeEnd + 1)).Value, 0) // Find the normalization value by selecting max human score then dividing it by the total count of answers within the same group For myRank = mySubRangeStart To mySubRangeEnd // loop over current group range Worksheets("Analysis").Range(myColumn & myRank).Value = Application.WorksheetFunction.RoundUp(Worksheets("Analysis").Range(myColumn & myRank).Value / mySubRangeCnt, 0) // normalize through division by normalization value Next End Sub Public Sub calcNormalizedRelevance(mySubRangeStart As Integer, mySubRangeEnd As Integer, myColumn As String) Dim mySubRange As String mySubRange = myColumn & mySubRangeStart & ":" & myColumn & mySubRangeEnd Dim newVals() As Variant Dim newRank() As Variant Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 76/82

Call AssignValsToArr(newVals, mySubRange) Dim pos As Integer Dim myIndex As Integer myIndex = 0 ReDim newRank(0) For i = mySubRangeStart To mySubRangeEnd maxVal = Application.WorksheetFunction.Max(newVals) // find max similarity score over the range pos = Application.WorksheetFunction.Match(maxVal, newVals, 0) // find index of max similarity score Worksheets("Analysis").Range(myColumn & (mySubRangeStart + pos - 1)).Value = Worksheets("Analysis").Range("E" & i).Value // place the corresponding original human score instead of max score newRank(myIndex) = Worksheets("Analysis").Range("E" & (mySubRangeStart + pos - 1)).Value // append newRank array myIndex = myIndex + 1 // increment index for later use ReDim Preserve newRank(myIndex) // resize newRank array newVals(pos - 1) = 0 // remove current value from newVals array Next Call AssignArrToVals(mySubRange, newRank) // update corresponding ranking column with new values End Sub Public Sub calcDCGp(mySubRangeStart As Integer, mySubRangeEnd As Integer, myColumn As String) Dim DCGp As Double, myCnt As Integer, rel1 As Double, reli As Double, myDiv As Double // find first relevance value, i.e. rel1, as follows: rel1 = Worksheets("Analysis").Range(myColumn & mySubRangeStart).Value myCnt = 1 DCGp = rel1 For i = mySubRangeStart + 1 To mySubRangeEnd myCnt = myCnt + 1 // find corresponding relevance value, i.e. reli, as follows: reli = Worksheets("Analysis").Range(myColumn & i).Value // divide reli by binary logarithm: myDiv = reli / Application.WorksheetFunction.Log(myCnt, 2) // accumulate values: DCGp = DCGp + myDiv Next // place and color resulting DCGp cell resColIdx = Range(myColumn & "1").Column() Worksheets("Analysis").Cells(mySubRangeStart, resColIdx + 3).Value = DCGp Range(Chr(64 + resColIdx + 3) & mySubRangeStart).Interior.Color = RGB(169, 208, 142) End Sub // The following function is a supporting function to assign range values into array: Public Sub AssignValsToArr(myArr() As Variant, myRange As String) Dim cell As Range Dim i As Integer i = 0 ReDim myArr(0) For Each cell In Worksheets("Analysis").Range(myRange) myArr(i) = cell i = i + 1 ReDim Preserve myArr(i) Next End Sub Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 77/82

// The following function is a supporting function to assign array values into range values: Public Sub AssignArrToVals(ByRef myRange As String, myArr() As Variant) Dim cell As Range Dim i As Integer For Each cell In Worksheets("Analysis").Range(myRange) cell = myArr(i) i = i + 1 Next End Sub


Page 78/82

References [1]

H. Xiaodong, G. Jianfeng and L. Deng, "Deep Learning for Natural Language Processing: Theory and Practice," in Tutorial presented at CIKM'14, Deep Learning Technology Center, Microsoft Research, Redmond, WA, 2014.

[2]

S. Michael, PhD Dissertation: Automated Analysis of Software Artefacts - A Use Case in E-Assessment, Campus Essen: Duisburg-Essen University, 2014.

[3]

M. Filipczyk, Master Thesis: Bewertung von Freitextaufgaben in automatischen Prüfungssystemen, Essen: Duisburg-Essen University, 2012.

[4]

M. Filipczyk, M. Striewe and M. Goedicke, "Bewertung von kurzen Freitextantworten in automatischen Prüfungssystemen," in e-Learning Fachtagung Informatik, Bremen, Germany, 2013.

[5]

L. F. Bachman, N. Carr, G. Kamei, M. Kim, M. J. Pan, C. Salvador and Y. Sawaki, "A reliable approach to automatic assessment of short answer free responses," in 19th international conference on Computational linguistics COLING ’02, Stroudsburg, PA, USA, 2002.

[6]

D. Pérez, E. Alfonseca and P. Rodríguez, "Application of the BLEU method for evaluating," in LREC'04 Conference, Lissabon, 2004.

[7]

B. Croft and J. Lafferty, "Language Model for Information Retrieval," in Information Retrieval Book Series, vol. 13, Dordrecht, Kluwer Academic Publishers, 2003.

[8]

F. M. Ham and I. Kostanic, "Nonlinear model of an artificial neuron," in Principles of Neurocomputing for Science and Engineering, McGrow-Hill Higher Education, 2001, p. 25.

[9]

F. Dieterle, "Principles of Neural Networks," in Dissertation of Multianalyte Quantifications by Means of Integration of Artificial Neural Networks, Genetic Algorithms and Chemometrics for Time-Resolved Analytical Data, Eberhard-KarlsUniversität Tübingen, 2003, p. 20.

[10] D. Rumelhart, J. McClelland and PDF Research Group, Parallel Distributed Processing, Explorations in the Microstructure of Cognition: Foundations, vol. 1, Cambridge: MIT Press, 1986. [11] J. Schmidhuber, Deep learning in neural networks: An overview, vol. 61, Neural Networks, 2015, pp. 85-117. [12] H. Geoffrey, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath and B. Kingsbury, "Deep Neural Networks for Acoustic Modeling in Speech Recognition," Signal Processing Magazine, 2012. [13] A. J. Robinson, L. Almeida, J. Boite, H. Bourlard, F. Fallside, M. Hochberg, D. Kershaw, P. Kohn, Y. Konig, N. Morgan, J. P. Neto, S. Renals, M. Saerens and C.


Page 79/82

Wooters, "A neural network based, speaker independent, large vocabulary, continuous speech recognition system: The Wernicke project," in EUROSPEECH’93 Conference, Berlin, Germany, 1993. [14] A. Senior, "Speech recognition," Google NYC, Lecture 14: Neural Networks, 2013. [15] Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin, "A Neural Probabilistic Language Model," Machine Learning Research, no. 3, pp. 1137-1155, 2003. [16] T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," in ICLR'13 Workshop, Scottsdale, Arizona, 2013. [17] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk and Y. Bengio, "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation," in EMNLP'14 Conference, Doha, Qatar, 2014. [18] Y. Bengio, H. Schwenk, J.-S. Senca, F. Morin and J.-L. Gauvain, "Neural probabilistic language models," in Innovations in Machine Learning, Springer Berlin Heidelberg, 2006, pp. 137-186. [19] C. Olah, "Deep Learning, NLP, and Representations," colah.github.io, 7 July 2014. [Online]. Available: http://colah.github.io/posts/2014-07-NLP-RNNsRepresentations/. [20] T. Mikolov and Q. Le, "Distributed Representations of Sentences and Documents," in ICML'14 Conference, Beijing, China, 2014. [21] L. Bottou, "From Machine Learning to Machine Reasoning," Cornel University Library, arXiv:1102.1808, 2011. [22] T. Kenter and M. de Rijke, "Short Text Similarity with Word Embeddings," in CIKM'15 Coference, Melbourne, Australia, 2015. [23] D. C. Carmen Banea, C. Cardie and J. Wiebe, "SimCompass: Using Deep LearningWord Embeddings to Assess Cross-level Similarity," in SemEval'14 Conference, Dublin, Ireland, 2014. [24] M. J. Kusner, Y. Sun, N. I. Kolkin and K. Q. Weinberger, "FromWord Embeddings To Document Distances," in ICML'15 Conference, Lille, France, 2015. [25] R. Socher, E. H. Huang, J. Pennington, C. D. Manning and A. Y. Ng, "Dynamic pooling and unfolding recursive autoencoders for paraphrase detection," in Advances in Neural Information Processing Systems, 2011. [26] P. S. Huang, X. He, J. Gao, L. Deng, A. Acero and L. Heck, "Learning Deep Structured Semantic Models for Web Search using Clickthrough Data," in CIKM'13 Conference, San Francisco, USA, 2013. [27] J. Gao, K. Toutanova and W. T. Yih, "Click-through-based latent semantic models for web search," in SIGIR, 2011. Checking Eligibility of Google and Microsoft Machine Learning Tools for use by Jack e-Learning System

Page 80/82

[28] R. Salakhutdinov and G. Hinton, "Semantic Hashing," in SIGIR, 2007. [29] J. Gao, X. He and J. Y. Nie, "Click-through-based translation models for web search: from word models to phrase models," in CIKM, 2010. [30] G. Montavon, G. B. Orr and K.-R. Müller, Trade, Neural Networks: Tricks of the, Berlin, Germany: Springer Berlin Heidelberg, 2012. [31] J. Gao, W. Yuan, X. Li, K. Deng and J.-Y. Nie, "Smoothing clickthrough data for web search ranking," in SIGIR'09, Boston, Massachusetts, USA, 2009. [32] R. Socher, B. Huval, C. D. Manning and A. Y. Ng, "Semantic compositionality through recursive matrix-vector spaces," in EMNLP'12, Stroudsburg, PA, USA, 2012. [33] "gensim - topic modelling for humans," Radim Rehurek, 12 Feb 2016. [Online]. Available: https://radimrehurek.com/gensim/intro.html. [34] R. Řehůřek, "Doc2vec tutorial," Rare Technologies, 15 Dec 2014. [Online]. Available: http://rare-technologies.com/doc2vec-tutorial/. [35] R. Řehůřek, "models.doc2vec – Deep learning with paragraph2vec," 2014. [Online]. Available: http://radimrehurek.com/gensim/models/doc2vec.html. [36] S. Company, "Deeplearning4j: Open-source distributed deep learning for the JVM, Apache Software Foundation License 2.0," Deeplearning4j Development Team., 2016. [Online]. Available: http://deeplearning4j.org/doc/. [37] "Deeplearning4j," Skymind firm, http://deeplearning4j.org/doc2vec.html.

2016.

[Online].

Available:

[38] [email protected], "deeplearning4j/dl4j-0.4-examples," [Online]. Available: https://github.com/deeplearning4j/dl4j-0.4examples/blob/master/src/main/java/org/deeplearning4j/examples/nlp/paragra phvectors/ParagraphVectorsTextExample.java. [Accessed 4 July 2016]. [39] "Sent2Vec," Microsoft Research, 28 July 2015. [Online]. Available: http://research.microsoft.com/en-us/downloads/731572aa-98e4-4c50-b99dae3f0c9562b9/. [40] Y. Shen, X. He, J. Gao, L. Deng and G. Mesnil, "A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval," in CIKM'14 Conference, Schanghai, China, 2014. [41] M. o. StackExchange, "StackExchange," [Online]. http://stackexchange.com/sites. [Accessed 23 May 2016].

Available:

[42] I. Stack Exchange, "Stack Exchange Data Dump," 1 March 2016. [Online]. Available: https://archive.org/details/stackexchange.


Page 81/82

[43] K. Järvelin and J. Kekäläinen, "Cumulated gain-based evaluation of IR techniques," ACM Transactions on Information Systems (TOIS), vol. 20, no. 4, pp. 422-446, 2002. [44] S. University, "Presentation: Introduction to Information Retrieval - Evaluation," Stanford University, 21 April 2013. [Online]. Available: http://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6per.pdf. [45] D. M. a. T. S. B. Croft, Search Engines: Information Retrieval in Practice, Addison Wesley, 2009. [46] Y. Wang, L. Wang, Y. Li, D. He, W. Chen and T.-Y. Liu, "A Theoretical Analysis of NDCG Ranking Measures.," in 26th Annual Conference on Learning Theory, Colt, USA, 2013. [47] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton and G. Hullender, "Learning to rank using gradient descent," in ICML'05 Conference, New York, USA, 2005. [48] K. Inc., "Normalized Discounted Cumulative Gain," 23 March 2014. [Online]. Available: https://www.kaggle.com/wiki/NormalizedDiscountedCumulativeGain. [49] Wikipedia, "Discounted cumulative gain - Example," 11 April 2016. [Online]. Available: https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG. [50] L. van der Maaten and G. Hinton, "Visualizing Data using t-SNE," in Journal of Machine Learning Research 9, 2008. [51] F. Galton, "Nature, Volume 32," in The British Association: Section II, Anthropology: Opening address by Francis Galton, Macmillan Journals Limited, 1885, p. 507–510.


Page 82/82

Checking Eligibility of Google and Microsoft Machine Learning Tools ...

Checking Eligibility of Google and Microsoft Machine Learning Tools ...

Suggest Documents

Checking Eligibility of Google and Microsoft Machine

Checking Eligibility of Google and Microsoft

Machine Learning @ Microsoft - Matroid

Model-based machine learning - Microsoft

Monte Carlo and Machine Learning - Microsoft

Data Mining: Practical Machine Learning Tools and

Journal of Machine Learning Research-- Microsoft ...

Journal of Machine Learning Research-- Microsoft ...

Data Mining: Practical Machine Learning Tools and ... - Google Sites

Practical Machine Learning Tools and Techniques - Google Sites

Practical Machine Learning Tools and Techniques ... - Google Sites

Data Mining: Practical Machine Learning Tools and ... - Google Sites

Machine Learning in Online Advertising - Microsoft Research

1484212010-predictive-analytics-microsoft-machine-learning-isbn ...

1484212010-predictive-analytics-microsoft-machine-learning-isbn ...

Machine Learning in Online Advertising - Microsoft Research

Machine Learning Tools for Engineering Processes ... - iMedPub

Review of" Data Mining: Practical Machine Learning Tools and ...

Overview of Machine Learning Tools and Libraries - Institute e-Austria ...

PDF Data Mining: Practical Machine Learning Tools ... - Google Sites

PdF Data Mining: Practical Machine Learning Tools ... - Google Sites

[PDF] Data Mining: Practical Machine Learning Tools ... - Google Sites

[PDF] Data Mining: Practical Machine Learning Tools ... - Google Sites

PDF Data Mining: Practical Machine Learning Tools ... - Google Sites