FunkR-pDAE: Personalized Project Recommendation

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2018.2870734, IEEE Transactions on Emerging Topics in Computing IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. XX, NO. X, XXXX

1

FunkR-pDAE: Personalized Project Recommendation Using Deep Learning Pengcheng Zhang, Fang Xiong, Hareton Leung, and Wei Song Abstract—In open source communities, developers always need to spend plenty of time and energy on discovering specific projects from massive open source projects. Consequently, the study of personalized project recommendation for developers has important theoretical and practical significance. However, existing recommendation approaches have clear limitations, such as ignoring developers’ operating behavior, social relationships and practical skills, and are very inefficient for large amounts of data. To address these limitations, this paper proposes FunkR-pDAE (Funk singular value decomposition Recommendation using pearson correlation coefficient and Deep Auto-Encoders), a novel personalized project recommendation approach using a deep learning model. FunkR-pDAE first extracts data related to developers and open source projects from open source communities, which build a developer-open source project relevance matrix and a developer-developer relevance matrix. Meanwhile, Pearson Correlation Coefficient is utilized to calculate developer similarity using the developer-developer relevance matrix. Second, deep auto-encoders are used to learn the factor vectors that represent developers and open source projects. Finally, a sorting method is defined to provide personalized project recommendations. Experimental results on real-world GitHub data sets show that FunkR-pDAE has a precision rate of 75.46% and a recall rate of 40.32%, which provides more effective recommendation compared with state-of-the-art approaches. Index Terms—Open Source; project recommendation; deep auto-encoder; GitHub.

F

1

I NTRODUCTION

O

P en Source Software (OSS) development paradigm is defined as collaborative work between multiple developers. Through online development, geographically dispersed developers can work together to maintain or improve OSS projects. Through this development paradigm, the same software project includes participants from different areas of the world. OOS also allows different developers to modify and improve source codes according to their own needs. In addition, unlike traditional organizations, OOS has a composite structure of social-technical processes and does not need to have a specific organization [9]. It usually has a lively software process, constantly changing requirements and fast development phases. For traditional software development, if the software project team is large, the project progress is often very slow. OOS, on the other hand, is developed in an environment where developers can communicate and share code via the Internet rather than in the same geographical place. Developers gather on social-coding sites in the same virtual environment, such as GitHub and Sourceforge, for mutual assistance and sharing of experiences. Developers can manipulate the codes through watch, fork, pull-request and so on. Meanwhile, they can also establish their own

• • •

P. Zhang and F. Xiong are with College of Computer and Information, Hohai University, Nanjing, China E-mail: [email protected]; [email protected] H. Leung is with Department of Computing, Hong Kong Polytechnic University, HongKong, China E-mail: [email protected] W. Song is with School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China E-mail: [email protected]

Manuscript received XXXX XXXX; revised XXXX, XXXX.

social network by following other developers to achieve socialization and transparent programming [2]. According to a report released by GitHub in 2017, GitHub has more than 24 million users, hosts more than 67 million software and combined over one billion code requests. Due to the rich resources on GitHub, developers and projects are not distributed evenly. It is very hard for developers to find the suitable project for themselves from 67 million code repositories while most of these projects are hosted on GitHub with only a handful of developers involved. Consequently, personalized project recommendation for GitHub is urgently needed. Project recommendation is a kind of Recommendation Systems in Software Engineering (RSSE) [19]. It lies primarily in the models used, and often relies on the data mining and the predictive nature of its functionality. In addition to these obvious features, RSSE often differs from other areas. The traditional recommendation system is user-oriented. Users usually create data items directly, for example, in the form of ratings. An important challenge with the traditional recommendation system is to infer and simulate user preferences and needs. Instead, the main challenge in designing RSSE is to automatically interpret the highly technical data stored in the repository. Recently, researchers have already proposed several novel recommendation approaches for GitHub application [15], [27], [30]. However, existing approaches have some limitations. We summarized in the following: •

Most of the existing approaches do not fully extract developers’ operating behaviors on open source projects. They also do not consider the developers’ social relationships when open source projects are recommended. These may lead to low recommenda-

2168-6750 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.


•

•

tion precision. These approaches are not validated under large amounts of data sets, which hamper their wide use on real open source projects. Due to limited time and efforts, developers can only participate in a few open source projects in a short time. Consequently, the project period is also an important factor during project recommendation, which is also not fully investigated.

For traditional recommendation systems, many experts and scholars have adopted various deep learning algorithms [6], [13], [20], [21], [25] and achieved promising results. However, for project recommendation, the use of deep learning models is still lacking. Towards this research direction, this paper presents FunkR-pDAE, a personalized project recommendation approach based on a deep learning model. Overall, the approach can solve the aforementioned limitations in the following aspects: •

•

•

•

To fully exploit the correlation between developers and open source projects, FunkR-pDAE extracts data related to developers and open source projects such as watch, fork, issue comment, and pull-request comment from GitHub’s dataset. At the same time, FunkR-pDAE extracts the attribute follow that can express the social relationships among developers. Based on the above data, FunkR-pDAE builds a developer-open source project relevance matrix (R) and developer-developer relevance matrix (D). The historical behaviors between developers and open source projects are translated into a matrix rating representation. Furthermore, FunkR-pDAE uses the Pearson correlation coefficient to calculate the developer relevance based on developer-developer relevance matrix (D) and obtains the similarity of interest among all developers. When dealing with large amounts of data, deep learning models can extract a lot of data features than traditional models. Consequently, FunkR-pDAE uses the deep auto-encoder to learn from the two matrices (R and D) and get two feature vectors representing developers and open source projects. Based on Funk singular value decomposition principle, FunkR-pDAE fits the inner product of the two feature vectors to the original score to obtain a predictive matrix. Then, FunkR-pDAE defines a sorting formula for the Top-N recommendation. Considering the influence of the project period on recommendation results, we conducted open source project recommendations in four different periods: 3 months, 6 months, 9 months, and 12 months, respectively. This can measure whether the project period affects recommendation results or not. We extracted four data sets from GitHub based on different type of programming languages. The experimental results show that FunkR-pDAE can extract features from large amounts of data and effectively recommend personalized open source projects.

The rest of this paper is organized as follows: Section 2 introduces related work of open source project recommendation. Section 3 gives background knowledge and the

2

related theoretical basis. Section 4 introduces the proposed recommendation approach FunkR-pDAE. Section 5 presents the experimental design and result analysis. Section 6 concludes the paper and looks into future work.

2 2.1

R ELATED W ORK Traditional Recommendation Systems

In traditional recommendation systems, SVD (singular value decomposition) model can be used to predict the user’s score on other items. However, the accuracy of the predicted value is not high enough to reflect the individual needs of the user. Also, for a larger matrix, the use of SVD decomposition is very memory intensive and timeconsuming. Consequently, Webb [26] propose FunkSVD, later called Latent Factor Model (LFM), which is mainly to improve computational efficiency and implement sparse matrix for SVD. However, the computational efficiency of the algorithm is still low when dealing with large amounts of data. In recent years, deep learning networks have been well developed in traditional recommendation systems [31]. Salakhutdinov first applies the restricted Boltzmann machine (RBM) [20], [14] to recommendation systems. Then, many researchers have used deep Auto-Encoder (AE) [21], [6], [13], Recurrent Neural Network (RNN) [5] and Convolutional Neural Network (CNN) [25] to implement traditional recommendation. Zhang et al [31] provide a comprehensive summary of current research on deep learning based recommendation systems, they identify open problems currently limiting real-world implementations and point out future directions along this dimension. RNN are suitable for modeling chronological data. Since the datasets we extracted from GitHub do not have chronological characteristics, RNN is not suitable. CNN [24] is a special feedforward neural network with convolutional layers and aggregation operations. Currently, the types of features handled by CNN mainly include images, audios, and texts. Consequently, it is also not very suitable for project recommendation. At the same time, deep reinforcement learning algorithm [32], [12] is also applied to the recommendation system. It can effectively simulate short-term feedback and long-term feedback to improve the accuracy of the recommendations. Among these models, deep auto-encoder is an excellent tool for reducing dimensions. For a given input data, the output data that approximates the input can be obtained through an encoding-decoding process, and the encoded hidden layer data is a certain characteristic representation of the input data. At the same time, the amount of data in open source community is relatively large and the rules are difficult to learn, which makes deep auto-encoder an ideal solution for project recommendation. 2.2

Project Recommendation

RSSE [19] is a research hotspot in software engineering data mining. RSSE aims to develop the software tool that helps developers conduct a wide range of activities, from reusing codes to writing effective bug reports. This field mainly includes source code recommendation [16], [29], technical expert’s recommendation [11], [23], knowledge document



recommendation [4], [18], and open source project recommendation [28], [30]. For open source project recommendation, Matek et al. [15] construct a bipartite graph where developers and projects are modeled as nodes, predict whether the project is related to the developer through the link prediction, and then perform the final recommendation. This method takes into account only the direct relationship between the developer and the project, i.e., the developer is the owner or collaborator of the project, but does not consider developer behavior data specific to GitHub such as watch, fork, and pull-request. For active developers, they tend to constantly look for interesting challenges, resulting in a subjective recommendation that does not consider the developers’ needs. Zhang et al. [30] recommend related open source projects to developers based on developer behavior data (watch, fork, pull-request, etc.). However, the social relations among developers and some other properties included in the projects are not fully taken into account. Yang et al. [28] design a series of quantitative formulas to calculate the association between developers and open source projects in three dimensions. Finally, a machine learning algorithm is used to perform recommendation ranking, and effective recommendation results can be obtained. However, during the process of training data generation, manual annotation is very time-consuming. Sun et al. [22] consider developers three-part behavior (create, star, and fork) and analyze the open source project description and corresponding information. However, only little data is used and the scalability of their models is not validated for large-scale data sets. Due to the limited time and effort, developers’ practical skills are only limited to several programming languages, which affects the widespread use. To solve the above problems, this paper presents a personalized open source project recommendation approach using deep auto-encoders, called FunkR-pDAE.

3 3.1

P RELIMINARIES GitHub Behavior Characteristics

Social coding sites have two key roles in open source software projects: social networks and online hosted repositories, which contain projects that can be developed by different teams for different organizations. One of the largest social coding sites is GitHub, which uses globally distributed software development called Git. Git is a popular open source version control system (VCS). It supports the use of many instances of software repositories. Developers can perform tasks locally and then push them through the Web for changes. GitHub is an online open source environment that not only allows individuals and organizations to create and navigate code libraries but also provides communitybased software development capabilities that allow them to collaborate on projects. The developer can follow other developers, watch or fork some projects. We describe these main features below: Follow: Following a developer is like following a person in twitter or facebook. After following, a developer can receive all the dynamic information that another developer has been working on for an open source project.

3

Watch: Watching a project, a developer can get all the dynamic information of the project. When the project changes, such as being pull-requested or being launched an issue by others developers, the developer will receive updated information about the project. Fork: Fork means to create a copy of the repository (including files, commit history, issues etc.). When developers want to make changes to other developers’ project, they usually use fork, which has no effect on the original project. Using fork as their starting point, they copy a project file locally to help the original project developer to improve the project, such as bug fixes or code optimization. Issue Comment: Issue comment is a problem tracking system. GitHub’s problem tracking system is used to check the history of communication among developers. Each developer is free to comment on the project or discuss issues. The comments for each project is also recorded. Pull-Request Comment: Once a developer makes some changes to a forked project, they can be shared with others via pull-request; then some reviewers can assess the changes. If the desired changes are included, the reviewer can incorporate the request into the upstream branch. Otherwise, they do not perform the merge operation. To complete the merge operation, developers must have the ability to access the repository. 3.2

Funk Singular Value Decomposition

Singular Value Decomposition (SVD) is widely used in machine learning, image recognition, and natural language processing [3]. It can transform high-dimensional data into low-dimensional data. SVD is basically defined as follows: Supposing matrix R is an m × n matrix, we can define the SVD of matrix R as: T Rm×n = Um×m Σm×n Vn×n

(1)

where Σ is a m × n matrix, whose elements are all zeros except for the elements on the main diagonal. Each element on the main diagonal is called a singular value. In the matrix, the singular values are arranged in descending order, consequently we can approximate the matrix with the largest k singular values and the corresponding left and right singular vectors, where k is the number of singular values in the larger part of matrix R, k is much smaller than n. That is to say, a large matrix R can be represented by three T smaller matrices Um×k , Σk×k , and Vk×n . Both U and V are T unitary matrices, they meet U U = I , V T V = I , where I is the unit matrix. The traditional singular value decomposition requires that the matrix to be decomposed have no missing values in the matrix. Missing values must be padded before SVD is used. Although SVD decomposition can be done by filling in with 0 or average value, the precision of the predicted value is not very high, which cannot exactly reflect developers’ specific requirements. Furthermore, decomposing SVD represented by a matrix is very memory-intensive and timeconsuming. Consequently, Webb [26] used the concept of FunkSVD at Netflix to improve computing efficiency and enable sparse matrices to be decomposed by SVD. Later the approach is also called the Latent Factor Model (LFM). As


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2018.2870734, IEEE Transactions on Emerging Topics in Computing IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, VOL. XX, NO. X, XXXX n

k

m

The network converts the input layer data x to the middle layer (hidden layer) h and then to the output layer x ˜. Each node in the graph represents one dimension of the data. The transformation between every two layers is linear change and non-linear activation, formulated as:

k V

Rij

rij ≈ ui

≈

vj

U

R

4

Fig. 1. The Funk-SVD decomposition diagram

^x1

x1 x2

h1

^x2

x3

h2

^x3

x4

h3

^x4

x5

+1

^x5

+1

Hidden

Output

Input Fig. 2. The topology structure chart of DAE neural networks

shown in Fig. 1, FunkSVD is expected to decompose into two lower-level matrices: T Rm×n = Um×k Vk×n

argminU,V

X

(rij − ui vjT )2 + λ(kui k2 + kvj k2 )

(3)

(i,j)∈K

where K is a set of (i, j) pairs scoring records in matrix R, rij is the original score of user i on item j , λ(kqi k2 + kpu k2 ) is a regularization term to prevent over fitting, and λ is the regularization coefficient. Finally, the newly obtained ˆ is calculated as: scoring matrix R

rˆij = ui vjT 3.3

(4)

Deep Auto-Encoder

An auto-encoder is a specific form of a neural network consisting of an encoder and a decoder [1]. The purpose of the auto-encoder is to make the output consistent with the original input, which is equivalent to learning an identity x = x ˜. The number of input nodes and output nodes of the auto-encoder is the same, but the copy-only input is meaningless. Auto-encoder expects to reconstruct the input with a small number of sparse high-order features. The model is shown in Fig. 2.

(5)

x ˜ = f (W 0 h + b0 )

(6)

The purpose of the neural network is to approximate the transformation function from the input layer to the output layer. Therefore, we need to define an objective function to measure the difference between the current output and the actual result. Through the objective function to gradually adjust (such as gradient descent) the system parameters W , b,W 0 , b0 , the entire network can be fitted to the training data as much as possible. If there is a regular constraint, it also requires the model to be as simple as possible (to prevent overfitting). After the training is over, the network learns the ability of x → h → x ˜. The hidden layer h is crucial because it is another expression of the original data without losing the amount of information as much as possible. To learn a meaningful expression, some constraints are added to the hidden layer. From the data dimension, the following two situations are common: •

(2)

Since the evaluation index is Root Mean Squared Error (RMSE), Webb can learn the user characteristic matrix U and item characteristic V by minimizing RMSE directly from observations in the training set and pass a regularization term to avoid over-fitting. It takes the idea of linear regression and uses the RMSE as the loss function to find the final U and V . FunkSVD training model is used to get the most accurate value. FunkSVD’s optimization objective function J(u, v) is:

h = f (W x + b)

•

the hidden layer dimension is smaller than the input layer dimension κ. The transformation of x → h is a dimension reduction operation. The network tries to describe the original data in a smaller dimension without losing data information. the hidden layer dimension is larger than the input layer dimension κ. At this time, many dimensions in the hidden layer are 0, which is not activated. This is a sparse auto-encoder. The sparse expression means that the system is performing feature selection, and it needs to find an important dimension in a large number of dimensions.

The auto-encoder is similar to the matrix decomposition model. It predicts the matrix score through dimension reduction. Therefore, we mainly consider the first case mentioned above. 3.4

Pearson Correlation Coefficient (PCC)

Pearson correlation coefficient, also known as the Pearson product-moment correlation coefficient (PPMCC or PCC), is a method of calculating linear correlation proposed by British statistician Pearson in the 20th century. It is used to calculate the similarity of two vectors. The Pearson correlation coefficient r(A,B) of two continuous variables (A, B) is equal to the co-variance between them cov(A, B), which is divided by standard deviations using the product σA σB .

E((A − µA )(B − µB )) cov(A, B) = σA σB σA σB E(AB) − E(A)E(B) p =p E(A2 ) − E 2 (A) E(B 2 ) − E 2 (B) Pn ¯ ¯ i=1 (A − A)(B − B) qP = qP n n ¯2 ¯ 2 i=1 (A − A) i=1 (B − B)

r(A,B) =

(7)



5

evaluated based on index such as RMSE, precision and recall.

Step1 Gtihub Data Collection and Preprocessing

Get Samples, Optimize Parameters

Collect Github Historical Data

Preprocess Data

Developer Relevance Matrix(R)

Developer Behavior Data

Rating Matrix(R)

Developer Follow Data

Calculate Weight (BP Algorithm)

Generate feature vectors u、v

4.1

Comparing Similarity(PCCs)

Output Prediction Value

Recommend TOP-N(custom)

Evaluate Experiment

Optimal Weight (SGD)

Step2 Model Training and Feature Vectors Generation

Step3 Prediction and Recommendation

Fig. 3. FunkR-pDAE overview

The value of r(A,B) is between −1 and 1. General speaking, the greater the absolute value, the stronger the correlation. When r(A,B) > 0, it indicates that the two variables are positively correlated. The larger the value of one variable, the greater the value of the other variable. When r(A,B) < 0, the two variables are negatively correlated. The larger the value of one variable, the smaller the value of the other variable. When r(A,B) = 0, it indicates that the two variables are non-linearly related. When r(A,B) = 1 and −1, it means that the two variables A and B can be described by a straight line equation.

Data Collection and Preprocessing

4.1.1 Data Collection GHTorent [7] retrieves high quality interconnect data through REST API provided by GitHub. It contains all the public projects on Github. In this step, we obtain historical developers’ behavior data from the website. The data includes information about developers, language information for open source projects, and a series of developers’ behavior about open source projects (watch, fork, pull-request comment, issue comment). By analyzing the data, some projects have a lower amount of information, fewer audiences, and lower recommendation meanings. If these data are not removed, the model fitting speed can be affected. Therefore, the data set needs to be filtered. The main idea is to remove some projects that do not satisfy our criterion. Given the developers’ level of activity, some developers who lack social information should be removed. We should also remove the number of developers which may affect the model training process of deep auto-encoder. In general, based on our previous experiences, the screening criteria are defined in the following: • •

4

T HE

F UNK R- P DAE

RECOMMENDATION

AP -

PROACH

The main steps of FunkR-pDAE are described in Fig. 3. 1)

2)

3)

Data collection and preprocessing. From the entire GitHub dataset, based on the attributes of developers and projects, and the relations between them, we collect historical data we need. The development history data collected during this phase includes attributes such as watch, fork, issue-comment and pull-request comment that represent the association between the developer and the project, and follow attribute that represents the social relations between developer and developer. According to the above attribute characteristics, the matrix D and R are respectively constructed, and the relevance of the developers in the matrix D is calculated through Pearson correlation coefficient. Model training and feature vector generation. By setting the loss function, the parameters of the auto-encoder are learned to minimize the reconstruction error. The feature vectors of the auto-encoder are learned by SGD (Stochastic gradient descent) to fit the original developer-project relevance score matrix. SGD is calculated using Back Propagation (BP) algorithm. Prediction and recommendation. In this step, based on the predictive scoring matrix and developer correlation obtained by the model, we defined a recommendation formulae. We recommend the corresponding open source projects that may be of interest and return the Top-N projects to the developers. Finally, the model and recommendation results are

Developer: The Developer follows 10 to 50 other developers; Open Source Project: The project has at least 5 watches and 5 forks.

(1) Developer-Project Relevance Matrix(DP or R) According to Zhang [30], developer behaviors such as fork, watch, pull-request and others are suitable for recommending related projects. Developers’ different actions on the project can also reflect their preference of the project. Developers’ watch, fork, pull-request, and issue comment behaviors generated by open source projects are a gradual process. The developers’ pull-request behavior is more reflective of the developers’ preference for the project than the behavior of watch. We construct a m × n developerproject relevance matrix as the primary matrix R based on developers’ actions on the project (watch, fork, pull-request comment, issue comment, etc.). The row of the matrix represents the developer, the column of the matrix represents the project, where m represents the number of developers, n represents the number of projects, and the value rij represents developers’ interest in the project. Different developer behaviors represent different preferences of developers, and the quantitative criteria of developer behaviors are actually scored. The results are shown in Table 1: TABLE 1 Developer historical behavior rating criteria

Behavior

watch

fork

issue comment

pull-request comment

Rate

1

2

3

4

The value of each item in the main matrix R is: X Ri,j = σ` (` ≤ 4)

(8)



Tor

issue-comment watch fork

oem

Developer-Project Relevance Matrix

eng

oem eng

Joh bra Isl

WL

bra WL

Tor

6

0

1

0

Joh

0

2

3

0

Isl

0

0

4

4

Fig. 4. Construct a score matrix based on the a developer’s historical behavior

6

Since the open source project scored in the developer matrix D is not the developer actually involved in the development, it can be used as an additional weight information to find developers with similar interest, and thus perform related recommendations. Therefore, when matrix D is constructed, the value of the rating item is taken as 1. The basic idea of this recommendation is that if developers a and b have similar interests, we can recommend a’s favorite open source project to b. We use Pearson Correlation Coefficient to calculate the degree of correlation between two developers a and b: P

Tor Ain Tna

pullrequestcomment

Joh

oem eng bra

Isl

WL

Developer Relevance Matrix

oem eng bra WL Ain α

α

α

0

Tna 0

0

α

α

Fig. 5. Construct a developer relevance matrix based on a developer’s follow attribute

That is, the sum of the developers’ scores on the behavior of open source projects. Fig. 4 shows an example to illustrate the calculating process. The developer Tor generates a variety of operational behaviors (watch, fork, issue-comment) for the project oem. Finally, Tor’s rating of the project oem is the sum of all the multiple behaviors. The final rating is 6, where the value of watch is 1, the value of fork is 2, and the value of issue-comment is 3. (2) Developer Relevance Matrix(D) GitHub’s social networking system is mainly through follow operator to provide interaction between developers. Developers build relationships through mutual attention, through which they can share knowledge, documents, and codes, and attract other developers. The type of interaction of individual choice is closely related to information resources, the developer’s technical relevancy and interest similarities. For example, if Developer A is concerned with Developer B, Developer A may be also interested in open source projects owned or operated by Developer B (watch, fork, etc.). In social networks, it is usually considered that developers who are interested in each other have similar preferences. Therefore, the developers’ indirect relation with a certain project can be obtained according to developer relevance information so as to construct a m × n developer relevance matrix D. As shown in Fig. 5, a developer Ain is concerned about Tor and Joh, and Tor and Joh have operated on Oem, eng, and bra. Therefore, Ain also indirectly scored the above three projects α. The projects with no indirect relationship are defined as 0. 4.1.2

Developer Similarity

In developer relevance matrix, existing scores represent indirect preferences of developers for open source projects.

Sim(A,B)

− r¯a,i )((rb,i − r¯b,i ) qP ¯a,i )2 ¯b,i )2 i∈SETa,b (ra,i − r i∈SETa,b (rb,i − r (9)

= qP

i∈SETa,b (ra,i

where ra,i and rb,i represent the i-th score in developers a and b, respectively. r¯a,i , r¯a,i represents the average score of developers a, b. SETa,b represents an open source project collection that developers a and developers b share together. By constructing a m×m matrix, developers are used as rows and columns, and the developers’ correlation coefficient is taken as the value. We can get a symmetric matrix that represents the correlation between developers. The height of the matrix represents the degree of similarity between two developers. 4.2

FunkR-pDAE Model Training and Recommendation

We propose FunkR-pDAE, which combines the main idea of Funk singular value decomposition with deep autoencoders to perform score prediction (Funk-DAE). Singular value decomposition (SVD) is a traditional approach for dimension reduction. The approach is very inefficient when it is applied to large amounts of data. Consequently, we absorb the main idea of SVD and combine it with deep auto-encoder to efficiently solve the dimension reduction problem for large amounts of data during the model training phase. 4.2.1

The Model Training Phase

For the scoring matrix R, where m and n are the number of developers and projects, respectively. Each project can be viewed as an input instance, and the developer can be considered as its corresponding feature. Then we can learn the potential representation of the project through an automatic encoder. Similarly, if the project is treated as a feature of each developer, we can also learn the potential expression of the developer. In order to speed up the rate of gradient descent and reduce the number of cycles of convergence, we need to normalize the data. During the scoring matrix constructed above, the developer i scores the project j as an integer, which is in the range of 1 − 10.

rij =

rij − rmin rmax − rmin

(10)

During this stage, the input sample is {x1 , x2 , ... , xm }, or {y1 , y2 , ... , ym }, where each input xi , yi ∈ R is the row i or column j in the scoring matrix R. Taking the input sample {x1 ,x2 , ... , xm } as an example, the input sample {y1 , y2 , ... , ym } is calculated in the same way. The input layer is



0.5

Input

0.3

?

Hidden

h1

h2

...

?

...

7

features of developers and open source projects (U and V ) through the following optimization objective:

1

ˆ + βkI ◦ (Y − Yˆ )k J = kI ◦ (R − U V T )k + αkI ◦ (X − X)k

h2

+ γ ◦ (kW1 k2 + kb1 k2 + kW10 k2 + kb01 k2 Output

0.5

0.3

?

Participate in encoding/decoding

?

...

+ kW2 k2 + kb2 k2 + kW20 k2 + kb02 k2 )

1

(15) where α, β , and γ are weight parameters. JX and JY in α◦JX and β◦JY are objective functions of learning potential factors of developers and open source projects. U V T is a scoring matrix formed by the inner product of learned latent factor vectors.

Does not Participate in encoding/decoding

Fig. 6. Auto-encoder structure diagram in collaborative filtering

encoded by using the hidden layer weight matrix W and the offset item b to obtain the feature expression h of DAE hidden layer:

h = f (W x + b)

(11)

Through the weight matrix W 0 and the offset b0 between the hidden layer and the output layer, the auto-encoder can reconstruct the original input xi to obtain the output x ˜ according to the following equation:

4.2.2

The Model Optimization Phase

The training process of the model follows BP algorithm, with BP training at each level done by SGD [10]. Each training is divided into two processes: forward propagation and backward propagation. The hidden layer h and the output layer x ˜ are calculated by Equation 11 and Equation 12, respectively. In back propagation we calculate the random gradient for each parameter. The partial derivatives of objective functions J for W1 , b1 , W2 , b2 , W10 , b01 , W20 and b02 are calculated as follows:

(12)

∂J T = −2(V (I · (R − U V T )) · δ1 ) · X T − 2αW10 · Iε1 ∂W1 ·δ1 X T + 2γ · W1

where f (Θ) is the encoding function and g(Θ) is the 1 decoding function. f (Θ) = g(Θ) = (1+e Θ ) . The model optimizes the parameters of auto-encoder by minimizing the reconstruction error as follows:

∂J T = −2[V (I · (R − U V T )) · δ1 ]T − 2α[W10 · Iε1 · δ1 ]T ∂b1 +2γ · W1

x ˜ = g(W 0 x + b0 )

min0

W,b,W ,b0

=

n X

kxi − x ˜i k2

(13)

i=1

Since developers only act on a small portion of open source projects, most of the values are missing when the scoring matrix R is used as input. If these missing values are all filled with zero, all missing values will be treated as negative samples, which will result in an imbalance in the number of positive and negative samples. Therefore, when automatic encoders are used, we only consider encoding and decoding existing rating items, which are shown in Fig. 6: At the same time, we only consider the errors of those projects that the current developer has scored in the backward propagation error, and ignore the errors of those items that have not been scored. Therefore, we introduce the index matrix I :

Ii,j

( 0 if Rij = 0 = 1 if Rij 6= 0

(14)

In our model, we assume that only one hidden layer’s value approximates U or V matrix. If the total number of layers for the encoder is L, two matrices U and V can be generated from each L/2 layers. We learn the potential

∂J T = −2(V (I · (R − U V T )) · δ2 ) · X T − 2βW20 · Iε2 ∂W2 ·δ2 X T + 2γ · W2 ∂J T = −2[V (I · (R − U V T )) · δ2 ]T − 2α[W20 · Iε2 · δ2 ]T ∂b2 +2γ · W2 ∂J = −2αI · ε1 U T + 2γ · W10 ∂W10 ∂J = −2[αI · ε1 ]T + 2γ · b01 ∂b01 ∂J = −2βI · ε2 U T + 2γ · W20 ∂W20 ∂J = −2[βI · ε2 ]T + 2γ · b02 ∂b02 where: δ1 = U · (1 − U ) δ2 = V · (1 − V ) ˆ · X · (1 − X) ˆ ε1 = (X − X) ε2 = (Y − Yˆ ) · Y · (1 − Yˆ ) According to the partial derivative calculated above, weights and parameters are updated along the gradient direction:

W ←W −η

∂J (W ∈ (W1 , W2 , W10 , W20 ) ∂W

(16)



∂J b←b−η (b ∈ (b1 , b2 , b01 , b02 ) (17) ∂b where η is the step of the gradient descent, i.e., the learning rate, which determines the speed of convergence. The specific process is described by Algorithm 1. The method of minimizing the loss function is to find the partial derivative of the parameters and obtain the gradient vector of the corresponding feature. The feature variable will be updated along the gradient direction until the feature vector is close to zero. If the obtained error E is less than the threshold emin , the training process will stop. According to the feature vectors of the finally trained developers and projects, the open source projects in the test data are predicted and scored. At the same time, the number of iterations and the value of RMSE can be obtained. Algorithm 1 Auto-encoder learning algorithm optimized by SGD Input: xi ,yi , Weight parameters α, β , γ , feature vector dimension κ; Output: Feature vectors u, v . 1: Initialize weights and offsets W1 ,b1 ,W10 ,b01 and W2 ,b2 ,W20 ,b02 by two auto-encoders respectively; 2: //unsupervised learning section 3: while E