Multilayer Perceptron Discrimination of Software ... - Semantic Scholar

Multilayer Perceptron Discrimination of Software Quality in a Biomedical Data Analysis System Mark D. Alexiuk, Nicolino J. Pizzi Institute for Biodiagnostics, National Research Council Winnipeg MB Canada R3B 1Y6 [email protected]

Abstract Biomedical data analysis typically involves large data sets, a diverse user base and intensive visualization procedures. These features place stringent quality demands upon software development and performance for both the architect and supervisor. An invaluable tool for software supervisors would be the automatic qualitative grading of software objects in terms of their extensibility, reusability, clarity and efficiency. This paper examines the quality assessment of software objects by a multilayer perceptron in relation to a gold standard provided by two independent software architects. Keywords: Artificial neural networks; parallel programming; software metrics.

1. INTRODUCTION Object−oriented (OO) software systems are used extensively for biomedical data analysis. One goal of software supervisors is to leverage existing code and developer experience through the application of OO principles, namely date abstraction, dynamic binding, inheritance and polymorphism. Supervisors are also interested in the early detection of weak or error−prone software objects [1] [2]. Software metrics [3], a mapping from source code to a numerical value, are used to quantify the characteristics of software objects. Thus, instead of examining pages of source code, the supervisor need only evaluate a vector of metrics in order to assign a quality grade. This is a classification problem, the determination of a mapping from a sample to its class label. After an initial quality assessment by a supervisor, this labeled data set could be used to train a classifier which would discriminate between future software objects. The paper is organized as follows. In the second section we describe the software that will be analyzed and discuss the metrics used to quantify its characteristics.

The third section discusses the use of neural networks in classification and the potential for parallelization. A discussion of the software implementation of this analysis follows in section four. We then discuss the results and planned future research.

2. DATA EvIdent® (EVent IDENTification) [4] is a model−free, GUI−based, 3D analysis package for biomedical data that locates regions of activation, and detects artifacts and trends. This user−friendly, algorithm−rich, exploratory data analysis software provides a wide variety of both Java and C++ object types for analysis. Typically, the Java objects are user−interface oriented while the C++ objects are oriented to numerical algorithms (Fig. 1). The quality of EvIdent® software components have been graded independently by two software architects and quantitative software features were compiled at several layers of granularity. An aggregate class label is assigned to each sample as the ceiling of the average of the architect assigned labels. This mapping labeled the 312 software objects as having low (26), medium (132) and high(154) quality.

2.1 Software Metrics The initial set of data was composed of 18 software metrics per object (Table 1). The software objects belong to one of three functional classes: GUI objects, Data Model objects, or Utility objects. Utility objects include functionality for I/O, the algorithm framework, and native interface calls. There were 216, 51, and 45 GUI, Data Model and Utility objects respectively. API designates whether the object is associated with the application programming interface (150 no, 162 yes). Cooks refers to the number of programmers who have

Proceedings of the 2002 IEEE Canadian Conference on Electrical & Computer Engineering - 770 0-7803-7514-9/02/$17.00 © 2002 IEEE

API

Table 1. Software metrics NM DEC DI FACE

CBO

TYPE

LC

WND

KIDS

rCC

LCOM

COOK

TOK

IC

SIBS

rCR

RFO

2.2 Performance Measure When a data set has classes with disparate numbers of samples, a classifier may achieve a high classification success rate by assigning all or most samples to the most populous class. One way to account for unequal class sizes is to use a kappa score [5], which is a chance corrected performance measure. The kappa score for a confusion matrix cm with N samples is:

κ=

(2)

1BP c

where Po is the sum of the diagonal entries (the correctly classified samples) and

Figure 1. EvIdent ® screenshot contributed to an object. Meth, IC, Kid, Sib and Face are the respective number of methods, inner classes, children, siblings, and implemented interfaces for each object. DI is the object’s depth of inheritance (including interfaces). The features rCC and rCR are the respective ratios of code to lines of comments and overloaded inherited methods to those that are not. IOC is a measure of inter− object coupling, a count of the number of other objects to which the corresponding object is coupled. RFO is a measure of the response set for the object, that is, the number of methods that can be executed in response to a message being received by the object. LCOM, lack of cohesion in methods, is a measure that reveals the disparate nature of an object’s methods. If Ij is the set of instance variables used by method j and the sets P and Q are defined as

P c=

∑ ∑ cm ij ∑ cm ji N

(3)

2

i.e. Pc is an averaged product of the marginals. A kappa score greater than 0 indicates agreement better than chance, while a kappa scores of zero is no better than chance. A negative kappa score indicates systematic disagreement. A de facto standard confidence measure for kappa scores [6] is listed in Table 2.

P= I i , I j I i ∩I j =∅ Q= I i , I j I i ∩I j ≠∅

P oBPc

(1)

then, if card(P)>card(Q), LCOM=card(P)−card(Q), otherwise LCOM=0. LOC, TOK, and DEC are the number of lines of code, tokens, and decisions respectively. WND is the weighted number of nested decisions where the weighting is proportional to depth. We follow a previous study [1] and use statistical variants for the last four metrics, and denote them as Mean, Median, Median2. We also include the totality of original and derived metrics as All.

- 771 -

Table 2. Confidence Measure Strength of κ statistic Agreement poor

κ≤0

slight

0