flow graphs by generating clusters of cells. ... 1 Introduction. 1 ..... The basic MCL algorithm . .... Spreadsheets have also been used in science and engineering disciplines such as ..... sheet system is done in Visual Basic for Applications (VBA).
University of Botswana
Faculty of Science Department of Computer Science
A Dynamic Graph-based Visualization for Spreadsheets
By
Bennett Freinderson Kankuzi Student ID: 200509238 A dissertation submitted in partial fulfillment of the requirements for the Degree of Master of Science in Computer Science Supervised By
Dr. Yirsaw Ayalew June 2008
Dedication I would like to dedicate this work to my parents: my late father, Mr. Freinderson Ishmael Kankuzi and my mother, Mrs Ireen NyaMayuni Kankuzi.
ii
Approval This dissertation has been examined as meeting the requirements for the partial fulfillment of Master of Science Degree in Computer Science.
—————————– Supervisor
———————– Date
—————————– Internal Examiner
———————– Date
—————————– External Examiner
———————– Date
—————————– Head of Department
———————– Date
—————————– Dean, School of Graduate Studies
———————– Date
iii
Acknowledgements Firstly, I would like to thank God for giving me strength and courage in the course of carrying out this work! A big thank you also goes to my supervisor, Dr Yirsaw Ayalew, who tirelessly guided me in the course of this work. I also thank Dr Ayalew for introducing me to academic research as well as an exciting world of spreadsheet research. My other vote of thanks go to Dr Stephen Kobourov of the University of Arizona who was also co-supervising me in the initial stages of this work and he also provided me with the open-source code of the Graphael graph drawing software.
My heartfelt thanks also go to Mr. Y. Alide and Dr. P.C. Chamdimba, both from the University of Malawi, for all the encouragement and support. May God richly bless you. I would also like to thank all friends and relatives who gave me support in the course of the work.
Finally, I also thank God for the ‘insights’ in the course of this work such that a number of research papers have been published out of this research work.
This document has been produced with TeXnicCenter, a free and open-source software for the LATEX typesetting system. I am also grateful to its developers.
iv
Declaration I hereby declare that this is my original work, except where due reference is made, and that this dissertation has not been submitted for any degree award in any other university.
Signed: ———————————– Bennett Freinderson Kankuzi (STUDENT)
v
Abstract Spreadsheet systems are widely used and highly popular end-user systems. They are highly popular because of the simplicity with which one can create spreadsheets. However, despite this simplicity in creating spreadsheets, they are generally difficult to understand and comprehend. The need for understanding spreadsheets arises when one wants to debug a spreadsheet as well when one wants to maintain or even just to understand a spreadsheet created by others. One contributing factor to the difficulty in understanding spreadsheets is due to the invisibility of the data dependencies which are associated with cell formulas.
This research work aims to provide a graph-based visualization approach that can simplify understanding and debugging of spreadsheets based on the MCL (Markov Clustering) algorithm. The MCL algorithm helps in visualizing spreadsheet dataflow graphs by generating clusters of cells. Navigation through graph clusters is provided through complementary techniques of compound fisheye views and treemaps. More importantly, our experiments show that graph based visualization using the MCL algorithm generates clusters which match with corresponding logical areas of a spreadsheet. Identified MCL clusters are then dynamically highlighted in the original spreadsheet using different cell background colours. Hence instead of looking at the whole spreadsheet at once, the user focusses his/her attention at each highlighted logical area at a time. The spreadsheet comprehension process is therefore properly guided since the focus area matches with what the user might perceive to be a logical unit.
vi
Contents List of Figures
xii
List of Tables
xiii
List of Algorithms
xiv
1 Introduction 1.1
1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
The end-user programming paradigm . . . . . . . . . . . . . .
1
1.1.2
Challenges in end-user programming . . . . . . . . . . . . . .
3
1.1.3
Popularity of spreadsheet systems . . . . . . . . . . . . . . . .
5
1.1.4
Importance of spreadsheets
. . . . . . . . . . . . . . . . . . .
7
1.1.5
Impact of errors in spreadsheets . . . . . . . . . . . . . . . . .
8
1.1.6
Classification of errors in spreadsheets . . . . . . . . . . . . .
9
1.2
Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3
Objectives of our research . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4
Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5
Overview of the rest of the Dissertation . . . . . . . . . . . . . . . . . 18
2 Related Work
19
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2
Spreadsheet error prevention techniques . . . . . . . . . . . . . . . . . 20
2.3
Spreadsheet error detection techniques . . . . . . . . . . . . . . . . . 22
2.4
Spreadsheet visualization techniques
vii
. . . . . . . . . . . . . . . . . . 25
2.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Graph-based Visualization
38
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2
The need for graph clustering . . . . . . . . . . . . . . . . . . . . . . 39
3.3
An overview of clustering algorithms . . . . . . . . . . . . . . . . . . 40
3.4
3.3.1
Optimization algorithms . . . . . . . . . . . . . . . . . . . . . 42
3.3.2
Construction algorithms . . . . . . . . . . . . . . . . . . . . . 43
3.3.3
Hierarchical algorithms . . . . . . . . . . . . . . . . . . . . . . 43
3.3.4
Graph theoretical algorithms . . . . . . . . . . . . . . . . . . . 44
Choice of clustering algorithm . . . . . . . . . . . . . . . . . . . . . . 44 3.4.1
3.5
An overview of the MCL algorithm . . . . . . . . . . . . . . . 45
Choice of graph drawing software . . . . . . . . . . . . . . . . . . . . 48 3.5.1
Experiments with the ZGRViewer graph drawing software . . 49
3.5.2
Experiments with the Graphael graph drawing software . . . . 51
4 The MCL Algorithm and Logical Areas in Spreadsheets
52
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2
Generating spreadsheet data-flow graphs using Graphael . . . . . . . 52
4.3
Determining the inflation operator for the MCL algorithm . . . . . . 57 4.3.1
4.4
Discussion of experiment results . . . . . . . . . . . . . . . . . 63
Testing the efficacy of the MCL algorithm on more spreadsheets . . . 64 4.4.1
Discussion of experiment results . . . . . . . . . . . . . . . . . 65
5 Comprehending and Debugging Spreadsheets Using MCL Clusters
71
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2
Analysis of the Project Accounting spreadsheet . . . . . . . . . . . . 71 5.2.1
Verification of MCL clusters for the Project Accounting spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3
Analysis of the IPO spreadsheet . . . . . . . . . . . . . . . . . . . . . 76 5.3.1
Verification of MCL clusters for the IPO spreadsheet . . . . . 78 viii
5.4
Summary of experiment results . . . . . . . . . . . . . . . . . . . . . 80
6 Implementation
81
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2
Software architecture of the visualization tool . . . . . . . . . . . . . 81
6.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7 Discussion
87
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2
Spreadsheet understanding and comprehension . . . . . . . . . . . . . 87
7.3
The spreadsheet debugging process . . . . . . . . . . . . . . . . . . . 88
7.4
Spreadsheet maintenance . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.5
Addressing HCI aspects . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8 Conclusion
93
8.1
A summary of the research work . . . . . . . . . . . . . . . . . . . . . 93
8.2
Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.4
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Bibliography
102
Glossary
103
Appendix A
106
Appendix B
107
ix
List of Figures 1.1
An illustration of different views of a spreadsheet by Igarashi et al. [34] 13
2.1
A Microsoft Excel spreadsheet with data-flow graph arrows. Sourced from [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2
Formula view of the Microsoft Excel spreadsheet depicted in Fig. 2.1. 27
2.3
A spreadsheet with its corresponding online data dependency diagram 28
2.4
An animated presentation of fluid-like flow of data in a spreadsheet by Igarashi et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5
A screenshot of the S2 visualization by Sajaniemi. . . . . . . . . . . . 30
2.6
A formula view of the spreadsheet given in Fig. 2.5.
2.7
A spreadsheet with highlighted logical areas (equivalence classes) by
. . . . . . . . . 31
Clermont et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.8
A data-flow graph of semantic classes as proposed by Clermont et al.
34
2.9
A sample spreadsheet data-flow graph by Ballinger et al. . . . . . . . 35
2.10 Hyperbolic view of a spreadsheet data-flow graph by Ballinger et al. . 35 3.1
A sample Project Accounting spreadsheet. Adapted from [38]. . . . . 40
3.2
The formula view of the Project Accounting spreadsheet . . . . . . . 41
3.3
A data flow graph of the given Project Accounting spreadsheet generated by the Graphael graph drawing software. . . . . . . . . . . . . 41
3.4
An example MCL cluster separation process from van Dongen [61]. . 47
3.5
A screenshot of the ZGRViewer graph drawing software displaying an unzoomed data-flow graph of a spreadsheet. . . . . . . . . . . . . . . 50
3.6
A screenshot of a zoomed-in spreadsheet data-flow graph in ZGRViewer. 50
x
4.1
The sample Project Accounting spreadsheet . . . . . . . . . . . . . . 53
4.2
Formula view of the Project Accounting spreadsheet. . . . . . . . . . 53
4.3
An illustration of a cluster tree . . . . . . . . . . . . . . . . . . . . . 54
4.4
A top-most level view of the cluster tree of the Project Accounting spreadsheet data-flow graph as displayed using Graphael. . . . . . . . 54
4.5
Second level view of the cluster tree. . . . . . . . . . . . . . . . . . . 56
4.6
An MCL cluster containing cells D6, F6, G6, H6 . . . . . . . . . . . . 56
4.7
An MCL cluster containing cells F10, G10 and H10 . . . . . . . . . . 57
4.8
Treemap and cluster tree with Γ = 1.1 . . . . . . . . . . . . . . . . . 58
4.9
Treemap and cluster tree with Γ = 1.5 . . . . . . . . . . . . . . . . . 59
4.10 Treemap and cluster tree with Γ = 2.0 . . . . . . . . . . . . . . . . . 60 4.11 Treemap and cluster tree with Γ = 2.5 . . . . . . . . . . . . . . . . . 61 4.12 Treemap and cluster tree with Γ = 3.0 . . . . . . . . . . . . . . . . . 62 4.13 Treemap and cluster tree with Γ = 5.0 . . . . . . . . . . . . . . . . . 62 4.14 Treemap and cluster tree with Γ = 7.0 . . . . . . . . . . . . . . . . . 63 4.15 The Project Accounting spreadsheet showing highlighted MCL clusters (when Γ = 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.16 The formula view of the Project Accounting spreadsheet with highlighted MCL clusters (when Γ = 2) . . . . . . . . . . . . . . . . . . . 64 4.17 The Consolidated Balance Sheet spreadsheet from the EUSES spreadsheet corpus [25] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.18 The formula view of the Consolidated Balance Sheet spreadsheet . . . 67 4.19 A treemap and cluster tree for the Consolidated Balance Sheet depicting a cluster with cell members, F34, F35, F36, F37, F38, F39 and F40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.20 The Consolidated Balance Sheet with highlighted (shaded) MCL clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.21 Formula view of the Consolidated Balance Sheet with highlighted (shaded) MCL clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1
The Project Accounting spreadsheet . . . . . . . . . . . . . . . . . . 72 xi
5.2
The formula view of the Project Accounting spreadsheet. . . . . . . . 72
5.3
Microsoft Excel displays an error message for a cell in MCL cluster number 5 in the Project Accounting spreadsheet. . . . . . . . . . . . 73
5.4
A sample IPO spreadsheet sourced from Ray Panko’s spreadsheet research website[43] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5
The IPO spreadsheet with highlighted MCL clusters. . . . . . . . . . 77
5.6
The formula view of the IPO spreadsheet . . . . . . . . . . . . . . . . 77
5.7
IPO spreadsheet with an Microsoft Excel warning message . . . . . . 78
6.1
Conceptual architecture of the spreadsheet visualization tool . . . . . 82
6.2
A screenshot of the prototype for the visualization with a “Balance Sheeet” spreadsheet, a cluster window (top-right window) and a treemap window (bottom-right window). . . . . . . . . . . . . . . . . . . . . . 83
6.3
A screenshot of the prototype showing the formula view of the “Balance Sheet” spreadsheet. . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4
A screenshot of the prototype showing the “Balance Sheet” spreadsheet with highlighted logical areas. . . . . . . . . . . . . . . . . . . . 85
xii
List of Tables 4.1
MCL clusters for the Project Accounting spreadsheet with Γ = 1.1 . . 58
4.2
MCL clusters for the Project Accounting spreadsheet with Γ = 1.5 . . 59
4.3
MCL clusters for the Project Accounting spreadsheet with Γ = 2.0
4.4
MCL clusters for the Consolidated Balance Sheet spreadsheet . . . . 68
5.1
MCL clusters for the Project Accounting spreadsheet . . . . . . . . . 73
5.2
MCL clusters for the IPO spreadsheet given in Fig. 5.4 . . . . . . . . 76
xiii
. 60
List of Algorithms 1
The basic MCL algorithm . . . . . . . . . . . . . . . . . . . . . . . . 48
2
The algorithm for the spreadsheet parser module . . . . . . . . . . . 106
3
The algorithm for the spreadsheet highlighter module . . . . . . . . . 106
xiv
Chapter 1 Introduction 1.1 1.1.1
Background The end-user programming paradigm
Computer end-users may be defined as people for whom conventional computer programming is not their main job although they use computers as part of their daily lives [10]. However, it is now common place to see computer end-users (hereafter referred to as end-users) being involved in some form of “programming”. End-users are being involved in “programming” applications such as spreadsheets, databases, animations, web applications, simulations, just to mention but a few. Although endusers are not professional programmers they might be experts in their professional domains. Some of these end users are educators, scientists, engineers, business professionals and many more belong to other professions.
End-users may be motivated to do some “programming” because they want to use a computer to accomplish a particular goal. For example, a teacher may create a spreadsheet for recording student grades for a particular course. End-users may also do some “programming” because this might be an an efficient way of solving a prob-
1
lem in comparison to manually solving the problem. For example, a mathematician may write some program code using a mathematical software application to find a solution to a complex differential equation. In all these cases, their main goal would be accomplishing a task at hand rather than producing high-quality, dependable program code [37]. Pre-packaged software applications may not be suitable in these situations because these software applications cannot do every task required by an individual and worse still, they cannot be customized to every individual’s needs [40]. This need has led to the birth of the end-user programming paradigm.
The rising growth in the popularity of the end-user programming paradigm can be attributed to the tools that have been developed to empower this kind of computer users. For example, the development of the spreadsheet paradigm has led to many users developing their own spreadsheets, hence doing some “programming”. An enduser programming environment provides the tools for an end-user to accomplish a task at hand. Examples of end-user programming environments include spreadsheet systems, web authoring applications and animation environments. Ideally, an end-user programming environment should possess the following characteristics [57]: • It should provide rapid feedback to the user as he/she works in the environment • It should provide a framework for end-users to easily and smoothly externalize their problem solving knowledge in their mind into a computer-readable form • It should have the capability to interpret a problem as described by the user (conceptual model) and then generate an equivalent runnable problem solving 2
model without violating the intended computational semantics Statistically, it was estimated that, in the year 2005, there were 55 million end-user programmers in the United States alone. This was about 20 times greater than the estimated number of professional programmers [14, 55, 67]. These estimates clearly indicate that a sizable amount of software produced in the whole world is developed by non-professional programmers. These end-user programmers write programs not as their primary job function but rather to support their quest for achieving their main goal such as accounting, doing office work, developing a web page, etc [40]
1.1.2
Challenges in end-user programming
Despite the huge popularity in end-user programming, programs developed by endusers are very prone to errors. This is because the programs are not developed according to software engineering principles as is the case with software developed by professional software developers. Many end-user developers would not want to get involved in the nitty-gritties of coding in a particular programming language, let alone try to learn the formal syntax and semantics of a particular programming language. In fact, learning programming language syntax has been identified as one of the significant learning barriers in end-user programming environments [37]. Other learning barriers in end-user programming environments include [37]: • Design barriers: the end-user programmer might not know what he/she wants the computer to do in order to solve a problem • Selection barriers: the end-user programmer might know what he/she wants 3
the computer to do but does not know how to choose an appropriate tool for the task • Coordination barriers: the end-user programmer might know the appropriate tools for a particular task but he/she does not know how to make the tools work together in order to solve the problem at hand • Use barriers: the end-user programmer might know what tools to use for a particular task but does not know how to use those tools • Understanding barriers: the end-user programmer might think that he/she knows how to use a particular tool but unfortunately the tool does not do what he/she expects • Information barriers: the end-user programmer might think that they know why a tool behaved in an unexpected or problematic manner but they might not have knowledge to check the problem Another major challenge in end-user programming is the reality that end-user needs vary so widely such that one cannot come up with general design tools and languages that can fit every end-user programmer’s needs [37]. It is also a major challenge to make users understand the importance of the programs they develop [37]. This is particularly true for non-trivial programs that have long life spans such that the programs might need long-term maintenance. A case in point are some spreadsheets that are not simple throw-away calculations but are continuously evolving as part of a business enterprise reporting function. The question would therefore be on
4
how to develop tools that can capture the evolving program’s history and design [37].
In trying to address some of the challenges outlined above, some researchers have advocated that much research on end-user programming should focus on development environments which can help end-users achieve their goals through the use of metaphors such as forms and spreadsheets [56]. Coupled with the statistics on the number of end user programmers, it is easy to see the need for more research in end-user programming.
1.1.3
Popularity of spreadsheet systems
Spreadsheet systems have become a common end-user programming environment used for trivial as well as non-trivial applications in private and public enterprises [9, 45, 67]. They are used for a variety of important tasks such as mathematical modelling, scientific computations, tabular and graphical data representation, data analysis and decision making. The provision of computational techniques that match user’s tasks makes spreadsheet programming easier. There is also a trend in the spreadsheet model as a general model for end-user programming [41]. Spreadsheet systems are widely used by end-user programmers not only due to their simplicity but also due to their features which facilitate programming. The suppresion of lowlevel details of programming, the immediate visual feedback and the availability of high-level task specific functions are commonly referred features among many others [35].
5
Spreadsheet systems allow computations to be defined by cells and their formulas. A cell’s value is defined solely by the formula explicitly given to it by the user [11]. A cell value is recalculated automatically whenever a value on which it depends ( a reference) changes thus providing immediate feedback. Spreadsheet systems also provide for copying of contiguous regions of cells from one physical area to another. References between the cells may be either absolute or relative in either their horizontal or vertical index. All copies of an absolute reference will refer to the same row, column or cell whereas a relative reference refers to a cell with a given offset from the current cell.
Spreadsheet systems are an example of a functional programming language environment. In functional programming, computations are specified by providing arguments to functions and/or operators [11]. However, spreadsheet systems differ from traditional functional programming language environments in mainly two ways [11]: • Spreadsheets are usually associated with first-order functions only. Other traditional functional programming languages support higher-order functions. In a first-order function, the arguments are objects like numbers and class objects - but not themselves functions. • There is continuous program evaluation in spreadsheet systems which is necessary to provide immediate feedback to the user.
6
The spreadsheet paradigm also differs from the procedural programming paradigm in several ways: • Spreadsheet programs are modeless in the sense that they do not require the user to separately code, compile, link, and execute the spreadsheet program as is the case with procedural programs [52]. • Spreadsheet programs provide immediate feedback to the user. For example, when a formula for a particular cell changes, the results are immediately reflected [52]. • The structure of a spreadsheet program is usually represented in a two-dimensional tabular layout while the code for procedural programs is represented in a linear fashion [6]. • From the point of view of a user, a spreadsheet program does not have clear separation between input, computational code and output. This is not the case with procedural programs [6].
1.1.4
Importance of spreadsheets
Despite the fact that some spreadsheet programs (hereafter spreadsheet program shall be used synonymously with spreadsheet) are simple throw-away scratch-pad calculations, many spreadsheets have been quite useful for business as well as personal endevours [67]. There are some large periodically used spreadsheets that are submitted to regular update-cycles like any conventionally evolving application software [16]. This shows that end-user programming, with spreadsheet programming 7
as an example, can not be regarded as a trivial subject.
Panko [45] observed that in a previous study, 46 percent of non-trivial spreadsheets examined were rated as important or very important to the surveyed organization. Panko also noted that another study found out that information generated from spreadsheets is also used in high-level decision making offices in business enterprises. This shows how critical non-trivial spreadsheets can be, to the running of a business enterprise. Therefore errors in spreadsheets may lead to erratic decision making.
Spreadsheets have also been used in science and engineering disciplines such as physics and chemistry, just to mention a few, because they are more usable than procedural programs [16]. Another reason for spreadsheet usage in science and engineering is the fact that spreadsheets already incorporate a way of displaying graphs and this can be very useful in displaying results of scientific experiments.
1.1.5
Impact of errors in spreadsheets
Errors in spreadsheet programs are non-trivial and costly [27, 45]. Despite this observation, there has not been quantitative data on the impact of spreadsheet errors. However, the European Spreadsheet Risks Interest Group (EuSpRIG) publishes on its web page, http://www.eusprig.org/stories.htm, verified stories on how errors in spreadsheet programs have impacted on public as well as private enterprises.
For example, it is documented on the website that in 2004, some city officials, in one 8
of the cities in the United States, miscalculated the amount of sales taxes generated at one of city’s parks during the first couple of months of its operation. The mistake inflated the figures by tens of thousands of dollars, which in turn meant the total sales estimates were overblown by millions of dollars. The mistake was attributed to an error in a spreadsheet formula which amplified a subtotal amount.
It is also documented that some candidates for police officer jobs were told that they had passsed an admission test when in fact they had failed. The reason for this mishap was that the spreadsheet which the examiners had used to record the scores was sorted improperly.
It is also documented that mis-stated earnings of a company led to the stock price of an online retailer to fall by 25 percent in a day and the Chief Executive Officer had to resign. Again a spreadsheet error was to blame. A single erroneous numerical input in a spreadsheet was the cause of the mis-statement. These are just some of the stories that underscore the fact that spreadsheet errors are non-trivial and costly.
1.1.6
Classification of errors in spreadsheets
Data from spreadsheet field studies and laboratory experiments indicate that errors in spreadsheets are indispensable. Panko [45] has tabulated data indicating error rates in spreadsheets as produced by the authors of the various field audits and laboratory experiments. The most important result of these studies is that spreadsheet 9
error rates are huge enough to tell us that most non-trivial spreadsheets will contain errors.
Several classification schemes have been identified to categorize these errors depending on the context in which a researcher is doing the analysis [7]. Panko [45] identified three categories of spreadsheet errors namely mechanical, logical and omission errors. Mechanical errors are simple slips such as mistyping a number or pointing to a wrong cell when entering a formula. Logical errors are defined as errors that occur when a spreadsheet developer has a wrong algorithm for a particular formula cell. On the other hand, omission errors are defined as errors that occur when a spreadsheet developer does not have complete understanding of the problem at hand and therefore produces an incomplete spreadsheet model of the problem. Hence omission errors are introduced due to faulty reasoning.
Another general classification scheme used by Panko [45], categorizes spreadsheet errors as quantitative errors and qualitative errors. A quantitative error is defined as an error that produces an incorrect value in an least one bottom-line variable in a spreadsheet. On the other hand, qualitative errors emanate from factors such as poor spreadsheet design which may later cause problems in data entry or even lead to incorrect data modifications and hence generate quantitative errors. This scheme further categorizes quantitative errors into mechanical, omission and logical errors which have already been defined in the preceeding paragraphs.
10
There is another spreadsheet error classification scheme proposed by Ayalew et al [7]. Unlike the other classification schemes given above, they do not want to categorize the errors by their cause, but rather by the spreadsheet concept the errors seem to be associated with. Thus, they have three categories of errors namely: physical area related errors, logical area related errors and general errors.
Physical area related errors are defined as those errors that normally deal with missing values in a physical area or values of the wrong type somewhere in the physical area. This kind of errors leads to several side-effects such as impacting on the results if new values are added to the area. According to this classification scheme, physical area related errors include what are termed as “reference to a blank cell/reference to a cell with value of wrong type” errors, “incorrect physical area specification” errors, “accidental deletion/addition of a cell within a physical area” errors and “physical area mix up” errors.
A logical area is defined as an area that represents some kind of cohesion between cells. It usually originates from copying from the same source multiple times. Examples of logical area errors include overwriting a formula with a constant value and having a formula copy misreference.
General errors have been defined as those errors that are not explicitly associated with a physical or logical area and are usually made during formula definition. An error might occur due to typographical errors or inability to formulate the necesssary 11
mathematical expression for a formula. An error might also occur due to incorrect use of formats which might affect the way a value is displayed.
1.2
Statement of the Problem
Despite the simplicity in creating spreadsheets, they are generally difficult to understand and comprehend [17]. The need to understand a spreadsheet may arise if one wants to debug a spreadsheet. It may also be necessary to understand a spreadsheet when wants to maintain or even just to comprehend a spreadsheet created by others.
Most spreadsheets are created by end-users and they contain errors which the developers themselves may not easily notice [45]. Unfortunately, most spreadsheet errors are not trivial considering the fact that key decisions, for example in business firms, are based on information extracted from spreadsheets [27, 45]. Therefore, it is important to help spreadsheet developers expose these errors or even prevent them from occuring in spreadsheets.
Furthermore, non-trivial spreadsheets may need to be modified by other people other than the spreadsheet developer himself/herself. Moreover, changes to the struture of the spreadsheet may be necessary since spreadsheets may need to maintained just as any conventionally evolving application software [16]. However, for one to make meaningful changes to the structure of a spreadsheet, he/she needs to understand the spreadsheet first. Spreadsheets normally come in the two-dimensional tabular
12
arrangement of numeric values with some accompanying explanatory text. Usually this does not suffice for a third party to clearly comprehend and understand what the spreadsheet is all about.
A spreadsheet is usually perceived only as a two-dimensional grid of cells populated mainly with numerical values although every spreadsheet has a formula view as well as an underlying data-flow graph [34] (see illustration in Fig. 1.1). A dataflow graph represents the network-structure of cell dependencies expressed by the references in the individual formulas. However, the data-flow graph is normally “hidden” from the spreadsheet developer. It is therefore not surprising that most
Figure 1.1: An illustration of different views of a spreadsheet by Igarashi et al. [34]
spreadsheet developers view a spreadsheet as a word processor for numbers and not necessarily as a complex data-flow graph that spreadsheets really are [15]. Despite this view from spreadsheet developers, the key to understanding spreadsheets is to clarify the data dependencies among the cells [17]. In other words, visualizing and clarifying the inherent data-flow graph can help users understand a spreadsheet as well as aid in the spreadsheet debugging process. This is so, because human beings 13
process and understand visual representations of data much faster and in a more effective way than doing so by reading the numerical or textual representations of the same data [18].
Like the numerical view of a spreadsheet, the formula view of a spreadsheet has also some disadvantages. For example, the formulas which compute the values of cells are hidden. It is possible to see either the formulas or the values but not both at the same time. For a single cell, it is possible to see both at the same time but this does not give much information about the overall structure of the spreadsheet. In some cases, this locality to a single cell may help by narrowing the point of focus instead of dealing with the spreadsheet as a whole, but it is also difficult to get sense of the general structure of the whole spreadsheet [32, 42]. As a result, it is difficult to identify where data comes from and where it goes unless one makes a detailed examination of the cell dependencies.
Therefore, it is against this background that this research work was embarked on with the aim of developing a tool for visualizing spreadsheet data-flow graphs that could help in solving problems of spreadsheet comprehension as well as spreadsheet debugging.
1.3
Objectives of our research
Our research work has four main objectives:
14
(i) We want to generate the data-flow graph of a given spreadsheet with nodes representing cells in the spreadsheet and edges representing dependencies between cells which can make the visualization (the generated data-flow graph) to be useful for spreadsheet understanding, debugging and maintenance. However, generating data-flow graphs leads to the problem of visualizing large graphs since normally the number of nodes and edges in the generated graph becomes large hence introducing problems of graph navigability and comprehension. (ii) We would like to deal with the problem of visualizing large graphs through graph clustering. Clustering allows us to view a manageable subset of the data-flow graph at a time. Provision of proper navigation techniques through the generated clusters shall also be an important aspect of this work. More importantly, we would like to produce “meaningful” clusters i.e. clusters that match with logical areas of the given spreadsheet. A logical area in a spreadsheet may be defined as a group of cells in a spreadsheet that from the spreadsheet creator/user perspective form a logical unit due to the semantics of the spreadsheet. (iii) We would like to separate the graph-based visualization from the spreadsheet so as to avoid the problem of cluttering on spreadsheet display as this introduces information overload. At the same time we would like to maintain mapping between spreadsheet cells and graph nodes. (iv) We would like to generate our visualization dynamically so that we are able to achieve real-time spreadsheet-visualization interactivity. 15
1.4
Research Methodology
We conduct our research work using a combination of research methodologies namely experimentation, case study and prototyping.
Experimentation is a term that is not universal [65]. Therefore we define it in the context of this research work. An experiment shall involve the running of a computer program multiple times while varying either program inputs or program parameters and observing the program outcomes. Basing on the observation of the program outcomes, we infer some system properties and characteristics. Observations in experiments are very important because they can lead to new useful and unexpected insights that can also open new areas of investigation [59]. We use experimentation in this research work in different tasks such as: • choice of suitable graph drawing software • determination of performance of our chosen graph clustering algorithm on different spreadsheets • determination of suitable clustering parameters of our chosen graph clustering algorithm. A case study is an empirical enquiry that allows one to investigate a contemporary phenomenon within its real-life context [58]. In software engineering, case studies are useful for the industrial evaluation of different software engineerng tools and methods [58]. For example, different software tools may be evaluated on how their
16
features may be suitable in accomplishing a particular task. Hence to avoid bias and to ensure internal validity, a valid basis is identified to assess the results of the case study [58]. However, case studies have the disadvantage in that results may not be generalized easily [59]. In this work, we use spreadsheets sourced from the EUSES Spreadsheet Corpus [25] and the Spreadsheet Research website [43]. This is because we want to conduct our experiments on real-life spreadsheets. The referred sources are repositories of spreadsheets collected from different organizations and business firms.
Prototyping involves the assembly of a model of an unfinished software system. The features of a prototype portray the capabilities of a finished software system at a glance. Prototyping may also offer a demonstration that theoretical ideas can be put into a “real-life” software tool or product. In other words, prototypes provide proof-of-concepts and they may also provide incentives to study a research question further [59]. However, it is important to note that prototypes do not provide solid evidence supporting a theory or ideas [59].
In this reserch work, we assemble a prototype of the spreadsheet visualization tool using the Microsoft Excel spreadsheet system in conjunction with an open-source Java based graph drawing software. Programming in the Microsoft Excel spreadsheet system is done in Visual Basic for Applications (VBA). We also modify the source code of the graph drawing software to suit the requirements of our application. 17
1.5
Overview of the rest of the Dissertation
The rest of the dissertation is organized as follows: Chapter 2 provides a review of related research works by other researchers in this research area. Our graph-based approach to the research problem is introduced in Chapter 3. Our experiments with the MCL algorithm on spreadsheets using the Graphael graph drawing software is given in Chapter 4. We demonstrate how clusters identified using the MCL algorithm can be used to comprehend and debug spreadsheets in Chapter 5.
A conceptual architecture of an implementation of the prototype of the visualization tool is presented in Chapter 6. A discussion of the results from this research work as well as a discussion of some issues that emerged from this research work is given in Chapter 7. We conclude this dissertation in Chapter 8 with a summary of our contribution in this research area, a presentation of limitations of our spreadsheet visualization technique as well as proposed future works.
18
Chapter 2 Related Work 2.1
Introduction
Considering the importance of spreadsheets, several research works have been undertaken to address the problem of quality in spreadsheets. Some research works focussed on error prevention techniques in spreadsheets while others focussed on error detection techniques in spreadsheets. Futhermore, other research works focussed on spreadsheet visualization techniques with the aim of improving error detection, debugging and general comprehension of spreadsheets. Other researchers focus on the application of principles of software engineering to spreadsheet development. This growing research direction is being embodied in a new and growing discipline known as end-user software engineering [13, 39, 51, 56]. Some of the research questions being tackled in this research direction include: • How can software engineering life cycle models be used in spreadsheet development? • How can improved programming practices such as teamwork and code inspection help in creating error-free spreadsheets? Some work in this area includes
19
that of Panko and Sprague [44, 46] which explored on the benefits of code inspection in spreadsheets. Vemula et al. [63] also researched on groupwork in spreadsheet development and testing. • Development of tools and techniques that can help in testing, debugging and verification of spreadsheets to minimize risk from errors in spreadsheets. Some work in this research direction include: using assertions in helping end-user programmers to correct spreadsheet errors [12], fault tracing in spreadsheets using “interval testing” and slicing [8], using type inference to identify programming errors in spreadsheets [5], just to mention but a few.
2.2
Spreadsheet error prevention techniques
Several research endevours have already reported on techniques that can be used to prevent errors from happening in spreadsheeets. The rationale for this research path being the fact that it is easier to prevent than correct errors in spreadsheets.
Ronen et al. [49] proposed a structured approach to spreadsheet design as a way of preventing errors in spreadsheets. The basis of this proposal was that a lack of design methodology in spreadsheets brings in problems of reliability, auditability and modifiability of spreadsheets. They introduced spreadsheet flow diagrams (SFDs) as a way of structuring spreadsheets. Spreadsheet flow diagrams are similar to flow-chart diagrams for structured programming. They argued that spreadsheet flow diagrams would help the designer structure the spreadsheet solution model to
20
a problem. Spreadsheet flow diagrams could also assist in communicating the structure of a spreadsheet model to others and they could also serve as a documentation tool when it is necessary to audit or modify the spreadsheet.
Some researchers have also proposed data control techniques as one way of preventing errors from occuring in spreadsheets (e.g. Panko [45]). Some proposed data control techniques include: • protection of cells and worksheets from unauthorized use. For example, cell protection can allow users to change only pre-specified input cells so that if a user attempts to “hardwire” a formula cell, they will be prevented from doing so. In hardwiring a formula cell, a user cursors to a formula cell and enters a number in the cell. This usually happens when a user does not realize that the cell was a formula cell and they think that they should just enter a value in the cell. • provision of data entry validation through the re-keying of input data. This method is also used in traditional data processing and it is called data verification. This method easily prevents errors from occuring since it is easy to check if two input areas are the same and if not, it is also easy to determine where the error lies. Erwig et al. [23] developed a system called Gencel in which spreadsheet templates using the Visual Template Specification Language (ViTSL) are used to generate spreadsheets which are free from reference, range or type errors. With this technique, 21
spreadsheet templates are created and verified by domain experts and later on can be used by less experienced users to generate spreadsheets that always conform to the template. This concept was extended to include the automatic generation of spreadsheet templates from object-oriented specifications that have been specified using Unified Modeling Language (UML) diagrams [21].
2.3
Spreadsheet error detection techniques
It maybe inevitable to introduce errors in spreadsheets. Therefore, some researchers have focussed on techniques that help in the detection of errors in spreadsheets as well as in testing techniques for spreadsheets.
Ayalew et al. [7, 8] developed a spreadsheet debugging technique based on “interval testing” and slicing. In this technique, each formula cell has a user-specified value interval and a system-generated value interval. When the user-specified interval and the system-generated interval for a cell do not agree with the actual spreadsheet computation, the cell is marked as displaying a symptom of a fault. A fault tracing strategy is then used to identify the most influential faulty cell from the cells perceived by the system to contain faults. This is based on the number of precedents and dependents of the influential faulty cells.
Rothermel et al. [50] also developed a spreadsheet testing methodology which they termed “What You See Is What You Test” (WYSIWYT) to help users test spread-
22
sheets. Since testing and debugging are closely interrelated, we find it worthwhile to make mention of this methodology. The methodology uses data-flow adequacy and coverage criteria to give the user feedback on how well tested a spreadsheet is. The WYSIWYT testing methodology has been integrated with another spreadsheet testing technique known as the “Help Me Test” (HMT) [24] technique into the Forms/3 [11] spreadsheet language. The HMT technique automatically generates test cases for the user as he/she actively works on the spreadsheet. Forms/3 is a form-based research spreadsheet language developed at the Oregon State University. The Forms/3 spreadsheet language also allows users to define assertions on the expected cell values [12]. To promote the usage of assertions by end-user programmers, Wilson et al [67], devised a curiosity-centred approach to eliciting assertions from end-users through a “surprise-explain-reward” strategy .
Randolph et al. [48] developed a spreadsheet verification tool based on the WYSIWYT methodology. Their main emphasis was to use the WYSIWYT methodology algorithms in implementing a spreadsheet independent tool. They placed much emphasis on issues of portability and the automatic generation of test cases.
Abraham and Erwig [2] developed an automated reasoning system for spreadsheets called UCheck. UCheck infers header unit information for cells in a spreadsheet. Based on the header unit information, the system identifies cells in the spreadsheet that contain erroneous formulas. They extended the UCheck system to produce a system known as UFix [4] in order to improve on the way error messages are re23
ported to users hence improving the spreadsheet debugging process. Abraham and Erwig also developed a type system and a type inference algorithm for spreadsheets which can be used in identifying some kind of errors in spreadsheets [2].
Abraham and Erwig [3] also developed a spreadsheet debugger known as GoalDebug based on a technique known as “goal-directed debugging”. GoalDebug allows users to mark cells with incorrect outputs and specify the expected output. The GoalDebug system then generates a list of change suggestions, any one of which when applied would result in the expected ouput being computed in the marked cell. The generated change suggestions are ranked based on a set of heuristics before being presented to the user. The generated change suggestions can be automatically applied and hence eliminating errors that can be introduced by end users through editing of cell formulas.
Metamorphic testing is also proposed as a potential way which can be used to test spreadsheets [14]. This technique has also been used to test other end-user developed software such as web applications, simulation and scientific computations. Metamorphic testing utilizes information carried out in successful test cases. An essential part of metamorphic testing is to identify effective metamorphic relations. A metamorphic relation is any relation among program inputs and the outcomes of multiple executions of the target program. The outcomes of multiple executions of the target program using isomorphic test cases are supposed to match, otherwise the tested program is at fault. A good metamorphic relation can be identified eas24
ily by a program tester who has black-box knowledge of the problem domain and white-box knowledge of the program structure.
2.4
Spreadsheet visualization techniques
Various spreadsheet visualization tools have also been proposed for different purposes such as spreadsheet comprehension, debugging, documentation, etc. Most of these spreadsheet visualization tools are based on the data-flow graph behind the spreadsheets [53]. Spreadsheet visualization is part of a discipline known as Information Visualization. Information visualization through automatic graph drawing involves construction of geometric representations of conceptual structures that are modelled as objects and connections between those objects [60]. In a graph, the objects are represented by nodes and edges are used to represent connections (reationships) between those objects. Automatic generation of graph drawings has been carried out for a wide variety of information visualization applications in science as well as in engineering [19]. Some example application areas include: • the World Wide Web: visualization of site maps and construction of browsing history diagrams, etc. • Software Engineering: construction of data flow diagrams, program call graphs, object-oriented class hierarchies, entity-relationship diagrams, etc • Artificial Intelligence: construction of knowledge representation diagrams • Management Science: construction of organization charts, PERT diagrams
25
Our research work focusses on the visualization of spreadsheet structures through the automatic generation of corresponding data dependency (data-flow) graphs.
Microsoft Excel, a popular commercial spreadsheet system, provides a built-in precedents/dependents tracer tool which upon request allows a spreadsheet developer to either get precedents or dependents of a particular cell. Arrows are then drawn linking the precedents or dependents to the selected cell. Fig. 2.1 shows a Microsoft Excel spreadsheet with arrows depicting the data-flow graph as generated by the tracer tool. The formula view of the spreadsheet is given in Fig. 2.2.
One prob-
Figure 2.1: A Microsoft Excel spreadsheet with data-flow graph arrows. Sourced from [53].
lem with this tool is that one can not get the overall data-flow graph for the whole spreadsheet at a single request. Therefore one cannot have a global view of the overall data-flow graph in a single step. Another major drawback with this kind of visualization is that the visualization is superimposed on the spreadsheet display. This clutters the spreadsheet view and as a result reduces readability and compre-
26
Figure 2.2: Formula view of the Microsoft Excel spreadsheet depicted in Fig. 2.1. hension of the spreadsheet.
Davis [17] produced two spreadsheet visualization tools: the arrow tool and the online data dependency diagram. The arrow tool is similar to the earlier versions of the Microsoft Excel (MS Excel 97) precedents/dependents tracer tool with the exception that the arrow tool coloured precedent and dependent cells in addition to grouping logically related cells. Again the visualization is superimposed on the spreadsheet display hence bringing in problems associated with the Microsoft Excel’s precedents/dependents tracer tool.
Online data dependency diagrams, as spreadsheet visualization tools, are based on flow-chart like diagrams (see Fig. 2.3). Distinctive symbols are used to represent cells according to whether they function as inputs, outputs, decision variables or parameters of formulas. Arrows are used to show data dependencies amongst the cells by connecting the symbols. The visualization produced is not superimposed on the spreadsheet display as in the other tools explained in the preceeding paragraphs. 27
Instead, the tool displays the spreadsheet in a window on one side of the screen and the diagram in a separate window on the other side as in Fig. 2.3. However, it has to be noted that the visualization is statically generated. As a result, the tool’s author suggested that if this visualization could be produced automatically, it could serve as a practical spreadsheet auditing tool because one could produce it when needed. Davis continues to state that the visualization was statically generated because at
Figure 2.3: A spreadsheet with its corresponding online data dependency diagram
the time the visualization was proposed, there were not good enough graph drawing algorithms. This is not the case right now and therefore we would like to exploit the availability of such robust graph drawing algorithms for automatic (dynamic) generation of such kind of visualizations.
On a related note, Vemuri et al. [64] conducted an experimental study on the usefulness of online data-dependency diagrams for visualizing spreadsheets. Although their study did not conclude that online data-dependency diagrams were useful, their studies indicated optimism by users that online data-dependency diagrams would be useful for maintaining larger spreadsheets. 28
Figure 2.4: An animated presentation of fluid-like flow of data in a spreadsheet by Igarashi et al.
Igarashi et al. [34] also developed a visualization tool that depicts a fluid-like flow of data in a spreadsheet as illustratd in Fig. 2.4. The main emphasis in this visualization tool is the visualization of the hidden data-flow structure behind the tabular layout of a spreadsheet. Transient local views are used to visualize data-flow structures associated with individual cells while it is possible to view the data-flow structure of the entire spreadsheet at once. A user is also able to navigate through the data flow structure interactively and it is possible to construct formulas using graphical editing techniques hence the provision of visual editing. However the main drawback with the tool is that it fails to scale on spreadsheets containing more than 400 used cells because there is noticeable degradation in performance with more than 400 used cells. This limits the application of the technique to larger spread-
29
sheets.
Sajaniemi [53] developed the S2 and S3 spreadsheet visualization tools in which logical areas or semantic units in a spreadsheet are highlighted and data-flow between logical areas is indicated through arrows. A screenshot of the S2 visualization tool is given in Fig. 2.5 and the corresonding formula view of the spreadsheet is given in Fig. 2.6. The S3 visualization is a slight improvement to the S2 visualization. Highlighted areas in the visualization describe the plan structure of the spreadsheets and deviations from this structure show clearly in the visualization hence helping in the spreadsheet debugging process. Both tools have the disadvantage that they are also superimposed on the spreadsheet display hence introducing cluttering of the display. In addition to this, the overall data-flow graph cannot be generated in a single step.
Figure 2.5: A screenshot of the S2 visualization by Sajaniemi.
On the same line, Sajaniemi further suggested that spreadsheet visualization tools
30
Figure 2.6: A formula view of the spreadsheet given in Fig. 2.5. should satisfy the following salient features [53]: • The visualization should be superimposed on the spreadsheet so as to reduce cognitive overhead in mapping between two display windows. However, we note that the superimposition of the visualization graph on the spreadsheet display clutters the view of the spreadsheet. Therefore our work shall avoid this problem by providing a separate window for the visualization. • The visualization can be constructed dynamically since users would probably like to use tools that require less user intervention. Our visualization tool shall satisfy this requirement since the visualization shall be dynamically generated upon the click of an appropriate button on the spreadsheet interface. Ayalew et al. [7] proposed a graphical spreadsheet visualization model that is not only based on a data-flow graph but also on visualizing logical and physical areas in spreadsheets. They proposed that such a visualization should be generated automatically with little or no intervention of the spreadsheet programmer. They also proposed that the visualization should allow zooming into specific areas of the gen31
erated graph without losing the global view of the graph using fisheye views. Their proposed visualization model was to serve three purposes: • shortening the trial and error process to develop solutions for real-world problems through support for problem understanding since problem understanding can be supported by the graphical representation of the spreadsheet model. • help in the maintenance of existing spreadsheets since a visualization can help in the understanding of spreadsheet programs developed by others. • enabling comparison of spreadsheets at the level of the spreadsheet model based on model properties such as data-flow, physical and logical areas and not just cell values. Clermont et al. [16] developed a spreadsheet visualization toolkit that partitions a spreadsheet into logical areas known as equivalence classes. The equivalence classes are mainly based on structural similarity of formulas. Identified equivalence classes are then highlighted in the original spreadsheet as in Fig. 2.7. The toolkit has three components: • A structure browser which displays the generated equivalence classes. • A dependency viewer that displays the data flow graph between the dependencies of the cells that are in the equivalence classes that is currently highlighted in the structure browser. • The spreadsheet itself which gives feedback to the user/programmer by highlighting the cells that are in the equivalence class (logical area) that is currently 32
selected in the structure browser. With large spreadsheets (e.g. having more than 5000 used cells), the number of equivalence classes becomes too large and hence they devised a further abstraction mechanism called semantic classes. Semantic classes are represented as nodes in a generated graph and data-flow between cells in different semantic classes is represented by directed edges as in Fig. 2.8. These graphs are not dynamically generated
Figure 2.7: A spreadsheet with highlighted logical areas (equivalence classes) by Clermont et al. since information about a spreadsheet (e.g. cell dependencies) is extracted and processed separately from the spreadsheet.
Ballinger et al. [9] developed a spreadsheet visualization tool that would first statically extract artefacts from spreadsheets and then convert this information into visualizations such as spreadsheet data-flow diagrams. A sample spreadsheet dataflow diagram is depicted in Fig. 2.9. Ballinger et al. also used a hyperbolic viewer to view the generated spreadsheet data-flow graphs in an attempt to deal with the 33
Figure 2.8: A data-flow graph of semantic classes as proposed by Clermont et al. problem of cluttering in graphs with a large number of nodes and edges (see Fig. 2.10). Unfortunately, hyperbolic viewing does not provide for views in which the current view displays nodes which match with logical areas in the corresponding spreadsheet. We want to produce graph views which match with logical areas in the spreadsheet.
2.5
Summary
The aforementioned spreadsheet visualization tools and techniques indeed offer very useful insights about data-flow as well as data patterns in spreadsheets which would not have been possible by just analysing the “data value” view of a given spreadsheet.
However, as already pointed out, there are some drawbacks with the aforemen34
Figure 2.9: A sample spreadsheet data-flow graph by Ballinger et al.
Figure 2.10: Hyperbolic view of a spreadsheet data-flow graph by Ballinger et al.
35
tioned approaches. For example, in some of the approaches (e.g. the Microsoft precedents/dependents tracer tool, the arrow tool [17] and the S2 and S3 [54] visualization), the generated arrows with highlighted areas are superimposed on the spreadsheet display which introduces cluttering on the display. In the other approaches which produce computational data-flow graphs, the problem of visualizing large spreadsheets (hence large graphs) is not adequately handled. For instance, the fluid visualization tool of Igarashi et al. [34] can only handle spreadsheets with not more than 400 used cells. Hyperbolic viewing of spreadsheet data-flow graphs as in the work of Ballinger et al. [9] has the problem that the viewing context generated by the fisheye views employed does not necessarily match with logical areas in the corresponding spreadsheet. Another drawback with some of the approaches is that the visualizations produced are statically generated hence cannot be useful for real-time spreadsheet-visualization interactivity. A case in point is the work of Clermont et al. [16, 38] and that of Ballinger [9] where information about a spreadsheet (e.g. cell dependencies) is extracted and processed separately from the spreadsheet. Online data dependency diagrams as proposed by Davis [17] are also processed statically from the spreadsheet.
To sum up, we can identify the following problems with the aforementioned spreadsheet visualization approaches: • Some of the approaches introduce cluttering of spreadsheet display hence reducing spreadsheet understanding
36
• Some of the approaches do not adequately handle the visualization of large spreadsheets • In some of the approaches which produce data-flow graphs, the generated graphs do not match with logical areas in the corresponding spreadsheet • Some of the visualizations produced are statically generated hence cannot be useful for real-time spreadsheet-visualization interactivity Our research work will be an attempt to address these problems.
37
Chapter 3 Graph-based Visualization
3.1
Introduction
A graph consists two finite sets, V and E. Each element of V is called a vertex or a node. The elements of set E are called edges and these are unordered pairs of the vertices in set V . A graph may be used to abstractly represent properties of a system by modelling and simulation if the vertices can be identified as objects and edges can be identified as relations between the objects [29]. In our approach, we model spreadsheet data-flow graphs where vertices (nodes) represent spreadsheet cells and the set of edges represents the dependencies between spreadsheet cells as defined by cell references through formulas.
Using different graph drawing techniques, one can generate the data-flow graph of any given spreadsheet. However, the graph becomes difficult to comprehend and navigate through due to cluttering of the graph which arises due to the large number of nodes. Grouping of the nodes into clusters is a viable solution to this problem. This technique is known as graph clustering.
38
3.2
The need for graph clustering
Consider the spreadsheet given in Fig. 3.1 whose formula view and corresponding data-flow graph are given in Fig. 3.2 and Fig. 3.3 respectively. The spreadsheet is used to track income and expenditure on several projects being run by some company. It is worth noting that it is very difficult to comprehend the data-flow graph of the spreadsheet. There are problems of readability and navigation just to mention but a few. This is a general problem of visualizing large graphs [1, 28] since large graphs contain a large number of nodes. As already stated above, graph clustering offers a possible possible solution to this problem. Graph clustering is a process of separating nodes of a graph into components/groups based on some classification criteria. The separated components are then interpreted as clusters. Clustering is important because it helps us to view a manageable subset of the generated graph at a time.
The process of coming up with clusters is known as cluster analysis and it can be broken down into a series of steps [29]. However, when applying a cluster anaysis procedure, a number of questions need to be answered [29, 66]: • What are the entities to be clustered? • When are two entities said to be similar? This is the classification criteria that determines if two entities fall under one cluster. • What is the basis for valuating the classification criteria? This is important because normally it is desired to have classification criteria that produces clus39
ters which are “natural” as much as possible. • What clustering algorithm to apply? This is important because a clustering algorithm is required in order to perform the actual cluster analysis. In addition, clustering algorithms vary in their effectiveness depending on the application area for which they are used. In our case, we need a clustering algorithm that would find clusters that correspond to the logical areas in the given spreadsheet. In other words, the clustering mechanism has to find “natural” clusters in the spreadsheet data-flow graph.
Figure 3.1: A sample Project Accounting spreadsheet. Adapted from [38].
3.3
An overview of clustering algorithms
There are so many clustering algorithms. However, clustering algorithms can roughly be divided in the following categories [29, 66]: • optimization algorithms 40
Figure 3.2: The formula view of the Project Accounting spreadsheet
Figure 3.3: A data flow graph of the given Project Accounting spreadsheet generated by the Graphael graph drawing software.
41
• construction algorithms • hierarchical algorithms • graph theoretical algorithms This categorization is not exhaustive as there are some algorithms which might not fall in the listed categories. There are also other algorithms which are a hybrid of other algorithms. However, it is important to note that the algorithms may either produce disjoint or overlapping clusters. In disjoint clusters, a sample of entities is split into non-overlapping subsets. Thus, every element of the sample is attached to exactly one cluster. On the other hand, for overlapping clusters, a sample of objects may be split into overlapping subsets.
Clustering algorithms can also be either supervised or unsupervised [66]. Supervised algorithms are provided a priori knowledge while this is not the case with unsupervised algorithms. An example of a priori knowledge could be the number of clusters that need to be generated.
3.3.1
Optimization algorithms
These algorithms are based on a clustering criterion that needs to be optimized. The clustering criterion is expressed as a quality function. Clusters are produced at the optimal value of the quality function. The main drawback with optimization algorithms is the computation time needed to find an optimum of the quality function. 42
3.3.2
Construction algorithms
In construction algorithms, the cluster construction process starts with an arbitrary choice of some elements which are believed to be “typical” representatives of some clusters. These representative elements are called kernels. Using an iterative process, all elements which are geometrically nearest to each kernel are attached to the kernel’s group. The process stops if the clusters become too heterogeneous. During the iterative process, elements which get closer to the geometric centre of a new group than to the centre of the group they previously have been attached to are reclassified. This reclassification implies heavy computations.
3.3.3
Hierarchical algorithms
Hierarchical algorithms build a hierarchy of clusterings as in a genetic tree (dendrogram) whereby each level in the hierarchy contains the same clusters as the first lower level except for two clusters which are joined to form one cluster. Hierarchical algorithms may be categorized into two types: • Agglomerative or bottom-up algorithms • Divisive top-down algorithms In agglomerative algorithms, clusters at a higher level are formed by the fusion of clusters which are at a lower level in the hierarchy. The starting point are singlemembered clusters which are at the lowest level of the hierarchy. On the other hand, divisive algorithms are a complete opposite of agglomerative algorithms. They start the clustering process by having all entities contained in one cluster. Thereafter, 43
in each iterative step, a cluster is split into two clusters until the lowest level of the hierarchy contain single-membered clusters. Agglomerative algorithms offer an advantage over divisive algorithms because it is computationally cheaper to perform a bottom-up clustering process than a top-down clustering process.
3.3.4
Graph theoretical algorithms
Graph theoretical algorithms work on graphs whereby nodes represent entities and edges represent entity relations. These algorithms do not start from the individual nodes but they try to find subgraphs which will form clusters. Examples of subgraphs include connected components and spanning trees. The algorithms used to find these subgraphs are based on graph theory. Some graph theoretical algorithms reduce the number of nodes in a graph by merging them into aggregate nodes which can be interpreted as nodes or can be used as input for a new iteration resulting in higher level aggregates.
3.4
Choice of clustering algorithm
In our case, we need a clustering algorithm that would find clusters that correspond to logical areas in spreadsheets. A logical area in a spreadsheet may be defined as a group of cells in a spreadsheet that from the spreadsheet creator perspective form a logical unit due to the semantics of the spreadsheet [35, 36]. The semantics of a spreadsheet define what the spreadsheet is all about (the meaning of the spreadsheet). Therefore the clustering algorithm has to find “natural” clusters in
44
the spreadsheet data-flow graph.
Based on our experiments, we found out that the Markov Clustering (MCL) algorithm [61, 62] finds “natural” clusters in spreadsheet data-flow graphs. We present a detailed description of these experiments in Chapter 4. Generally, natural clusters in a graph are characterised by the presence of many edges between the members of that cluster and one expects that random walks on the graph will infrequently go from one natural cluster to another [61, 62]. Due to its ability to find natural clusters, the MCL algorithm has also been used in many advanced applications. For example, the algorithm has been reliably used for the assignment of proteins into families based on precomputed sequence similarity information [22].
3.4.1
An overview of the MCL algorithm
The Markov Clustering (MCL) algorithm is a graph clustering algorithm that is based on column stochastic (Markov) matrices to simulate random walks through a graph. A column stochastic matrix is a matrix whose column vectors are probabilities i.e. the sum of the matrix entries in each column is 1.
The first step of the algorithm is to associate a given input graph with some column stochastic matrix, M , such that entry Mij will indicate the probability of moving from node j to node i in the input graph (note that we start columnwise). Then two operations known as expansion and inflation are performed iteratively starting with the associated stochastic matrix thus simulating random walks through the input 45
graph.
An expansion operation is carried out by taking the power of the associated stochastic matrix using the normal matrix product. An inflation operation involves taking the Hadamard power of the matrix result from the expansion operation. The Hadamard power of a matrix is computed by taking the powers of each matrix entry. The Hadamard power of the matrix is specified using what is known as the inflation operator. This is followed by normalizing or scaling the resulting matrix so that we have a stochastic matrix again. The process of expansion and inflation are then repeated iteratively jointly together. The iterative process is stopped after we get a doubly-idempotent matrix. A doubly-idempotent matrix does not change after further expansion and inflation operations.
Expansion computes random walks of higher length paths. That is, given any pair of nodes we will have an associated probability value depicting the probability of having a higher length path between the two nodes. But since we have more higher length paths within clusters than between different clusters, node pairs in the same cluster will have large probabilities since there are so many ways of going from one node to the other. The probabilities of random walks with higher length paths are further boosted by applying inflation operation. Thus inflation boosts the probabilities of intra-cluster walks and demotes inter-cluster walks. Intra-cluster walks are therefore more favoured than inter-cluster walks.
46
The process of jointly iterating expansion and inflation results in a very sparse stochastic matrix which is interpreted as the separation of the input graph into different connected components which are in turn interpreted as clusters. An example graphical representation of the MCL cluster separation process is given in Fig. 3.4. An MCL cluster would therefore be characterized by the following attributes:
Figure 3.4: An example MCL cluster separation process from van Dongen [61].
• the presence of many edges between members of a cluster • the number of higher-length paths between two arbitrary nodes in the cluster is large than between two arbitrary pair of nodes from different clusters • if one takes a random walk through a dense cluster then the random walker will likely not leave the cluster until many of its nodes have been visited. 47
The basic MCL algorithm is given in Algorithm 1 below.
It is important to note
Algorithm 1 The basic MCL algorithm 1: G is the input graph 2: set M1 to be the associated matrix of random walks on graph G 3: set the inflation operator Γ to some value 4: repeat 5: M2 = M1 ∗ M1 //this is expansion 6: M1 = Γ(M2 ) //this is inflation 7: change = difference(M1 , M2 ) 8: until (change = 0) //zero matrix 9: set clusters as the components of M1
that the inflation operator can be altered using the parameter Γ. Increasing this parameter has the effect of making the inflation operator stronger, and this increases the granularity or tightness of clusters. In addition to this, it is also important to note that the MCL algorithm has been proven to converge quadratically. In practice, the algorithm starts to converge noticeably after 3 to 10 iterations [62].
3.5
Choice of graph drawing software
Graph-based visualization is a way of representing structural information as diagrams of abstract graphs and networks. It is tedious to draw such kind of graphs by hand. Therefore, automatic drawing of these kind of graphs is done using graph drawing software. Graph drawing software usually have a variety of graph layout algorithms. Different graph drawing software has been used in a wide variety of important applications in software engineering, database and web design, networking, and in visual interfaces for many other domains.
48
We investigated two open-source Java-based graph drawing programs in this work. These are ZGRViewer [47] and the Graphael [26, 30] programs. Each of the programs were used in collaboration with Microsoft Excel spreadsheet application program. We used open-source programs because they were not only free in terms of monetory costs but most importantly because we were able to modify the source code of the programs to suit our needs.
3.5.1
Experiments with the ZGRViewer graph drawing software
ZGRViewer is a 2.5D graph visualizing program implemented in Java. It is specifically aimed at displaying graphs expressed in the DOT graph modelling language using the GraphViz [20, 31] graph drawing library. ZGRViewer is designed to handle large graphs, and offers a zoomable user interface (ZUI), which enables smooth zooming and easy navigation in the visualized structure. A screenshot of ZGRViewer displaying a spreadsheet data-flow graph is given in Fig. 3.5. A zoomed in screenshot of the same spreadsheet data-flow graph is given in Fig. 3.6.
Despite the
fact that ZGRViewer is able to efficiently handle large graphs through smooth and continuous geometric zooming (zooming in/out) as illustrated in Fig. 3.6, it has some shortcomings: • The whole context of the graph is lost as one zooms in to get a detailed view of a part of the graph (see Fig. 3.6). • To deal with the problem of visualizing large graphs, graph clustering becomes 49
Figure 3.5: A screenshot of the ZGRViewer graph drawing software displaying an unzoomed data-flow graph of a spreadsheet.
Figure 3.6: A screenshot of a zoomed-in spreadsheet data-flow graph in ZGRViewer.
50
a potential solution. Graph clustering allows us to view a subset of the whole graph at a particular time. Unfortunately, the graph drawing libraries which ZGRViewer uses, do not have any graph clustering algorithm implemented in them. We therefore experimented with another graph drawing software, Graphael.
3.5.2
Experiments with the Graphael graph drawing software
The Graphael program has a number of graph clustering algorithms. The Graphael program has a geometric graph clustering algorithm as well the MCL algorithm implemented. A geometric clustering algorithm clusters nodes according to their spacial locality given an initial layout of the entire graph.
Needless to say, the geometric clustering algorithm does not produce clusters that match with logical areas in the corresponding spreadsheet. On the other hand, our experiments with the MCL algorithm showed that the clusters produced would in most cases match with logical areas in the corresponding spreadsheet. A detailed discussion of our experiments with the MCL algorithm is given in Chapter 4.
51
Chapter 4 The MCL Algorithm and Logical Areas in Spreadsheets 4.1
Introduction
We conducted experiments on several spreadsheets to determine the performance of the MCL algorithm in finding “natural clusters” in spreadsheet data-flow graphs. We also determined whether the “natural clusters” match with logical areas in the corresponding spreadsheet. In this chapter, we present details of the experiments and our findings.
4.2
Generating spreadsheet data-flow graphs using Graphael
For the Project Accounting spreadsheet, given Fig. 4.1 and its corresponding formula view given in Fig. 4.2, we generated its corresponding data-flow graph using the Graphael program. The spreadsheet is used to track income and expenditure for some projects being run by some company. To avoid the problem of graph cluttering, Graphael provides the MCL algorithm to generate clusters. Conceptually, 52
the generated clusters are hierarchically arranged in a cluster tree. An illustration of a cluster tree is given in Fig. 4.3.
The leaves of a cluster tree are the actual
Figure 4.1: The sample Project Accounting spreadsheet
Figure 4.2: Formula view of the Project Accounting spreadsheet.
nodes of the generated graph while the rest of the higher-level nodes of the cluster tree represent clusters. The root of the tree is the highest-level cluster of the graph. In the Graphael program, navigation through the cluster tree is achieved by using compound fisheye views and treemaps [1]. Fisheye views are a graph visualization technique which allows one to view a graph as a whole at once while at the same time providing the ability to the viewer to see detailed parts of the graph without 53
Figure 4.3: An illustration of a cluster tree
Figure 4.4: A top-most level view of the cluster tree of the Project Accounting spreadsheet data-flow graph as displayed using Graphael. losing the overall context of the graph. Compound fisheye views is a fisheye view technique provided by Graphael that enables one to view members of a particular cluster while at the same time showing any relationships between the cluster members and the rest of the clusters in the cluster tree. On the other hand, treemaps are a visualization technique in which hierarchical information is displayed within nested rectangles, with each level of nesting corresponding to a level of hierarchical decomposition. In our case, the cluster tree is also displayed using nested rectangles.
54
Using the Graphael program, the cluster tree of the spreadsheet data-flow graph is visualized using two windows which are displayed side by side as in Fig. 4.4. The right-side window is the cluster window while the left-side window is a treemap window. The cluster window is displaying the root node of the cluster tree which is represented by a dot. The treemap window is an important complementary cluster tree navigation aid because it not only helps in determining the level we are at while navigating the cluster tree in the cluster window but it also indicates the number of nodes which are in a selected cluster. We know the level we are at when using a treemap window by counting the number of thickened rectangular borders from the outermost border to the currently highlighted thickened border.
Clicking on the root node of the cluster tree as depicted in the cluster window in Fig. 4.4 leads to the display of the nodes (clusters) at the next lower level of the cluster tree as depicted in Fig. 4.5. On the other hand, right-clicking on any node in the currently displayed cluster leads to viewing of nodes which are at the next higher level in the cluster tree (going up the cluster tree).
In our case, a look at the corresponding treemap in Fig. 4.5 shows that the next lower-level nodes are leaf nodes. Therefore, clicking on any node (cluster) in Fig. 4.5 should lead to leaf nodes in that particular cluster. For example, in Fig. 4.6, we have a cluster containing cells D6, F6, G6 and H6 and these are depicted by labelled nodes. The unlabelled nodes indicate clusters which are not currently under 55
selection. Fig. 4.7 depicts a cluster with cells F10, G10 and H10.
Figure 4.5: Second level view of the cluster tree.
Figure 4.6: An MCL cluster containing cells D6, F6, G6, H6
Compound fisheye views help us to know the relationship between cluster members currently being viewed in relation to other cluster members and unselected clusters. For example, in Fig. 4.6, cluster member G6 is linked to three nodes: F6, D6 and an unlabelled cluster. We can view these details without loosing the overall context of clusters which are at a particular level in the cluster tree. 56
Figure 4.7: An MCL cluster containing cells F10, G10 and H10
4.3
Determining the inflation operator for the MCL algorithm
The size of MCL clusters is dependent on the value of the inflation operator [62]. According to the MCL algorithm, the inflation operator, Γ, has to be greater than 1 (Γ > 1). As Γ values get larger, we would expect tighter (smaller-sized) clusters. It is also expected that as Γ becomes smaller, we should get bigger-sized clusters.
We then setout an experiment to find the best value of the inflation operator, Γ, that gives us MCL clusters that closely match with logical areas in a given spreadsheet. We used the same Project Accounting spreadsheet given in Fig. 4.1 for our experiments. The formula view of the spreadsheet is also given in Fig. 4.2. The result of the experiment is given below: With Γ = 1.1 (Γ > 1): The corresponding treemap and cluster tree is depicted in Fig. 4.8. The resulting 57
clusters are summarized in Table 4.1. Clearly, we do not get any meaningful MCL clusters (compare tabulated clusters with the formula view of the spreadsheet in Fig.4.2).
Figure 4.8: Treemap and cluster tree with Γ = 1.1
Cluster No. 1. 2.
Member Cells B5:B14 the rest of the spreadsheet
Table 4.1: MCL clusters for the Project Accounting spreadsheet with Γ = 1.1
With Γ = 1.5 (Γ > 1): The corresponding treemap and cluster tree is given in Fig. 4.9. Refer to Table 4.2 for a summary of identified clusters. A comparison analysis of the tabulated MCL clusters with the formula view of the spreadsheet in Fig.4.2 shows a mismatch with most logical areas.
58
Figure 4.9: Treemap and cluster tree with Γ = 1.5 Cluster No. 1. 2. 3. 4. 5. 6.
Member Cells B5:B10 B11:B14 F5, F15 H5, H10, H15 E5:E15, F6:F14, I5:I14 D6:D9, D11:D14, G5:G15, H6:H9
Table 4.2: MCL clusters for the Project Accounting spreadsheet with Γ = 1.5 With Γ = 2.0 (Γ > 1): The corresponding treemap and cluster tree is depicted in Fig. 4.10. Identified MCL clusters with Γ = 2.0 are listed in Table 4.3. A comparison of the identified clusters with the formula view of the spreadsheet show matches with most logical areas n the spreadsheet.
With Γ = 2.5 (Γ > 1): Consider the treemap (left window) in Fig. 4.11. It is clear from the treemap that we have so many MCL clusters which have either one member, two members or three members. For example, one single membered cluster contains cell F7. An example 59
Figure 4.10: Treemap and cluster tree with Γ = 2.0
Cluster No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.
Member Cells D6, F6, G6, H6 D7, F7, G7, H7 D8, F8, G8, H8 D9, F9, G9, H9 F10, G10, H10 D11, F11, G11, H11 D12, F12, G12, H12 D13, F13, G13, H13 D14, F14, G14, H14 E5:E15, I14 F5, F15 G5, G15 H5, H15 I5, I6 I7 I8 I9 I10 I11 I12 I13 B5:B8 B9, B10 B11:B14
Table 4.3: MCL clusters for the Project Accounting spreadsheet with Γ = 2.0
60
of an identified two membered cluster contains cells, E10 and I10. An example of a three-membered cluster is a cluster containing cells, B12, B13 and B14. Clearly we can not get any meaningful clusters.
Figure 4.11: Treemap and cluster tree with Γ = 2.5
With Γ = 3.0 (Γ > 1): We get the treemap and cluster tree as in Fig. 4.12. Again, we have so many smaller-sized clusters.
With Γ = 5.0 (Γ > 1): We get the treemap and cluster tree as in Fig. 4.13. Again, we have so many smaller-sized clusters. The largest cluster in this case has got only three member cells (see the treemap).
61
Figure 4.12: Treemap and cluster tree with Γ = 3.0
Figure 4.13: Treemap and cluster tree with Γ = 5.0 With Γ = 7.0 (Γ > 1): We get the treemap and cluster tree as in Fig. 4.14. Again, we have so many smaller-sized clusters. The largest cluster in this case has got only two member cells (see the treemap).
62
Figure 4.14: Treemap and cluster tree with Γ = 7.0
4.3.1
Discussion of experiment results
Using the same analysis technique as with the Project Accounting spreadsheet, we extended our experiment with more spreadsheets. Our results show that the inflation operator, Γ = 2, gives clusters that better match with logical areas in the spreadsheet. Values less than 2 (Γ < 2) give us bigger-sized clusters which do not match with logical areas in the spreadsheet. On the other hand, values greater than 2 (Γ > 2) give us many smaller (tighter) clusters which are not useful either.
For the Project Accounting spreadsheet, MCL clusters identified when Γ = 2 as indicated Table 4.3 are highlighted with different cell background colours and cell border styles in the spreadsheet as in Fig. 4.15 and Fig. 4.16.
63
Figure 4.15: The Project Accounting spreadsheet showing highlighted MCL clusters (when Γ = 2)
Figure 4.16: The formula view of the Project Accounting spreadsheet with highlighted MCL clusters (when Γ = 2)
4.4
Testing the efficacy of the MCL algorithm on more spreadsheets
To test the efficacy of the MCL algorithm, we run the algorithm on one more spreadsheet while maintaining the inflation operator, Γ = 2. The sample spreadsheet used
64
is the Consolidated Balance Sheet depicted in Fig. 4.17. The formula view of the spreadsheet is given in Fig. 4.18. A treemap and cluster tree for the spreadsheet depicting a cluster with cell members F34, F35, F36, F37, F38, F39 and F40 is depicted in Fig. 4.19.
Table 4.4 is a summary of identified MCL clusters for the spreadsheet. For each cluster, we also determine the degree of conformance for each cluster. We define the degree of conformance in terms of the number of cells in an MCL cluster and the number of cells which are supposed to be in the corresponding logical area in a spreadsheet. For example, the degree of conformance for cluster 2 is 6/8. This is interpreted as follows: The corresponding logical area for this cluster is supposed to have 8 cells, but cluster 2 contains only 6 of the 8 cells. A similar interpretation goes for all the other clusters indicated in Table 4.4. The identified MCL clusters for the Consolidated Balance Sheet spreadsheet are then highlighted as in Fig. 4.20 and Fig. 4.21.
4.4.1
Discussion of experiment results
The MCL algorithm was able to identify clusters that match with logical areas in the Consolidated Balance Sheet spreadsheet. Referring to Table 4.4, minor deviations in clusters 2, 5, 8 and 12 occur because a cell can only belong to one MCL cluster where the cell has higher probability of being visited in a random walk. Based on the degree of conformance as a measure, we can conclude that the performance of the MCL algorithm is satisfactory. 65
Figure 4.17: The Consolidated Balance Sheet spreadsheet from the EUSES spreadsheet corpus [25]
66
Figure 4.18: The formula view of the Consolidated Balance Sheet spreadsheet
67
Figure 4.19: A treemap and cluster tree for the Consolidated Balance Sheet depicting a cluster with cell members, F34, F35, F36, F37, F38, F39 and F40
Cluster No.
Member Cells
1. 2.
F50, F54:F60 F42:F46, F61
degree of conformance 8/8 6/8
3. 4. 5.
F34:F40 E21:E23 E19, E25:E29
7/7 3/3 6/8
6. 7. 8.
E9:E17 F21:F23 F19, F25:F29
9/9 3/3 6/8
9. 10. 11. 12.
F9:F17 E50, E54:E60 E34:E40 E42:E46, E61
9/9 8/8 7/7 6/8
Comments
F40 and F60 are left out because they have been put in other clusters
E17 and E23 are left out because they have been put in other clusters
F17 and F23 are left out because they have been put in other clusters
E40 and E60 are left out because they have been put in other clusters
Table 4.4: MCL clusters for the Consolidated Balance Sheet spreadsheet
68
Figure 4.20: The Consolidated Balance Sheet with highlighted (shaded) MCL clusters
69
Figure 4.21: Formula view of the Consolidated Balance Sheet with highlighted (shaded) MCL clusters
70
Chapter 5 Comprehending and Debugging Spreadsheets Using MCL Clusters 5.1
Introduction
One of the goals of our spreadsheet visualization tool is to aid in the comprehension and debugging of spreadsheets. In this chapter, we demonstrate how identified MCL clusters can be used to serve that purpose through a process of cluster member verification. Cluster member verification involves verifying whether the identified clusters belong to their respective logical areas. The aim of this process is to comprehend and understand a spreadsheet as well as identify errors (if any) in the spreadsheets. We use two different spreadsheets in our experiments.
5.2
Analysis of the Project Accounting spreadsheet
We again consider the Project Accounting spreadsheet in Fig. 5.1 and its corresponding formula view is given in Fig. 5.2 below. Identified MCL clusters for the spreadsheet given in Table 5.1. The Project Accounting spreadsheet with high71
lighted MCL clusters is also given in Fig. 5.3 with a captured Microsoft Excel error message.
Figure 5.1: The Project Accounting spreadsheet
Figure 5.2: The formula view of the Project Accounting spreadsheet.
5.2.1
Verification of MCL clusters for the Project Accounting spreadsheet
Referring to Table 5.1 in conjunction with the the formula view of the Project Accounting spreadsheet in Fig. 5.2, clusters 1 to 9 have members which indeed from
72
Cluster No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.
Member Cells D6, F6, G6, H6 D7, F7, G7, H7 D8, F8, G8, H8 D9, F9, G9, H9 F10, G10, H10 D11, F11, G11, H11 D12, F12, G12, H12 D13, F13, G13, H13 D14, F14, G14, H14 E5:E15, I14 F5, F15 G5, G15 H5, H15 I5, I6 I7 I8 I9 I10 I11 I12 I13 B5, B6, B7, B8 B9, B10 B11, B12, B13, B14
Table 5.1: MCL clusters for the Project Accounting spreadsheet
Figure 5.3: Microsoft Excel displays an error message for a cell in MCL cluster number 5 in the Project Accounting spreadsheet. 73
the user’s point of view fall in their respective logical areas. For example, for cluster 1, cells D6, F6, G6 and H6 are in a logical area relating to “Ted” (see Fig 5.2). We have similar cases for cluster 2 to cluster 9.
Cluster 5 may draw some interest since its members do not follow the pattern of its neighbouring clusters. This is because even if we look at the formula view in Fig. 4.16, the formulas of these cells are structurally different from the cells in neighbouring clusters. This is not an error despite the fact that Microsoft Excel produces an error-warning message (see Fig. 5.3).
Cluster 10 has cell range E5:E15 and cell I14. Cell I14 seems to be the odd one out in this cluster. But this should not be the case. Cell I14 belongs to this cluster through the cell dependency as defined by the formula I14=I13+E14-F14 (see formula view in Fig. 5.2). This is an example of a case where it may not be obvious to the user that cluster members belong to the same logical area. We have this phenomenon because a cell may belong to more than one logical area e.g. column-wise and row-wise. However, it will only belong to one and only one MCL cluster at a time. The cell will belong to the cluster where there is higher probability of being visited in a random walk as defined by the MCL algorithm. In this case, cell I14 has to belong to cluster 10.
Cluster 11, cluster 12 and cluster 13 are also in their respective logical areas. However, one would expect that for example, that cluster 11 would have cell range 74
F5:F14 as part of this cluster yet cluster 11 has cells F5 and F15 only. Cells F5 to F14 are members of other clusters. The reason for this phenomenon is that a cell can only belong to one cluster at a time and it will belong to the cluster where it has higher probability of being visited in a random walk as defined by the MCL algorithm. The same explanation goes for clusters 11, 12 and 13.
Cluster 14 has two cells, cell I5 and cell I6, which are connected by the formula I6=I5+E6-F6. Clusters 15 to 21 are all single-membered. A look at the formula view suggests that they should belong to one logical area at least from the user’s perspective.
From the user’s perspective, clusters 22, 23 and 24 should belong to one logical area containing the cell range B5:B14. However, the MCL algorithm has unnecessarily split the logical area into three different clusters. This is an example of a case where an MCL cluster could sometimes not necessarily match with the user’s perspective of a logical area.
From Table 5.1, we could say that clusters 1 to 21 match the user’s perspective of logical area while cluster 22, cluster 23 and cluster 23 provide a mismatch. This represents a success rate of 87.5% for the MCL algorithm.
75
5.3
Analysis of the IPO spreadsheet
We also consider the IPO spreadsheet given in Fig. 5.4. The spreadsheet is used to calculate income after tax for some company. This spreadsheet has been seeded with errors. We list in Table 5.2, the clusters defined by the MCL algorithm for the IPO spreadsheet. Identified MCL clusters for the IPO spreadsheet are highlighted in Fig. 5.5. The formula view of the spreadsheet with highlighted clusters is given in Fig. 5.6.
Figure 5.4: A sample IPO spreadsheet sourced from Ray Panko’s spreadsheet research website[43]
Cluster No. 1. 2. 3. 4.
Member Cells B6, B20 C6, C20 B4, B8:B15, B17, B18, B19, B21 C4, C8:C15, C17, C18, C19, C21
Table 5.2: MCL clusters for the IPO spreadsheet given in Fig. 5.4
76
Figure 5.5: The IPO spreadsheet with highlighted MCL clusters.
Figure 5.6: The formula view of the IPO spreadsheet
77
Figure 5.7: IPO spreadsheet with an Microsoft Excel warning message
5.3.1
Verification of MCL clusters for the IPO spreadsheet
Identified MCL clusters presented in Table 5.2 are highlighted in Fig. 5.5. Cluster 1 has two cells, B6 and B20. A look at the formula view of the IPO spreadsheet in Fig. 5.6 indicates that two cells are connected by the formula B20=B6*B17. According to the MCL algorithm, these two cells belong to the same cluster. But it is up to the discretion of the user to ask themselves if indeed the cells should really belong in the same cluster.
In this case, a user should notice that probably what was intended was that “Taxes” (B20) should be calculated from “Corporate income tax rate” (B5) and “Sales Revenues” (B17) and not from ‘Depreciation rate” (B6) and “Sales Revenues”. Thus the end-user would notice that this is an error and hence corrections can be made
78
that cell B20 should have the following formula B20=B5*B17. Hence the MCL clustering technique can help the end-user in debugging a spreadsheet. All the end-user needs to do is to verify if members of a particular MCL cluster logically belong to the same cluster. A similar analysis goes for cluster 2 which has cell members C6 and C20.
Notice that in Fig. 5.7, Microsoft Excel is producing an error warning message about cell B20. We feel that the error message is not very helpful as the clue to the solution of the error is not appropriate. On the other hand, it is easy to deduce the source of the error from MCL clusters by simply verifying whether the members of a cluster logically belong to the same cluster.
It is easy to see that cluster 3 and cluster 4 belong to logical areas that match with the user’s perspective. However, it is interesting to note that cells B5, C5, B7 and C7 do not belong to any cluster. A closer look at Fig. 5.5 and take note that these cells have not been highlighted. A spreadsheet developer would therefore ask himself/herself as to why this is the case. In trying to provide answers to this question, they would realize that these cells have not been used in any calculation in the spreadsheet. This is a potential error in the spreadsheet and thus the MCL clustering technique can help the end-user in finding about the error and therefore aiding in the spreadsheet debugging process.
79
5.4
Summary of experiment results
The MCL algorithm will most often produce clusters that match with the user’s perspective of logical areas in spreadsheets. The identified MCL clusters have also been shown to aid in the spreadsheet debugging process through the process of cluster membership verification and identification of unused numeric spreadsheet cells.
80
Chapter 6 Implementation 6.1
Introduction
We implemented the visualization tool using the Microsoft Excel spreadsheet system in conjuction with the Graphael graph drawing software. Microsoft Excel was chosen because it is a commonly used spreadsheet system. We chose the Graphael program because it already has an implementation of the MCL algorithm, our algorithm of choice for clustering. We did our programming on the Microsoft Excel side using the programing language, Visual Basic for Applications (VBA). On the other hand, we had to modify the open-source Java code of Graphael to suit our needs.
6.2
Software architecture of the visualization tool
A conceptual architecture of the spreadsheet visualization tool is given in Figure 6.1. In this conceptual architecture, the system is viewed as an aggregation of cooperating components which are represented by boxes and arrows. Screenshots of the prototype of the visualization tool are given in Figure 6.2 and Figure 6.3.
81
In the architecture, whenever a user initiates the cluster generation process by invoking an appropriate command from a dropdown menu in the spreadsheet system interface, the spreadsheet parser module will be run (the “Visualize” menu provides a list of commands). The spreadsheet parser module is coded in Microsoft Visual Basic for Applications (VBA). Spreadsheet dependencies are extracted through a process of rowwise iteration through all used spreadsheet cells. The cell dependency information is written to a text file in the Graph Modelling Language (GML) [33] file format. The algorithm for the spreadsheet parser module is given in Algorithm 2 in Appendix A. A sample GML file is also given in Appendix B.
Figure 6.1: Conceptual architecture of the spreadsheet visualization tool
The graph drawing application, Graphael, is then invoked through an Windows Application Programming Interface (API) command. The invoked Graphael program then parses the GML text fle for syntactical conformance to the Graph Modelling Language (GML). This is done using the GML graph parser which is part of the Graphael program. After successful parsing, a graph is generated in form of an adjacency matrix. An adjacency matrix is a square matrix, M , such 82
Figure 6.2: A screenshot of the prototype for the visualization with a “Balance Sheeet” spreadsheet, a cluster window (top-right window) and a treemap window (bottom-right window).
Figure 6.3: A screenshot of the prototype showing the formula view of the “Balance Sheet” spreadsheet.
83
that Mij = 1 if and only if (i, j) is an edge in graph G and Mij = 0 otherwise. An adjacency matrix can be implemented programmatically as an array [1...n, 1...m] or using any appropriate data structure. From the generated graph, the Markov clustering (MCL) algorithm is run to produce the required graph clusters. After the MCL graph clusters have been produced, the clusters are displayed by organizing them in a cluster tree. The leaves of the cluster tree are nodes (cells) of a particular cluster. The top-most level view of the cluster tree is a single node. A cluster tree is displayed at different levels in the cluster display window as in Fig. 6.2. A right-click or left-click of the mouse over a node helps to navigate up and down the cluster tree. Nodes which have already been visited are distinguished using different colourings. Green colour is used to indicate an already visited node (see Fig. 6.4). Cell members of a currently selected cluster are labelled with cell information they represent. Nodes representing clusters which are not currently under selection are left unlabelled. Linkages between currently selected cluster members with other clusters are indicated through edge connections amongst them. This is because a cell may belong to more than one logical area in the corresponding spreadsheet. This technique is known as compound fisheye views. Compound fisheye views help us to view members of a particular cluster while showing their linkages with other clusters. To help in the navigation of the cluster tree, we also use a treemap window in coordination with the cluster window as in Fig. 6.2. While we are navigating the cluster tree in the cluster window, we use the treemap to know the depth we are at in the cluster tree. The treemap also helps to know the number of nodes which are in a particular cluster without actually navigating onto the cluster. 84
Figure 6.4: A screenshot of the prototype showing the “Balance Sheet” spreadsheet with highlighted logical areas.
As the user selects on members of a particular cluster, the cluster members are written into a text file. Upon demand, members of a currently selected cluster in the cluster window can be highlighted in the spreadsheet . This is done by invoking an appropriate command from a dropdown menu in the spreadsheet system interface. Behind the scenes, the spreadsheet cluster/logical area highlighter module, written in VBA code is run, which uses the cluster member text file to highlight currently selected cluster cells as in Fig. 6.2. The algorithm for the spreadsheet highlighter module is given in Algorithm 3 in Appendix A. As the user navigates through the clusters in the cluster window, the user may repeat the spreadsheet highlighting process, which will lead to all logical areas being highlighted with different colours in the spreadsheet as in Fig. 6.4. The user is also aided in navigating clusters in
85
the cluster window by clearly labelling visited nodes with different colourings. This helps the user to access clusters that need to be visited only.
6.3
Summary
In this chapter, we presented how we implemented the prototype of the spreadsheet visualization tool. We paid particular attention to demonstrate how the visualization tool works with reference to its conceptual architecture.
86
Chapter 7 Discussion 7.1
Introduction
In this research work, we focussed on producing a graph-based spreadsheet visualization tool that would not only aid in the understanding of a spreadsheet but also aid in the debugging and maintenance of spreadsheets. However, as in any visualization tool, we also needed to incorporate human-computer interaction (HCI) aspects in trying to achieve the goals of the visualization tool. In the sequel, we present how we attempted to address these issues in our work.
7.2
Spreadsheet understanding and comprehension
Understanding and comprehension of spreadsheets can be a daunting task especially when one just uses the superficial numerical (value) view of a spreadsheet. Understanding and comprehending a spreadsheet might be necessary when one tries to understand a spreadsheet developed by others. One of the purposes of our spreadsheet visualization tool was to use information from the underlying spreadsheet data-flow graph to aid spreadsheet programmers/users in understanding their spreadsheets.
87
We achieved this through the use of a graph clustering algorithm that produces graph clusters that match with logical areas in the original spreadsheet. The identified clusters are then highlighted using different cell background colours on the original spreadsheet.
Hence, instead of looking at the spreadsheet as a whole at once, the user focusses his/her attention on each highlighted logical area at a time. The spreadsheet understanding process is therefore properly guided since the focus area matches with what the user might perceive to be a logical area. In addition, the user has also an option to analyze cell members of a particular cluster on the graph cluster window other than the spreadsheet.
7.3
The spreadsheet debugging process
Debugging a spreadsheet is also a daunting task since the numerical (value) view of spreadsheet hides the computational details in a spreadsheet. Although one can access details of how spreadsheet computations are done through the formula view of a spreadsheet, arbitrary explicit cell-by-cell inspection through cell formula is also challenging. Examining the corresponding data-flow graph provides a solution to this problem. We, therefore, used information from a spreadsheet data-flow graph in the spreadsheet debugging process by first generating graph clusters using the MCL algorithm and then highlighting the identified clusters in the original spreadsheet. The clusters correspond to logical areas in the spreadsheet. We then demonstrated
88
how through a process of cluster member verification, one can identify some types of errors in the spreadsheet. Cluster member verification involves analysing whether a cell belongs to a particular logical area or not. Cell formulas in a particular logical area are also analysed through the same process. Unused numerical cells in the spreadsheet are also easily identified since they are not highlighted in the spreadsheet. This is because they are not part of the spreadsheet data-flow graph since they are not part of any cell formula.
However, we take note that there are other types of spreadsheet errors which we can not identify using our visualization tool. For example, if a user enters a wrong value of a cell due to a typographical error, it might be difficult to isolate such kind of an error. We therefore propose that we use the tool with other existing spreadsheet debugging techniques such as the use of assertions in spreadsheets [12]. Assertions help in making sure that numerical cells have expected values.
7.4
Spreadsheet maintenance
Non-trivial spreadsheets undergo maintenance cycles as in conventional software. However, spreadsheets get larger in size as they undergo such maintenance routines. We therefore also need a spreadsheet visualization tool that should be able to handle large spreadsheets. Large spreadsheets result in large data-flow graphs which lead to problems of graph navigability. We handled this problem through a graph clustering algorithm that is designed to scale to large graphs. In particular, we used the MCL
89
algorithm. Our experiments showed that it was indeed scaling well to large graphs.
In addition, the spreadsheet tool might also be used in creating spreadsheet documentation artefacts since one can capture and store cell dependency information at a particular time through the use of external graph definition text files. The documentation artefacts could then be used in tracking changes to spreadsheets as the spreadsheet evolves.
7.5
Addressing HCI aspects
We were able to generate spreadsheet data-flow graphs of any given spreadsheet and then display the graph separately from the spreadsheet window. We separated the data-flow graph from the spreadsheet because we believe that superimposing the graph over the spreadsheet clutters the view of the spreadsheet. However, we tried to maintain the mapping between the spreadsheet and the graph by labelling graph nodes with corresponding familiar spreadsheet cell addresses such as “A1”. All spreadsheet cells with formulas also had their corresponding graph nodes labelled with their formula definitions.
The graphs can also be regenerated anytime the user wishes to do so by just clicking on a command button on the spreadsheet system interface. This dynamism in graph generation is important so that the user could be working on a spreasheet while at the same time accessing the corresponding graph on the other side thus achieving
90
real-time spreadsheet-graph interactivity.
To deal with the problem of visualizing large spreadsheet graphs (which comes from large spreadsheets), we successfully employed the MCL algorithm which was satisfactorily able to find “natural” clusters in the graphs which are then highlighted in the corresponding spreadsheet. It was important to find a clustering algorithm that finds clusters that match with logical areas in the corresponding spreadsheet. This is because we did not want a clustering algorithm that produces “meaningless” clusters since that will lead to the incomprehensibility of the spreadsheet thus defeating one of the purposes of the spreadsheet visualization tool.
To help in the navigation of a generated data-flow graph we employed two complementary windows for the visualization: the cluster window and the treemap window. The generated MCL clusters are arranged in cluster tree which is displayed in the cluster window. The cluster window displays the generated MCL clusters as nodes with the root of the cluster tree being represented as a node which is used as the starting point for graph navigation. On the other hand, a treemap is a visualization technique in which hierarchical information is displayed within nested rectangles, with each level of nesting corresponding to a level of hierarchical decomposition. In our case, the cluster tree is a hierarchical decomposition of the data-flow graph and as a result the treemap is complementary navigation aid as one accesses the graph in the cluster window. Treemaps not only help to visualize the depth we are at while navigating a cluster tree but also indicate the number of member nodes in a a 91
selected cluster. However, we note that the inclusion of the treemap window on the display introduces three different windows (i.e. the spreadsheet window, the cluster window and the treemap window). This might lead to the problem of information overload on the part of the viewer which we have to investigate further.
7.6
Summary
In this chapter, we discussed some of the issues being addressed in our graph-based spreadsheet visualization. We have discussed how our visualization attempts to address the problem of spreadsheet understanding and comprehension as well as debugging. We also highlighted how several human-computer interaction(HCI) concerns are addressed.
92
Chapter 8 Conclusion 8.1
A summary of the research work
In this research work, we presented a graph-based spreadsheet visualization that can simplify the task of debugging and understanding (hence maintenance) of spreadsheets. In particular, we tried to address the following three important aspects: (i) Provision of a graph-based visualization that is on a separate window from the original spreadsheet. The main purpose of separating the graph from the spreadsheet is to avoid information overload on the user due to the cluttering of the spreadsheet view. However, we note that an issue that can be raised is the difficulty in the mapping between the spreadsheet and the graph. We handled this by dynamically generating the graph from the spreadsheet. In addition, we showed the link between spreadsheet cells and the corresponding graph nodes by labelling graph nodes using cell addresses. (ii) Application of a clustering algorithm to handle the visualization of large spreadsheets (which lead to large data-flow graphs). This has been handled with the MCL algorithm which is one of the algorithms specifically developed to handle
93
the visualization of large graphs. We tried to improve navigability of graph clusters using compound fisheye views and treemaps. (iii) Provision of a clustering algorithm which identifies graph clusters that match with logical areas in the original spreadsheet. We achieved this by using the MCL algorithm. We observed through our experiments that the algorithm satisfactorily identifies graph clusters that match with logical areas in spreadsheets. This is a novel way of finding logical areas in spreadsheets since the logical areas are found without necessarily looking at structural similarity of cell formulas.
8.2
Our contribution
The following are the main contributions of this research work: (i) We have developed a prototype tool that dynamically generates spreadsheet data-flow graphs which are separated from the spreadsheets. This is in contrast to graph-based spreadsheet visualization techniques that process spreadsheet graphs statically and separately from the original spreadsheet. (ii) We have used a novel way of visualizing large spreadsheet data-flow graphs by successfully employing the MCL algorithm to find clusters in the data-flow graphs that match with logical areas in spreadsheets (Logical areas are not necessarily derived from structural similarity of cell formulas). (iii) We have also demonstrated how the graph-based visualization using the MCL
94
algorithm can assist in understanding and debugging spreadsheets.
8.3
Limitations
We have used two different software applications hand in hand to produce the visualization tool. This in itself is a disadvantage because the software applications use non-compatible programming languages i.e. VBA for Microsoft Excel and Java for the Graphael program. This meant that we had to find a VBA-Java application programming interface (API) implementation. Unfortunately, we found none that could provide for realtime spreadsheet-visualization interaction. This necessitated that we had to use text files as a means of communication between Microsoft Excel and the Graphael program. We also had to use a text file file to submit the input graph to the graph drawing software because currently graph drawing software accepts only graph definitions in files which can later be parsed and thereafter a graph is generated. Similarly, for the cluster highlighting process, cluster member names are written in a text file by the Graphael program afterwhich the file is accessed by Microsoft Excel and then the cluster members are highlighted in the spreadsheet.
Writing and reading text files brings in a computation overhead which affects the spreadsheet-visualization interaction response time. The spreadsheet-visualization interaction response time would have been improved if the graph drawing procedure would have been implemented as part of Microsoft Excel. Unfortuately, VBA is not a powerful programming language to handle advanced algorithms like the MCL
95
algorithm which need complex data structures.
8.4
Future work
In order to address some of the limitations as well as introducing new features in our work, we plan to go in the following research direction: (i) We plan to import the MCL algorithm and other graph drawing procedures into the Microsoft Excel spreadsheet system by using compatible programming languages such as Microsoft C#. This will eliminate the need for separate graph drawing software which we envisage might improve spreadsheet-visualization response time. (ii) We also plan to conduct experiments to investigate the computation overhead of the visualization tool (iii) We also plan to conduct trials of the visualization tool with spreadsheet users to gauge the usefulness and usability of the tool.
96
Bibliography [1] Abello, J., Kobourov, S. G., and Yusufov, R. Visualizing Large Graphs with Compound-Fisheye Views and Treemaps. In Proceedings of the 12th International Symposium on Graph Drawing (2004), pp. 431–441. [2] Abraham, R., and Erwig, M. Header and Unit Interference through Spatial Analyses. In IEEE International Symposium on Visual Languages and HumanCentric Computing (2004), IEEE, pp. 165–172. [3] Abraham, R., and Erwig, M. Goal-Directed Debugging of Spreadsheets. In Proceedings of the 2005 IEEE Symposium on Visual Languages and HumanCentric Computing (VL/HCC’05) (Washington, DC, USA, 2005), IEEE Computer Society, pp. 37–44. [4] Abraham, R., and Erwig, M. How to Communicate Unit Error Messages in Spreadsheets. In Proceedings of the First Workshop on End-User Software Engineering (New York, NY, USA, 2005), ACM, pp. 1–5. [5] Abraham, R., and Erwig, M. Type Inference for Spreadsheets. In Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming (New York, NY, USA, 2006), ACM, pp. 73–84. [6] Ayalew, Y. A User-Centred Approach for Testing Spreadsheets. International Journal of Computing and ICT Research 1, 1 (2007), 77–85. [7] Ayalew, Y., Clermont, M., and Mittermeir, R. T. Detecting Errors in Spreadsheets. In Proceedings of the 1st European Spreadsheet Risks Interest Group Symposium: Spreadsheet Risks, Audit and Development Methods (London, UK, 2000). [8] Ayalew, Y., and Mittermeir, R. T. Spreadsheet Debugging. In Proceedings of the 4th European Spreadsheet Risks Interest Group Symposium (Dublin, Ireland, 2003). [9] Ballinger, D., Biddle, R., and Noble, J. Spreadsheet Structure Inspection Using Low Level Access and Visualisation. In Proceedings of the Fourth Australasian Conference on User Interfaces (Darlinghurst, Australia, 2003), Australian Computer Society, Inc., pp. 91–94.
97
[10] Blackwell, A. F. What is Programming? In Proceedings of the 14th Workshop of the Psychology of Programming Interest Group, J. Kuljis, L. Baldwin, and R. Scoble, Eds. PPIG, London, UK, 2002, pp. 204–218. [11] Burnett, M., atwood, J., Djang, R. W., Gottfried, H., Reichwein, J., and Yang, S. Forms/3: A First Order Visual Language to Explore the Boundaries of the Spreadsheet Paradigm. Journal of Functional Programming 11, 2 (2001), 155–206. [12] Burnett, M., Cook, C., Pendse, O., Rothermel, G., Summet, J., and Wallace, C. End-User Software Engineering with Assertions in the Spreadsheet Paradigm. In ICSE ’03: Proceedings of the 25th International Conference on Software Engineering (Washington, DC, USA, 2003), IEEE Computer Society, pp. 93–103. [13] Burnett, M., Cook, C., and Rothermel, G. End-User Software Engineering. Commun. ACM 47, 9 (2004), 53–58. [14] Chen, T. Y., Kuo, F.-C., and Zhou, Z. Q. An Effective Testing Method for End-User programmers. In Proceedings of the First Workshop on End-user Software Engineering (New York, NY, USA, 2005), ACM, pp. 1–5. [15] Clermont, M. Heuristics for the Automatic Identification of Irregularities in Spreadsheets. In Proceedings of the First Workshop on End-user Software Engineering (New York, NY, USA, 2005), ACM, pp. 1–6. [16] Clermont, M., Hanin, C., and Mittermeir, R. T. A Spreadsheet Tool Evaluated in an Industrial Context. In Proceedings of the 3rd European Spreadsheet Risks Interest Group Symposium (Cardiff, Wales, 2002). [17] Davis, J. S. Tools for Spreadsheet Auditing. International Journal of HumanComputer Studies 45, 4 (1996), 429–442. [18] Deligiannidis, L., Kochut, K. J., and Sheth, A. P. User-centered Incremental Data Exploration and Visualizaton. Tech. rep., LSDIS Lab and Computer Science, University of Georgia, Anthens, USA, 2006. [19] Di-Battista, G., Eades, P., Tamassia, R., and Tollis, I. G. Graph Drawing: Algorithms for the Visualization of Graphs. Prentice–Hall, Upper Saddle River, New Jersey, USA, 1999. [20] Ellson, J., Gansner, E., Koutsofios, L., North, S. C., and Woodhull, G. Graphviz Open Source Graph Drawing Tools. In Graph Drawing, Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2002, pp. 594–597. [21] Engels, G., and Erwig, M. ClassSheets: Automatic Generation of Spreadsheet Applications from Object-Oriented Specifications. In ASE ’05: Proceedings of the 20th IEEE/ACM international Conference on Automated Software Engineering (New York, NY, USA, 2005), ACM, pp. 124–133. 98
[22] Enright, A. J., Van Dongen, S., and Ouzounis, C. A. An Efficient Algorithm for Large-Scale Detection of Protein Families. Nucleic Acids Research 30, 7 (2002), 1575–1584. [23] Erwig, M., Abraham, R., Cooperstein, I., and Kollmansberger, S. Automatic Generation and Maintenance of Correct Spreadsheets. In Proceedings of the 27th IEEE/ACM International Conference on Software Engineering (2005), pp. 136–145. [24] Fisher, M., Cao, M., Rothermel, G., Cook, C. R., and Burnett, M. Automated Test Case Generation for Spreadsheets. In Proceedings of the 24th International Conference on Software Engineering (New York, NY, USA, 2002), ACM, pp. 141–153. [25] Fisher, M., and Rothermel, G. The EUSES Spreadsheet Corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. SIGSOFT Software Engineering Notes 30, 4 (2005), 1–5. [26] Forrester, D., Kobourov, S. G., Navabi, A., Wampler, K., and Yee, G. V. Graphael: A System for Generalized Force-Directed Layouts. In Graph Drawing, J. Pach, Ed., vol. 3383 of Lecture Notes in Computer Science. Springer, 2004, pp. 454–464. [27] Galletta, D. F., Hartzel, K. S., Johnson, S., Joseph, J., and Rustagi, S. An Experimental Study of Spreadsheet Presentation and Error Detection. In HICSS ’96: Proceedings of the 29th Hawaii International Conference on System Sciences (HICSS) Volume 2: Decision Support and KnowledgeBased Systems (Washington, DC, USA, 1996), IEEE Computer Society, p. 336. [28] Gansner, E. R., Koren, Y., and North, S. C. Topological Fisheye Views for Visualizing Large Graphs. IEEE Transactions on Visualization and Computer Graphics 11, 4 (2005), 457–468. [29] Godehardt, E. Graphs as Structural Models – 2nd Edition. In Advances in Systems Analysis, D. P. F. Moller, Ed. Viewg, Germany, 1990. [30] Graphael Home Page. URL: http://graphael.cs.arizona.edu/, Access date: 1st August, 2007. [31] GraphViz Home Page. URL: http://www.graphviz.org/, Access date: 1st August, 2007. [32] Hendry, D., and Green, T. Creating, Comprehending and Explaining Spreadsheets: A Cognitive Interpretation of What Discretionary Users Think of the Spreadsheet Model. International Journal of Human-Computer Studies 40, 6 (June 1994), 1033–1065. [33] Himsolt, M. GML: a portable Graph File Format. Tech. rep., Universitt Passau, 94030 Passau, Germany, 1996. URL: http://www.infosun. 99
fim.uni-passau.de/Graphlet/GML/gml-tr.html, Access date: 10th August, 2007. [34] Igarashi, T., Mackinlay, J. D., Chang, B. W., and Zellweger, P. T. Fluid Visualization of Spreadsheet Structures. In Proceedings of the IEEE Symposium on Visual Languages (1998), pp. 118–125. [35] Kankuzi, B., and Ayalew, Y. A Dynamic Graph-Based Visualization for Spreadsheets. In Proceedings of the 3rd IASTED Conference on HumanComputer Interaction (Innsbruck, Austria, 2008), pp. 198–203. [36] Kankuzi, B., and Ayalew, Y. A User-Centered Graph-Based Visualization for Spreadsheets. In Proceedings of the 4th International Workshop on End-User Software Engineering (WEUSE ’08) (Leipzig, Germany, 2008), ACM Press, pp. 86–90. [37] Ko, A. J. Barriers to Successful End-User Programming. In End-User Software Engineering, M. H. Burnett, G. Engels, B. A. Myers, and G. Rothermel, Eds., no. 07081 in Dagstuhl Seminar Proceedings. Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany, 2007. [38] Mittermeir, R., and Clermont, M. Finding High-Level Structures in Spreadsheet Programs. In WCRE ’02: Proceedings of the Ninth Working Conference on Reverse Engineering (WCRE’02) (Washington, DC, USA, 2002), IEEE Computer Society, p. 221. [39] Myers, B. A., Burnett, M. M., Wiedenbeck, S., and Ko, A. J. End user software engineering: CHI 2007 special interest group meeting. In CHI ’07 Extended Abstracts on Human Factors in Computing Systems (New York, NY, USA, 2007), ACM, pp. 2125–2128. [40] Myers, B. A., Ko, A. J., and Burnett, M. M. Invited Research Overview: End-User programming. In CHI ’06 Extended Abstracts on Human factors in Computing Systems (New York, NY, USA, 2006), ACM, pp. 75–80. [41] Nardi, B., and Miller, J. The Spreadsheet Interface: A Basis for End User Programming. Hewlett-Packard, 1990. [42] Nardi, B. A. A small matter of programming: perspectives on end-user computing. The MIT Press, 1993. [43] Panko, R. R. Ray Panko’s Spreadsheet Research website. URL: http: //panko.shidler.hawaii.edu/SSR/index.htm, Accessed: 29th September, 2007. [44] Panko, R. R. Applying Code Inspection to Spreadsheet Testing. Journal of Management Systems 16, 2 (1999).
100
[45] Panko, R. R. Spreadsheet Errors: What We Know. What We Think We Can Do. In Proceedings of the European Spreadsheet Risk Interest Group Symposium (2000). [46] Panko, R. R., and Sprague, R. H. Hitting the Wall: Errors in Developing and Code Testing a Simple Spreadsheet Model. Decision Support Systems 22, 4 (1998). [47] Pietriga, E. A Toolkit for Addressing HCI Issues in Visual Language Environments. IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) 00 (2005), 145–152. [48] Randolph, N., Morris, J., and Lee, G. A Generalised Spreadsheet Verification Methodology. In ACSC ’02: Proceedings of the Twenty-Fifth Australasian Conference on Computer science (Darlinghurst, Australia, Australia, 2002), Australian Computer Society, Inc., pp. 215–222. [49] Ronen, B., Palley, M. A., and Henry C. Lucas, J. Spreadsheet Analysis and Design. Commun. ACM 32, 1 (1989), 84–93. [50] Rothermel, G., Burnett, M., Li, L., Dupuis, C., and Sheretov, A. A Methodology for Testing Spreadsheets. ACM Transactions on Software Engineering and Methodology 10, 1 (2001), 110–147. [51] Ruthruff, J. R., and Burnett, M. Six Challenges in Supporting EndUser Debugging. In Proceedings of the First Workshop on End-user Software Engineering (WEUSE I) (New York, NY, USA, 2005), ACM, pp. 1–6. [52] Ruthruff, J. R., Prabhakararao, S., Reichwein, J., Cook, C., Creswick, E., and Burnett, M. Interactive, Visual Fault Localization Support for End-User Programmers. Tech. rep., School of Electrical Engineering and Computer Science, Oregon State University USA, 2004. [53] Sajaniemi, J. Modeling Spreadsheet Audit: A Rigorous Approach to Automatic Visualization. Journal of Visual Languages and Computing 11, 1 (2000), 49–82. [54] Sajaniemi, J. Modeling Spreadsheet Audit: A Rigorous Approach to Automatic Visualization. Journal of Visual Languages and Computing 11, 1 (2000), 49–82. [55] Scaffidi, C., Shaw, M., and Myers, B. An Approach for Categorizing End User Programmers to Guide Software Engineering Research. In WEUSE I: Proceedings of the First Workshop on End-user Software Engineering (New York, NY, USA, 2005), ACM, pp. 1–5. [56] Segal, J. Two principles of end-user software engineering research. In Proceedings of the First Workshop on End-user Software Engineering (WEUSE I) (New York, NY, USA, 2005), ACM, pp. 1–5. 101
[57] Seta, K., Ikeda, M., Kakusho, O., and Mizoguchi, R. Capturing a Conceptual Model for End-User Programming: Task Ontology as a Static User Model. In User Modeling: Proceedings of the Sixth International Conference, UM97, A. Jameson, C. Paris, and C. Tasso, Eds. Springer Wien New York, Vienna, New York, 1997, pp. 203–214. [58] Sjoberg, D. I. K., Dyba, T., and Jorgensen, M. The Future of Empirical Methods in Software Engineering Research. In FOSE ’07: 2007 Future of Software Engineering (Washington, DC, USA, 2007), IEEE Computer Society, pp. 358–378. [59] Tichy, W. F. Should Computer Scientists Experiment More? IEEE Computer 31, 5 (1998), 32–40. [60] Tollis, I. G. Graph Drawing and Information Visualization. ACM Computing Surveys (1996), 19. [61] van Dongen, S. MCL - an algorithm for clustering graphs. URL: http: //micans.org/mcl/, Access date: 1st August, 2007. [62] van Dongen, S. Graph Clustering by Flow Simulation. PhD thesis, Centre for Mathematics and Computer Science, University of Utrecht, The Netherlands, 2000. [63] Vemula, V. R., Ball, D., and Thorne, S. Towards a Spreadsheet Engineering. In Proceedings of the 2006 European Spreadsheet Risks Interest Group (2006). URL: http://www.eusprig.org/2006/ vemula-towards-spreadsheet-engineering.pdf, Accessed on 14th August, 2007. [64] Vemuri, S., Sengupta, S., and Davis, J. S. Data Dependency Diagrams for Spreadsheet Applications. In Proceedings of the 30th Annual 30th Southeast Regional Conference (New York, NY, USA, 1992), ACM, pp. 467–470. [65] Wang, Y., Carzaniga, A., and Wolf, A. L. Four Enhancements to Automated Distributed System Experimentation Methods. In ICSE ’08: Proceedings of the 30th International Conference on Software Engineering (New York, NY, USA, 2008), ACM, pp. 491–500. [66] Wiggerts, T. A. Using Clustering Algorithms in Legacy Systems Remodularization. In Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE ’97) (Washington, DC, USA, 1997), IEEE Computer Society, pp. 33–43. [67] Wilson, A., Burnett, M., Beckwith, L., Granatir, O., Casburn, L., Cook, C., Durham, M., and Rothermel, G. Harnessing Curiosity to Increase Correctness in End-User Programming. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (New York, NY, USA, 2003), ACM, pp. 305–312. 102
Glossary Spreadsheet-related Terminology
Spreadsheet system: Application software which allows computations to be defined by cells and fomulas which reference to cells. A spreadsheet system usually contains a two-dimensional grid of cells known as a spreadsheet as well as an accompanying programming language which allows the development of third party applications to extend the generic functionality of a spreadsheet system. A popular spreadsheet system is Microsoft Excel which comes with the programming language Visual Basic for Applications (VBA).
Spreadsheet: usually a two-dimensional grid of cells comprised of rows and columns where data is entered and calculations are specified through formula. In Microsoft Excel, a spreadsheet is called a worksheet.
Cell: the intersection of a column and a row. One can enter text, a number or a formula in a cell.
Spreadsheet program: We use spreadsheet program synonymously with spreadsheet. See spreadsheet.
Spreadsheet model: a categorization of programming in which computations are
103
specified through cells and formulas.
Spreadsheet template: a spreadsheet containing standardized content and/or formatting that one can use as a basis for developing other spreadsheets.
Cell formula: An entry that produces a calculated result,usually based on a reference to one or more cells. The results of a formula change if one changes the contents of a cell referenced in the formula. An example formula in Microsoft Excel would be cell A1 having the formula =B1+$C$2.
Graph-related Terminology
Graph: A graph G consists of two finite sets, V and E. Each element of V is called a vertex or a node. Vertices are also known as nodes. The elements of set E are called edges and these are unordered pairs of the vertices. For example, the set V might be {1, 4, 7, 8, 9} and set E might be {{1, 4}, {4, 9}, {1, 8}, {4, 7}}. Together, V and E are a graph G.
Connected graph: a graph is connected if every pair of nodes can be joined by a path.
Tree: is a connected graph that contains no cycles. Graph-theoretic trees resemble trees in nature in the sense that graph theoretic trees do not have cycles just as the 104
branches of trees in nature do not split and rejoin.
Cluster: a grouping of nodes (vertices) in a graph depending on some criteria such as structural or geometric proximity
Fisheye view: A technique which allows one to view a picture or a diagram as a whole at once while at the same time providing the ability to the viewer to see detailed parts of the picture without losing the overall context of the picture. Fisheye views are important in graph drawing because they enable the display of a complex graph on a limited screen display of most computers.
Treemap: a visualization technique in which hierarchical information is displayed within nested rectangles, with each level of nesting corresponding to a level of hierarchical decomposition. Cluster trees of complex graphs may also be visualized using treemaps.
105
Appendix A Algorithm 2 The algorithm for the spreadsheet parser module Require: a non-empty spreadsheet 1: open a GML graph definition text file 2: for all used cells in spreadsheet do 3: if cell has formula then 4: extract cell dependency information 5: write extracted cell dependencies to GML graph definition file 6: end if 7: end for 8: close GML graph definition file 9: invoke graph drawing software
Algorithm 3 The algorithm for the spreadsheet highlighter module Require: a non-empty cluster member text file 1: open a cluster member text file 2: for all cluster members in text file do 3: determine cell address of cluster member 4: generate random colour for the cell 5: highlight cell background with the generated random colour 6: end for 7: close cluster member text file
106
Appendix B A sample GML graph definition file: graph [ directed 0 node [ id 1 label “F34 ” ] node [ id 2 label “F35 ” ] node [ id 3 label “F36 ” ] node [ id 4 label “F37 ” ] node [ id 5 label “F38 ” ] node [ id 6 label “F39 ” ] node [ id 7 label “F40=SUM(F34:F39)” ] edge [ source 1 target 7 ] edge [ source 2 target 7 ] edge [ source 3 target 7 ] edge [ source 4 target 7 ] edge [ source 5 target 7 ] edge [ source 6 target 7 ] ] 107