Supporting Incremental Learning with Active ... - Semantic Scholar

2 downloads 0 Views 132KB Size Report
The author thanks Gerhard Fischer, Brent Reeves and. Jonathan Ostwald as well as other members from L3D,. Center for LifeLong Learning and Design in ...
Proceedings of International Symposium on Future Software Technology 1998 (ISFST'98), SEA, Hangzhou, China, pp. 185-190, 1998

Supporting Incremental Learning with Active Accumulative and Adaptable Documentation Yunwen Ye Department of Computer Science, University of Colorado at Boulder Campus Box 430, Boulder, CO. 80309-0430, U.S.A. Email: [email protected] Tel: +1-303-492-8136 ABSTRACT

Acquiring knowledge about library routines is necessary for a programmer to effectively use most programming languages. Support for this learning activity is rare while the learning effort required is huge. This paper proposes a conceptual framework of a new documentation system that supports programmers to incrementally learn the use of library routines. The system draws on active information delivery where information relevant to programmers’ work is volunteered by the system, accumulative information space where the system mediates peer-to-peer learning, and adaptable information space where information is displayed according to the increasing skill level of programmers. KEYWORDS

Documentation system, software reuse library, learning on demand 1. INTRODUCTION

As one proverb in Bell Labs going as “library design is language design” [1] indicates, learning how to use library routines is a necessity for programmers to fully master most programming language in the sense of using it effectively and efficiently. While programming language mostly focuses on providing a set of concepts for the programmer to specify actions to be executed by computer, library routines provide a set of vocabulary, which has higher abstraction level and is closer to human being and problem domains, for the programmer to use when thinking about what can be done. For beginner programmers, their knowledge about library components is very limited; hence comparing with more experienced programmers, their power of expressing themselves in the programming language is discounted. Unlike syntax of programming, which is usually learned through schooling or tutoring, the mastery of library routines is usually left upon to programmers themselves. The sheer volume of information about library routines poses a great challenge to programmers’ learning capability. This paper is concerned with the problem of how to support programmers to acquire knowledge about library routines incrementally while they are working with the programming language. After analyzing why current online documentation systems fall short of supporting programmers’ mastery of library routines, a conceptual framework of a new documentation system that facilitates

incremental learning by making the documentation systems active, accumulative and adaptable is proposed. 2 . PROBLEMS OF CURRENT DOCUMENTATION SYSTEM FOR LIBRARY

Documents that accompany the library routines are meant to help programmers to acquire knowledge about library routines. However, from the perspective of supporting programmers’ work and learning, current on-line documentation systems for library routines suffer from the following problems. D o c u m e n t a t i o n s y s t e m s a r e p a s s i v e . Information

contained in current documentation system passively waits for programmers to discover it. It is built under the assumption that programmers have already known what kinds of components exist in the library. Some empirical studies have shown that people do not ask for help when they do not know about the existence of available components [2, 3]. Detailed studies on users using systems like UNIX, Emacs, Lisp and etc. have found that on the average only a small fraction of the functionality of complex systems is used and there are different levels of users’ knowledge about a system’s information space (Fig. 1). In Figure 1. D1 is the subset of concepts that the users know and use frequently with confidence. D2 is the subset of concepts that they use only occasionally. Users do not know some details about them. Documents are important for users gradually master the concepts in this part. D3 is the mental model of the user about the system, i.e. the set of concepts that s/he thinks existing in the system. D4 represents the actual system[4]. For those components that falling in the domain of (D4 - D3), current online documentation does not offer too much help.

D4 D3

D2

D1

Figure 1. Levels of users' knowledge about a system

Documentation systems are unresponsive to programmers’ various needs. Programmers consult

documents for library routines for different purposes. Some may search for library routines with a specific requirement in mind; some may come only for the exact spelling of the name; and some may want to make sure if s/he remembers the order and type of parameters correctly. Even for the same programmer, s/he may look for different information about a library routine for different purposes at different times. For all of them, the current documentation system presents all the same information again and again. Documentation systems are designer-oriented.

Documents for library routines are designed and written by library routine designers, who usually have few ideas about the needs of application programmers. Like many other document writers, they often “concentrate more on the intricacies of the product (library routines) than on answering users’ (programmers) questions [5].” Information contained in the documentation system is fixed at the design time. When it comes to real use, due to the wide range of variety of application situations and backgrounds of programmers, it can not address all concerns held by programmers. There is no way for programmers to change it while using to make it more suitable for their particular situations. Documentation systems are s t a t i c .

Documentation systems do not facilitate knowledge sharing among peer programmers.

They disseminate information vertically by forcing programmers into the role of passive learner only; it does not mediate horizontal learning which means programmers share their acquired experience and knowledge about library routines through the documentation system. Computer, as a new medium, is different from other media because of its computing power. The information structure of online documentation should be enriched by taking advantage of the computing power, and go beyond the giftwrapping [6] techniques of current online documentation style, which retrofits paper documents only and degenerates into yet another book on screen [5]. 3. REQUIREMENTS FOR DOCUMENTATION SYSTEM

A

NEW

Inasmuch as the programmer is working with computer, which provides a possibility for computer to capture programmer’s work and his/her intention, an ideal computer-based online documentation system should be able to activate information hunting process autonomously based on its observation of programmer’s work, and to provide information pertinent to the programmer’s current task and the programmer himself. In particular, as a documentation system for library routines, it must focus on two aspects: helping programmers perform their task (programming) with appropriate information actively delivered in an appropriate context; and helping programmers mature into expert programmers by enlarging

their D1 toward D4 to take full advantage of library routines. In order to achieve this goal, the documentation system must satisfying the following requirements. • It should be able to actively deliver relevant information to the programmer when the system recognizes what s/he is doing or wants to do. • Information provided should be guided by the principle of helping programmers to perform their tasks. Information access must be seamlessly integrated into working environment to avoid unnecessary distractions. • Programmers are not only consumers of documentation systems. It must provide means for programmers to take part in the producing and shaping of the documentation system. It should enable programmers to enrich the system’s information space by extending, updating and restructuring to maintain its sustainability. • From the perspective of software development organization, the knowledge held by individual programmers is a valuable asset of the organization. The organization can benefit more if this precious knowledge is shared organization-wide. The documentation system should be able to behave like an organization memory system [7] that captures and disseminates the knowledge, created by individual programmers in the organization, about those library routines. • Documentation system should serve as a mediator to help programmers form their own “expertise network” [8], which facilitates the knowledge transfer from expert programmers to less advanced programmers and help them mature. Following sections present a conceptual framework for a new documentation system, which is active, accumulative and adaptable (referred as A3D hereinafter), to fulfill aforementioned requirements. Section 4 explains how the system can actively deliver information to programmers to raise the awareness of available reusable components specified in the documentation; section 5 explains in more detail why documentation system should act as an organizational memory and how A3D supports it; section 6 focuses on personalization of information space in A3D documentation system. 4. ACTIVE DELIVERY OF INFORMATION

Documentation system for library routines is a kind of information system whose existence serves programmers’ needs: working and learning. In order for programmers to take advantage of the information included in the documentation system, a communication channel must be set up between the two parties. Current online documentation systems only allow the communication channel to be explicitly set, i.e. only after a programmer issues a search process, the information becomes accessible to him/her. If the programmer does not know there is useful information existing in the documentation system for him/her, the communication breaks down. As a

ramification of this communication breakdown, the programmer writes some codes, which could have been replaced with a simple call. Thus the overall productivity of the programmer drops and the quality of his/her program may also drop because library routines are usually better than most programmers code. The communication channel could also be implicitly set up if the information provider, which is the documentation system, and the information receiver, which is a programmer, have a shared workspace. Through sharing the workplace, the documentation system is able to recognize the programmer’s information needs and then actively deliver to the programmer the relevant information which is found by a searching process automatically initiated by the system. A3D system supports active information deliver by employing techniques from information retrieval [9] and signature match. The system is constructed with a free-text indexing mechanism, which is briefly explained in the next section, and using programmers’ partial work as a query to initiate searching for relevant information. 4.1 Overview of Information Retrieval Approach

Information retrieval system helps users to find relevant information from a collection of documents. The key issues around the information retrieval systems are problems of how to represent document (and query as well) in a form suitable for computer processing, of how to define the relevance between pairs of query and document. The document indexing process is designed to create document representatives which reflect the document, and upon which computers are able to operate. An index language, whose elements are called index terms, is used to describe the document representatives. Vector-space model is used in A3D as document representative. In vector-space model, a document is represented by a t-dimensional vector, where t is the number of index terms, and the number in the vector is the frequency of occurrence of that term in the document. Terms used in term vector can be determined to be a single word or a phrase directly extracted from documents, or controlled terms which are independently pre-constructed (either automatically or manually). In deriving a term vector from a document, basically two tasks should be performed: removal of high frequency words and suffix stemming. High frequency words refer to those grammatically functioning words, or those words that appear almost in every document. These words are not so much significant in characterizing the content of documents. Suffix stemming is introduced based on the belief that reducing the variation of morphological forms of words increases the recall because words originate from the same root often hold same meaning. Traditionally, the effectiveness of information retrieval is measured by two parameters: recall and precision. Recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents in the whole collection. Precision is the ratio of the number of relevant documents retrieved to the total number of documents retrieved. Ideally, everything relevant

is expected to be retrieved while at the same time everything non-relevant gets rejected, with both recall and precision equal to 1. The reality is that a compromise has to be made because these two parameters conflict with each other. Retrieval query is also treated as a document, and a vector of it is formed in the same way as is done to documents. This vector is then compared against the indexed documents, and a correlation coefficient, which indicates the similarity between query and documents, is computed for each query-document pair base on their corresponding vectors. A threshold value, usually empirically decided, is needed to cut off non-relevant documents. 4 . 2 Active Information Programmer’s Intention

Deliver

Based

on

The editor which programmers use serves as the shared workspace that facilitates the setting up of the implicit communication channel between programmers and A3D. A programmer’s input to the editor is fed into the A3D system in real-time. A3D system incrementally parses the inputted character stream on the fly, and extracts out comments and procedure names to use them as queries. A3D system works under the condition that programmers write comment to state the purpose of the procedure before they define it. As soon as the comment is captured, A3D system starts the process of active delivery using the comment as query. If the programmer does not write comment, the procedure name chosen by the programmer is considered meaningful enough to reveal programmer’s intended purpose of the procedure, and is used as the query to trigger the automatic searching. The philosophy underlying this is: if some descriptions of library routines in the documentation have high similarity with the programmers intention, revealed through comment or naming, there is a high possibility that the programmer is going to re-invent the wheel without knowing the existence of reusable components that serve his purpose; by actively bring relevant components into the attention of the programmer, unnecessary rework may be avoided and a learning opportunity, which is highly integrated and tuned into the working situation, is provided for the programmer to know more about the library. A short scenario is used to illustrate how A3D works. Joe is a programmer, who is new to C programming language and UNIX environment. He does not know too much about the reusable library routines provided by his working environment. Now he is developing a file backup system and needs a procedure to compare if two files are same. He has a good programming style and starts with a comment statement as follows. /* Are they same files? */ int sameFiles(char* fileName1, char* fileName2)

A3D extracts out the comment “Are they same files?” after Joe types it in, and interprets it as Joe’s informal conceptual intention [10] for the immediate programming task. Using this comment as a query, A3D system search

through its repository of library routines. The routine cmp, originated from the same named UNIX command, whose textual description states as “The cmp command compares two files”, is picked out and shown to Joe because it has a similarity value higher than 0.50 (1.0 means almost similar) with the stated comment. Other routines such as diff are also presented to Joe, and they are arranged in the decreasing order of similarity. Joe reads through the descriptions and finds that cmp does exactly what he wants. He decides to use cmp directly instead of writing one by himself. According to users’ preference, A3D starts to volunteer information at two different times. The first one is when the programmer has finished typing comments but has not started to define the procedure (the end of first line in above example), as explained in above scenario. The second one is just after the programmer has finished the prototypical specification of the procedure (the end of the second line). In this situation, the signature of the procedure (char* x char* -> int in the example) is also used as searching criteria to match against the signatures of library routines using the signature matching advocated in [11]. This process of signature matching works in two ways: if the retrieval result returned by comment-matching contains too many candidates, relatively strict signature matching (exact matching) is adopted to filter out less related items; if too few, relatively loose signature matching is adopted to add more likely candidates. In addition to support those programmers who do not know the existence of reusable routines, A3D system also works as a prompter to support those programmers who know the existence of a library routine that they want to use now but may forget some aspects of it. Empirical study shows that even for programmers who could name a reusable components a priori, they may still make mistakes in recalling the exact spelling, the order of the parameters and etc [12, 13]. They also need to look up the documents to figure it out, which is quite tedious and distracts programmers from their real work. A3D system benefits those more advanced programmers by showing prompts to them actively so that they do not need to interrupt their current work for looking up the documents. In this case, the procedure name is used as the query. For example, Joe has already used strcpy(s1, s2) several times but he could not remember whether s1 is duplicated to s2 or the reverse. A3D can help Joe by displaying “char *strcpy(char *s1, const char *s2)” only to show Joe s1 is the final destination, when it recognizes that Joe has just typed strcpy in a function call context. 4.3 Adaptive Indexing

Free text indexing has to struggle with the problem of recall. The disparity of the vocabulary used by users (programmers in this case) and the system (documents) contributes to the low recall rate. An ideal picture for documentation system that makes information retrieval easier and effective is to create an

“indexing scheme similar to the knowledge structures possessed by most programmers [2]”. In order to achieve this ideal picture, many pairs of vocabulary matching are needed to fill in the gap between the situation model (the way that programmers perceive and describe their problems and solutions) and the system model (the language that documents use) [14]. However, as empirical studies have suggested, due to the variety of programmers background and implementation domain, it is very difficult to make a complete list of matching by anticipating all the words that might be used by programmers. Experiments show that the probability of two people choosing the same term to describe common objects, cooking recipes, editor commands, and other items is only between 10 and 20% [15]. Adaptive indexing [16] is used in A3D to improve the indexing effectiveness incrementally while the documentation system is used by programmers. When a programmer tries to use his own word, for example, “draw a CD-ROM”, to describe his intention, the repository system fails to provide relevant information. By using other means, such as asking other programmers or trying other names to search, s/he may finally find the function XDrawArc (a library routine in X11) that helps him to draw it. The A3D system will provide a mechanism that allows the programmer to add an alias from “draw a CD-ROM” to “XDrawArc” into the indexing mechanism of the system. This added alias will be beneficial to other programmers, especially those programmers who are working in the same or similar projects because they are likely to use the same description. In this example, two mutual learning activities happened. One is that the programmer learned the word used by the document; and the other one is that the system learned a new way of express the same idea. 5. ENRICHING THE REPOSITORY CONTINUAL ACCUMULATION

BY

As knowledge intensive organizations, software development organizations rely heavily on the experience and knowledge possessed by each member. Capturing, storing and disseminating those experience and knowledge within the organization is a critical issue for the organization remains competitive. Library of reusable components is an asset originally resulted from a community of practice, either produced in an organization (for those in-house built library) or in the society at large (for those standard libraries or commercially available ones). When it is introduced into a software development organization, it must take the role of organization memory to make the knowledge contained sharable among programmers. A static organizational memory is not sustainable because it only maintains the status quo instead of supporting a constant flux of new knowledge acquired by members of the organization. Casting documentation system into the perspective of organizational memory, the problem that it falls short of supporting the growth of the system arises immediately. A documentation system must grow because the information contained in each document is not

complete, and the incompleteness of information can be and should be compensated with programmers’ contribution. The incompleteness of documents comes from the fact that information contained in the documents for reusable library routines is determined by library designers prior to real use. The content of documents is fixed at design time. While the description written by designers may be perfect from their own points of view, it is not complete for the programmers who are faced with various kinds of problem domains and hold different backgrounds. It is difficult for document designers, at library design time, to anticipate all different information needs of programmers who use it in different situations. Therefore, the documents often fall short of providing the information needed for many situations. Another factor that contributes to the incompleteness of documents for library routines is that while designing a reusable component, library designers make many tradeoffs in order to balance the conflicting requirements of correctness, flexibility, efficiency, extensibility, portability, consistency, simplicity, completeness, ease of use, generality and etc. to make it generally reusable. It is not reasonable to assume that designers could explicitly incorporate into the documents all these design rationales made consciously or even unconsciously. Many users of the library documentation system are very skilled programmers. When they are using those library routines in real situation, they may find some pitfalls which are not described in the documents. They may also come up with some tips of using the library routines in certain situations, such as performance constraints, portability issues and etc. Currently, emails, newsgroups or WWW are mainly used by some organizations to enhance the expertise exchange among programmers. However, the knowledge disseminated in this way is decontextualized, and is difficult for programmers to access within their working environment when they really need it. Making documentation system to serve the role of organizational memory is able to reduce the decontextualization problem of expertise exchange about library routines. When a programmer comes up with new knowledge about a library routine and wants to share it with other programmers, he adds it to the document associated with the library routine in the documentation system. As other programmers come to visit the document for the same library routine, the knowledge augmented by knowledge producer becomes available to them as knowledge consumer. Thus, the knowledge consumer and producer are able to share, at least a part of, the same context of the knowledge, which is mediated through the same library routine accessed by both. Making component repository system an organizational memory system also enhance peer-to-peer learning by giving programmers the opportunity to build up their own expertise network about library routines. Studies in some organizations have shown that one major reason that makes good programmers more productive is that they know who to ask about particular problems and are able to quickly

refer to other peer colleagues [8, 17]. In A3D, programmers who add their knowledge into the repository leave their names and dates on the repository. Later when other programmers visit the documentation system, and have difficulty in understanding and using a library routine, they can trace these footprints left on it to find out the experts to who they should turn for help. Thus, the opportunities for programmers to interact with and stimulate each other increase. As Brooks argues in [18], these opportunities are very important for growing programmers. 6. PERSONALIZING THE REPOSITORY

Although all programmers are created equal, they are not equal when it comes to their needs for information. Programmers vary in their past experiences and in levels of skill. When they consult a same document, one paragraph that rightly addresses one programmer’s need may become a distractive string of words to another programmer who has known it already. Unlike textbook, which people usually study systematically and do not go back to re-read very often, documents for library routines fall into the category of reference resources such as dictionary, which people look up often when necessity arises. Programmers often come to documents for library routines for a particular task which they are working on. Once their needs are met, they usually don’t bother mastering all what is written there due to the higher priority of getting the “real” work done. It is not uncommon for programmers to consult the same document again and again. This is particularly true for those components falling in the D2 and D3 in Figure 1. Programmers, who access a document for a library routine which they have known some aspects already, come for new information which has escaped their notice before. To make documents less distractive and stay focused on programmer needs, the content to be displayed should match user’s increasing knowledge by allowing programmers to customize it. Through customizing the contents of documents, programmers are able to evolve the documents gradually along with their own changing skills, and get a personalized version of their own documentation system. In order to support customization, documents have to be decomposed into finer granularity. A document for a routine in A3D system is divided into fields which include name and quick description, syntactic information, detailed description, examples, member contributions and etc. Fields which have more than one paragraph such as detailed description are further divided into paragraphs. Programmers are able to deselect a field or a paragraph while they are reading through it. The system is able to remember each programmer’s preference for each document. When the same programmer access the same document later on, the deselected parts are not shown anymore and replaced with a small icon which can be clicked to collapse into the full content if the programmer wants to read it again. Programmers are also able to restructure the documents by moving those fields, which they think most important to them, up to front place.

Because the repository system is used by a group of programmers, one programmer’s customization should not interfere with other programmer’s one. A3D separates structure of the documents from the contents. Two databases are set up in A3D: one keeps the content and the other keeps the customized structure for each document and each programmer. When displaying a document D for a programmer P, the system finds out, at first, the structure specialized for D and P in the structure database, and then filling the content from the content database. For example, assuming a document D1 has contents , and the default structure for it is , with S1 filled with C1, S2 filled with C2, and so on. This default structure is applicable to all first time visitors. Programmer P1 looks at D1 and realizes that C2 is not in his need anymore and leaving it there distracts his attention from other really needed information, so he clicks off C2. Now, the system creates a structure for P1 and D1 in the structure database. The structure looks like . The next time when P1 comes to access D1 again, the system shows him with C2 replaced with an icon.

4.

7. SUMMARY

9.

This paper described the conceptual framework of a new documentation system that supports programmers to incrementally learn how to use library routines effectively. Although the discussion in this paper focused only on the documentation systems for library routines, the problem and solution delineated in this paper also hold for reusable software component repository, of which documentation systems of library routines are special examples. Being designed for black-box reuse, library routines reused in a program are automatically integrated by linkers, therefore manuals of library routines only compose a reusable software components repository. As further development, the framework presented in this paper is being extended to support reuse repository in general.

5. 6.

7.

8.

10. 11.

12.

13.

ACKNOWLEDGMENTS

14.

The author thanks Gerhard Fischer, Brent Reeves and Jonathan Ostwald as well as other members from L3D, Center for LifeLong Learning and Design in University of Colorado at Boulder, for their insights and comments.

15.

REFERENCES

1. 2. 3.

Stroustrup, B., The C++ Programming Language. 2nd ed. 1995: Addison-Wesley Publishing Company. Curtis, B., H. Krasner, and N. Iscoe, A Field Study of the Software Design Process for Large Systems. CACM, 1988. 31(11): p. 1268-1287. Reeves, B.N., Locating the Right Object in a Large Hardware Store -- An Empirical Study of Cooperative Problem Solving among Humans, Techinical Report CU-CS-523-91, Department of Computer Science, University of Colorado, 1991, Boulder, CO

16. 17.

18.

Fischer, G., A.C. Lemke, and H. Nieper-Lemke, Enhancing Incremental Learning Processes with Knowledge-Based Systems (Final Project Report), Technical Report CU-CS-392-88, Department of Computer Science, University of Colorado, 1988, Boulder, CO Hackos, J.T. Online Documentation: The Next Generation. in ACM SIGDOC'97. 1997. Snowbird, Utah. Fischer, G., et al., Making Argumentation Serve Design, in Design Rationale: Concepts, Techniques, and Use, T. Moran and J. Carrol, Editors. 1996, Lawrence Erlbaum and Associates: Mahwah, NJ. p. 267-293. Ackerman, M.S. and T.W. Malone. Answer Garden: A Tool for Growing Organizational Memory. in Proceedings of the ACM Conference on Office Information Systems. 1990. Cambridge MA. Terveen, L.G., P.G. Selfridge, and M.D. Long, Living Design Memory: Framework, Implementation, Lessons Learned. Human-Computer Interaction, 1995. 10(1): p. 1-37. Salton, G. and M.J. McGill, Introduction to Modern Information Retrieval. 1983: McGraw-Hill. Biggerstaff, T.J., B.G. Mitbander, and D.E. Webster, Program Understanding and the Concept Assignment Problem. CACM, 1994. 37(5): p. 72-83. Zaremski, A.M. and J.M. Wing, Signature Matching: A Tool for Using Software Libraries. ACM Transaction on Software Engineering and Methodology, 1996. 4(2): p. 146-170. Lange, B.M. and T.G. Moher. Some Strategies of Reuse in An Object-oriented Programming Environment. in Human Factors in Computing Systems 1989. 1989. Austin, Texas: ACM Press. Houde, S. and R. Sellman. In Search of Design Principles for Programming Environments. in CHI'94. 1994. Boston, Massachusetts: ACM Press. Fischer, G., S. Henninger, and D. Redmiles. Cognitive Tools for Locating and Comprehending Software Objects for Reuse. in Proceedings of the 13th ICSE. 1991. Austin, Texas. Furnas, G.W., et al., The Vocabulary Problem in Human-System Communication, in CACM. 1987. p. 964-971. Furnas, G.W. Experience With an Adaptive Indexing Scheme. in Human Factors in Computing Systems 1985. 1985. New York. Berlin, L.M. Beyond Programm Understanding: A Look at Programming Expertise in Industry. in Empirical Studies of Programmers: Fifth Workshop. 1993. Palo Alto, CA: Ablex Publishing Corporation. Brooks, F.P., No Silver Bullet: Essence and Accidents of Software Engineering. IEEE Computer, 1987. 20(4): p. 10-19.