Using Data Mining and Recommender Systems to Scale up the Requirements Process Jane Cleland-Huang
Bamshad Mobasher
Center for Systems and Requirements Engineering DePaul University 243. S. Wabash Chicago, IL 60604 (+1) 312-362-8863
Center for Web Intelligence DePaul University 243. S. Wabash Chicago, IL 60604 (+1) 312-362-5174
[email protected]
[email protected]
ABSTRACT Ultra-Large-Scale (ULS) software systems’ projects are anticipated to be highly complex and to involve thousands, or even hundreds of thousands of stakeholders. Unfortunately numerous accounts of recent failures and challenges in industrial and governmental projects have demonstrated that current requirements elicitation and prioritization practices do not scale adequately to address the needs of large projects. This position paper directly addresses this problem through proposing an open, inclusive, and robust elicitation and prioritization process that utilizes data-mining and recommender technologies to facilitate the active involvement of many thousands of stakeholders. We believe that the approach described in this paper is a fundamental building block towards addressing higher level requirements problems facing ULS Systems.
Categories and Subject Descriptors D.1.3 [Software Engineering]: Requirements/Specifications Elicitation methods; H.2.8 [Information Systems]: Database Management - Database Applications- Data Mining; H3.3 [Information Storage and Retrieval]: Information Filtering
General Terms Algorithms, Management, Human Factors.
Keywords Ultra-Large-Scale Systems, requirements elicitation, requirements prioritization, recommender systems, data mining, clustering.
1. INTRODUCTION The anticipated scale and complexity of ultra-large-scale (ULS) software systems challenges almost every activity in the software development life-cycle, including tasks related to eliciting, analyzing, specifying, and managing requirements. Unfortunately, examples abound of large-scale projects in which
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ULSSIS’08, May 10-11, 2008, Leipzig, Germany. Copyright 2008 ACM 1-58113-000-0/00/0004…$5.00.
the requirements processes have not adequately supported the inclusion of even a few thousand stakeholders, suggesting that current practices may not scale well to support ULS systems of the future [16] which are expected to include thousands, or even hundreds of thousands of stakeholders. As an example consider the well-documented $170 Million FBI Virtual Case File project, which involved almost six months of requirements gathering in the form of Joint Application Design (JAD) sessions and ultimately produced a bloated 800 page requirements specification [8]. This project was ultimately written off as a total failure attributed at least partially to problems in eliciting, managing, and prioritizing requirements. Similarly, NASA routinely manages stakeholder projects with many thousands of stakeholders. In a deliberate exaggeration, which is none-the-less very revealing, NASA engineer, Kristin Farry PhD, reported that the paperwork produced whilst eliciting Space Station requirements was so extensive that it could almost have been used to build a stairway all the way into space and eliminate the need for a rocket launch [10]! In this position paper we propose the use of data-mining techniques to analyze and process massive amounts of unstructured or semi-structured data; and the use of recommender systems to facilitate and semi-automatically manage broad stakeholder participation in the elicitation and prioritization process. Our approach is designed to scale up the requirements elicitation and prioritization process in order to support ULS systems and subsystems. These systems will be characterized not only by the complexity of the problems they are designed to solve, but also by the fact that requirements may often not be fully knowable upfront, and are expected to emerge over time as stakeholders use and interact with the systems [16]. In the face of unstable requirements, it is likely that systems will need to be built and delivered incrementally, in a way that returns early and optimal value to the stakeholders. The requirements process therefore not only needs to scale up to include thousands, or even hundreds of thousands of stakeholders, but needs to be conducted in carefully sized increments in order to respond more nimbly to changing needs. Current requirements methods were never designed to meet such challenges. More traditional waterfall approaches or even iterative ones such as the Unified Process assume that requirements are knowable upfront and therefore elicited, analyzed, specified, and baselined during early phases of the software development lifecycle. In contrast, agile approaches are designed to embrace change through relying upon practices such
Figure 1. An overview of the proposed requirements framework as short iterations, minimal upfront design, and face to face communications between developers and onsite customers which clarify stakeholders’ needs immediately prior to development. However, agile processes are best suited for small scaled projects, and attempts to scale them up, usually result in re-adoption of more traditional practices. All of these limitations are challenged by ULS Systems. Most current techniques also involve identifying a carefully selected subgroup of ‘CRACK’ stakeholders, considered to be Collaborative, Representative, Authorized, Committed, and Knowledgeable. However in very large projects, where knowledge may be distributed across thousands of stakeholders with unique perspectives and needs, reliance on such a limited and select group is unlikely to be feasible or effective. Scalable requirements processes of the future therefore need to be fundamentally more open and inclusive and incorporate a much broader spectrum of stakeholders. This change represents a significant paradigm shift from current practice and will require high levels of automated support. Open and inclusive requirements elicitation processes will enable requirements elicitation to be scaled up to include large numbers of stakeholders. It will facilitate the capture of a more complete set of requirements; enable exploration of options in greater depth and consideration of more perspectives. Stakeholder buy-in to the project is likely to increase, and tradeoffs and conflicts will be emerged earlier in the software development lifecycle. Unfortunately the potential problems are also significant. They include information overload, redundant requirements, conflicting opinions, unmanageable discussions, and deliberate attempts by stakeholders to control discussion threads and manipulate prioritization schemes to promote their own win conditions. . Inclusive elicitation processes have been modeled in a rudimentary way in open source projects, where feature requests and discussions are hosted in collaborative forums. However, open-source projects rarely go through an intense elicitation phase, such as would be expected at the onset of a major change or modification to subsystems in an ULS system. Furthermore,
our analysis of open source discussion threads has suggested that discussions are often incomplete and ad hoc, and also that uninformed users frequently introduce ‘new’ requests that have already been discussed and addressed.
2. DYNAMIC THEME DETECTION Our solution involves the use of data mining techniques and recommender systems to gather statements of needs, wants, and desires from stakeholders, in order to automatically generate highly specialized topical forums, and to dynamically assign stakeholders to appropriate forums where they can work collaboratively to transform statements of need into sets of articulated and prioritized requirements. Specifically, we use unsupervised clustering techniques to identify emerging themes [5,6,7] from the stakeholders’ statements of needs, and then employ content-based and collaborative recommender systems [18,20] to place stakeholders into appropriate forums. We have compared the effectiveness of several different clustering methods to cluster stakeholders’ needs. Algorithms considered include bisective, agglomerate hierarchical, K-means [5,6], self-organizing maps, Latent Dirichlet Allocation (LDA), and probabilistic latent semantic analysis (PLSA) [7]. However, we have found that although many of the generated clusters were reasonably cohesive, a significant number of them contained requirements that were only loosely related. This problem creates a serious challenge, as human users will quickly loose trust in a system that creates haphazard discussion forums. Furthermore, all of these techniques failed to identify cross-cutting concerns which often represent globally significant requirements that must be discovered early in the software development process. Our goal is to consistently generate clusters that human users perceive to have a clear and distinct theme winding through each and every requirement. To accomplish this goal we are investigating a wide variety of techniques and combinations of techniques. For example, we have recently investigated use of a probabilistic Latent Semantic Analysis (LSA) method to identify themes and classify stakeholders’ needs. Early results have
suggested that this approach improves the cluster quality and also identifies cross-cutting themes.
3. RECOMMENDER TECHNOLOGIES In ULS Systems it is likely that the number and scope of discussion forums will be large and diverse. Our proposed framework therefore proactively manages participation in the forums through use of recommender systems. Recommendation technologies have traditionally been used in e-commerce domains to recommend products to customers and in information systems where content is dynamically targeted to one or more user [1,20]. They offer scalable solutions capable of supporting hundreds of thousands of users. The recommendation problem is typically formulated as a prediction task in which a predictive model is built according to prior training data and then the model is used in conjunction with the dynamic profile of a new user to predict the level of interest by that user on a target item. Recommender systems generally fall into three categories of content-based systems [18], which make recommendations based on semantic content of data, collaborative-filtering systems which make recommendations by examining past interactions of a user with the system and identifying other stakeholders with similar interests [20], and knowledge-based systems which make recommendations based on knowledge of the user and preestablished heuristics. To evaluate the use of recommender systems in large-scale requirements processes we have adopted a feature-augmentation approach which incorporates both content-based and collaborative filtering techniques. A prototype has been built and used to evaluate the effectiveness of recommender systems in a large set of feature requests extracted from the SugarCRM open source project for managing customer relationships. In our initial informal results the generated discussion forums were found to be more cohesive in respect to standard coupling and cohesion metrics than the ad-hoc ones created by users in the SugarCRM forums. Furthermore, early experimental results have shown that the recommender systems were able to recommend a significant number of relevant forums to stakeholders. Despite these promising results, significant work is needed to evaluate the effectiveness of using recommender technologies within the requirements domain.
4. GROUP COLLABORATION Several researchers and commercial organizations have developed techniques and related tools for supporting collaborative requirements processes [9,17]. These tools and techniques could be adopted within the forums of the open requirements process. However there are also several subtle challenges introduced by the decentralized model of the proposed open elicitation framework. For example, if discussion forums spawn new discussion threads, diverge entirely from their initial discussion topic, or encroach on areas covered by other forums, these scenarios need to be dynamically detected by underlying datamining algorithms, and trigger dynamic restructuring of forums. Any restructuring must be balanced by the need for stability within the forums, and must be neither overly disruptive to stakeholders nor diminish their sense of ownership or control of their own forums. Other potential problems that can occur in decentralized forums, stem from the fact that group leaders will not be present to
facilitate and supervise the forums all of the time. This introduces a number of challenges related to leadership and group dynamics such as forceful or manipulative stakeholders taking control of discussions, group think among forum members, or low commitment to contribute and discuss ideas. These problems can be partially alleviated through tracking participation, measuring the activity level of the forum, and implementing role-based access controls that limit tasks and influence levels according to authorization levels and subdomains. These areas need further investigation in order for the decentralized forum-based approach to be seen as a feasible and competitive option in large software projects.
5. REQUIREMENTS PRIORITIZATION Resource limitations and timing constraints mean that requirements must be carefully prioritized [4,13], a fact that is especially true in ULS Systems which are expected to be built incrementally. Although there are many different prioritization techniques, most of these assume a more centralized process than will be feasible for ULS Systems. One of the most common methods involves stakeholders placing requirements into categories such as mandatory, desirable, or inessential, or else quantitatively ranking them [4], while more sophisticated methods combine the preferences or decisions made by multiple stakeholders. Although these approaches are relatively scalable in terms of effort, they are problematic in large projects simply because no single group of stakeholders holds a global perspective, and therefore prioritization decisions are biased by the limited perspective of each stakeholder. A second class of prioritization technique such as the use of binary search trees or the analytical hierarchical process (AHP), is based on the relative value of requirements and produces a strict prioritization. For example, the Analytic Hierarchy Process (AHP) [11] uses a “pair-wise” comparison matrix to compute the relative value and costs of individual requirements in respect to one another. Although provably more accurate than simple categorization methods, these comparative approaches do not scale well, and are therefore not very practical to apply to large scaled projects. In the proposed framework, stakeholders work in decentralized forums to collaboratively author and prioritize requirements. Individual stakeholders may have conflicting opinions concerning the importance of each requirement, and these differences should not be averaged out by a voting scheme that returns only one prioritization value. For example, stakeholder S1 might rank requirement R1 as ‘high’ and R2 as ‘low’, while stakeholder S2 might rank R1 as ‘low’ and R2 as ‘high’. Another stakeholder might provide no input for requirements R1 and R2 at all. This type of conflict and partial perspective reflects the normal differences of opinion and interests that are expected in the requirements process. ULS systems will therefore require new prioritization techniques that adopt more sophisticated decision support mechanisms such as voting methods which propel winwin discussions rather than attempting to make final decisions, and techniques that infer global priorities from the partial information provided by the individual forums.
6. ROBUST RECOMMENDATIONS All human intensive systems are vulnerable to individual stakeholders and groups manipulating the system to achieve their
own personal agendas. If a human participant is able to figure out how a prioritization algorithm works, then that person can also find ways to manipulate the results. This phenomenon is well known in requirements prioritization. For example, prioritization schemes that award stakeholders a certain number of voting points [21], often result in collaborative schemes to affect the outcome of the prioritization in a way favorable to individual interests [4]. Sophisticated algorithms are also vulnerable to more complicated gaming moves by stakeholders [2,19]. Even assuming that stakeholders enter a requirements process with good intentions, it is human nature to try to find ways to promote personal interests. For example, in the elicitation stage, a stakeholder may attempt to manipulate forum content in order to affect the content-based recommendations to other users. Similarly, a group of users entering bogus profiles into the system can manipulate collaborative recommendations in an attempt to promote a particular requirement; a phenomenon known as shilling [12]. In prior work, Mobasher and colleagues [2,14,15,19] identified and characterized several attack models that can be effective against various collaborative recommender system technologies. Their work has studied the robustness of different recommendation algorithms against different attack types, and has resulted in detection algorithms based on the statistical properties of various attack types. Clearly, the scale and complexities of ULS Systems will introduce new possibilities for manipulation. We are therefore extending this work by identifying different types of vulnerabilities through which stakeholders might be able to manipulate outcomes of the requirements specification and prioritization processes, and developing appropriate countermeasures to mitigate these problems.
7. CONCLUSIONS This position paper has described a feasible approach that utilizes data mining and recommender systems to scale-up the fundamental processes of requirements elicitation and prioritization. We believe that if these fundamental problems can be solved, we will be in a much stronger position to tackle some of the higher level problems identified in the ULS Systems report [16], such as those related to unstable requirements, emergent requirements, and variable trade-offs that occur across different instantiations of otherwise similar products.
ACKNOWLEDGMENTS The work described in this paper was partially funded by NSF grants CCR- 0306303, CCR-0447594, and IIS-0430303.
REFERENCES [1] Basu, C., Hirsh, H., & Cohen, W. “Recommendation as Classification: Using Social and Content-Based Information in Recommendation” National Conference on Artificial Intelligence, Madison, WI. (1998) 714-720. [2] Burke, R., Mobasher, B., Williams, C., & Bhaumik, R. “Detecting Profile Injection Attacks in Collaborative Recommender Systems”, IEEE Joint Conf. on E-Commerce Technology and Enterprise Computing. (Palo-Alto, 2006) 23. [3] Cleland-Huang, J., Settimi, R., Zou, X., & Solc, P. “Automated Detection and Classification of Quality Requirements”, Requirements Engineering Journal, Springer-Verlag, (August, 2007) 36-45.
[4] Davis, A., Dieste, O., Hickey, A., Juristo, N., & Moreno, A. “Effectiveness of Requirements Elicitation Techniques”, IEEE International Requirements Engineering Conference, (Minneapolis, MN, Sept. 2006) 179-188. [5] Duan, C., & Cleland-Huang, J. “A Clustering Technique for Early Detection of Dominant and Recessive Cross Cutting Concerns”, Early Aspects. (Minneapolis, MN, 2007). [6] Duan, C., & Cleland-Huang, J., “Clustering Support for Automated Traceability”, Automated Software Engineering, (Atlanta, Georgia, 2007) 244-253. [7] Duan, C., Clustering and its Application in Requirements Engineering, Technical Report #08-001 , School of Computing., DePaul University, Available online at http://www.cs.depaul.edu (Chicago, Feb. 2008). [8] Goldstein, H. “Who Killed the Virtual Case File?” IEEE Spectrum, 42,9 (2005) 24-35. [9] Gruenbacher, P.,, "Integrating Groupware and CASE Capabilities for Improving Stakeholder Involvement in Requirements Engineering," Euromicro, 2 (2000) 2232, 10] Hooks, I. F., & Farry, K. Creating Successful Products Through Smart Requirements Management. New York: Amacon. (2001). [11] Karlsson, J., & Ryan, K. “A Cost-Value Approach for Prioritizing Requirements”, IEEE Software , 5 (1997), 67-75. [12] Lam, S., & Riedl, J. “Shilling Recommender Systems for Fun and Profit”, Conf. on the World Wide Web. (2004) 393. [13] N.R. Mead, "Requirements Prioritization Introduction", Software Eng. Inst. web pub., Carnegie Mellon Univ., (2006). [14] Mobasher, B., Burke, R., & Sandvig, J., “Model-based collaborative filtering as a defense against profile injection attacks”, Proceedings of the 21st National Conference on Artificial Intelligence, (Boston, MA, 2006). [15] Mobasher, B., Burke, R., Bhaumik, F., & Williams, C. “Towards trustworthy recommender systems: An analysis of attack models and algorithm robustness”, ACM Transactions on Internet Technology , 7,4 (2007). [16] Northrop, L., Feiler, P., Gabriel, R., Goodenought, J., Linger, R., Longstaff, T., Kazman, R., Klein, M., Schmidt, D., Sullivan, K., Wallnau, K., Ultra-Large-Scale Systems: The Software Challenge of the Future, Technical Report, Software Engineering Institute, Carnegie Mellon, (June 2006). [17] Nunamaker, J. F., Briggs, R., & Mittleman, D. “Lessons from a Decade of Group Support Systems Research”, Hawaii International Conference on System Sciences, (Hawaii, 1996) 418-427. [18] Pazzani, M., & Billsus, D. “Content-Based Recommendation Systems”. In P. Brusilovsky, A. Kobsa, & W. Nejdl, The Adaptive Web: Methods and Strategies of Web Personalization, Berlin Heidelberg NewYork: SpringerVerlag. (2007). [19] Sandvig, J. J., Mobasher, B., & Burke, R., “Attacks and Remedies in Collaborative Recommendation”, Expert Sys, Special Issue on Recommender Sys. ,22, 3, (2007). [20] Schafer, J. B., Frankowski, D., & Shilad, S., “Collaborative Filtering Recommender Systems” In P. Brusilovsky, A. Kobsa, & W. Nejdl, The Adaptive Web: Methods and Strategies of Web Personalization. New York: SpringerVerlag. (2007). [21] Wiegers, K.E., Software Requirements, Microsoft Press, Redmond, WA, (1999).