Computational Intelligence Virtual Community: Framework ... - CiteSeerX

Porceeding of the IEEE World Congress on Computational Intelligence, Hong Kong, June 1-6, 2008.

Computational Intelligence Virtual Community: Framework and Implementation Issues Jacek M. Zurada, Janusz Wojtusiak, Fahmida Chowdhury, James E. Gentle, Cedric J. Jeannot, and Maciej A. Mazurowski

Abstract—This paper discusses the framework for virtual collaborative environment for researchers, practitioners, users and learners in the areas of computational intelligence and machine learning (CIML) that is currently developed by our group. It also outlines main features of the community portal under construction that will support communication and sharing of computational resources. In particular, selected aspects of structure of the portal such as common formats of data, models, software, publications and software documentation are discussed. The preliminary portal is available at URL: www.cimlcommunity.org.

I. INTRODUCTION

V

collaborative organizations and communities, are geographically distributed groups of people who collaborate through using various modern media. The importance of such organizations is constantly growing. This can be attributed to several factors including the growth of the Internet as the main medium to distribute scientific resources, and the related need to organize such resources. Researchers from many scientific fields have already created their virtual communities to share findings, data, methods, and other resources. Among the most prominent are efforts in bioinformatics and medicine, but there are also notable efforts in other domains. Researchers in these fields have established standards for reporting results, sharing data and methods, and online collaboration. Recent years have seen numerous attempts towards the development of virtual collaboration in computational intelligence and machine learning (CIML). Many individuals, ad-hoc groups, research teams, and national or international organizations have established their virtual presence in the field by designing and deploying a wide variety of repositories or collaborative sites. Their IRTUAL

Manuscript received December 10, 2007. This work was supported in part by the National Science Foundation under Grant CBET 0742487. Opinions and statements expressed below neither reflect the position of the sponsoring organization nor are those of US Government. J. M. Zurada is with the University of Louisville, Louisville, KY, USA (e-mail: [email protected]). J. Wojtusiak is with the George Mason University, Fairfax, VA, USA (e-mail: [email protected]). F. Chowdhury is with the University of Louisiana, Lafayette, LA, USA, (e-mail: [email protected]). J.E. Gentle is with the George Mason University, Fairfax, VA, USA (e-mail: jgentle@ gmu.edu). C.J. Jeannot is with the University of Louisville, Louisville, KY, USA (e-mail: [email protected]). M.A. Mazurowski is with the University of Louisville, Louisville, KY, USA (e-mail: [email protected]).

contributions to global research and learning, as well as their overall impact, can be evaluated using a variety of criteria, and is subjective from the perspective of a particular user. This is because each of these sites has its unique scope, vision, or promotes specific techniques and results. Despite their diversity and richness, most of these sites usually remain confined to narrow communities, and their activities have inherent barriers, and therefore they are, and will remain, of a limited reach. Also, none of the currently existing collaborative sites, including those of the best quality, reflects a unifying and global CIML effort. Most popular are pages maintained by individuals, research groups and centers that serve as useful repositories of publications and software. These sites, however, present only a perspective of a single group or individual. For example [1] presents information about the page owner’s published books, software and machine learning tools. Website [2] is an example of an initiative driven by a handful of researchers that is dedicated to a narrow CIML sub-area (support vector and kernel machines). Another example of a focused site is [3] maintained by a research team at the Helsinki University of Technology (the institution where the WEBSOM technology was invented, and has now been perfected and put to a good use). The tool is useful to display an ordered map of a semantic space, where similar documents lie near each other on the map. The Weka site [4] also contains important resources. It offers open-source machine learning-based data mining tools developed in Java. They were initially implemented by two authors, and later developed by other members of the community built around its common objective. Weka offers tools for data pre-processing, classification, regression, clustering, association, rule extraction and data visualization, and its software remains very popular. Very useful sites are those with data repositories. Among others, a comprehensive list of links to such websites is offered by the Standards Committee of IEEE Computational Intelligence Society (CIS) at [5]. The most popular data sites are University of California at Irvine [6] and Carnegie Mellon University repository [7], and repositories of specific types of data, e.g. the digital database for screening mammography at University of South Florida [8]. As of today, there are several portals that enable a limited degree of collaborative research effort in areas close to CIML. The most prominent of these sites have already initiated several geographically distributed (but not quite virtual) communities permitting uniform collaborative access to hundreds of researchers. One such an organization


is the PASCAL network [9] that is the leading Europe-wide community similar in scope to the scope of the CIML portal of this paper. It is a flat organization of researchers working in areas of pattern analysis, statistical modeling and computational learning. It involves over 50 research teams called nodes represented by different institutions, with each node receiving funding from the European Union. According to the opinions of participants and observers, PASCAL’s main benefit is that it enforces collaborations but on the other hand it has limited usefulness (and access) for practitioners. The CIML portal facilitates interaction between the following groups: members (participants of the project), developers, users and learners as shown on Fig. 1 where groups are arranged in layers. While members of the Network constitute its kernel, the outside community of developers is expected to be spontaneous, self-organizing and truly virtual. The benefits of the virtual community in CIML will transcend both the research boundaries and the traditional scope of CIML, and will extend beyond the community members and developers. The most important benefit will be an open access to computing and educational resources for the communities of users and 1earners that surround the network kernel as shown on Fig. 1. This layer of the network is expected to be the largest in number and will include the bulk of the CIML communities. In this paper we describe the framework for the virtual community in CIML, and selected aspects of building its portal. Some of these aspects, such as communication between members, are general and apply to any virtual community portal. Other aspects, such as common formats for resource sharing are more specific for the portal’s field. The described framework and solutions are part of an ongoing effort supported by the National Science Foundation.

Fig.1: Interactions within the CIML community.

II. COMPUTATIONAL INTELLIGENCE AND MACHINE LEARNING PORTAL Collaboration between scattered researchers in CIML field can be facilitated by the internet portal that allows its users to share software, algorithms and other resources. Our group undertook the effort of developing such portal (available at http://www.cimlcommunity.org). The primary objective of the portal is to support the advancement of universal practices in collaborative research in CIML and to offer research resources to other research areas. The portal should also provide open source resources divided into five groups: algorithms, software, application tools, models, and data. The role of the portal is to provide the forum for exchanging materials in the field, to allow communication within the community: between the members, users and the public, to offer a discussion forum on topics related to the community’s areas of interest, to host software, tools and data for CIML research, and to guide individuals visiting the portal through large repositories of knowledge, software and tools, and data. The latter function is particularly relevant to people who are not experts in the field, but who want to explore useful tools to be applied to their own problems, research options available and simply extend their knowledge in the field. This would include students at different levels of preparation. In addition, the portal’s training and educational listings are helpful for communities of users and learners. They benefit from postings of prerecorded lectures and demos, as well as from prepared informational pages and references to publications and other resources. An important issue concerning development of all web portals is their user interface. It is one of the most important factors contributing to success and popularity of web portals. We designed the CIML portal to be content-driven, that is, users should be able to easily access relevant information from different pages within the portal. At the same time we plan to keep the interface as simple as possible and show only relevant and important information. The CIML portal will be organized into several interconnected sections. These sections are computational resources that include data, algorithms, software, tools and models, and a collection of articles, educational packages, frequently asked questions and a discussion forum. Interactions between computational resources are depicted in Fig. 2. For example, when accessing a download site for a machine learning program, user can immediately access description of its algorithm, list of relevant publications, articles about the program hosted at the portal, educational materials etc.


Algorithms

Software

Models

Tools

Data Fig. 2: Computational resources provided by the portal.

Discussion Forum

Educational Packages

Articles

Direct Communications

Fig. 3: Communication and learning environment of the portal.

Fig. 3 illustrates connections between different parts of communication and learning environments within the portal that will be made available to all community layers. The following sections discuss selected aspects of the portal, such as articles communication between its users, data, software and tools repository, and collection of publications, articles and documentation. A. Articles and Communication between Members There are different possible levels of communication between users of the CIML virtual community portal. On one side there is a traditional way of exchanging information through articles on the portal’s website, publications, etc., and on the other side there is a direct communication between portal’s users through chats, emails, teleconferences and video conferences, virtual desktops, etc. In between, there are all other forms of communication, such as discussion lists. For articles present in the portal we assume a model similar to the one used in Wikipedia [10]. This model, in which registered users create articles in the portal, was demonstrated to work well. In the virtual community portal, in contrast to Wikipedia, users’ right to create and edit articles would be approved by the portals’ Editors. For example, articles created by students and some new members may need to go through review process, while established members have direct right to create and edit articles (web pages of the portal). On the other hand, students should have a full right to ask questions and participate in discussions on the portal’s forums.

B. Sharing Resources: Data and Software The portal will provide computational resources such as data and software. To ensure high quality of available resources, they need to go through a review process similar to one used in scientific journals. Sharing resources also requires imposing common formats on the shared data, software etc. It is not uncommon that the most time consuming part of research is experimental comparison with state of the art methods, testing algorithms on commonly used benchmark problems or datasets, or simply trying different previously presented methods to a new problem. Thus, one of the basic goals of the virtual community is to provide standards that make this part of research easier. This includes not only common formats for stored datasets, but also for software modules with their interfaces, models, algorithms etc. 1) Data Probably the most common data format in machine learning research is one used by the UCI Machine Learning Repository [6]. The format is compatible with c4.5 decision tree learning program [11] and several other programs. This data format has, however, strong limitations including limitation on attribute types, no possibility of defining background knowledge and use of only one flat data table without allowing relational data, and others. Many programs such as AQ21 [12] have added capabilities to read this data format in order to be easily applied to common test problems. However, they have very limited functionality (e.g. no possibility of defining parameters) when working with c4.5 files. AQ21 as well as other programs have their own data format. For example, the WEKA system [13] uses its native ARFF format. All of the above formats use comma-separated data format, but differ in the way representation space and tasks are defined. The use of comma-separated data has many advantages as it is platform-independent, easy to use without any special software, can be loaded using virtually any statistical package, database system, or spreadsheet, but have also very strong limitations. Also, many programs are equipped with import tools that can read popular data formats. All data submitted to and stored in the virtual community repositories need to be standardized to follow requirements of the common format. To allow compatibility with previously developed and frequently used programs, a set of tools can be used to transform these data to formats required by particular applications. Similarly, users should be able to convert files from different formats to common format. These tools can be offered either as standalone programs, downloadable from the community website, or embedded into the portal. In the latter option user will simply request data in a given format, and all needed conversions will be done automatically. Because of the needed flexibility of the common data format, XML seems to be appropriate language for describing the data. The XML data files for the common format are structured into two mandatory sections that correspond to definition of representation space and the


actual data tables. The representation space is defined by specifying tables, attributes and their domains, and relations between the tables. The data section lists values of attributes in the defined tables. The use of XML, in contrast to a fixed relational database structure, gives flexibility needed by many computational intelligence and machine learning applications. For example, it allows defining meta-values (generalization of missing values) or aggregated values (used for meta-analysis), defining data in forms other than relational tables (predicates, rules, different length examples, etc.). Data stored in the widely used c4.5 format implicitly define learning task by specifying output attribute (always last column in data table). The XML representation of data implies possibility of separating data from learning tasks, similarly to Weka’s ARFF format. The same data can be used for concept learning, clustering, optimization, and other tasks with many different parameter settings. Moreover, input to different applications may allow specifying different forms of background knowledge, models, structures, etc. Tasks to be performed by learning programs are also encoded using XML, and can be included in the same file as the data or defined separately. The CIML portal will allow submitting results of application of different methods to the available datasets. Users interested in downloading and using different datasets will be able to check results from different methods, relevant publications, models, etc. 2) Algorithms, Programs and Tools A more intricate problem is how to define common format for algorithms, programs and tools available to the virtual community’s users. One of the important roles of the community portal is to provide a free access to a repository of software developed by different researchers, groups, companies and organizations. After careful review such software as well as older well-known software will be added to the portal’s repository. One issue concerning standard format of software is that these programs should be able to work with applicable datasets as defined in the previous section. By applicable we mean that data can be used for certain tasks, for example, a concept learning program should work with all types of data stored in tables, but not necessarily with a set of predicates. This can be achieved either by loading XML data directly to the programs, or by using appropriate tools for data conversion. It is not realistic to modify older programs to work with the standard XML data format available in the repository. For example, well known concept learning programs such as c4.5, RIPPER, CN2, AQ17, have their own data formats. For all these programs, appropriate conversion tools should be developed. Providing the common format for test problems for some CI areas such as evolutionary computation, is much harder than for usual ML tasks such as one described in the previous paragraph. In the case of evolutionary computation, problems are defined by specifying search spaces, fitness functions that define optimization objectives (often

complicated and requiring execution of external simulators), sets of problem constraints, relevant background knowledge, and possibly also sets of already known solutions (e.g. as starting populations for evolutionary computation methods), therefore cannot be represented by simple data files, but rather by software modules able to compute fitness and constraints. In this case, submissions should be compatible with common programming interface standard. The interface should allow users to integrate computational problems without much programming effort. In addition to programming interface, a possibility is to allow problemdependent modules to be executed as separate programs, with communication through files. Although computationally inefficient, this method would give full flexibility, and is independent of programming language. Another issue is compatibility of various pieces of software and their platform independence. The standard has to be set that allows for possibility of interfacing the uploaded modules with each other (and with preexisting users software) and for possibility of using these models independently of users hardware and operating system. It is not realistic to assume that all software submitted to the repository will be fully compatible with all other modules, will be platform-independent, and will have the same programming interface. Even in commercially developed software, these issues are often not resolved and different modules simply don’t work together. In the scientific community, the situation is more complicated because different researchers and groups have already developed vast amount of software, modules, libraries, etc. without following common standards. While most of currently developed software in computational intelligence is implemented using relatively novel programming languages and tools, it is not always the case. For example, in numerical domains, software developed in FORTRAN is still present, but Matlab is gaining its popularity. Considering this diversity, we recommend certain standards for software submitted to the repository, while allowing for flexibility, and allowing special cases not fully compatible with these standards. Otherwise, many users won’t share their software with the repository, but would post it at their own websites. One possible approach to the unification issue is to define a common standard of programming in Java. Several sites feature selected computational intelligence and machine learning programs in Java such as Weka [13], the Stuttgart Neural Network Simulator [14], and the Self-Organizing Map [15]. A preliminary neural network builder in Java is featured on site [16]. Moreover, many programming languages and environments can be interfaced with Java modules, e.g., Matlab. Another solution is to perform an upfront translation to the common format after the submission. This will be transparent to the user and will allow for storage of only one format on the portal and provide the compatibility between the modules.


C. Publications, Citations and References Databases An important service for any scientific community is to offer access to all available relevant information in a given context. For example, when browsing portal’s content relevant to a particular symbolic machine learning method, users should be able to follow links to relevant publications describing the method, either hosted by the portal or elsewhere. The user should also be able to find references to more general information, such as list of the most relevant conferences with lists of their papers, relevant journals, etc. Most of this information is already available in the Internet, but distributed over different sites. For example, DBLP [17] is one of the most popular sources of publications in the field. Instead of keeping a list of publications for a given machine learning conference, it may be sufficient to provide a link to an appropriate DBLP record. Other initiatives to create publications databases in the field of computer science include Citeseer [18] and CSB [19]. Thus, it is not our intention to duplicate existing efforts, but rather interface with them. Whenever publications are protected by publishers’ copyrights and it is not possible to host them in the portal and distribute to its members, the portal will provide links to publishers’ websites. In addition to links to existing repositories of papers, the portal may offer technical reports for materials not otherwise publishable. User guides, guides to programming interfaces, detailed descriptions of data formats and similar issues are often not found in scientific journals, interested more in methodologies, algorithms, and results. Technical reports of the CIML portal will provide opportunity of publishing vital technical information about products available on the community’s portal. Such technical reports, with clearly identifiable authors, will play a complementary role to web pages which in the portal’s model may have many authors and can be constantly updated.

resources such as articles, data, software, models, and results, submitted by users will go through a peer-review process, similar to one currently used in scientific journals.

ACKNOWLEDGMENT The development of the CIML virtual organization portal is supported by the National Science Foundation grant CBET 0742487.

REFERENCES [1] [2] [3] [4] [5] [6]

[7] [8]

[9] [10] [11] [12]

[13]

III. CONCLUSIONS

[14]

This paper presented a framework and selected implementation issues of an Internet portal for the virtual community in computational intelligence and machine leaning that is currently developed by our group. The main goals of the portal are to provide channels of communication between members of the community, to standardize ways in which resources are shared, and to build repository of these resources. Our goal is to enable users of the portal easy access to various types of resources together with relevant software, references, datasets, methods, and their descriptions. An important and challenging part of the portal is to standardize data, models, and software formats. The solutions proposed here are based on the common XML data format, a set of tools for data conversions, and standardized Java programming environment for software. Software submitted to the portal should be compatible with these standards, either directly, or through conversion tools. To ensure high quality of the content of the portal, most

[15] [16] [17] [18] [19]

V. Kecman, Author’s site: http://www.engineers.auckland.ac.nz/~vkec001/ SVM and Kernel Machines site: http://www.kernel-machines.org/ Helsinki University of Technology WEBSOM site: http://websom.hut.fi/websom/ I.H. Witten and E. Frank, Machine Learning Weka site: http://www.cs.waikato.ac.nz/~ml/ Standards Committee IEEE CIS Repository: http://ieeecis.org/standards/benchmarks/ A. Asuncion and D.J. Newman, UCI Machine Learning Repository, http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science, 2007. Carnegie-Mellon Data Repository: http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/ Digital Database for Screening Mammography , University of South Florida website: http://marathon.csee.usf.edu/Mammography/Database.html The PASCAL Network: http://www.pascal-network.org/Network/ Wikipedia website: http://www.wikipedia.org J.R. Quinlan, C4.5 Systems for Machine Learning, Morgan Kaufmann Publishers Inc. 1993. J. Wojtusiak, R.S. Michalski, K. Kaufman and J. Pietrzykowski, "The AQ21 Natural Induction Program for Pattern Discovery: Initial Version and its Novel Features," Proceedings of The 18th IEEE International Conference on Tools with Artificial Intelligence, Washington D.C., November 13-15, 2006. I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition, Morgan Kaufmann, 2005. Stuttgart Neural Networks Simulator http://wwwra.informatik.unituebingen.de/software/JavaNNS/welcome_e.html Self-Organizing Feature Map Generator: http://javasom.sourceforge.net/ Neural Networks Designer: http://www.jooneworld.com/ M. Ley, DBLP website: http://www.informatik.uni-trier.de/~ley/db/ Citeseer website: http://citeseer.ist.psu.edu/ The Collection of Computer Science Bibliographies website: http://liinwww.ira.uka.de/bibliography/index.html