The Handbook of Probability in Computing: Interactive Media for ...

1 downloads 0 Views 130KB Size Report
The Handbook of Probability in Computing: Interactive Media for Research. Wray L. Buntine,. RIACS & NASA Ames Research Center. Mail Stop 269{2.
The Handbook of Probability in Computing: Interactive Media for Research Wray L. Buntine,

RIACS & NASA Ames Research Center Mail Stop 269{2 Mo et Field, CA 94035{1000, USA [email protected]

Abstract The paper outlines a Handbook being developed for distribution on the World-Wide Web. The Handbook combines an encyclopaedia with an annotated bibliography, tutorials, and links to community events and software. The Encyclopaedia and Annotated Bibliography form the center piece of the Handbook. Their development will be distributed amongst the community, driven by communityreviewing and authorship. This is an experiment in network-based research cooperation that we hope will be a model for future publication and education media. This paper outlines the proposed Handbook, discusses the bene ts to the uncertainty community, and discusses some of the development issues involved. Finally, this paper calls for community feedback and involvement.

Keywords: world-wide web, research, uncertainty community, encyclopaedia, annotated bibliography.

1 Introduction This paper is a proposal for an online Handbook on the application and supporting theory of probability in the computing sciences. Computational applications of probability are being developed at the interface of arti cial intelligence, computer science, statistics, decision theory, neural networks and related elds. The members of the proposed community include users of probability in the uncertainty community in arti cial intelligence, Bayesians within machine learning and neural networks, some theoretical computer scientists familiar with probability, and users of probability in applications areas such as natural language, speech, vision, and robotics, and statisticians developing computational methods such as Monte Carlo methods and statistical software. Graphical models, for instance Bayesian and Markov networks, have recently become

recognized as a unifying language for some of the techniques spread across the di erent areas, so the uncertainty community in arti cial intelligence is a reasonable center for the wider community of users of probability in computing. The online Handbook proposed here is an ideal match for this community of applied users of probability. The community is fairly well agreed on basic principles and coupled applied methods, at least when compared with other communities. They are computer literate and are already heavy users of the Internet and are often users of interactive systems. Their methods have solid theoretical and mathematical foundations that are strongly coupled with a growing variety of applications. The Handbook proposed is a community project, representing a record of tutorial and established expertise, and knowledge of the literature. It is hoped the Handbook would grow with the community, and represent in an integrated manner, more concepts and algorithms than any one small group of individuals could ever hope to master (i.e., and thus publish in a book). We also believe the Handbook can potentially accelerate the development of the community as a whole, and help promote access to the community's e orts for other researchers. This paper has several goals:  



to outline the overall concept, in Sections 2 and 3, to place an initial set of proposals before the community, i.e. by describing each section and a mechanism for maintaining it, in Sections 4, 5, and 6, and to enlist community support.

Because the Handbook is essentially a community project, it is important to gain support from within the community and to ensure that an appropriate set of mechanisms be set in place that are suitable to the community as a whole. The appendicies describe various issues such as the roles of editors, authors, etc.

2 An overview of the proposed Handbook The home page for the Handbook is available for perusal1 , with some sections partly functioning. The handbook has four main content sections: News and Notes, Tutorials, the Handbook, and the Annotated Bibliography. The News and Notes section collects together conference announcements, call for papers, information on software, and other community related events and services. A fth section, the Contributors section gives information for editors, authors, reviewers and other contributors. The development of the handbook is to be both distributed and interactive, in many cases guided by forms constructed in the Mosaic forms facility, for instance, for reader feedback, reviews, etc. The Contributors section therefore provides both a style guide and instructions to facilitate the community involvement. As with all encyclopaedias, graduate text books, and similar publications, print or electronic, the Encyclopaedia will not contain new research ideas but rather include material that has in some sense been settled by the experts, and is written up in some semiuniform notation. The Encyclopaedia also serves as an organizational platform for the Annotated Bibliography, providing the background information often assumed in research publications. The Annotated Bibliography is more informative than the simple listings of references usually found in texts. The annotations are to serve as a road map to guide researchers through the literature, to both prepare them, direct them, and warn them. The Tutorials serve more as an entry point into the Handbook proper. More detailed descriptions of each of these Handbook sections are given in subsequent sections below. Di erent classes of uses would be allowed access to the Handbook: editors, authors, reviewers and general readers, whose roles are more or less described by their title. Reviewers, for instance, would be any active member of the research community with some publication experience. General readers constitute the Internet community. There would also be some few dozen area editors. The Handbook would come in di erent releases. The current public release would be accessible to the Internet user in electronic form. A current development release would contain all other sections currently under construction. Only editors, authors and reviewers would be allowed access to the development release, and this would contain more detailed comments feedback, access statistics, etc. Depending on the status of the Encyclopaedia and copyright, a print version may I tell a lie. It will be available, hopefully in early March. The URL will be announced to the program committee at that time. 1

also be made available.

3 Intended functionality Before discussing the format, rst consider some of the roles the Handbook would hopefully ful ll.

Not a medium for original research publication:

The Handbook is not a record of original research; that role is played by journals. Rather, it is a support medium for research intended to facilitate access to research and to help researchers integrate their research with others. A guide to the research literature: It is dicult for a newcomer to a research community to extract from the literature the best version of a new technique. This occurs for many reasons: there may be urry of papers on the subject and it is dicult to tell which is the most useful; a good paper may appear in an obscure journal but nevertheless have been distributed amongst the select few; the forward chain of subsequent articles will be dicult to track; some techniques may have appeared in other disciplines, etc. The Handbook will address this problem both through the Annotated Bibliography with its threaded links and annotations pointing the reader to subsequent developments and their use, and through the Encyclopaedia with its (hopefully) more rapid publication of reviews and summaries.

A moderated forum for discussion about papers: A popular and successful variation on the standard research article is the article with invited comments and author responses. This is found for instance in Computational Intelligence and Journal of the Royal Statistical Society. The Annotated Bibliography will serve this purpose.

Addressing frequently asked questions:

The Encyclopaedia explicitly addresses multidisciplinary questions and frequently asked questions as might be found on the Internet news groups. The interactive nature of the medium makes it possible for new readers to provide feedback on the kinds of questions that might occur to them. This is important to make the material accessible to potential users of the technology from other elds. Linking tutorials with research: The Handbook can draw on summer schools and tutorials to develop material, and so make it available to a wider audience. A tutorial represents a particular collection of material, together with a particular navigation through it. Hopefully, material covered in the tutorial will be present in the Encyclopaedia

re ned due to the community development process. Thus, it is hoped that the Handbook would provide a medium that could better facilitate researchers maktheir knowledge available to the community as a Integration of research within and across elds: ing whole, so provide better integration and access to Linked electronic documents such as the Hand- researchand material. book can explicitly show the rich interconnecting web of relationships that typically exists between di erent methods, techniques, and research com- 4 Tutorials munities. The Handbook can record the integration of new ideas into some standard notation or Tutorials are a recognized part of any major conference from the language of another community (for in- program. We propose including tutorials in the Handstance, translating some of the key ideas of neural book by maintaining the author's original format exnetworks into Bayesian networks and the language with appropriate hyper-text links into related of probability). Connections between communi- tended sections Encyclopaedia and the Annotated Bibties (such as the observation that Pearl's stochas- liography.of the Hopefully, major section of a tutotic simulation is a special case of Gibbs sampling rial would have relevanteach expanded entries in the Ency[Hry90]) often go unreported in the formal re- clopaedia. The author retains copyright their tutosearch journals, but in the Handbook they could rial, and gives the Handbook permission of for get recorded in some cases merely by the addition publication and the potential for subsequent electronic hardcopy of a single sentence and a hyper-text link. publication. The author would be required to assist Community feedback on drafts: the Handbook in generating a suitable HTML HTML2 The Connectionist mailing list and the associated version of the tutorial, for instance by providing LaTEX Neuroprose archive in the neural networks com- source or some other form readily converted to HTML. munity has had the e ect that authors can receive Tutorials could not be accessed in Postscript or some detailed feedback on their technical reports and other display language since this would prevent intedrafts from many di erent members of the com- grating the tutorial in the hyper-text structure of the munity, and thereby discover new research col- Handbook. The tutorial section consists of a master leagues, etc. While this would not replace con- index to the individual tutorials, together with a forms ference or journal reviewing, it does improve the facility for reader feedback. Individual tutorials could quality of the nal paper considerably. The 11th include multi-media in the form of images and speech, Conference on Uncertainty in Arti cial Intelli- where these have been recorded at the original pregence, for instance, could encourage their sub- sentation of the tutorial. As the World-Wide Web mitting authors to co-enter their submissions to develops, we expect to make tutorials incorporate inthe Handbook for concurrent community review volve more types of multi-media. Extended speech and through the Annotated Bibliography. movies are not practical at present. Tapping a wider pool of authors: The Handbook can be built up in small pieces, rather 5 The Encyclopaedia like skins of an onion. When an author adds a new entry, there is no requirement to explain the We envisage that each entry in the Encyclopaedia basic notation or introduce auxiliary techniques would be a small self-contained description of a conbecause these will already by contained. An au- cept, with pointers to prerequisite, related, and more thor could discuss a single issue, for instance, \the advanced concepts included in the Handbook together Ravens paradox", that would normally be a few with relevant entries the Bibliography. The entry pages in a book. The long 40 page journal arti- may be a few pages orin more, and may contain links cles that take many months preparation would not to a simple simulation with plots, perhaps with userbe the standard entry in the Handbook. There- adjustable parameters, that demonstrates some of the fore, the Handbook can be readily extended in aspects of the concept. For instance, an introduction small chunks requiring a smaller part of the au- to importance sampling may allow the user to select thors time. This means the experience of many one dimensional distribution and importance samauthors can be tapped, authors who would not apler subsequently graph the quantities concerned usually write a book or large survey article be- such and as convergence and variance. Standard systems cause they don't have the months of free time. such as Mathematica and versions of Bayesian network such as IDEAL [SB90] could be necessary to The key advantage electronic documents such as the software implement these interactive sections. Handbook o er over existing hardcopy Encyclopaedias and graduate text books is that the turn-around time Another useful integration in the Encyclopaedia would is far quicker and the possibility exists that the re2 sultant material will be better integrated and more http://www.ncsa.uiuc.edu/General/Internet/WWW/HTMLPrimer so that tutorials can serve as a guided entry into the Handbook proper.

be a C++ library making simple versions of algorithms and methods accessible to those wishing to experiment. These would be linked in to the relevant sections describing the methods. Existing public domain collections might be used for this purpose. Presumably, the distinction between commercial, research and academic uses of code will have to be made to allow inclusion of some standard libraries. Each small Encyclopaedia entry would be grouped with related sections, to form a larger chapter, and the whole body organized in a hierarchical fashion suitable for printing linearly. As with programming in a large team project, it should help to have each unit as small and self contained as possible. To give an idea of the breadth and scope of a typical entry to the Encyclopaedia, some examples are given below. These include introductory material, links to other disciplines, standard paradoxes and misconceptions, some frequently asked questions, etc. The uncertainty community has sucient agreement on fundamentals, at least by comparison with other elds, for the development of this to be realistic. The example entry topics below are grouped as Fundamentals, paradoxes, independence models, etc., and separated by semi-commas.

Fundamentals: the univariate Gaussian distribu-

tion, with links to derived distributions such as Student's t and chi-squared, and some C++ code; probability density functions; marginalizing; minimizing expected utility; etc. Paradoxes and their resolution: the Ravens paradox; modern statistical paradoxes; the nonlinear utility of money; the St. Petersburg paradox; in uential quotes by famous scientists misrepresenting probability; Important concepts: the value of information; value of computation; the value of clairvoyance; the average case analysis of CSP algorithms; probabilistic models of game playing and search; Independence models: Bayesian networks; Markov networks; similarity networks [Hec90]; chain graphs [Fry91]; and plates [Bun94]; Operations on independence models: the de nition of a chordal graph, and their basic properties, and pointers to where they are used; join trees; converting a Bayesian network to a chordal graph; and converting a Markov network; Inference methods: Gibbs sampling (with a simple simulation, and links to approximate versions such as the EM algorithm); simulated annealing; Markov processes; Gibbs sampling on Bayesian networks as a special case; cross validation and empirical Bayes; Kalman lters with Gaussian in uence diagrams; dynamic programming; hypothesis testing; sucient statistics and the exponential family;

Human-computer interaction: assessing probabilities; assessing utilities; building fuzzy control systems;

Historical perspectives and Justi cations:

origins of probability in Physics (Bayesians, Jeffreys, etc.) versus Biology (Frequentists, Fisher, etc.); a short history of subjective probability; Dutch books; axiom schemes for entropy; Cross disciplinary connections: probabilistic interpretations of fuzzy logic; representing feedforward neural networks, Boltzmann machines and hierarchical mixtures of experts with probability networks; methods for plausible, qualitative and default reasoning; The following mechanisms are envisaged to maintain and integrate Encyclopaedia.  Each new section is published in the development release of the Encyclopaedia and any reader classi ed as a reviewer is able to submit comments in an open review process. Comment types can be requests for hyper-text links, general comments to the author, or request for a footnote be inserted at the speci ed point for community discussion. The new section must be placed in the linear hierarchy illustrated previously to facilitate later hardcopy publication.  Standards are encouraged for uniformity of notation, and for the careful integration of a new section into the Encyclopaedia. A notation guide will be developed, but a glossary will not be required since standard terms will each have their own Encyclopaedia entry. Part of the labor in developing the Encyclopaedia would be the preparation of this foundational material.  Contributions are solicited from the community by the editors. A single section may change several times as additions and corrections are made, as indicated by knowledgeable readers through the open review process. Errors discovered in the development version of the Encyclopaedia are marked in situ, controlled by comment forms.  Once a new entry is deemed stable by the area editor, it is transferred to the public release of the Handbook. Authorship of each section is recorded with the entry, and copyright is assigned to the Handbook. Electronic copyright is retained by the Handbook, although hardcopy copyright may be assigned to a publisher at a later date.

6 The Annotated Bibliography The Bibliography will in some ways be a standard online bibliography, with keyword and abstract search, etc. The default access to the Bibliography will be as collected references at the end of sections in the

Encyclopaedia. Access through the Annotated Bibliography will be through a Mosaic forms interface for keyword search, and as an alphabetical listing. Entries found by the search can be formatted for HTML or text, or returned as entries for BibTEX. Bibliographic entries will take two forms, the short form giving the usual citation, and an extended form which includes: (1) keywords, (2) abstract, (3) annotations and the moderator's or author's summary, and (4) statistics on access and reader opinions. This extended form corresponds to a \home page" for the reference (term due to Mark Torrance, MIT). The main innovation of the Bibliography, however, will be the annotations present in the extended bibliographic form. These annotations can be submitted by readers, collected and integrated by moderators, and included with the expanded bibliographic listings. Typical material in the annotation might include recommended prerequisites, relationship with other work, important citations included in the work, controversial material or errors if any, and subsequent work. Annotations are intended to be hypertext links and comments particular to that paper. More general summaries should of course be kept for the Encyclopaedia. The annotation facility is intended to assist the community in establishing and recording a publication's contribution after the publication has been made. Through time, a paper that might rst introduce an important algorithm will be superseded by re ned versions, or various modi cations to the algorithm may be proposed. In this case, the annotations will provide forward links so the reader can rapidly follow subsequent developments. In addition, annotations may include minor errors, shifts in opinion, or subsequent connections discovered with related elds. Compare, for instance, with articles in the Journal of the Royal Statistical Society, which come with 300 word invited comments from some 20 members of the community, and a nal response by the author. In addition, the annotation facility is intended to provide a community reviewing mechanism whereby papers appearing in an early publication form such as a technical report or research note can be subject to wider community review. These can be included in the Bibliography and linked in to the relevant encyclopaedia entries and related papers in the Bibliography. Our intent is that authors can receive feedback from their peers, a potentially much larger community than the usual three reviewers. We will ask various conferences and journal committees to endorse this as \good in principle", in order to encourage entries, but stress that the Handbook is not a publication medium for these drafts since the Handbook will merely give a citation (and possibly a URL, for instance, or some other way of locating the draft). The annotation facility will work as follows: Entries are submitted as BibTEX entries or through a corre-

sponding forms interface. We hope that authors will submit their own papers so they can install entries in the extended form of the Annotated Bibliography; abstract, keywords, etc. Important background material from other elds, for instance, important graduate texts such as [GMW81], will be entered due to their frequent citation. Journals and conferences can be entered via a standard list of abbreviations using the crossref facility of BibTEX. The various annotations elds speci c to the Annotated Bibliography can be entered at that time, and will presumably be updated continuously as readers use the Bibliography. The URL for extended form, the \home page" would be assigned a unique BibTEX cite key, used for all subsequent references to the publication by the Handbook. This cite key used in accessing the Bibliography (for instance, with the TeX ncite command) could also be generated automatically by the system according to a standard format. Readers who are accredited reviewers can submit comments to the Annotated Bibliography. Comments can be in one of the following forms: (1) comments directly to the author, for instance, presentation problems and minor errors, (2) anonymous comments to the moderator for inclusion in the moderator's summary, and (3) signed comments of not more than 300 words included as is. A second Mosaic forms interface will exist for this purpose, made available whenever the bibliographic entry is accessed. Authors who register with the Handbook can choose to have all comments regarding their paper submitted directly to them. Each paper in the Bibliography involved in the annotations facility will have assigned a primary moderator whose task it is to integrate the received comments into a cohesive whole. In general, we hope that this role will be played by the authors themselves. To help prevent moderation from becoming a burden, individual comments will be restricted to 300 words, and the annotations section should cover points of signi cant impact, and be presorted into categories such as prerequisites, related work, etc., as a well as comments that are attached to a particular section in the paper. Grammatical or presentation problems should be excluded from the annotation.

7 Further issues Mosaic and its various tools in its current form provides sucient functionality to implement most of the Handbook interactions proposed: forms, user authentication, control of remote programs, LaTEX to HTML and RTF to HTML converters, etc. Implementation also requires the various issues of copyright, publishers, and an HTTP site (the electronic home for the Handbook) be established, and editors and reviews be solicited from the several disciplines applying probability in computing.

Acknowledgments These ideas have been developed in cooperation with Bruce D'Ambrosio, Max Henrion Barney Pell, and Michael Frank.

References

[Bun94] W.L. Buntine. Learning with graphical models. In preparation, 1994. [Fry91] M. Frydenberg. The chain graph Markov property. Scandinavian Journal of Statistics, 1991. [GMW81] Philip E. Gill, Walter Murray, and Margaret H. Wright. Practical Optimization. Academic Press, San Diego, 1981. [Hec90] D. Heckerman. Probabilistic similarity networks. Networks, 20:607{636, 1990. [Hry90] T. Hrycej. Gibbs sampling in Bayesian networks. Arti cial Intelligence, 46:351{363, 1990. [SB90] S. Srinivas and J. Breese. IDEAL: A software package for analysis of in uence diagrams. In Proceedings of the Conference on Uncertainty in Arti cial Intelligence, Cambridge, MA, 1990.

A Editorial issues The editors role is to:  suggest new subjects for entries and solicit entries to the Encyclopaedia,  try to maintain standards of notation,  assign moderators for each bibliographic entry requiring annotation where the author has not volunteered to moderate,  monitor the reviews and comments submitted to an author to ensure they are adequately addressed, and subsequently decide when an Encyclopaedia entry is stable and transfer it to the public release. As many editors as possible would be best to distribute the workload. i.e. I do not see it unreasonable that a large proportion of the authors are editors, and I may be an editor for some of Peter Cheeseman's entries, whereas Peter Cheeseman may be an editor for some of mine. Or is this too incestuous? Editors could come from various communities: uncertainty in arti cial intelligence, probability types in machine learning and neural networks, statistical physics, computational methods in statistical, and of course the appropriate applied communities such as vision, speech, where we would have to chase up names.

B Reviewer issues Reviewers are basically readers quali ed to read and comment on the development release. Reviewers should be any active and reasonably competant member of the research community. Any author or editor should be a reviewer. All people who have published in a relevant conference or workshop, and any reasonable graduate student should automatically be made a reviewer. The main hope is that their feedbook will be less than 30% noise.

C Author issues An author is someone well-quali ed to write an entry on a particular section. They should not necessarily be the most revered world expert on the topic, but hopefully be someone current in the research literature and able to give a balanced opinion. An author or set of authors would have their authorship recorded on each entry they write.

D Copyright and publication issues Tutorial copyright might be best retained by the authors. For instance, Peter Cheeseman has some great introductory material on probability theory that would be suitable. We should solicit tutorials from all active Summer School and AAAI tutorial presenters. Copyright for the Encyclopaedia will be retained by the Handbook (and whatever company represents it). Perhaps hardcopy copyright could be given to a publisher at a later date. Authorship of each Encyclopaedia entry would be maintained, and joint authors encouraged. Copyright for the Bibliographic annotations is a an dicult issue to settle. This should perhaps be retained by the moderators. I'm not sure it would have any hardcopy use, except maybe as a moderated literature guide. Perhaps the copyright should be retained by the moderators with the Handbook having both electronic and hardcopy rights to publish free of charge. Steve Minton's JAIR works as follows: The AI Access Foundation retains electronic copyright and hardcopy copyright goes to Morgan Kaufmann. I believe a similar setup could be arranged for the Handbook. Once a quality Handbook is established, nding publishers should not be a problem. I expect hardcopy publication will be important to have the Handbook distributed in libraries. When the Internet sets up billing, access to the Handbook should be sold, however, that would be a matter for discussion. Any pro ts made can presumably go directly to maintenance, computer support, funding for graduate students, and funding for generating and interfacing software libraries to individual Encyclopaedia entries.

E Bibliographic standards To make the bibliography work smoothly in a distributed fashion, the following needs to be done:  Create a standard for the cite key (opening eld of a BibTEX entry). These can be generated automatically from a BibTEX entry with a modi ed style le. For instance: rst 7 letters of authors name, followed by abbreviation of journal/conference name, then year.  Create a format for a bibliographic home page for each reference including: author's contact details, abstract, keywords, URL links to source in whatever formats available, URL links to moderated commentaries, moderator's email (if di erent from author). The home page would be listed in bibliographies for any document so that readers wishing to nd out more about the reference can refer to the home page by following the hypertext link.  Create standard entries in the bibliography for all major conferences and journals so cross referencing can be done uniformly.

F Tutorial issues Tutorials need to be made available in HTML so that links to the appropriate subjects in the Encyclopaedia and Bibliography can be made. Authors would have to give the Handbook permission to modify their electronic source so that these links can be included, and continually updated. Hopefully, authors would participate in this process. Likewise, some terminology may need to be standardized at least somewhat (a vain hope, impossible, get real!). For instance, where an author uses a particular form, they would hopefully refer to other standard forms in the initial introduction. I envisage each initial mention of Bayesian networks looking something like this: ... Bayesian networks (also known as causal networks, directed Markov elds, probability networks, causal probabilistic networks, probabilistic causal networks, Bayesian belief networks, belief networks, similar to in uence diagrams, similar to ...) With the LaTeXtoHTML translator, LaTEXis a suitable source for tutorials. Images, gures and mathematical symbols can be readily included. The LaTEXwould have to be as vanilla as possible. A suitable style is a short page of width 6in. and height 4.5in. in 11pt. font. This looks good when printed or photocopied at 1.3 magni cation. A style le \tute.sty" is available from Wray. A similar setup should be possible with Framemaker using the Frame to HTML translators. It is not clear how to handle PowerPoint at

the moment, although Word does have a translator to HTML so PowerPoint text can be translated via RTF.