The TVGUIDE Application Project Report
Pontus Johansson
[email protected]
2002-05-25
Java Development for HLT, GSLT 2002
1
Introduction
The purpose of this course project has been to build a knowledge source management system for the domain of movies, with a natural language interface. Specifically, user’s can post natural language queries in English about all movies available at the Internet Movie Database (Imdb) web site 1 . TVGUIDE is a question/answering (Q/A) system, rather than a dialog system, which simplifies the system, but – not surprisingly – also makes it less powerful since the user (and system) cannot rely on elliptical expressions, or incrementally access information. The reminder of this section describes the open source and programming/design techniques used in the project. Section 2 describes the practical work of incrementally constructing the TVGUIDE system. In section 3, both the system and the project as a whole are discussed. The paper is concluded with the outline of the next obvious steps for the T VGUIDE application. 1.1
Open Source and Programming
The TVGUIDE project has from step one been an open source project available at Sourceforge 2 , under the umbrella of the NlpFarm Sourceforge project. Even though there were some troubles getting CVS to work properly in the beginning, the source code has been (and still is) available throughout the iterations. There is an accompanying web page 3 with information and documentation. The programming technique advocated in this course is Extreme Programming (XP). More information about XP can be found at the resources and links page on this projects web site. Section 3.2 discusses how XP really worked in this project. 1.2
Design Patterns
is implemented in the Java software development kit (SDK), version 1.3. Effort has been put into finding, and employing, various design patterns. The key pattern used in TVGUIDE is the Phased Process (PP) pattern, which is to be explained in this section. Other patterns, such as Singleton and Factories, have also been used for several components in TVGUIDE and are explained elsewhere (Gamma et al., 1994). Naturally, “internal” design patterns from the Java SDK influences the design of all (especially javax.swing) Java systems (e.g. Adapter, Model- View-Control etc.). These patterns will not be described in this report. TVGUIDE
In situations where a set of operations must proceed in an orderly fashio n, the most obvious approach might be to code each operation so that it calls the next operation, similar to a oneway linked list. While this approach works, its main drawback is that it requires each operation to know about the next. This makes future extensions and modifications hard to add. One could also use the State pattern. This tailors to objects that must alter their own behavior when their internal state changes (Gamma et al., 1994). However, when there is one object controlling other, separate objects and they are to be kept loosely coupled (but still allow them to transfer control to one another), a different solution is proposed: the Phased Process pattern. This pattern is especially useful because it (Benton, 2002:1):
1
http://us.imdb.com/
2
http://sourceforge.net/
3
http://nlpfarm.sourceforge.net/tvguide/index.html
2
Java Development for HLT, GSLT 2002 §
Decouples sequential operations
§
Unifies the API for phased processes
§
Provides a flexible dispatch mechanism
In the T VGUIDE application the PP pattern is used as in Figure 1: Note that only the classes associated with the PP pattern flow are shown in Figure 1. The “rest of the system” is encapsulated in the blue box. Here, all utility classes that handle data access etc. are included. One central class is the Palette, which is a data structure, to which each of the three phases writes to – and reads from. The mechanism for progressing the phase is extracted from the phases themselves and put in a general domain- independent package. A uniform next() call in each of the phase classes is handled by this mechanism. The Parse phase uses a Parser Controller (which in fact works as a client to the JavaChart chartparser) to parse the user utterance. In this case the user utterance is written to the TvPalette by a text field in the GUI (not visible in Figure 1). The parser output is written to TvPalette and used by the Knowledge Controller (called from the Domain knowledge management (Dkm) phase). The Knowledge Controller handles information in the Dkm phase. This is the phase where access to Imdb is handled. The outcome, an Imdb result set, is written to the TvPalette. This is used in the Generation phase, where a Response Controller formats the system answer, which is displayed to the user in the GUI.
Palette
Parse Phase
Parser Controller
Dkm Phase
Knowledge Controller
Figure 1. The three phases of the TvGuide Phased Process Pattern.
3
Generation Phase
Response Controller
Java Development for HLT, GSLT 2002
2
The TVG UIDE Iterations
This section describes the five iterations that TVGUIDE in its current state has gone through. 2.1
Iteration 1:
The goals for iteration 1 was to get a running system, that gave an idea of how the “machinery” of T VGUIDE should look like, and present a GUI model of the interface. For the “machinery”, the PP pattern plays a key role. At the end of iteration 1, the system had the required functionality to fetch a user utterance from a text field, write the user’s string to Imdb’s native search engine, retrieve the resulting html page and display it in the GUI. The GUI model is based on three “entities”: the user, the system, and the domain or result. Previous experiences with dialog systems have shown that the separation between system and domain is often necessary to clearly give the user the impression of an on-going dialog. Since the goal of the T VGUIDE eventually is to grow into a dialog system, this design decision was made already in iteration 1. 2.2
Iteration 2:
The second iteration mainly included a rearranged package structure, and incorporating classes and methods from the tractor package – also available from the farm. The result of the new structure was that the phased process pattern was generalized (to an abstract form) to the domain- independent tractor package, and implemented in the TVGUIDE package. Instead of simply displaying the Imdb search results, work on the information extraction (IE) algorithms commenced. For iteration 2, the properties genre, director and writer worked. For the IE mechanisms, regular search expressions were used, utilizing a 3rd party regexp package. A third issue for iteration 2 was testing. A separate class for testing the IE utility was written. It is a simple GUI that allows the tester to select what properties should be returned for a specific Imdb query. The reason for using a test application such as this, is that the parsing/feature structure problem can be completely ignored; and view the results of particular properties easier without having to invoke the phased process for each single query. 2.3
Iteration 3:
Taking the parallel work of the quaks.access package into consideration, the information source access patterns were re-written in iteration 3 to fit the structure of Quaks. Like Tractor, Quaks is domain- independent and simply utilized by T VGUIDE. The specific access package provides interfaces for general information source access, and these were implemented in the TVGUIDE code. The work of building the Imdb knowledge source framework actually provided the quaks.access package with added functionality for Internet connections with classes for handling URL connections and connection pools.
4
Java Development for HLT, GSLT 2002 2.4
Iteration 4:
Now that the quaks.access package has been integrated the fourth iteration has mainly consisted of 1) bridging the gap between the chartparser (JavaChart) and the rest of T VGUIDE, and 2) more work on extracting information from the returned Imdb pages. The parsing connection is acceptable, but little more than that. This is mainly due to that the version of JavaChart available did not compile as- is. External resources include a languagespecific morphological lexicon, a domain- and language-specific grammar and lexicon, which are compiled by methods in the JavaChart package. The grammar is basically “key word spotting” in this version, with only a few grammatical constructions (handling utterances such as “what is [TITLE] about”). At this stage, the system allowed users to ask for these certain key words or properties about both movies (actors, directors, poster/image, genres, production year, and a few others), and persons (birth date, birth place, and picture). One problem is the separation of user queries and search items that go directly to the Imdb search engine. For testing the user text field was divided into two separate fields where the user types the NL query (e.g. "show actor and image for") in the first field, and the search term (e.g. "star wars episode II") in the second field. This obviously is not acceptable in a “release” of this system, but serves its purpose in this iteration! The future aim is to have the parser phase handle the separation of the information retrieval (IR) – i.e. searching for title or person pages at Imdb, from the information extraction (IE) – i.e. extracting the relevant properties from the retrieved Imdb pages. The IE algorithms still operates with pattern matching using regular expressions. The methods all parse the html code for the returned page and uses regular expressions to search for both html tags and phrases. Although a new regexp package (Jakarta) has been employed due to licensing issues. One separate method for each property users can ask for has been added (such as getActors(), getImage(), etc.) These are included in the class ImdbResultSet. The issue of multiple hits from the Imdb search engine continues to be a problem, since these exceptional results are usually handled by dialog, which is not present in the T VGUIDE as of now. The (far from optimal) approach taken is to simply assume that the top hit is the desired hit from the user’s perspective. This is the main reason for surprising results: e.g. the top hit for the query "Julia Roberts" is for some obscure reason "Erik Roberts", and "the lord of the rings" actually gives the old 1978 cartoon version instead of the new trilogy and so forth). The separation between titles and persons can be problematic since a search for "Madonna" may yield titles containing "madonna", instead of the actress Madonna. So, clearly dialog with clarification capabilities for multiple (or empty) search hits is needed in future iterations. 2.5
Iteration 5:
The fifth (and current) iteration includes a more robust treatment of the object-property structure. First of all, a more robust identification (and handling) of objects in the domain knowledge- handling phase has been employed. Second, properties are classified for both kinds of objects (e.g. both person and movie can have the property actor affiliated with it). Third, more extraction methods for both persons and movies have been added. These three improvements all give the system a more advanced and robust domain knowledge management. From the interaction point of view, the two text fields have been merged into one making the interaction less “forced”. There is still a requirement that proper names (both titles and names 5
Java Development for HLT, GSLT 2002 of persons) need to be enclosed in quotation marks. This is still a “hack” for separating the NL queries from the Imdb search strings. The parser still cannot handle this, so this happens in TVGUIDE’s native parser client. Small improvements on documentation have been added, and this report is part of the fifth iteration as well.
6
Java Development for HLT, GSLT 2002
3
Discussion
This section discusses the TVGUIDE system, the development process, and the future directions of the project. 3.1
System Evaluation
No formal evaluation of the system has been conducted. However, it is the author’s belief that the system in its current state is ready for a first end-user evaluation. Informal evaluations have shown that users are positively surprised by the system. And there has been several times when the author himself has used the system for its intended use (i.e. to get information about for example which actors are playing in a certain movie, and if they have acted in any other movies etc.) “The usefulness” from an NLP point of view is a different story. The research community is not impressed by functionality in a particular domain (such as simply extracting information from a database). Since there is no dialog history, and no sub-dialogs this aspect obviously vanishes. Furthermore, the grammar does not fully exploit the power of natural language, since it – in its current state – mostly rely on keyword spotting. However, as grammars and lexicons are further refined, and dialog capabilities are added, this system will contribute to NLP research. In particular, this domain requires handling of so-called open worlds, in that there are no limits to the domain (new actors and movies etc. enter the domain constantly). As pointed out, this presents a problem for the parser. It will present us with an even greater problem when speech recognition is added, since we cannot load a grammar covering all proper names of persons and titles. If we include domain and world models the research field of ontologies comes to mind. The research in this field should be utilized, and hopefully contributed to. 3.2
The Development Process
When evaluating the XP method, and the open source programming technique this project has suffered a little bit from the fact that there so far has been a single developer, and a “community” consisting of one person! The XP paradigm of pair programming has thus not been evaluated or even tried out. On the other hand: even with this “simplification” of the community and the feedback given, working iteratively has proved successful. T VGUIDE is a fairly large project (considering the resources in man-hours available) and without an iterative approach and continuous feedback, the developer would probably still be sitting around doing design sketches! 3.3
Future Pointers
There are several interesting and important issues for future work. The suggested agenda is: 1. Incorporating more knowledge sources 2. Dialog management 3. User modeling 4. Speech recognition
7
Java Development for HLT, GSLT 2002 Additional knowledge sources are of two types: information/data sources, and reasoning knowledge sources. Incorporating a new database consisting of TV tableaus with channels and times is a data source. For a T VGUIDE worthy its name, this sort of data should obviously be available to the user. For a richer system, this would need to be integrated with the alreadyexisting Imdb data source. As soon as tableaus (with dates and times) occur, at least one reasoning knowledge source is desired: the temporal reasoner. Often-vague user notions of time need to be accommodated. For truly usable interactions, some sort of dialog history is desirable. Users then do not need to “restart” at each query, but can incrementally learn more about a specific person or movie by having a conversation with the system. Also, dialog management supports clarification and sub-dialogs, which could help user and system to resolve ambiguities and exceptional query results. The dialog management is a prerequisite for user modeling. Endowing the T VGUIDE with the capability to “know its users” and recommend movies that a user wishes to see may be boosted by incrementally building a user model through dialog. Current interaction is (for simplicity reasons) written. Future full- fledged dialog systems should of course use speech as interaction instead, especially in the relaxed context of TV information. With speech recognition, porting the system to a mobile and/or phone platform can be possible. For these extensions, the PP pattern has great potential. Adding phases (such as a speech recognition phase before the parsing phase) can be done smoothly, and without having to change already-existing components. For example, the parser still reads the recognized text string from the palette and need not be the least concerned with if the string was typed or recognized in the first place. This approach can also prove useful in the open source community, since developers skilled in one area (e.g. speech recognition engines and APIs) can contribute to the project without actually being concerned with other areas of the project (e.g. the bits and pieces of TV program information extraction).
8
Java Development for HLT, GSLT 2002
4
References
Benton, J. (2002). Evolve your apps with the Phased Process pattern. JavaWorld. Retrieved April 5, 2002 from http://www.javaworld.com/javaworld/jw-04-2002/jw-0405-process_p.html. Gamma, E., Helm, R., Johnson, R., and Vlissides, J. (1994). Design Patterns. Elements of Reusable ObjectOriented Software. Addison-Wesley.
9