Providing Answers to Questions from Automatically Collected Web ...

11 downloads 152 Views 296KB Size Report
Jan 1, 2008 - Collected Web Pages for Intelligent Decision Making ... traverses a Web space and decides whether a page is construction oriented or not, and ...
Providing Answers to Questions from Automatically Collected Web Pages for Intelligent Decision Making in the Construction Sector Milos Kovacevic1; Jian-Yun Nie2; and Colin Davidson3 Abstract: The construction sector is notorious for the dichotomy between 共a兲 its intensive use of information in its decision-making processes and 共b兲 its limited access to, and insufficient use of, the pertinent information that is potentially available. In the context of the potential availability of valid information on the Web, we have developed a question-and-answer system, which enables construction practitioners to seek for information by posing questions in English or French, instead of entering a list of relevant keywords. Based on a modular systems approach using natural language, relevant answers to questions are selected and presented in a convivial way, thus improving and speeding up the classical Web search procedure. Our system consists of two main components: An intelligent robot that traverses a Web space and decides whether a page is construction oriented or not, and a question-answer 共Q-A兲 component 共comprising three modules兲 that uses a domain-specific thesaurus to extract meaningful parts of a question and to detect, process, and present paragraph passages extracted from relevant Web pages. The robot is trained on positive and unlabeled examples using a machine learning approach, while the Q-A component uses natural language processing techniques. In our experiments, we show that our automatically collected database consists of approximately 16% of noise, while the performance of the Q-A component is 65.45% in mean reciprocal rank. DOI: 10.1061/共ASCE兲0887-3801共2008兲22:1共3兲 CE Database subject headings: Classification; Information retrieval; Knowledge-based systems; Internet; Construction management; Decision making.

Introduction: The Context A few years ago, a senior university administrator was discussing building science with a senior researcher in a public research agency. Suddenly, the researcher interrupted the 共presumably兲 serious conversation to interject: “Architects never tell lies . . .” and after a pause charged with conflicting emotions, continued: “because they don’t know what the truth is.” Beyond the irony of this story lies the very real problem of knowing 共or rather: Not knowing兲; how can the engineer 共or the architect, or the city planner, . . .兲 possess all the information that is required to make professionally correct decisions? Indeed, it has been shown that among the causes of loss of productivity in 1

Ph.D. Candidate, School of Civil Engineering, Chair of Management and the Technology of Building, Univ. of Belgrade, Bul. Kralja Aleksandra 73, 11000 Belgrade, Serbia. E-mail: [email protected] 2 Professor, Dept. of Informatics and Operations Research, Univ. of Montreal, P.O. Box 6128 Main Post Office, Montreal, Quebec H3C 3J7, Canada. E-mail: [email protected] 3 Architect, ACSA Distinguished Professor, School of Architecture, Univ. of Montreal, P.O. Box 6128 Main Post Office, Montreal, Quebec H3C 3J7, Canada 共corresponding author兲. E-mail: colinhdavidson@ sympatico.ca Note. Discussion open until June 1, 2008. Separate discussions must be submitted for individual papers. To extend the closing date by one month, a written request must be filed with the ASCE Managing Editor. The manuscript for this paper was submitted for review and possible publication on August 14, 2006; approved on March 26, 2007. This paper is part of the Journal of Computing in Civil Engineering, Vol. 22, No. 1, January 1, 2008. ©ASCE, ISSN 0887-3801/2008/1-3–13/$25.00.

the building design and construction process, lack of effective access to information is the single most significant factor 共Mohsini and Davidson 1991兲. Indeed, Leslie and McKay 共1995, p. 23兲 write: The importance of up-to-date information cannot be in doubt but practitioners constrained by lack of time, money, and human resources and perhaps unaware of their knowledge gaps, tend to rely on old familiar material for solutions. Rarely, if at all, would a project be delayed while a search is undertaken for additional, but unknown, information. In other words, under current conditions, better information is sought only where there is a pressing need for it. Project procedures, contractual arrangements and even fee structures can act to discourage a practitioner from seeking new and possibly better solutions. In the past, the busy practitioner could consult his/her preferred technical librarian—though there is evidence that this was not often the case 关see, for example, Bardin et al. 共1993兲兴. Currently, the Web offers a seductive source of information. It is free and it contains a vast amount of information, but the information is not structured systematically, much of it is unreliable and most probably irrelevant. Indeed, unlike in the library, information on the Web is not accompanied by metadata. Furthermore, the typical search process yields a set of annotated links, from which a selection has to be made, and potentially interesting links have to be followed through and then read—yet in the construction sector, reading is not considered to be a productive use of time 共Davidson 2004兲. Previous research 共MacKinder 1982; Bardin and Blachère

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008 / 3

1992; BRANZ 1984兲 has shown that the search for information most often stops after one or two informal steps for, among other reasons, a fear of research terminology or “jargon” 共King 1984兲. As a consequence, problems are solved by “muddling through” on the basis of previous experience, thus bypassing opportunities for innovation. In other words, there is a major and insufficiently recognized information problem. The research described in this paper contributes to solving this problem by developing a “question-and-answer” system, in which the aim is to provide a direct answer 共in natural language兲 to questions posed 共also in natural language兲 by participants of the building industry, instead of providing a list of references, as is the case with most online information retrieval systems 共for details, see Robert et al. 2006; Lizarralde et al. 2005; Zhang et al. 2004兲. The initial premise is that activities such as reading references, selecting the seemingly relevant ones, reading the content, comparing the information from different sources, detecting and resolving contradictions, combining pieces of information, and assembling a relevant and reliable response, are too long and costly for a busy user, even if construction is “an informationintensive industry” 共Rezgui 2006兲. Indeed, earlier research has shown that the median acceptable wait time is half a day 共Bardin and Blachère 1992兲. This approach presupposes the possibility of selecting, from within the universe of the Web, a set of potentially relevant sites, and then of categorizing them to help the question-answer 共Q-A兲 system in its selection of the best elements of the response, prior to presenting them to the “practitioner client”. The paper is organized as follows: First, the problem is stated and research goals formulated; the research plan brings together a presentation of the components of the system 共an overview of the intelligent acquisition component, the principles of the Q-A component and its constituent modules, design of the interface兲; the status of experiments using the main modules of the system are described; and finally, some brief conclusions are drawn.

Problem Statement According to Zhang et al. 共2004兲: The goal of question-answering 共Q-A兲 is to provide a direct answer to a question of a user, instead of a reference which may contain such an answer 关. . .兴. This problem is becoming increasingly important due to the huge number of documents available on the Web. The low accuracy of the current 关information retrieval兴 IR systems and search engines also make some groups of professionals reluctant to use search engines in their professional activities. Professionals in the construction sector are part of them. These professionals have their own working habits, and they consider browsing on the Web as an inefficient way to find professional information. In order to meet the requirements of these professionals, two aspects of the current search engines have to be improved: 共1兲 Search engines have to be specialized in the documents they provide; professionals in the construction sector would like to have specialized search engines that only provide construction-related documents; and 共2兲 Search engines should provide accurate answers to their questions 关. . .兴. The hypothesis justifying our research is that better decisions are made 共in building design and construction兲 when better information is used for making them. In the research we present here,

we tackle the problems posed by accessing general information 共information that is not generated for a particular project but which is most often the product of a research institution, a codes body, or a commercial firm兲. Accessing general information is bedeviled by three problems: 共a兲 Language: For example, “Research is heavy jargon . . . after the first page we start falling asleep” 共King 1984兲; 共b兲 the production of general information is not concurrent with the practitioners’ specific 共and usually urgent兲 need for it; and 共c兲 reading is not seen as a productive use of time 共as, indeed, Leslie and McKay suggest in the quotation above兲. Our goal, therefore, is to provide a direct answer to the user’s question, instead of providing a list of references 共as is the case with most online information retrieval systems兲, which the user has to “open, scan, and read” to see whether they really do provide an answer—all of which takes time. The objective, therefore, is to provide for easy access to reliable information on the Web. The priorities concern 共1兲 conviviality, in order to circumvent the endemic reticence of building industry professionals to spend time looking for information; and 共2兲 pertinence of the information provided for decision making, since only potentially relevant preselected Web resources are consulted by the question-and-answer system.

Research Plan Using a construction-related metaphor, the research can be said to comprise an “infrastructure,” a “structure,” and a “superstructure.” The infrastructure is designed to provide a database of relevant Web pages; the structure enables the Q-A component to be developed, and the superstructure involves the design of the users’ convivial interface. Infrastructure: The Intelligent Acquisition Component One of the goals of our research is to establish and maintain a database of relevant Web pages potentially interesting for customers from the construction sector—that will be used by the Q-A module. Since the Web space is so huge and so frequently changing 共Lawrence and Giles 1998兲, we do not expect that this task could be accomplished by a pool of human experts. The reasons are obvious: 共1兲 experts are not able to examine efficiently large portions of the everyday changing Web; and 共2兲 experts are very expensive. In this research, we propose a different approach to acquiring data from the Web, using a topic-specific automated crawler, which is capable of discovering and indexing construction pages on the Web, using novel technologies from the machine learning field 共Mitchell 1997兲. One could argue that the Q-A database might be created using the services offered by general-purpose search engines 共SEs兲 such as Google or MSN. There are several reasons why such a solution would be questionable. Here, we wish to emphasize the following ones: Information relevance, information freshness, and neighborhood information. Information relevance is the first criterion that a searcher uses to judge if a piece of information is useful. General SEs index and store a huge number of documents in very different areas. They do not focus on a particular specialization area, such as construction. The answers that they will provide can be in any area, provided that they contain the keywords included in a user’s query or question. For example, for a question “How can I modify building structure?” the documents that SEs return can be any that contain the very common words “modify,” “building,” and “structure.”

4 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008

tion in the particular community can reveal valuable social information and can be used to calculate the authority of the page source.

Fig. 1. Crawling strategies: 共a兲 an ordinary crawler starts from a page S and then follows links visiting the tree in a breadth-first order 共numbers indicate the order in which the nodes are visited兲; 共b兲 gray circles denote on-topic pages, white are not interesting at all. Pages pointed to by black arrows will be visited for further evaluation purposes. A focused crawler visits the tree in a best-first manner following only “promising” links 关the black arrows in 共b兲兴.

This is the first reason that motivates our project to develop a construction-specific search system that only returns answers relevant to the construction sector. Information freshness is defined as the average age of all indexed pages. According to Searchenginewatch 共2003兲, when one searches the Web, one “look关s兴 into the past,” since the average freshness of an SE-generated database varies from 30 days 共Google兲 to 7 months 共Gigablast兲. Indeed, a general purpose SE tends to collect all the pages on the Web and hence the interval between two consecutive visits 共for indexing兲 is not negligible. In many applications, especially involving decision support, information should be up to date; one should not have to tolerate broken links. With the development of parallel architectures and increased bandwidth, the information freshness problem will become smaller, but since the number of pages constantly grows, it will still be present. An SE that periodically revisits pages from only one or several communities can have a more up-to-date index. Such a topic-oriented SE would revisit its community more often and newly-added pages will be detected faster. Neighborhood information 共information about “who links to whom”兲 is not accessible if the crawling task is not performed for a specific purpose. Google, for example, offers queries about linkage, but that service is limited in the number of requests from the same internet protocol 共IP兲 address. The neighborhood informa-

Architecture of the Acquisition Crawler A Web crawler is a software component that visits Web pages for storing them in a local database for further analyses; it performs this task by starting from a seed set of initial addresses and, following links that connect neighboring pages, tries to collect as many pages as possible. Obviously, all the collected pages have to be reachable from the initial seed set. Web crawlers of this sort are the essential part of the architecture of general purpose search engines 共Brin and Page 1998兲. If a crawler fetches only the pages that belong to a specified topic or a set of topics, then it is called a topical or focused crawler 共Chakrabarti et al. 1999兲. Focused crawlers are used in a variety of applications, such as vertical search engines 共McCallum et al. 2000兲, competitive intelligence 共Chen et al. 2002兲, and digital libraries 共Pant et al. 2004; Qin et al. 2004兲. The difference between an ordinary and a focused crawler is illustrated in Fig. 1. Focused crawlers exploit the fact that the Web is a “social network” and that pages mostly reference similar pages 共Davison 2000兲. Therefore, when a visited page is judged to be relevant, its links are followed too, otherwise not. In some work, this rule is relaxed in order to traverse a certain number of irrelevant pages before reaching the relevant one 共Diligenti et al. 2000兲. There are two main issues related to the design of a successful focused crawler. The first one deals with how to recognize whether a page is on topic or not 共this is named the page classification problem兲. The second issue is connected to the problem of reordering links 关uniform resource locators 共URLs兲兴 extracted from on-topic pages in such a way that the crawler visits the more promising pages before potentially less relevant ones 共this is named frontier reordering where a frontier is a set of URLs extracted from the on-topic pages that are not yet visited—see Pant and Srinivasan 共2005兲. After defining and downloading a set of starting pages, our system extracts all the links from them, and enters into the main crawling loop. Fig. 2 depicts how the system works. When a page is fetched, it is classified with the relevance classifier 共RC兲 and labeled to be from the construction sector or not. If the page is relevant, it is further processed with the categories classifier 共CC兲. The CC decides to which subcategories the fetched page belongs

Fig. 2. System architecture 共simplified view兲. Arrows represent the flow of data structures, e.g., a page. JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008 / 5

关our general information concept is divided into four subcategories that are potentially significant for members of the construction sector: design methods and designs 共including buildings and engineering works, and design processes兲, science and technology 共including research findings, techniques, and product information兲, business and management, and current events兴 共Davidson 2004兲. After labeling by the CC, the page is then stored on a disk and a metarecord about it is put in a metadata database. We keep information about the time of retrieval, size, page title, page URL, relevant subcategories, etc. Before storing the page, we extract its links 共URLs兲, check whether these URLs have already been visited, and if not put them in a corresponding priority queue. Each link is labeled with the relevance of its parent page. If a page is more relevant 共scored by the RC outputs with a relevance score from 0 to 1兲, then its links go to a higher priority queue. Here, the assumption is that pages that are more relevant generate more important URLs to visit. If the page is not relevant 关the RC output is less than the threshold 共set at 0.5兲兴, then we stop crawling that path. This is a reasonable heuristic since we assume that if a page is not interesting, then it is less probable that it contains links to other interesting sites. Before starting the operative work, an expert should establish a list of examples with their URLs and for each example, specify any subcategories to which it and other examples belong. These examples are used for teaching our RC and CC classifiers to recognize concepts of construction-oriented pages and their subcategories. The system has been designed in advance to support administrative tasks, such as creation of subcategories, defining crawling and learning parameters, defining additional starting points on the Web, and so on. After pages are collected, one is able to query both databases 共metadata and page databases兲, and to review the acquisition results. The whole system is implemented in Java, using open source tools. The architecture of the system is simple enough to run on standard PC configurations. In the next section, we give a brief overview of the machine-learning methods that are used to build the logic of the RC and CC classifiers. Machine-Learning Framework for Web Page Classification A machine-learning approach to automated document classification is not unknown to the construction field. A prototype of a classification system is developed to help in organizing the large number of construction project documents 共Caldas et al. 2002兲. In addition, the authors compared several classification schemes based on machine learning to find the best suited method for the construction project corpora. First, we define the task of classification of Web pages. Let P 債 W be a subset of all Web pages W and C = 兵c1 , c2 , . . . , cn其 be a set of n class labels that correspond to some predefined classes. A function ⌽ : P ⫻ C → 兵−1 , 1其 is called a classification if for each pi 傺 P and c j 傺 C it holds that ⌽共pi , c j兲 = 1 if pi belongs to a class whose label is c j, and ⌽共pi , c j兲 = −1 if not. When a page has a label c j, we say that it belongs to class c j and vice versa. In practice, one only has a limited set of n labeled examples of form 共pk , ck , y k兲 , k = 1 , . . . , n, where pk 苸 P, ck 苸 C, and y k 苸 兵−1 , 1其. Labeled examples form a training set for the classification problem at hand. The machine-learning approach ˆ , which is a good approximation of a tries to find a function ⌽ real, unknown function ⌽, using only the examples from the training set and a specific learning method such as neural networks, decision trees, or support vector machines 共Sebastiani 2002兲. If a page exclusively belongs to one class, we deal with a single-label classification; otherwise, it is a multilabel classification. In practice, it is common for a page to belong to more than

one of the predefined subcategories simultaneously. Assuming that a page belongs to just one class and that there are only two classes 共C = 兵c1 , c2其兲, then we speak about a binary classification model. Usually c1 and c2 are called positive and negative classes, respectively. Each multilabel classification task with n classes can be modeled as a sequence of n binary tasks. In our work, we use the 1-versus-rest approach in which we train n binary classifiers, one for each of n classes. The training set for classifier k consists of two classes: Positive examples belong to ck and negative examples belong to C \ ck. The efficiency of a learning method is evaluated on a separate set of examples, having the same form as a training set—the test set. Class labels in the test set are also known in advance, but the test examples are not presented to the learning method during the training phase. In the operative phase in the 1-versus-rest approach, the ensemble of n already-trained classifiers produces a binary output vector in which the ith coordinate is either 1 or −1, depending whether a page belongs to the class ci or not. Vector Representation of a Web Page In order to learn the unknown classification function ⌽, the learning method takes on—as its inputs—the available training data. For that purpose, the pages should be represented in an appropriate numerical form. A common approach is to construct a vocabulary V consisting of all different words, i.e., features 共features are not just ordinary words but any sequence of symbols separated by separator characters such as a space or punctuation兲 that appear in the training examples, and then to represent each page from P relative to the space of features from V. In such a representation, each page pi is transformed into feature vector xi, such that kth coordinate in the vector represents the frequency of the kth feature from V in the page pi. For example if V = 兵representation , page , vector , frequency , building其 then we represent the previous sentence as 具1 , 2 , 2 , 1 , 0典. Geometrically speaking, each page is represented as a point in the 兩V兩-dimensional space of words 共Salton 1989兲. The training set is now of the following form: 共xi , y i兲, i = 1 , . . . , n. In the vocabulary construction process, each page is parsed in order to remove HTML tags and extract body text. In order to reduce the dimensionality of the feature space, we remove all the words from the English and French stop lists 共words that do not have discriminative meanings such as articles, pronouns, etc.兲 and all the words that appear in less than two documents in the training set 共these are considered statistically irrelevant兲. There are many different vocabulary reduction methods, and a good comparison of these is given in Yang and Pedersen 共1997兲. In classification practice, feature vectors are normalized in order to minimize the effects of a document length, and to introduce the numerical stability of some learning methods. We injected words that appear in ⬍meta⬎ and ⬍title⬎ tags with doubled frequency since these words, when they are present, carry significant semantic meaning. Linear Two-Class Support Vector Machines Support vector machines 共SVMs兲 共Vapnik 1995; Joachims 1998兲 are state-of-the-art learning methods in pattern classification and they deal with binary classification models. According to Caldas et al. 共2002兲, a SVM is the best performing method compared to other commonly used approaches when dealing with diverse construction project documents. Here, we briefly describe a subclass of SVMs with so-called linear kernels, or linear SVMs. Linear SVMs are proven to classify textual data very well 共Yang and Liu 1999兲. Let 共xi , y i兲, xi 苸 Rn, y i 苸 兵−1 , 1其, i = 1 , . . . , m be the training set of m examples represented relative to vocabulary V of n dis-

6 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008

long either to h1 or h−1 共depending on y i兲. Let xa and xb be two support vectors 共C ⬎ ␣a , ␣b ⬎ 0兲 for which y a = 1 and y b = −1 hold. Now b* = −w* · 共xa + xb兲 / 2 and finally the classification function becomes m

f共x兲 = sgn

兺 ␣iy i i苸兵j兩␣ ⫽0其 j

Fig. 3. Linear two-class SVM: construction of a separation hyperplane in a two-dimensional case

tinct features. Basic principles of linear SVM are explained in Fig. 3. Without loss of generality, we assume that feature vectors are points in a two-dimensional plane. White and gray squares represent pages from a training set comprised of two distinct classes. The SVM learning algorithm tries to construct a separating line h 共or hyperplane in a higher dimensional case兲, which best separates the examples of two classes. One chooses h : w · x + b = 0 to be the best separating hyperplane such that the following condition is satisfied 共Chen et al. 2005兲: maxmin w,b



储x − xi储 i = 1,2, . . . ,m where w · x + b = 0



In the above expression, x · y⫽a dot product between two vectors, and 储x储⫽vector length. The notion of the best separation can be formulated to find the maximum margin M = 2 / 储w储 that separates data from both classes. Since in reality the classes are not linearly separable 共the circled examples from Fig. 3兲, the learning process also tries to minimize the sum of distances of points on the wrong side of the separating line 共hyperplane兲 while maximizing the margin M. The best separating hyperplane can now be found by solving the following nonlinear convex programming problem 共Chen et al. 2005兲: find w, b such that 1 min 储w储2 + C w,b 2

兺i ␧iw.r.t: 1 − ␧i − yi共w · xi + b兲 ⱕ 0

− ␧i ⱕ 0,

共2兲

i = 1,2, . . . ,m

Note from Eq. 共1兲 that maximizing the margin is equivalent to minimizing the 储w储2 / 2. The constant C models the penalty for misclassified points in a training set. The optimization problem 关Eq. 共2兲兴 is usually solved in its dual form and the solution is m

*

w =

␣ i y ix i 兺 i=1

C ⱖ ␣i ⱖ 0,

i = 1, . . . ,m

共3兲

Note that a solution w* for an optimal hyperplane is a linear combination of training examples 关for solving optimization problems see Fletcher 共1987兲兴. However, it can be shown that w* represents a linear combination of those vectors xi for which the corresponding ␣i are nonzero values. Those vectors are called support vectors. Support vectors for which C ⬎ ␣i ⬎ 0 holds, be-

1 1 xi · x − xa − xb 2 2

冊冊

共4兲

If one removes all training data that are not support vector points and retrains the classifier, we would again obtain the same solution. Obviously a linear SVM classifier discovers important documents from the training set that best describe the document classes. This great property has a significant impact on the generalization capacity of the SVM. A detailed review of SVM for pattern classification can be found in Chen et al. 共2005兲. Our CC consists of four two-class linear support vector machines trained in a 1-versus-rest manner 共recall that we modeled general construction information into four categories兲. In our experiments, we used standard information retrieval measures such as microprecision and microrecall to evaluate the performance of CC. Let TPi⫽number of pages that are correctly classified as positives by the ith SVM classifier 共classifier SVMi trained on pages from class i as positives and all other pages as negatives兲. Further, let FPi⫽number of pages that are incorrectly classified as positives, and FNi⫽number of pages that are incorrectly classified as negatives by SVMi. Microprecision 共mp兲 and microrecall 共mr兲 are defined to be 共Sebastiani 2002兲 n

mp = 共1兲

冉 冉

兺 i=1 n

n

TPi 兺 i=1

TPi n

TPi + 兺 FPi 兺 i=1 i=1

mr =

n

n

共5兲

TPi + 兺 FNi 兺 i=1 i=1

Microprecision and microrecall are measures that are hardly possible to maximize at the same time. Therefore, for performance evaluation, a microaveraged F1共mF1兲 is used to combine them in the following way 共Sebastiani 2002兲: mF1 = 2*mp*mr/共mp + mr兲

共6兲

Learning without Negative Data Our acquisition system is designed to collect pages from C 傺 W where C⫽set of all pages that are potentially interesting for the construction sector, and W⫽set of all Web pages. However, a user has only a limited set of labeled examples Cl 傺 C where 兩Cl 兩 Ⰶ 兩C 兩 共兩C兩 denotes cardinality of the set C兲. To the best of our knowledge, all classifier-oriented approaches used in the focused crawling field are based on the binary classification model in which a user must specify training examples from both positive 共on-topic兲 and negative 共off-topic兲 classes. While it is not difficult to specify examples of the on-topic class, it is extremely difficult to model the negative class, since almost all the pages on the Web are irrelevant for a specific application. Therefore, the accuracy of the classifier is greatly influenced by the choice of negative examples. In our work, we decided to train our RC without the presence of labeled negative data. The RC is built upon the positive examples based learning 共PEBL兲 framework 共Yu et al. 2004兲. PEBL utilizes the fact that, apart from having a limited set of positive examples Cl, one can freely download many pages from archives like Yahoo 共www.yahoo.com兲 or Open Directory Project 共www.dmoz.org兲. Suppose that we downloaded a set U 傺 W,

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008 / 7

兩U 兩 Ⰶ 兩W兩 of pages randomly sampled among different categories of an archive. Such a set is called an unlabeled dataset, since it is not known in advance if a page from U is a positive or a negative example for our learning setting. Actually, we could download millions of pages from an archive belonging to diverse categories in order to form U, but we would not have expert time available to evaluate all the collected examples. The idea behind the PEBL approach is to detect strong negative examples from U automatically and to train the initial twoclass SVM. Such an SVM 共SVM-0兲 uses strong negatives and positives from Cl to build a separating hyperplane between them. Then SVM-0 is used to classify pages from U \ 兵strong _ negatives其. Newly obtained negatives are appended to the strong negatives thus forming NI; then an SVM-1 is trained using NI and positives from Cl. In the kth iteration, an SVM-k is trained on Nk and Cl. The SVM-k will extract a new set of negatives from U to be appended to Nk, thus forming Nk+I. The process is finished when it is not possible to extract more negatives from U. At the end, a final SVM is trained on all negative examples extracted from U and positives from Cl. The initial set of strong negative pages used for SVM-0 is extracted using the definition of strong positive features. Here, a vocabulary is built using all different features from both labeled and unlabeled pages. Yu et al. 共2004兲 define strong positive features in the following way: Let df li, df ui be the document frequencies of the ith feature from the vocabulary in the labeled set Cl and the unlabeled set U, respectively 共where the document frequency is the number of documents in which the corresponding feature occurs at least once兲. If 共df li兲 / 共df ui 兲 ⬎ 1 then the feature i is positive, otherwise, it is negative. Now, a strong negative is defined to be a page from U that does not contain any of the positive features from the vocabulary. Obviously, if a positive feature is present in a page, then it raises the probability that the page might be from the positive class. In our experiments with PEBL, we used the ratio 共df li共%兲兲 / 共df ui 共%兲兲 to detect whether a feature is positive or not, where df li共%兲, df ui 共%兲 are relative document frequencies of the feature in Cl and U, respectively. Using relative document frequencies of features, we better adapted to the sizes of both Cl and U, since it usually holds that 兩U 兩 Ⰷ 兩Cl兩.

language to another person 共even through a computer兲, users rarely compose a single question, rather, they usually ask several subquestions, typically including four elements: 共1兲 the context 共e.g., the reasons underlying the request, the location, and circumstances wherein the problem takes place兲; 共2兲 a formula for writing the question 共where can I find, what is, where, etc.兲; 共3兲 the question itself; and 共4兲 details of the question 关interestingly, most of the words used—about 75%— concern the context of the question rather than the question itself 共Moulet 2004兲兴. For example, “I am renovating my basement 共context兲; I would like to know 共form of question兲 which material I should use for the floor 共the question兲 in order to increase the thermal insulation 共accompanying details兲.”

This research task implies developing three modules: • Analyzing the types of questions that practitioners pose, in terms of their content and their form. • Updating the thesauri—the bilingual French-English and English-French thesauri 共IF Research Group 1978, 1979兲. • Developing the Q-A module and testing it on a corpus of documents.

Updating the Thesauri The use of a structured controlled vocabulary is essential for the Q-A system to respond successfully to many of the types of questions posed by practitioners in the construction sector 共Rezgui 2006; El-Diraby et al. 2005兲. A megathesaurus comprised of two pairs of bilingual thesauri 关English with French translations and French with English translations: Canadian thesaurus of construction science and technology and the Canadian urban thesaurus 共IF Research Group 1978, 1979兲兴 was used. This megathesaurus has a strict set of structural rules, based on treating all candidate terms through a set of algorithms or “logical propositions.” Its hierarchy includes 共i兲 Broader_ Term 共BT兲 and Narrower_ Term 共NT兲, 共ii兲 Whole_ Term 共WT兲 and Part_ Term 共PT兲 relationships, plus 共iii兲 the classical Related Term 共RT兲, Use 共US兲, and Use_ For 共UF兲 relationships. For the express purposes of this research 共where the vocabulary was to serve for tagging the questions and the responses rather than for document indexing兲, the megathesaurus was enriched by the inclusion of verbs, the “jargon” of everyday practice, and acronyms. An artificial intelligence program implemented in Common LISP was custom developed for the purpose, allowing for the verification and the automatic correction of link symmetry and of typographical errors. These two features required a careful balance between strict term matching and the tolerance of slight spelling variations existing between terms legitimately representing different concepts. Indeed, too strict term matching had a tendency to create unwanted new links whenever typographical errors were present, while too slack matching tolerance allowed for unjustified spelling variations to remain in the thesaurus. An iterative process allowed for the gradual weeding out of linking and spelling errors, while maintaining an optimal level of balance between these two opposing constraints. Adjustments to this balance were made upon examination of the output produced at each step. Input and output were made through standard spreadsheet file formats, making the inspection of the outcome at various iterations in the verification process very easy to do.

Analyzing the Types of Questions Two studies clarified the kinds of questions that practitioners faced, for which the answer was not available within their own offices: • A survey of over 70 respondents indicated that practitioners’ information needs extend beyond the sole domain of technical information on design and construction, and concern aspects such as office management, legal procedures, marketing, and so on 关these findings confirm those reported by Davidson 共2004兲兴. • A collection of 60 questions and answers from online discussion forums revealed that when submitting a query in natural

Developing the Q-A Module In general, a Q-A system must cope with: 共1兲 passage retrieval: The identification of a set of document passages 共i.e., paragraphs兲 that may contain an answer to a given question; and 共2兲 postprocessing: Comparing these passages and sentences in them with the question in order to verify if the required element 共e.g., a person, an address, and so on兲 is in the passage and how strong it is related to the other elements of the query 共e.g., other keywords兲. General Q-A systems are initially focused on answering factoid questions of certain forms, the answer to which is usually a “named entity 共NE兲”—a date, a person’s name, an organization, and so on. For example, the answer to the question “who is the

Structure: The Principles of the Q-A Component and Its Constituent Modules

8 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008

president of the USA?” is a person’s name. Named entities usually follow some rules in writing. For example, a person’s name usually starts with a capital letter, possibly after a title 共such as Dr., Mr., etc.兲. They can also be formed by known words from a closed set, e.g., months. Therefore, two common approaches to recognizing NEs are based on rules 共that can be learned automatically from examples or be set up manually兲 or on gazetteers. A gazetteer contains a set of terms that are known to be of certain type of NE. For example, one can gather all the names of countries, provinces, and cities, and put them into a location gazetteer. There are several gazetteers available on the Web, for example, www.world-gazetteer.com/. These rules and gazetteers are used to help recognize NEs of different types. In a construction-specific search, the general Q-A approaches are also useful: They allow the user to find answers to questions such as “where is RAIC located?” 共RAIC is the Royal Architectural Institute of Canada兲. In this question, we can identify that the answer type should be a location because of the “where”question type. However, the common NEs are not sufficient for answering more domain-specific questions such as “what materials are best suited for construction on permafrost?” In such a case, the answer is not an NE, but objects in a specific category 共e.g., materials兲, and to answer such domain-specific questions, the use of a thesaurus as a domain-specific knowledge base is essential, as mentioned above. Indeed, by using the thesaurus, it is possible 共i兲 to move from words to concepts 共El-Diraby et al. 2005兲 and 共ii兲 to recognize domain-specific concepts 共through representative terms兲 during text and question analyses. In the above question, it is necessary to recognize “materials” as a domain-specific concept, then “wood,” “brick”, etc. as subconcepts of it. These latter can be part of a possible answer to the question. This process is possible only if we possess structured domain-specific knowledge, i.e., concepts and relations between concepts contained in the thesaurus. The construction-domain-specific information retrieval system, in addition to using key terms, also uses the concept-toconcept relationships 共Rezgui 2006兲 suggested by the thesaurus, tagged in both documents and questions. In a question such as “what material . . .,” the required concept “material” will be considered as a category. Any subconcept under this category in the thesaurus can constitute a possible answer to the question. Therefore, we tag the subconcepts by the same category to make them comparable during search. Subconcepts are identified by the narrower term/broader term or the part term/whole term relations, such as “materials NT organic materials” and the reciprocal “organic materials BT materials.” By combining both relations, we can recognize “wood” as a subconcept of “materials,” thus being a possible answer for a question about the latter. There have been a large number of studies on the recognition of common NEs. GATE 共Cunningham et al. 2002兲, in particular, the module ANNIE in it, is an open source system allowing us to do this. This module can process a text and tag all the common NEs recognized, such as date, person, location, and so on. In this study, we use GATE as our module for NE recognition during the preprocessing step of documents and questions. To recognize domain-specific concepts and terms, we make use of the domain-specific thesaurus. The concept-recognition module identifies all the concepts occurring in a document or a question. If a concept is identified as the question word 共e.g., materials in the question what materials . . .?兲, then a semantic type 共materials兲 is determined as the question type. Such a question is considered as a question of the semantic type, i.e., its answer should be a domain-specific concept in that category. The

Fig. 4. Work flow of the Q-A module 关adapted from Zhang et al. 共2004兲兴

semantic type assigned to a concept corresponds to a concept of a certain abstraction level, i.e., in certain high levels of the thesaurus hierarchy. For example, specific concepts such as wood or brick are not recognized as semantic types. Rather, the more abstract concept organic materials is used as their type. This choice allows us to standardize the semantic types of concepts, which facilitates comparing concepts 共if they are of the same semantic type兲. The work flow of the Q-A module is shown in Fig. 4. Both NE and concept recognitions are performed in a preprocessing step, which aims to tag the elements in the questions and the document collection by their NE or concept types. The next step tries to determine the set of passages 共paragraphs兲 that may contain an answer to the question. This step uses the traditional IR approach based on a vector space model 共Salton 1989兲. In this model, a passage Pi and a question Q are represented as vectors Pi and Q of term weights. The weights are determined by the classical schema tf *idf, which combines both term frequency 共the frequency of occurrences of the term in the passage/question兲 and inverse document frequency 关see Salton 共1989兲 for details兴. The similarity between a passage Pi and a question Q is calculated as cos共␸兲, where ␸⫽angle formed by the two vectors, i.e., ScoreIR共Pi,Q兲 =

Pi · Q 储Pi储储Q储

共7兲

where 储Pi储 and 储Q储⫽lengths of the passage and question vectors, respectively. In our case, we use an open source retrieval system—Lucene 共2006兲 as the basic retrieval tool. This system implements the vector space model, together with the full capability of indexing 共i.e., determining the keywords in a document and assigning tf *idf weights to them兲. In our utilization of Lucene for passage retrieval, what is different from general passage retrieval is that we also use NE and semantic types as additional keywords. For example, for the question What materials . . .? The type materials is used as a keyword. This allows us to favor the passages containing concepts of type material, which can be possible answers to the question.

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008 / 9

The final step of the Q-A module is post processing. This step tries to select the best passages that most likely contain the required answer element. To do this, we have to consider the structure of the passage with respect to the type of the question. We deal with three types of questions in this study: “named entity 共NE兲 questions” 共“who?” “what?” “where?”兲, “category questions” 共what materials . . .兲, and “keyword” questions. In NE search, a paragraph is retained 共or reranked higher兲 if it contains a sentence that provides the required type of NE and is closely connected with the other NEs or keywords in the question. As in most studies on Q-A, a heuristic measure of similarity ScoreNE is computed according to the number of the required NEs and keywords, the distance between the NEs and the other keywords of the question in the passage 共the less the distance, the higher the score兲, as well as n-grams 共sequences of two or three words兲, in the candidate passages that match those of the question. The last criterion on n-grams is used to favor the passage containing a similar sequence of words to the question, that is “building structure” is preferred to “structure building” for the question “How can I modify building structure?” The final similarity score of the passages is then calculated as a linear combination of both passage retrieval and NE matching scores as follows 共where Pi is a passage and Q the question兲: Score共Pi,Q兲 = ␣ScoreIR共Pi,Q兲 + 共1 − ␣兲ScoreNE共Pi,Q兲

共8兲

The parameter ␣⫽set empirically 共0.2兲. This approach is similar to the other studies on Q-A in the literature. The same strategy is extended to questions on specific semantic categories: An answer paragraph has to contain a sentence that provides a concept in the specific semantic category, in connection with the other concepts and keywords specified in the question. Based on these, another similarity score ScoreSem is calculated. The final score is calculated as follows: Score共Pi,Q兲 = ␤ScoreIR共Pi,Q兲 + 共1 − ␤兲ScoreSem共Pi,Q兲

共9兲

where ␤⫽another parameter set empirically 共0.3 in our case兲. An example of a processed category question is: Question: What are the common thermosetting foams used in frame construction? Keywords: thermosetting, foams, frame, construction Compound terms: frame_construction Category type: product_forms The type of this question is “Category,” and the required category is “product_ forms,” which is a more general concept of thermosetting foams at a certain level of hierarchy in the thesaurus. If a question does not correspond to any of the above specific types of question, then the answer is directly determined by passage retrieval, i.e., no postprocessing is applied. To evaluate the quality of our Q-A process, we use the standard mean reciprocal rank 共MRR兲 measure of the correct answers, which is defined as follows: MRR =

1 N

1

兺i RRi where RRi = 兺j Rankij

共10兲

N⫽number of test questions, and Rankij⫽rank of the jth correct answer to ith test question in the answer list. 1/MRR can be seen as a sort of average rank of the correct answers in the list. The higher the MRR, the better. The Q-A module was initially tested with 100 test questions in a closed corpus containing documents from the Canadian Building Digests 共1990兲. Among these questions, 40 questions ask for

categories, 42 for named entities, and 18 do not correspond to any of these types. The correct answers to these questions were first determined manually by experts. Our experiments show that the postprocessing for NEs and categories can greatly improve the MRR of the search. Compared to passage retrieval, searches using the named-entities approach yielded an overall improvement for 33% of the 42 test questions, no change for 48%, and a loss of performance for 19%. For NE search, we obtained an MRR of 76.98% versus 66.63% with passage retrieval only, that is to say, an improvement of 15.5%. Searches using the semantic categories on 40 questions led to an improvement in 35% cases, no change in 55%, and a decrease in 10%, leading to an MRR measure of 55.0%, or an improvement of 14.8%, compared to 47.89% with passage retrieval. Overall, for all the 100 questions, the MRR is 65.45%, meaning that on average, the correct answer can be found at rank 1.5 共i.e., 1/0.6545兲, roughly between the first and the second answer in the list, which is encouraging. Without using our postprocessing, the MRR is 58.26%. Thus, we can conclude that the postprocessing on NE and semantic categories—though simple—is quite effective, representing an improvement of 12.3% in the ranking of correct answers. Superstructure: The Design of the Users’ Convivial Interface The objective of the research project, as has been stated, was, and is, to provide for easy access to reliable information on the Web. The reliability aspect is supported by the pertinence of the information provided for decision making, since only potentially relevant preselected Web resources are consulted by the Q-A system. The conviviality priority guides the design of the user-system interface, in order to circumvent the endemic reticence of building industry professionals to spend time looking for information. According to Robert et al. 共2006兲: Three interface prototypes for the Q-A system were designed and tested, in order to select the most appropriate interface for the users. The goal of the tests was twofold: First, to test the capacity of each prototype to induce the user to write a clear and self-sufficient question in natural language so as to facilitate its analysis by the system and eventually improve the response; second, to test the usability of each prototype, where the usability of a product is defined as “the extent to which the product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use” 共ISO 9241 关. . .兴兲. In the context of this project, given the constraints and the stakes of the building industry, and the diversity of the actors involved, the quality of usability is essential for acceptance of the application by the intended users. This approach led to approaching the design of the interface on the basis of the “e-mail” metaphor, that is to say, giving the interface an air of familiarity or “déjà vue” as if one were sending an e-mail to a human being, a librarian. Within this framework, a series of prototypes was tested. The first interface prototype included: A prompting message to guide the user, a dialog box for choosing a prefabricated form of question 共e.g., what is, who, where, how, . . .兲, a dialog box for entering the question 共in natural language兲, a dialog box for presenting the answer 共also in natural language兲, “O.K.,” and “ERASE” buttons. The second prototype added a dialog box for

10 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008

Fig. 5. Interface of the Q-A system. Note the “e-mail” metaphor, with 共from top to bottom兲: Addressee 共Cibaˆt-International Building Center in Montreal兲, subject box, query box 共with optional menu of “prefabricated” forms of question兲, categories boxes 共for optionally choosing the nature of the response being sought兲, “send,” “cancel,” and “print” boxes, and space for displaying the answer.

describing the context of the question 共the importance of the context emerged from the survey described above兲. The third prototype also added four prompts to the system, indicating the domains considered most pertinent for guiding the search for a response to the user’s query—共1兲 design process/designs; 共2兲 science, technology; 共3兲 business, management; and 共4兲 news 共recall that our focused crawler classifies construction pages into these four subcategories; class labels are used here to filter the response presented to the user兲. For test purposes, the Web interfaces were presented to 11 architect or engineering students working individually. In a first scenario, questions were imposed and use 共or nonuse兲 of the features in the different interfaces was observed; in a second scenario, the questions were not imposed and the process was observed, and the clarity of the ensuing question was evaluated. In 70% of the queries, one of the “prefabricated” forms of question was used 共and in 86% of the clear questions兲, confirming their utility. However, the separate dialog box for describing the context was not appreciated by the users and did not lead to better questions. The addition of the prompt boxes 共third interface兲 did not appear to lead to better questions, but they helped in the retrieval of potentially useful responses. The final interface is shown in Fig. 5.

Experimental Results Experiments performed in different stages of our research allowed us to evaluate the performance of our focused crawler in terms of noise in the collected database, and the performance of the Q-A module and its improvement over standard search procedures.

Experiments with the Focused Crawler We tested our intelligent crawler on the Web in December 2006 using the initial set of 200 positive examples of constructionoriented Web pages provided by experts 共set A兲. The pages were evenly distributed among the four categories that are potentially significant for members of the construction sector, mentioned previously. This initial set of positive examples contained nearly equal numbers of English and French pages. We also collected nearly 7,500 pages from the Open Directory Project archive 共www.dmoz.org兲 to create the unlabeled set U. The RC is trained using the PEBL approach, with sets A and U, while the CC is trained using only set A. The system was configured with 30 parallel fetchers for downloading the pages. The acquisition of data lasted 12 hr and during that time, the system visited 407,400 pages 共9.4 pages per sec兲. The total number of relevant pages that were discovered and downloaded is 61,240 or 15% 共1.4 relevant pages per sec兲. Average relevance of all relevant pages is 0.85. The size of the downloaded data was 1.39 GB, while the average size of the relevant pages was 23.7 KB. To evaluate the real relevance of collected data, we would have to manually check a large portion of the downloaded pages. Since this would be a time-consuming task for domain experts, we decided to approach the verification problem in a machinelearning spirit. We created a new set of initial examples 共set B兲 with 315 relevant pages. The distribution of the pages among classes and language was similar to the set A, but pages were taken from different sources in order to achieve independence between A and B. The idea of the approach is to train two classifiers relevance classifier—human 共RCH兲 and categories classifier—human 共CCH兲, each simulating an independent expert

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008 / 11

Table 1. Performance of the Relevance Classifier 共RC兲 and Categories Classifier 共CC兲, Evaluated by the Ultimate Classifiers 共RCH and CCH兲 after the First Crawl, RC Precision Is the Ratio between the Number Pages Judged to Be Relevant by Both RC and RCH, and the Number of Pages Judged to Be Relevent by RC Only, CC Performance Is Evaluated Comparing CC Decisions and CCH Decisions 共Considered as Correct兲 for Those Pages That Are Judged to Be Relevant for RCH. We Report on Microprecision 共mp兲, Microrecall 共mr兲, and Miroaveraged F1 共mF1兲. Relevance level 0.9–1.0 0.8–0.9 0.7–0.8 0.6–0.7 0.5–0.6

Number of relevant by RC

Number of relevant by both RC and RCH

RC precision

CC mp

CC mr

CC mF1

31,096 8,788 7,659 6,969 6,728

27,475 6,102 4,128 3,267 2,777

0.884 0.694 0.539 0.469 0.413

0.74 0.65 0.64 0.63 0.62

0.62 0.66 0.66 0.67 0.67

0.68 0.65 0.65 0.65 0.64

to verify the collected data. RCH is trained on sets B and U, while CCH is trained on B. The results of the verification process are given in Table 1. From this Table, it is clear that different relevance levels contain different estimated noise 共i.e., level “0.9–1” contains 11% of noisy pages兲. If we concatenate levels “0.9–1” and “0.8–0.9” and reject pages from other relevance levels, we end up with 16% of noise in a database constructed in this way. In order to test our approach, we decided to perform another crawl using all 515 positive examples 共union of sets A and B兲 and 7,500 unlabeled examples. It is reasonable to assume that the noise rate of the acquired data would not be higher and that the classification into four classes would also not be less correct than in the first experiment. The rationale for this assumption is that now we doubled the number of examples for teaching, thus better modeling the space of words for the domain of interest. Pages collected after the second crawl were used to create a database for our Q-A module. After examination of results from the previous crawl, we decided to retain five relevance levels, but to consider as relevant only the pages that belong to levels “0.9–1” and “0.8– 0.9.” Lower relevance queues were retained in the crawling machine since the less relevant pages could generate a certain number of links to higher relevance ones. In the second crawl, we visited 2,135,955 pages and found 263,187 pages to be relevant for our Q-A module 共126,491 English and 136,696 French pages兲.

The true test is to show that—rather than rely on intuition or bluffing their way past a ‘lack-of-knowledge’ barrier— decision makers in the building sector would accept spending a little time seeking the information they would otherwise not bother to obtain, because of the ‘friendly’ nature of the system now available to them. Early tests and initial feedback seem to justify our gamble. In other words, the main objective of this research is to lessen the burden faced by busy practitioners as they face the need to make rapid decisions for which they may confess to a tacit need for more information to complete their pertinent knowledge. A more “subversive” objective is to suggest to them that obtaining information is less burdensome than it otherwise might have been, and potentially less “costly” than proceeding without it. This research is based on interdisciplinary teamwork, involving skills in computer science, information science, software ergonomics, and construction project management—all coupled to a good understanding of the decision processes in construction practice. Each contributing team member developed an innovative module and the success of the project flows from the coordination of their efforts.

Experiments with the Q-A Component

Acknowledgments

Tests of the individual modules of the Q-A component have been described above. Tests of the Q-A component as a whole are currently under way, using the pages collected by the second crawl. Two approaches are envisaged: 共1兲 Using the same set of questions 共mentioned previously兲 for comparison purposes; and 共2兲 opening the Q-A system up for unrehearsed questions posed first by engineering and architecture students, and then others posed by practitioners. Depending on the outcome of these tests, the system will then be made available at the ␤ level.

The writers are indebted to our research colleagues and fellow team members: Lyne Da Sylva 共Library and Information Science兲 and Gonzalo Lizarralde 共Environmental Design兲 at the University of Montreal, and Jean-Marc Robert 共Mathematics and Industrial Engineering兲 at the Polytechnic School of Montreal. We thank Philippe Davidson 共EDHEC, Nice兲 and also the students who took part in the project: T. De Bellefeuille, N. El Khoury, F. Jin, M. Léger-Rousseau, L. Liu, G. Lizarralde, L. Moulet, M. Ngendahayo, L. Shi, Q. Zhang, and Z. Zhang without whose crossdisciplinary participation the project could not have succeeded. This work was made possible thanks to funding provided to C.H.D. by the Bell University Laboratories and by the Natural Sciences and Engineering Research Council of Canada.

Conclusions The research was initiated in an attempt to automate the questionand-answer process while hopefully preserving the conviviality and reliability of “asking one’s friendly reference librarian.” This initiative is doubly important in the context of an industry 共the construction industry兲 where 共a兲 searching for general information is not seen as a productive use of time, and 共b兲 adopting innovation depends on the broadest circulation and uptake of pertinent information. As Nie et al. 共2005兲 put it:

References Bardin, S., and Blachère, G. 共Collab. Davidson, C.兲. 共1992兲. Amélioration de l’efficacité de la diffusion des résultats de la recherche en bâtiment, Auxirbat, Paris. Bardin, S., Blachère, G., and Davidson, C. H. 共1993兲. “Are research results used in practice?” Build. Res. Inf., 21共6兲, 347–354.

12 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008

Brin, S., and Page, L. 共1998兲. “The anatomy of a large-scale hypertextual web search engine.” Comput. Netw., 30共1/7兲, 107–117. Building Research Association of New Zealand 共BRANZ兲. 共1984兲. Abbreviated Rep. of Information Dissemination Strategies, Porirua, New Zealand. Caldas, C., Soibelman, L., and Han, J. 共2002兲 “Automated classification of construction project documents.” J. Comput. Civ. Eng., 16共4兲, 234–243. Canadian building digests. 共1990兲. Series of 250 documents, National Research Council, Institute for Research in Construction 共from 1960 to 1990兲, Ottawa, Canada. Chakrabarti, S., van den Berg, M., and Dom, B. 共1999兲. “Focused crawling: A new approach to topic-specific web resource discovery.” Comput. Netw., 31共11–16兲, 1623–1640. Chen, H., Chau, M., and Zeng, D. 共2002兲. “CI spider: A tool for competitive intelligence on the web.” Decision Support Sys., 34共1兲, 1–17. Chen, P.-H., Lin, C.-J., and Schölkopf, B. 共2005兲. “A tutorial on nusupport vector machines.” Applied Stochastic Models in Bus. and Ind., 21, 111–136. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., and Ursu, C. 共2002兲. “The GATE User Guide.” 具http://gate.ac.uk/典. Davidson, C. H. 共2004兲. “Agenda 21: Information and documentation—A research agenda.” Limited circulation Rep. International Council for Research and Innovation in Building and Construction–CIB, Rotterdam, The Netherlands. Davison, B. D. 共2000兲. “Topical locality in the web.” Proc., 23rd Annual Int. Conf. on Research and Development in Information Retrieval, Athens, Greece, 272–279. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 共2000兲. “Focused crawling using context graphs.” Proc., 26th Int. Conf. on Very Large Data Bases, Cairo, Egypt, 527–534. El-Diraby, T. A., Lima, C., and Feis, B. 共2005兲. “Domain taxonomy for construction concepts: Toward a formal ontology for construction knowledge.” J. Comput. Civ. Eng., 19共4兲, 394–406. Fletcher, R. 共1987兲. Practical methods of optimization, 2nd Ed., Wiley, New York. IF Research Group. 共1978兲. Thesaurus—Canada/Construction— Science, Dept. of Industry, Trade and Commerce, Government of Canada, Ottawa, Canada, 2 vols. IF Research Group. 共1979兲. Canadian urban thesaurus/Thésaurus urbain canadien, Secretary of State for Urban Affairs, Government of Canada, Ottawa, Canada. Joachims, T. 共1998兲. “Text categorization with support vector machines: Learning with many relevant features.” Proc., ECML-98, 10th European Conf. on Machine Learning, Chemnitz, Germany, 137–142. King, J. 共1984兲. “Research in practice: Generation, use and communication.” Architectural research, J. Snyder, ed., Van Nostrand Reinhold, New York. Lawrence, S., and Giles, C. L. 共1998兲. “Searching the world wide web.” Science, 280, 98–100. Leslie, H. G., and McKay, D. G. 共1995兲. Managing information to support project-decision making in the building and construction industry, CSIRO, Melbourne, Australia. Lizarralde, G., Da Sylva, L., Nie, J.-Y., and Davidson, C. H. 共2005兲. “Innovation through better information—A question-and-answer service shows the way.” Proc., Information and Knowledge Management in a Global Economy: Challenges and Opportunities for Construction Organizations, Instituto Superior Tecnico, Lisbon, Portugal, 129–138.

Lucene. 共2006兲. “An open source text search engine.” 具http://lucene. apache.org/典. MacKinder, M. 共1982兲. Design decision making in architectural practice, Institute of Advanced Architectural Studies, Univ. of York, U.K., research paper 19. McCallum, A., Nigam, K., Rennie, J., and Seymore, K. 共2000兲. “Automating the construction of internet portals with machine learning.” Inf. Retrieval, 3共2兲, 127–163. Mitchell, T. M. 共1997兲. Machine learning, McGraw-Hill, New York. Mohsini, R., and Davidson, C. H. 共1991兲. “Building procurement—Key to improved performance.” Build. Res. Inf., 19共2兲, 106–113. Moulet, L. 共2004兲. “Comment guider l’utilisateur lors de l’interrogation en langage naturel d’une base de données informatique sur le Web dans le domaine du bâtiment?” MS thesis, Dept. of Industrial Engineering, Polytechnic School of Montreal, Quebec, Canada. Nie, J.-Y., Da Sylva, L., Lizarralde, G., and Davidson, C. H. 共2005兲. “A question and answer system for the construction sector—Facilitating on-line access to information.” Proc., Information and Knowledge Management in a Global Economy: Challenges and Opportunities for Construction Organizations, Instituto Superior Tecnico, Lisbon, Portugal, 119–128. Pant, G., and Srinivasan, P. 共2005兲 “Learning to crawl: Comparing classification schemes.” ACM Trans. Inf. Syst. secur., 23共4兲, 430–462. Pant, G., Tsioutsiouliklis, K., Johnson, J., and Giles, C. L. 共2004兲. “Panorama: Extending digital libraries with topical crawlers.” Proc., 4th ACM/IEEE-CS Joint Conf. on Digital Libraries, New York, 142–150. Qin, J., Zhou, Y., and Chau, M. 共2004兲. “Building domain-specific web collections for scientific digital libraries: A meta-search enhanced focused crawling method.” Proc., 4th ACM/IEEE-CS Joint Conf. on Digital Libraries, New York, 135–141. Rezgui, Y. 共2006兲. “Ontology-centered knowledge management using information retrieval techniques.” J. Comput. Civ. Eng., 20共4兲 261–270. Robert, J.-M., Lizarralde, G., Moulet, L., Davidson, C. H., Nie, J.-Y., and Da Sylva, L. 共2006兲. “Finding out: A system for providing rapid and reliable answers to questions in the construction sector.” Constr. Innovation, 6, 250–261. Salton, G. 共1989兲. Automatic text processing, Addison-Wesley, Reading, Mass. Searchenginewatch. 共2003兲. “Freshness showdown.” 具http://www. searchengineshowdown.com/stats/freshness.shtml典. Sebastiani, F. 共2002兲. “Machine learning in automated text categorization.” ACM Comput. Surv., 34共1兲, 1–47. Vapnik, V. 共1995兲. The nature of statistical learning theory, Springer, New York. Yang, Y., and Liu, X. 共1999兲. “A reexamination of text categorization methods.” Proc., 22nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Berkeley, Calif., 42–49. Yang, Y., and Pedersen, J. O. 共1997兲. “A comparative study on feature selection in text categorization.” Proc., ICML-97, 14th Int. Conf. on Machine Learning, Nashville, Tenn., 412–420. Yu, H., Han, J., and Chang, K. 共2004兲. “PEBL: Web page classification without negative examples.” IEEE Trans. Knowl. Data Eng., 16共1兲, 1–12. Zhang, Z., Da Sylva, L., Davidson, C., Lizarralde, G., and Nie, J.-Y. 共2004兲. “Domain-specific Q-A for the construction sector.” Proc., IR4QA Workshop (Information Retrieval for Question Answering), SIGIR ’04, Sheffield, U.K., September 29, 65–71.

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / JANUARY/FEBRUARY 2008 / 13