Vol.12 No.5 2007 769-772
Article ID: 1007-1202(2007)05-0769-04 DOI 10.1007/s11859-007-0001-4
A Survey of Web Information System and Applications 0
□ HAN Yanbo1, LI Juanzi2, YANG Nan3, 3
4
Introduction
3
LIU Qing , XU Baowen , MENG Xiaofeng
1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China; 2. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China; 3. School of Information, Renmin University of China, Beijing 100872, China; 4. School of Computer Science and Engineering, Southeast University, Nanjing 210096, Jiangsu, China Abstract: The fourth international conference on Web information systems and applications (WISA 2007) has received 409 submissions and has accepted 37 papers for publication in this issue. The papers cover broad research areas, including Web mining and data warehouse, Deep Web and Web integration, P2P networks, text processing and information retrieval, as well as Web Services and Web infrastructure. After briefly introducing the WISA conference, the survey outlines the current activities and future trends concerning Web information systems and applications based on the papers accepted for publication. Key word: Web mining; data warehouse; Deep Web; Web integration; Web services; P2P computing; text processing; information retrieval; Web security CLC number: TP 391
Received date: 2007-01-04 Biography: HAN Yanbo(1962-), male, Professor, research direction: distributed systems integration, service computing and business process automation. E-mail:
[email protected]
Wuhan University Journal of Natural Sciences Vol.12 No.5 2007
With the prevalence of Web information systems and applications, a lot of new challenges emerge, bringing many open questions for research and development. To meet the needs of sharing and exchanging research achievements and development experiences from both academic and industry communities, the Office Automation and Electronic Government Society of China Computer Federation(CCF) has launched the Web Information Systems and Applications conference, WISA. The goal of the WISA conference is to provide a forum for researchers, practitioners, and industrial professionals to share their experiences in the rapidly growing area of Web information systems and applications. The first WISA conference was held in Wuhan, China(2004)[1], the second WISA conference was held in Shenyang, China(2005)[2] and the third WISA conference was held in Nanjing, China(2006)[3]. WISA 2007, organized by Renmin University of China, will be held in Beijing in September, 2007. WISA 2007 has received 409 submissions. Only 37 papers are accepted for publication in this special issue. They can be categorized into the following five research areas: Web mining and data warehouse, Deep Web and Web integration, Web services and peer-to-peer networks, text processing and information retrieval, as well as Web infrastructure and security issues. The subsequent sections summarize the papers according to the individual categories.
769
1 Web Mining and Data Warehouse With the development of Internet, as one of the key techniques, Web mining[4] and Data warehouse[5] become more and more important. Current work concerns the theory and application research. In the theoretical research on Web usage mining, some novel algorithms are proposed. The paper “A Novel Incremental Mining Algorithm of Frequent Patterns for Web Usage Mining” proposes a novel algorithm updating for global frequent patterns-IPARUC. A new structure called AFP-tree, which can hold all items of transactions, is introduced. It is adjusted dynamically in inserting process so that nodes in AFP-tree can be shared to approaching optimization. Some researches on Web structure mining, especially in the filed of link analysis. The paper “Extracting the Cores of Implicit Communities” focuses on the Web communities’ discovery. By improving the method of trawling, a new method based on edge removal to extract cores from a web graph was proposed. The paper “A New Generalized Similarity-Based Topic Distillation Algorithm” proposes a new algorithm for query expansion-based topic distillation algorithm is presented to improve the quality of topic distillation. Personalized root set and base set with query expansion are constructed utilizing Web logs to avoid the topic drift, and then using HITS to evaluate and return authority pages of relative topics to end users. The paper “Mining Representative Subset Based on Fuzzy Clustering” proposes a new model for identifying representative set from a large database. Information theoretic measures such as mutual information and relative entropy are used to measure the representativeness of the representative set. A greedy algorithm and a heuristic algorithm are designed to deliver better performance. In the application of Web mining, the paper “Semantic Session Analysis for Web Usage Mining” presents a novel semantic session analysis method partitioning Web usage logs. Markov chain model based on semantic measurement is used to identifying which active session a request should belong to. The competitive method is applied to determine the end of sessions. Compared with other algorithms, more successful sessions can be additionally detected by semantic outlier analysis. In the field of Data Warehouse, there are two papers. The paper “Multi-Dimensional Customer Data Analysis
770
in Online Auctions” uses OLAP and data mining technology for the business of online auction companies. A customer-centered data warehouse system is designed. Through this model, online auction companies can make full use of their customer data; get abundant customer information which can help them in set customer strategy, and increase customer loyalty and purchase amount. The paper “Study and Implementation of a New SQL-Based ETL Approach” proposes a new ETL approach, E-LT (Extraction, Loading&Transformation) in Data Warehousing. The E-LT approach applies database mapping technique to ensure that loading stage and transformation stage of ETL process are performed at the same time after the extraction stage. Thus, it can use SQL commands to complete loading and transformation processing, and eliminates the staging area before loading in traditional ETL process. At last, a real case in Marine Data Warehousing of the E-LT method is discussed for illustrating the validity of the proposed method.
2
Deep Web and Web Integration
Getting and integrating the right information from the right sources within Web databases are in essential need for today’s network-based information processing, and pose a lot of new challenges at the same time. The term Deep Web is often used to refer to the new paradigm related to this kind of Web integration. Eight papers in this volume are closely related to the topic. The paper “Query Translation on the Fly in Deep Web Integration” proposes an approach to automatically translating queries from an integrated interface to a Web database interface, and verifies the effectiveness and efficiency of the approach by an empirical study. The paper “A Deep Web Query Interfaces Classification Method Based on RBF Neural Network” proposes a query interface classification method, the Radial Basic Function Neural Network (RBF) algorithm is used to classify the query interfaces. Given the fact that the traditional cache mechanisms are not efficient for deep Web query, the paper “KDS-CM: A Cache Model Based on Top-k Data Source for Deep Web Query” proposes a cache model based on Top-k data sources, and obtains some positive results. It is of practical importance to extract result schema while extracting result data from a particular Web database. This is exactly the topic of the paper “Extracting Result Schema Based on Query Instances in the Deep Web”. XML is a defacto standard for information repreHAN Yanbo et al : A Survey of Web Information System and …
sentation and exchange on the Internet. The other 3 papers deal with Web-based XML documents: the paper “Hisc:A Hybrid XML Index Composing Structure-Encoded with Cluster” deals with the indexing of semi-structured XML documents, it makes improvements to an existing dynamic index method for querying XML data by tree structure, and proposes a hybrid XML index composing structure-encoded with cluster. Experiment results indicate the index is efficient with high precision; the paper “Functional Dependencies and Its Axiom System in XML” handles the problem of expressing and deducing functional dependencies caused by data redundancies; the paper “TwigStack+: Holistic Twig Join Pruning Using Extended Solution Extension” focuses on performance improvements of the existing twig query approach - TwigStack by avoiding redundant function calls.
3 Web Services and Peer-to-Peer Computing Due to the open and dynamic nature Web services, it is a common case that different providers offer Web services which are functionally equivalent and are replaceable by others. The choice between them can be dictated by non-functional properties, referred to as Quality of Service (QoS) attributes. These attributes reflect users’ requirements and the degrees of satisfaction. The research about how to describe the QoS of Web service is presented in the paper “A Fuzzy Directed Graph-Based QoS Model for Service Composition”. In the paper “Performance Predictable ServiceBSP Model for Grid Computing”, the computation services are selected with QoS consideration so as to guarantee the QoS of the applications, and the performance of a computing application in ServiceBSP model can be predicted according to the performance prediction model. The paper “Automatic Question Answering from Web Documents” proposes a passage retrieval strategy for web-based question answering (QA) systems. The rest 3 papers in this category concentrate on improving query or search efficiency on a peer-to-peer topology. The paper “An Efficient Multi-Keyword Query Processing Strategy on P2P Based Web Search” proposes a benefit-based strategy for query routing in a P2P environment. Experiments are done on some real-world data, and some promising results are obtained. The paper “Resource Search in Unstructured Peer-to-Peer System Wuhan University Journal of Natural Sciences Vol.12 No.5 2007
Based on Multiple-Tree Overlay Structure” deals with resource discovery in unstructured P2P systems. Peers that share similar characteristics are grouped into a tree-like cluster structure, naturally forming a multiple tree-like overlay system. Key issues in utilizing the overlay structure for resource discovery are discussed. In the paper “ISS: Efficient Search Scheme Based on Immune Method in Modern Unstructured Peer-to-Peer Networks”, the authors propose a search scheme in unstructured P2P networks, which aims at improving the efficiency of the existing dynamic query techniques by eliminating redundant messages.
4 Text Processing and Information Retrieval With the development of the Internet and the World Wide Web, various kinds of available information resources have been growing quickly in number and scope. For example, Web pages, documents, advertisements, Wiki, BBS are typical information resources. At the same time, different Internet users have different requirements for the information they want to get from the Internet. So, how do we extract important information and organize them? How do we find the specific information users want and perform personalized search? These are challenges in current Web information systems. The advanced text processing and information retrieval are targeted at these problems, and they are thus among the main research topics of WISA2007. We have accepted six papers to be published in this volume, which are relevant to the topics. Among these papers, five of them are about information retrieval. The papers “Research on the User Interest Modeling of Personalized Search Engine” and “A Novel personalized Web Search Model” aim at providing the effective personalized search. A user interest model as well as its construction and updating are proposed in the former paper, while a personalized search method is proposed to expand the user query using a user interest model; The paper “A High-Performance Extraction Method for Public Opinion on Internet” focuses on searching for the importance and urgency of the analysis for public opinion on the Internet by employing a class space model to reflect the relationship between words and categories; The paper “Combining Block and Corner Features for Content-based Trademark Retrieval” provides the search of registered trademarks by using both
771
the block hit statistic and corner Delaunay. The other two papers related to this research topic concern with the problems of text processing. The paper “A New Feature Selection Method for Text Clustering” focuses on the feature selection for text clustering. It uses the expectation maximization and cluster validity to perform the feature selection effectively and efficiently. The paper “Keyword Extraction Based on tf/idf for Chinese News Documents” gives the specification of keywords in Chinese news documents and proposes a new keyword extraction method based on linguistic characteristics of news documents.
5 Infrastructure and Security Issues Testing is a fundamental aspect of software engineering, in today’s fast-paced Web application development, it becomes increasingly important to ensure the correctness of Web applications through a validation and verification(V&V) process. The paper “Model Checking-Based Testing of Web Applications” proposes a formal model to represent the navigation behavior of a Web application. Software cost estimation plays an important role in successful software project management. The paper “Grey Prediction Based Software Stage-effort Estimation” presents a grey prediction based method for month effort prediction during software development. Traditional database system is not suitable for biological data management. The paper “A Personalized Microbe Biodiversity Dataspace Information System” discusses how to manage a large amount of complex microbe biodiversity data by an object deputy database system which can provide rich semantics and enough flexibility. Security insurance and privacy control are one of the major concerns in a multi-stakeholder environment. Four papers in this volume are concerned with the topic. The paper “Research of Anti-plagiarism Monitoring System Model” presents models and algorithms for anti-plagiarism scenarios. The model is expected to be useful for anti-piracy, reprinted monitoring as well as information security purposes. The paper “Fuzzy Privacy Decision for Context-aware Access Personal Information” discusses access control models that can protect personal privacy. Users are assigned with roles with disclosure policies and situational considerations. In addition to intrusion detection, network anomaly detection is an alternative way to ensure network security. The paper
772
“Anomaly Detection with Artificial Immune Network” proposes an adaptive anomaly detection approach on the basis of artificial immune network and shows some promising results. The last paper in this group “Resolution for Conflicts of Inter-Operation in Multi-Domain Environment” focuses on secure interoperation in distributed system environments. Four types of the conflicts for such interoperation are identified, and then a method for conflict detection and resolution is proposed.
6
Conclusion
WISA2007 is the fourth one of the Web Information System and Application conference series. The event has gained increasing popularity. Remarkable technological advances in the related fields can be observed from the arduous submissions and participation all over China. We are sure that the accepted papers are quite representative and can certainly bring forth new advances with regards to Web techniques, information systems, electronic government and office automation. We would like to express our deep appreciation to the authors of all submitted papers for their interests in the conference themes and to all those who have contributed to the success of WISA2007.
References [1] XU Baowen, XU Lei, MENG Xiaofeng, et al. Web-Based Information Systems and Applications: A Survey[J]. Wuhan University Journal of Natural Sciences, 2004, 9(5):537-541. [2] MENG Xiaofeng, XU Baowen, LIU Qing, et al. A Survey of Web Information Technology and Application[J]. Wuhan University Journal of Natural Sciences, 2006, 11(1):001-005. [3] WANG Guoren, XU Lizhen, XU Baowen, et al. A Survey of Web Information Systems and Applications[J]. Wuhan University Journal of Natural Sciences, 2006, 11(5): 10591064. [4] Cooley R, Mobasher B, Srivastava J. Web Mining: Information and Pattern Discovery on the World Wide Web [C]// 9th International Conference on Tools with Artificial Intelligence (ICTAI ’97).Newport Beach, CA, USA, Nov 3-8, 1997. [5] Widom J. Research Problems in Data Warehousing[C]// Proceedings of the 1995 International Conference on Information
and
Knowledge
Management.
Baltimore,
Maryland, USA, Nov 28-Dec 2, 1995.
□
HAN Yanbo et al : A Survey of Web Information System and …