Understanding the Hidden Web - Pierre Senellart

5 downloads 6047 Views 981KB Size Report
Jun 17, 2005 - Possible restriction to some domain (e.g. publications). .... Booking services. Mailing list .... With nonsense words. With domain words ...
Understanding the Hidden Web Pierre Senellart Introduction

Understanding the Hidden Web

Process description Discovery Wrappers

Pierre Senellart

Semantic Analysis Indexing and Querying

Summary

17 June 2005

The Hidden Web Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers

Definition (Hidden Web) The set of webpages (which may or may not be dynamically generated) not accessible from the hyperlinked structure of the World Wide Web.

Semantic Analysis Indexing and Querying

Summary

Size estimate (2001) : 500 times larger than the surface Web. How to understand it and benefit from its content?

The Hidden Web Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers

Definition (Hidden Web) The set of webpages (which may or may not be dynamically generated) not accessible from the hyperlinked structure of the World Wide Web.

Semantic Analysis Indexing and Querying

Summary

Size estimate (2001) : 500 times larger than the surface Web. How to understand it and benefit from its content?

The Hidden Web Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers

Definition (Hidden Web) The set of webpages (which may or may not be dynamically generated) not accessible from the hyperlinked structure of the World Wide Web.

Semantic Analysis Indexing and Querying

Summary

Size estimate (2001) : 500 times larger than the surface Web. How to understand it and benefit from its content?

Understanding the Hidden Web Understanding the Hidden Web Pierre Senellart Introduction Process description

Purpose Intensional indexing of the Hidden Web High-level queries ⇒ a semantic search engine over the Hidden Web

Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

In a fully, unsupervized, way!

Difficult and broad problem. Possible restriction to some domain (e.g. publications).

Understanding the Hidden Web Understanding the Hidden Web Pierre Senellart Introduction Process description

Purpose Intensional indexing of the Hidden Web High-level queries ⇒ a semantic search engine over the Hidden Web

Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

In a fully, unsupervized, way!

Difficult and broad problem. Possible restriction to some domain (e.g. publications).

Understanding the Hidden Web Understanding the Hidden Web Pierre Senellart Introduction Process description

Purpose Intensional indexing of the Hidden Web High-level queries ⇒ a semantic search engine over the Hidden Web

Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

In a fully, unsupervized, way!

Difficult and broad problem. Possible restriction to some domain (e.g. publications).

Web Service Semantic Interpretation Process Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

World Wide Web

Web Service Semantic Interpretation Process Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

World Wide Web

discovery

HTML form UDDI WSDL

Web Service Semantic Interpretation Process Understanding the Hidden Web Pierre Senellart

World Wide Web

discovery

HTML form UDDI WSDL

Introduction Process description

wrappers

Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

WSDL

Web Service Semantic Interpretation Process Understanding the Hidden Web Pierre Senellart

World Wide World Wide Web Web

discovery

HTML form HTML form UDDI UDDI WSDL WSDL

Introduction

wrappers

Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Analyzed Web Services analysis

WSDL WSDL

Web Service Semantic Interpretation Process Understanding the Hidden Web Pierre Senellart

World Wide World Wide Web Web

discovery

HTML form HTML form UDDI UDDI WSDL WSDL

Introduction

wrappers

Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Analyzed Web Services analysis

Summary

indexing

Web Services Index

WSDL WSDL

Web Service Semantic Interpretation Process Understanding the Hidden Web Pierre Senellart

World Wide World Wide Web Web

discovery

HTML form HTML form UDDI UDDI WSDL WSDL

Introduction

wrappers

Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Analyzed Web Services analysis

WSDL WSDL

Summary

indexing Queries Web Services Index

querying Results

Understanding the Hidden Web Pierre Senellart

1

Introduction

2

Process description Web Service Discovery Wrapping Web Service Descriptions Web Service Semantic Analysis Web Service Indexing and Querying

3

Summary

Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Web Service Discovery Understanding the Hidden Web Pierre Senellart

Crawling the World Wide Web for: HTML forms implementing a Web Service

Introduction

UDDI registries

Process description

WSDL descriptions

Discovery Wrappers Semantic Analysis Indexing and Querying

Other resources (XML, HTML, Web as a full-text index. . . )

Summary

Only interested in Web Services with no side effects: Ok

Not Ok

Yellow Pages

Booking services

Publication databases

Mailing list management

...

...

Web Service Discovery Understanding the Hidden Web Pierre Senellart

Crawling the World Wide Web for: HTML forms implementing a Web Service

Introduction

UDDI registries

Process description

WSDL descriptions

Discovery Wrappers Semantic Analysis Indexing and Querying

Other resources (XML, HTML, Web as a full-text index. . . )

Summary

Only interested in Web Services with no side effects: Ok

Not Ok

Yellow Pages

Booking services

Publication databases

Mailing list management

...

...

Web Service Discovery Understanding the Hidden Web Pierre Senellart

Crawling the World Wide Web for: HTML forms implementing a Web Service

Introduction

UDDI registries

Process description

WSDL descriptions

Discovery Wrappers Semantic Analysis Indexing and Querying

Other resources (XML, HTML, Web as a full-text index. . . )

Summary

Only interested in Web Services with no side effects: Ok

Not Ok

Yellow Pages

Booking services

Publication databases

Mailing list management

...

...

Web Service Discovery Understanding the Hidden Web Pierre Senellart

Crawling the World Wide Web for: HTML forms implementing a Web Service

Introduction

UDDI registries

Process description

WSDL descriptions

Discovery Wrappers Semantic Analysis Indexing and Querying

Other resources (XML, HTML, Web as a full-text index. . . )

Summary

Only interested in Web Services with no side effects: Ok

Not Ok

Yellow Pages

Booking services

Publication databases

Mailing list management

...

...

Understanding the Hidden Web Pierre Senellart

1

Introduction

2

Process description Web Service Discovery Wrapping Web Service Descriptions Web Service Semantic Analysis Web Service Indexing and Querying

3

Summary

Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Analyzing HTML forms Understanding the Hidden Web

Analyzing the structure of HTML forms.

Pierre Senellart Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Issues What are the relevant form fields? What is the concrete type of each field? What is the label of each field?

Analyzing HTML forms Understanding the Hidden Web

Analyzing the structure of HTML forms.

Pierre Senellart Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Issues What are the relevant form fields? What is the concrete type of each field? What is the label of each field?

Analyzing HTML forms Understanding the Hidden Web

Analyzing the structure of HTML forms.

Pierre Senellart Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Issues What are the relevant form fields? What is the concrete type of each field? What is the label of each field?

Probing HTML forms Understanding the Hidden Web Pierre Senellart Introduction

Probing HTML forms to retrieve sample HTML answer pages: With dictionary words

Process description Discovery Wrappers Semantic Analysis Indexing and Querying

With nonsense words

Summary

With domain words

Probing HTML forms Understanding the Hidden Web Pierre Senellart Introduction

Probing HTML forms to retrieve sample HTML answer pages: With dictionary words

Process description Discovery Wrappers Semantic Analysis Indexing and Querying

With nonsense words

Summary

With domain words

Probing HTML forms Understanding the Hidden Web Pierre Senellart Introduction

Probing HTML forms to retrieve sample HTML answer pages: With dictionary words

Process description Discovery Wrappers Semantic Analysis Indexing and Querying

With nonsense words

Summary

With domain words

Query-answer webpages Understanding the Hidden Web

Extract data from query-answer webpages.

Pierre Senellart Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Issues What part of the webpage contains the answer? How to extract structured content? How to label this structured content?

Query-answer webpages Understanding the Hidden Web

Extract data from query-answer webpages.

Pierre Senellart Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Issues What part of the webpage contains the answer? How to extract structured content? How to label this structured content?

Query-answer webpages Understanding the Hidden Web

Extract data from query-answer webpages.

Pierre Senellart Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Issues What part of the webpage contains the answer? How to extract structured content? How to label this structured content?

R OAD RUNNER Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Example: R OAD RUNNER information extraction engine

Understanding the Hidden Web Pierre Senellart

1

Introduction

2

Process description Web Service Discovery Wrapping Web Service Descriptions Web Service Semantic Analysis Web Service Indexing and Querying

3

Summary

Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Conceptual Model Understanding the Hidden Web Pierre Senellart

IsA ontology of concepts (simple DAG) Thing

Introduction Process description Discovery

Person

Publication

Wrappers Semantic Analysis Indexing and Querying

Man

Woman

Proceedings

Article

Summary

n-ary typed roles AuthorOf(Person,Publication) HasName(Person,Name)

Book

Conceptual Model Understanding the Hidden Web Pierre Senellart

IsA ontology of concepts (simple DAG) Thing

Introduction Process description Discovery

Person

Publication

Wrappers Semantic Analysis Indexing and Querying

Man

Woman

Proceedings

Article

Summary

n-ary typed roles AuthorOf(Person,Publication) HasName(Person,Name)

Book

Semantic representation of a service Understanding the Hidden Web Pierre Senellart

What is a service described by?

Introduction

A n-uple of typed input parameters

Process description

A complex (= nested) type of its output

Discovery Wrappers

Semantic relations between inputs and outputs

Semantic Analysis Indexing and Querying

Summary

Definition (Complex types) S: set of concepts T ←− S||T ∗

Semantic representation of a service Understanding the Hidden Web Pierre Senellart

What is a service described by?

Introduction

A n-uple of typed input parameters

Process description

A complex (= nested) type of its output

Discovery Wrappers

Semantic relations between inputs and outputs

Semantic Analysis Indexing and Querying

Summary

Definition (Complex types) S: set of concepts T ←− S||T ∗

Semantic representation of a service Understanding the Hidden Web Pierre Senellart

What is a service described by?

Introduction

A n-uple of typed input parameters

Process description

A complex (= nested) type of its output

Discovery Wrappers

Semantic relations between inputs and outputs

Semantic Analysis Indexing and Querying

Summary

Definition (Complex types) S: set of concepts T ←− S||T ∗

Semantic representation of a service Understanding the Hidden Web Pierre Senellart

What is a service described by?

Introduction

A n-uple of typed input parameters

Process description

A complex (= nested) type of its output

Discovery Wrappers

Semantic relations between inputs and outputs

Semantic Analysis Indexing and Querying

Summary

Definition (Complex types) S: set of concepts T ←− S||T ∗

Services and queries Understanding the Hidden Web Pierre Senellart Introduction Process description

Example Service giving authors from publication titles A*← WrittenBy(P,A),HasTitle(P,T),Input(T)

Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Example Query: *← WrittenBy(P,A), Article(P), HasTitle(P,T), KeywordOf(“xml”,P)

Services and queries Understanding the Hidden Web Pierre Senellart Introduction Process description

Example Service giving authors from publication titles A*← WrittenBy(P,A),HasTitle(P,T),Input(T)

Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Example Query: *← WrittenBy(P,A), Article(P), HasTitle(P,T), KeywordOf(“xml”,P)

Managing extensional information Understanding the Hidden Web Pierre Senellart Introduction

How to represent extensional information (i.e. documents) in this formalism?

Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Definition A document is a service with no input.

Summary

Complex types: natural representation of a DTD. (Disjunctions a|b simulated by (a?,b?)).

Semantic Interpretation of a Service Understanding the Hidden Web Pierre Senellart

How to analyze a Web Service?

Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Field labels, variable names, tag names Concrete type descriptions (e.g. \d{4}-\d{2}-\d{2} is a date) Linguistic analysis of plain text descriptions and pages linking to the service Note: Extracting concepts is easier than extracting relations between concepts.

Semantic Interpretation of a Service Understanding the Hidden Web Pierre Senellart

How to analyze a Web Service?

Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Field labels, variable names, tag names Concrete type descriptions (e.g. \d{4}-\d{2}-\d{2} is a date) Linguistic analysis of plain text descriptions and pages linking to the service Note: Extracting concepts is easier than extracting relations between concepts.

Semantic Interpretation of a Service Understanding the Hidden Web Pierre Senellart

How to analyze a Web Service?

Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Field labels, variable names, tag names Concrete type descriptions (e.g. \d{4}-\d{2}-\d{2} is a date) Linguistic analysis of plain text descriptions and pages linking to the service Note: Extracting concepts is easier than extracting relations between concepts.

Semantic Interpretation of a Service Understanding the Hidden Web Pierre Senellart

How to analyze a Web Service?

Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Field labels, variable names, tag names Concrete type descriptions (e.g. \d{4}-\d{2}-\d{2} is a date) Linguistic analysis of plain text descriptions and pages linking to the service Note: Extracting concepts is easier than extracting relations between concepts.

Semantic Interpretation of a Service Understanding the Hidden Web Pierre Senellart

How to analyze a Web Service?

Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Field labels, variable names, tag names Concrete type descriptions (e.g. \d{4}-\d{2}-\d{2} is a date) Linguistic analysis of plain text descriptions and pages linking to the service Note: Extracting concepts is easier than extracting relations between concepts.

Understanding the Hidden Web Pierre Senellart

1

Introduction

2

Process description Web Service Discovery Wrapping Web Service Descriptions Web Service Semantic Analysis Web Service Indexing and Querying

3

Summary

Introduction Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Web Service Indexing and Querying Understanding the Hidden Web Pierre Senellart Introduction Process description

Given a query, represented as an Analyzed Web Service, how to know which known web services to query?

Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Issues Subsumption of input/output parameters Missing input parameters Composition of webservices

Web Service Indexing and Querying Understanding the Hidden Web Pierre Senellart Introduction Process description

Given a query, represented as an Analyzed Web Service, how to know which known web services to query?

Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Issues Subsumption of input/output parameters Missing input parameters Composition of webservices

Web Service Indexing and Querying Understanding the Hidden Web Pierre Senellart Introduction Process description

Given a query, represented as an Analyzed Web Service, how to know which known web services to query?

Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Issues Subsumption of input/output parameters Missing input parameters Composition of webservices

Web Service Indexing and Querying Understanding the Hidden Web Pierre Senellart Introduction Process description

Given a query, represented as an Analyzed Web Service, how to know which known web services to query?

Discovery Wrappers Semantic Analysis Indexing and Querying

Summary

Issues Subsumption of input/output parameters Missing input parameters Composition of webservices

Differences with classical database querying Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers

Three main differences: Information can be queried only through views (Local As View)

Semantic Analysis Indexing and Querying

Summary

Nested types Incomplete information Three sources of complexity!

Differences with classical database querying Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers

Three main differences: Information can be queried only through views (Local As View)

Semantic Analysis Indexing and Querying

Summary

Nested types Incomplete information Three sources of complexity!

Differences with classical database querying Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers

Three main differences: Information can be queried only through views (Local As View)

Semantic Analysis Indexing and Querying

Summary

Nested types Incomplete information Three sources of complexity!

Differences with classical database querying Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers

Three main differences: Information can be queried only through views (Local As View)

Semantic Analysis Indexing and Querying

Summary

Nested types Incomplete information Three sources of complexity!

Differences with classical database querying Understanding the Hidden Web Pierre Senellart Introduction Process description Discovery Wrappers

Three main differences: Information can be queried only through views (Local As View)

Semantic Analysis Indexing and Querying

Summary

Nested types Incomplete information Three sources of complexity!

Web Service Semantic Interpretation Process Understanding the Hidden Web Pierre Senellart

World Wide World Wide Web Web

discovery

HTML form HTML form UDDI UDDI WSDL WSDL

Introduction

wrappers

Process description Discovery Wrappers Semantic Analysis Indexing and Querying

Analyzed Web Services analysis

WSDL WSDL

Summary

indexing Queries Web Services Index

querying Results