308. 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) .... uses the wiki ontology, providing text phrases to link to ontology properties. ..... Applications in Business Intelligence and Semantic Web,â The. Semantic ...
Semantic Annotation Tools Survey Pedro Oliveira, João Rocha School of Engineering Polytechnic of Porto Porto, Portugal {ped, jsr}@isep.ipp.pt Abstract— The Web has come a long way since its invention by Berners-Lee, when it focused essentially on visualization and presentation of content for human consumption (Syntactic Web), to a Web providing meaningful content, facilitating the integration between people and machines (Semantic Web). This paper presents a survey of different tools that provide the enrichment of the Web with understandable annotation, in order to make its content available and interoperable between systems. We can group Semantic Annotation tools into the diverse dimensions: dynamicity, storage, information extraction process, scalability and customization. The analysis of the different annotation tools shows that (semi-)automatic and automatic systems aren’t as efficient as needed without human intervention and will continue to evolve to solve the challenge. Microdata, RDFa and the new HTML5 standard will certainly bring new contributions to this issue. Keywords—Semantic Web; Semantic Annotation; Semantic Annotation Tools.
I.
INTRODUCTION
Despite recent developments, the Web remains essentially composed with documents written in HTML (Hyper Text Markup Language), a language focusing on visualization and presentation of information to be consumed mainly by humans. Machines can interpret the information available on the Web, but it is necessary to provide it in a language that both can understand. This is precisely the aim of the Semantic Web. Basic mechanisms could understand the semantics and facilitate people’s searching and browsing. The inventor of the Web, Tim Berners-Lee, looks at the Semantic Web as a logical extension of the current Web, instead of an entire new Web. To achieve this logical extension he states: “The first step is putting data on the Web in a form that machines can naturally understand, or converting it to that form. This creates what I call a Semantic Web – a web of data that can be processed directly or indirectly by machines.”[1]. The potential of the Semantic Web will be truly realized when applications can perform searches of information in different ways on the Web, process that information and share results with other programs [2]. Obviously the effectiveness of these applications is dependent on the proliferation of structured data so that it can be accessed and understood by machines. The understanding between machines requires the use of a specific vocabulary to describe the requirements for a given field and also a set of logical axioms necessary to print the semantic meaning intended by the words of the vocabulary,
c 978-1-4673-5895-8/13/$31.00 2013 IEEE
which can be achieved by ontologies. These formally represent knowledge through a set of concepts and relations between them, in a given field. They play an important role in achieving interoperability between organizations and the Semantic Web, because they aim to capture domain knowledge and its role is to explicitly create generic semantics, providing a basis for agreement within a domain. II.
SEMANTIC ANNOTATION
The process of joining semantic concepts to natural language is referred as Semantic Annotation. This process is seen as a dynamic creation of bidirectional relationships between ontologies and unstructured and semi-structured documents [3]. From a technological view, Semantic Annotation is the process of inserting metadata, which are concepts of a ontology (i.e. classes, instances, properties and relations), in Web resources, in order to assign semantics. The success of the Semantic Web depends essentially on the proliferation of annotated Web content. Annotating data can help provide better search facilities, since queries will be based not only on traditional keywords, but also well-defined concepts described by the ontology of the domain that we want to search for information [4]. According to [5] we can distinguish diverse annotation dimensions. In this document we will focus on dynamicity of annotation, storage, information extraction (IE) process, scalability and customization. The dynamicity of annotation is concerned with the nature of the annotation content: static annotation and dynamic annotation [5]. Static annotation is the traditional form of annotation and is applied to static content, normally something that has a very small probability of changing. Dynamic annotation is also known as rule annotation and is a result of a query or filter in an automatic form according to the content of the document and the annotation method. The storage dimension is concerned with the storage of the annotation so that it can be accessed later. Annotations can be kept in a local or remote repository. Advantages and inconveniences are stated in [5]. The information extraction process is concerned with the method of identifying important parts of the document to annotate and the content of annotation [5]. We can subdivide this dimension into: the method type and the level of automation. These two sub-dimensions are related as the automation degree of the IE process is influenced by the
307
method type. Pattern matching methods are normally easily automated while Natural Language Processing (NLP) and ontology-based methods are harder to automate. The level of automation of the method can go from manual to fully automated. Manual annotation can be accomplished using diverse authoring tools that normally provide an integrated environment for authoring and annotating text. The utilization of humans for the process of annotation is too expensive to carry out without a kind of automation. It is almost impossible to deal with the volume of the existing documents on the Web and often introduces errors, mainly due to the following factors [6]: utilization of highly complex coding schemas, inconsistencies in the labeling among different annotators and familiarity with the domain. This kind of annotation has brought the knowledge acquisition to a situation where it is difficult to facilitate the dissemination of the Semantic Web. To overcome this bottleneck in annotation acquisition, systems that can lead the process automatically have been proposed. Systems known are mainly semiautomatic approaches while fully automatic systems are still a challenge. These systems provide the scalability needed to annotate existing documents of the deep Web and facilitates the annotating of new documents. These systems also facilitate the use of multiple ontologies to annotate the same document. The method type can be organized based on the classification presented in [7]: • Pattern matching – is based on regular expressions defined manually before searching the content of the document or by pattern discovery, mostly following the basic method outlined by [8], where an initial set of entities is defined and the corpus is exploited to discover patterns in which the entities exist. New entities are discovered along with new patterns and the process continues recursively until the method does not discover any entity or the user stops it. This method should be used on pages that do not suffer changes often. • Rule based methods – the rules must be defined manually before searching the content of documents. This approach does not require training data and has good results on structured documents with clear patterns. • Wrapper induction methods – are defined by [9] as the task of learning a procedure for extracting tuples from a particular information source of examples provided by the user. Hybrid approaches are normally used to take the advantage of the strengths of methods from different categories, based on the target domain, the semantic complexity of annotations and the availability of trained human annotators and/or language engineers [3]. Scalability and customization consists on the ability to process millions of documents and the ease to adapt to new ontologies and/or domains.
308
III.
SEMANTIC ANNOTATION TOOLS SURVEY
Semantic Annotation faces the challenge to deliver tools capable of full automatic annotation. This work in progress presents an additional effort to make a complete survey, as improved versions are released and completely new prototypes of tools are emerging constantly. The present survey highlights the most commonly referenced tools found in the literature review of recent Semantic Annotation and not an exhaustive collection of all tools. A comparison of Semantic Annotation tools is also a difficult task because we can find differences in terms of different semantic models adopted, document formats and domains. The tools are presented in alphabetical order: • AeroDAML [10] is a knowledge markup tool that automatically generates DAML annotations on web pages after the application of natural language extraction techniques. AeroDAML maps proper nouns and common relationships with classes and properties in DAML ontologies. AeroDAML has two different modes of utilization. The web-enabled version of AeroDAML supports annotation with a default generic ontology of commonly found words, classes and relationships. The user enters a URI and AeroDAML responds with the DAML annotation for the URI normally associated to a web page. The client server version of AeroDAML supports annotation with customized ontologies. In this version the user must enter a file name and AeroDAML returns the DAML annotation for the text document. • AKTiveMedia [11][12] is a user centric system for annotating documents with support of text, images and HTML documents (containing both text and images) with ontology-based and free-text annotations. Both author and reader can perform annotations allowing the utilization of different ontologies. The annotations are not stored in the document but separately with authorship allowing users to share comments and annotations with other members of the community using a centralized server. Most annotations are done manually but various techniques are available to reduce the effort of annotating. AKTiveMedia is the successor of AKtiveDoc [13]. • Armadillo [14][15] is a system for creating automatic domain-specific annotation on large repositories in an unsupervised way. This tool implements an adaptive information extraction algorithm using a pattern-based approach to find entities from a handful of seed examples that must be provided by the user and discovers new facts using examples. Learning is seeded for information extraction from redundant information repositories, such as databases and digital libraries, or from a user-defined lexicon. The information that is retrieved is in part used to annotate a set of new documents. Then, the new annotated documents are used to bootstrap learning. The user can repeat this process until the annotations reach the quality expected.
2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
Armadillo uses diverse techniques, from searches based on keywords, to adaptive IE to information integration. • Cerno [14][16] is a framework for semi-automatic Semantic Annotation of text documents according to a domain-specific semantic model. Cerno uses lightweight techniques and tools for code analysis and markup, requiring limited human effort for adaptation to a new domain. The Cerno framework comprises a process for defining keywords and grammar-based rules so that it can identify instances of concepts in a textual document and an architecture that applies the rules to annotate and extract instances that are identified in a document. • CREAM [17][18] (CREAtion of Metadata) is a framework for creating annotations, in particular relational metadata. CREAM supports metadata creation during authoring and after authoring Web pages. CREAM includes inference services, crawler, document management system, ontology guidance/fact browser, document editors/viewers, and a meta ontology. Annotations can be done manually – by typing or in a drag-and-drop style, associating instances with the concepts that are displayed in the ontology browser – or semi-automatically – by using wrappers and information extraction components. OntoAnnotate and OntoMat Annotizer are two different implementations of the CREAM framework. OntoMat Annotizer is a user-friendly tool that explores the inner structure of documents written in HTML language inferring annotations and aids system analysts to gather knowledge from different documents and Web pages and to populate with metadata an ontology. OntoMat expects a collection of highly structured and consistent documents so it can rapidly induce extraction rules from a little corpus annotated by humans. CREAM is well suited for highly structured web documents, while for annotating HTML pages of a less structured nature, SemTag is more appropriate. • Drupal [19] is one of the top three, in terms of market share, open source Content Management System (CMS) [20]. These systems manage textual and multimedia content, but are also enriched with meta-information of the site’s structure and content. Drupal enables site administrators to export their site content model and data to the Web of Data without the need of deep knowledge of technologies of the Semantic Web. It is possible to create RDFa annotations, giving the possibility to map the site data to existing ontologies. We can resume that Drupal is capable of generating and injecting RDFa annotations within HTML pages, based on the structure of the content within the system. • GoNTogle [4] offers a framework for Semantic ontology-based Annotations. It is possible to annotate different document formats (doc, pdf, txt, rtf, odt, sxw, etc.) and allows the annotation of the whole documents or fragments of it. This framework supports manual and automatic annotation, where automatic annotation is proposed with a learning method that explores past
annotations made by the user and textual information to make annotation suggestions automatically. Annotations are stored in a centralized ontology server maintaining them separate from the document. GoNTogle provides advanced searching facilities threw the utilization of a flexible combination of keywordbased and semantic-based search over the different document formats. • KIM [21][22][23] (Knowledge and Information Management) platform contains an ontology, a knowledge base, an automatic Semantic Annotation, indexing, and retrieval server. Similar to SemTag, KIM centers its attention on assigning to the entities in the corpus links to their semantic descriptions, provided by the KIMO ontology that, apart from containing named entity classes and their properties, is pre-populated with a large number of instances. KIM offers an infrastructure capable of scalable and customizable information extraction as well as annotation and document management, based on GATE (the General Architecture for Text Engineering). In order to provide basic level of performance and allow easy bootstrapping of applications, KIM is equipped with an upper-level ontology and a knowledge base providing extensive coverage of entities of general importance. From a technical point of view, the platform allows KIM-based applications to use it for automatic Semantic Annotation, content retrieval based on semantic restrictions, and querying and modifying the underlying ontologies and knowledge bases. Created annotations are linked to the type of entity and to the exact individual in the knowledge base. • KnowItAll [24] is an unsupervised, domainindependent system that automates the extraction of large collections of facts from the Web. In KnowItAll this is done using pointwise mutual information (PMI) statistics. The PMI measure can be presented generally as the relation between the result (number of hits) obtained by querying a search engine with the discriminator phrase (e.g. “Oporto is a city”) by the result (number of hits) achieved by querying with the extracted fact (e.g. “Oporto”). KnowItAll does not need an initial set of seeds, relying on automatically generated domain-specific extraction rules. • KnowWe [25][26] is a Semantic Wiki for knowledge engineering, that helps build decision-support systems. Each page of the wiki represents a distinct concept of the ontology, which is described using text and may include multimedia content. The annotation process uses the wiki ontology, providing text phrases to link to ontology properties. KnowWe gives the possibility to define rules/models to derive instances of particular concepts. • Lixto [27] presents techniques for supervised wrapper generation and automated web information extraction. Lixto assists users to create wrapper programs in a semi-automatic way providing a visual and interactive interface. It scans Web pages continuously and extracts
2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
309
relevant information from Web pages with dynamic contents. • Magpie [28] is presented within a web browser, avoiding the need of manual annotation by automatically annotating web resources by associating text strings that exist in the web page with an ontology chosen by the user. It is similar to Thresher using wrappers to produce RDF in “real-time” as users explore web pages. • MnM [29][30] is an annotation tool, which provides the possibility of annotating pages in an automatic or semiautomatic way with semantic metadata. MnM is an ontology-based information extraction engine based on unsupervised learning. It can learn extraction rules from a training corpus applying these rules on unseen news articles to populate an ontology. MnM has a web browser integrated with an ontology editor. It provides open APIs for the connection to ontology servers and for the integration with information extraction tools. MnM is a good example of an ontology editor, because it is web-based, facilitates semantic tagging and mechanisms for large-scale automatic tagging with convenient metadata on web pages. • OpenCalais [31] is a web service offered by Reuters, which automatically creates rich semantic metadata from unstructured text source. OpenCalais performs natural language processing (English and French) and also uses machine-learning techniques to define entities in text. Entities are divided in named entities (people, companies, books, albums, etc.), facts (political events, …) and events (sports, change of command). Using this information it is possible to construct maps (or graphs or networks) linking documents to people, companies, places, and various other entities. OpenCalais is offered for free but with a daily limit of requests. • PANKOW [32] (Pattern-based Annotation through Knowledge on the Web) is an unsupervised, patternbased method to create instances based on an ontology. PANKOW does not rely on a seed corpus since it makes use of linguistically motivated regular expressions to discover relations in the document. It uses the knowledge existing in the Web to propose annotations based on counting Google hits of instantiated linguistic patterns. • Semantic MediaWiki [33] is an extension of MediaWiki (wiki application that powers Wikipedia), a traditional wiki essentially with plain text. It adds Semantic Annotations serving as a collaborative database. Each article corresponds to a class or property of an ontology. Semantic MediaWiki provides additional markup for wiki-text, which simplifies the structure of the wiki, enables users to find more information in less time and improves quality and consistency of the wiki. The Semantic MediaWiki can serve as a data source for other applications because annotations can be exported in different formats.
310
• Semantic Wikipedia [34][35] is a framework to extract new semantic information according to a subject from plain text. The subject must have a structured description of the domain to which it belongs. The strategy is based on disambiguation of plain text using both domain ontology and linguistic pattern matching methods and has three main steps: TOC extraction from the original page, annotation of the content for each section and generation of the Semantic Wiki. • SemTag [36][37] is the semantic annotation component of a platform, called Seeker, a large-scale semantic tagging tool, which facilitates the annotation of the deep web. SemTag uses structural analysis to annotate web documents and uses a standard ontology (TAP) to annotate text with its terms in an automated style. SemTag is designed to operate as a centralized application that can access database records and the comprehending metadata, features that bring advantages over local taggers. The TAP ontology contains lexical and taxonomic information about a large variety of named entities, as for instance, locations, movies, authors, musicians, autos, and others. SemTag can detect the occurrence of the named entities in web pages disambiguating them with the Taxonomy Based Disambiguation (TBD) algorithm. SemTag stores the annotations generated separated from the original document. It is seen as a tool for experts instead of knowledge workers. • Thresher [38][39] allows end-users, instead of content providers to unwrap the semantic structures that have been nested inside Web pages. Thresher presents a web interface that gives the ability for non-technical users to easily mark-up examples of a particular class. By analogy, thresher learns from these examples so it can induce wrappers automatically that can be applied to the same page or “similar” web pages. Thresher is aimed to Web pages that present similar content (same type of object), normally web pages fed by relational data through a template and by analogy extracts the corresponding information. The Thresher system is similar to Magpie. • Zemanta [3] is an online annotation tool that offers word processing of unstructured documents recommending relevant links to diverse contents on the Web (tags, categories, images, articles and named entities). The content is analyzed with a proprietary algorithm for natural language and semantic processing. It also combines machine-learning techniques to constantly refine its recommendations. IV.
SURVEY SUMMARY
The different tools reviewed are summarized in TABLE I. according to the dimensions presented earlier in this document.
2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
TABLE I. Dynamicity of Annotation Dynamic
AeroDAML
Annotation Storage RDF knowledge base
SURVEY SUMMARY Level of Automation Automatic
Method Type Rule based
AKTiveMedia
Dynamic/Static
Annotation Server
Automatic/Manual
Wrapper induction
Armadillo
Dynamic
RDF knowledge store
Automatic
Pattern Matching
Cerno
Dynamic
External database
(Semi-)Automatic
CREAM
Dynamic/Static
Embedded in webpage
Automatic/Manual
Drupal
Dynamic
Embedded in webpage
Automatic
GoNTogle
Dynamic/Static
Ontology Server
Automatic/Manual
KIM
Dynamic
RDF knowledge base
Automatic
Rule based
KnowItAll
Dynamic
RDF knowledge store
Automatic
Rule based
KnowWe
Dynamic
RDF knowledge base
Manual/ (Semi-)Automatic
Rule based
Lixto
Dynamic/Static
Real time
(Semi-)Automatic
Magpie
Dynamic/Static
Real time
Automatic
MnM
Dynamic/Static
Embedded in webpage
(Semi-)Automatic
OpenCalais
Static
Embedded in webpage
Automatic
Rule based
PANKOW
Dynamic
RDF knowledge base
Automatic
Pattern Matching
Semantic MediaWiki
Dynamic/Static
Embedded in webpage (can be exported)
Manual/ (Semi-)Automatic
Rule based
Semantic Wikipedia
Dynamic/Static
Embedded in webpage
Automatic
Pattern Matching
SemTag
Dynamic
RDF knowledge base
Automatic
Rule based
Thresher
Dynamic/Static
Real time
Automatic/Manual
Wrapper induction
Zemanta
Dynamic/Static
Real time
(Semi-)Automatic
Rule based
We can notice that in the dynamicity of annotation dimension there is only one tool, OpenCalais, which is completely static and is offered as a Web service by Reuters. All other tools are dynamic which can be justified by the constant evolution of the diverse documents present on the Web. The type of storage used in the different tools demonstrates that different paths have been taken. We can divide the type of storage as embedded, external or real time. As examples of embedded annotations in Web pages we
Wrapper induction Rule based/ wrappers Rule based Wrapper induction
Wrapper induction Wrapper induction Wrapper induction
Scalable/ Customizable Scalable with diverse ontologies. Difficult to scale despite having diverse techniques available to reduce effort of annotating. Needs seeds to scale. Uses a cyclical annotating process with different techniques. Can adapt to new domains. Difficult to scale but well suited for highly structured web documents. Supports diverse ontologies and adapts easily to new domains. Automatic annotation requires learning method of past annotations. Supports diverse document formats. Ontology pre-populated with large number of instances. Automatic generated domain specific extraction rules. Annotations are made via a special markup. Each article corresponds to a separate knowledge base containing all the annotated relations and knowledge markups It scans Web pages continuously and extracts relevant information. Works within web browser. User can choose the ontology. Learns extraction rules from training corpus. Offers API for connection to ontology servers. Uses NLP and machine-learning techniques. Free but with daily limit request. Based on counting Google hits of instantiated linguistic patterns. Annotations are made via a special markup. Ontologies can be imported. Every article corresponds to exactly one ontological element. Domain must have a structured description. Performs structural analysis. Can access database records and metadata. Aimed to Web pages with similar content. Uses Machine-learning techniques. Targeted for unstructured documents. Algorithm for natural language and semantic processing is proprietary.
have: CREAM, Drupal, MnM, OpenCalais, Semantic MediaWiki, Semantic Wikipedia. On the other hand we have tools that have annotations stored in external sources: AeroDAML, AKTiveMedia, Armadillo, Cerno, GoNTogle, KIM, KnowItAll, KnowWe, PANKOW, SemTag. Some analyzed tools generate annotations as users explore pages (embedded or external annotations): Lixto, Magpie, Thresher and Zemanta. We think this dimension will continue to have special attention in Semantic Annotation, where annotations
2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
311
made on external sources will facilitate information extraction. All tools presented in this survey have the intention to overcome the challenge of developing automatic systems, but the level of automation is subjective because normally some human intervention is necessary. The tools analyzed focus mainly on one method type, except CREAM that uses rules and wrappers. The method type used, where pattern matching and wrapper induction are not good choices for scalable tools, in part limits the scalability. Tools like AKTiveMedia, Armadillo, CREAM, GoNTogle, MnM PANKOW, Semantic Wikipedia and Thresher are difficult to scale because they need a training corpus (wrapper induction) or pages with a similar structured content (pattern matching). Only a few tools present the capability of using different ontologies (AeroDAML, Drupal, Magpie and MnM) for the annotation process and facilitate the adaption to new domains (Cerno, Drupal, etc.). We can observe that more or less all tools offer some kind of customization. We can conclude that challenges still exist in this research area, where Web pages with different characteristics demand different kind of methods: Web pages fed by relational data through a template may achieve good results with a rule based approach, but on the other hand, Web pages containing great portions of free text might not be very suitable for this method. V.
Schema.org is a project that includes the major search providers (Bing, Google and Yahoo!), in the spirit of sitemaps.org, to provide a shared collection of schemas that webmasters can use. The major search providers rely on these schemas to improve the display of search results, facilitating people to find relevant information on the web, avoiding results concerning different subjects. Markup will also enable new applications that make use of the semantic information available. This project is seen as a threat to the proliferation of RDF and Semantic Web adoption, as it can affect their business model. Another characteristic of HTML5, with great potential for Semantic Annotations is Web Storage, the technology that enables web pages to store named key/value pairs locally, within the client web browser. The data stored is persistent, similar to cookies, but is never transmitted to the remote server (cookies transmit data). We can envision a Web page that will have its own database with the advantage that displaying structured information and accessing it will be done in a much more standard way. Applications will be developed to use the data stored in the Web storage.
FUTURE PERSPECTIVES
The question at this moment is if the challenge of semantically annotating pages is solved and we know it still has a long road ahead. As our survey demonstrated, systems are evolving from manual systems presenting great costs and error-prone, to full automatic systems, but these still need some human intervention in their process of annotation. Many researchers have dedicated their studies in implementing semi-automated systems (where human intervention is necessary), delivering systems with better accuracy. They are relevant because we continue to use an essentially syntactic web. The Semantic Web would ideally aid web users to find and use information because computers would be capable of knowing, for example, that a person is a person rather than just a set of tags, providing the ease to add the person to our contact list. The new standard HTML5 will be part of the solution along with microdata, a new lightweight semantic metasyntax, which allows the definition of nestable groups of name-value pairs of data that are generally based on the page’s content. It gives a whole new way to add semantic information and, depending on its proliferation, can possibly help take another step towards the Semantic Web. A great number of sites are generated from structured data, often stored in databases, but when the data is incorporated in a HTML page it becomes a difficult task to recover the original structured data. The introduction of
312
microdata and the utilization of collections of schemas, i.e., HTML tags used to markup pages will accelerate the process to extend the syntactic Web. It is essential that webmasters start to markup their pages so search providers feel that their effort is worth it.
REFERENCES [1]
T. Berners-Lee, Weaving the Web. Harper, 1999, p. 244.
[2]
M. M. Taye, “Understanding Semantic Web and Ontologies: Theory and Applications,” Arxiv preprint arXiv:1006.4567, vol. 2, no. 6, pp. 182–192, 2010.
[3]
Kalina Bontcheva and Hamish Cunningham, “Semantic Annotation and Retrieval: Manual, Semi-Automatic and Automatic Generation.”
[4]
G. Giannopoulos, N. Bikakis, T. Dalamagas, and T. Sellis, “GoNTogle: A Tool for Semantic Annotation and Search,” The Semantic Web: Research and Applications, no. c, pp. 376–380, 2010.
[5]
N. Bettencourt, P. Maio, A. Pongó, N. Silva, J. Rocha, and R. D. A. B. de Almeida, “Systematization and clarification of semantic web annotation terminology,” in International Conference on Knowledge Engineering and Decision Support, 2006.
[6]
P. S. Bayerl, H. Lungen, U. Gut, and K. I. Paul, “Methodology for reliable schema development and evaluation of manual annotations,” in Workshop on Knowledge Markup and Semantic Annotation at the Second International Conference on Knowledge Capture (KCAP), 2003.
[7]
L. Reeve and H. Han, “Survey of semantic annotation platforms,” Proceedings of the 2005 ACM symposium on Applied computing - SAC ’05, p. 1634, 2005.
[8]
S. Brin, “Extracting Patterns and Relations from the World Wide Web,” WebDB 98 Selected papers from the International Workshop on The World Wide Web and Databases, vol. 1590, no. 2, pp. 172–183, 1999.
2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
[9]
N. Kushmerick, “Wrapper induction for information extraction,” Citeseer, 1997.
[10]
P. Kogut and W. Holmes, “AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages,” in First International Conference on Knowledge Capture KCAP 2001 Workshop on Knowledge Markup and Semantic Annotation, 2001, vol. 21, p. 3.
[11]
[12]
[25]
J. Baumeister, J. Reutelshoefer, and F. Puppe, “KnowWE: a Semantic Wiki for knowledge engineering,” Applied Intelligence, vol. 35, no. 3, pp. 323–344, Mar. 2010.
[26]
J. Baumeister, J. Reutelshoefer, F. Haupt, and K. Nadrowski, “Capture and refactoring in knowledge wikis coping with the knowledge soup,” 2008.
[27]
R. Baumgartner, O. Frölich, and G. Gottlob, “The Lixto Systems Applications in Business Intelligence and Semantic Web,” The Semantic Web: Research and …, pp. 16–26, 2007.
[28]
M. Dzbor, E. Motta, and J. Domingue, “Opening Up Magpie via Semantic Services,” pp. 635–649, 2004.
[29]
M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna, “MnM: Ontology driven semi-automatic and automatic support for semantic markup,” Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web, pp. 213–221, 2002.
[30]
M. Vargas-Vera, E. Moreale, A. Stutt, E. Motta, and F. Ciravegna, “MnM: semi-automatic ontology population from text,” Ontologies, pp. 373–402, 2007.
A. Dingli, F. Ciravegna, and Y. Wilks, “Automatic semantic annotation using unsupervised information extraction and integration,” in Proceedings of SemAnnot 2003 Workshop, 2003.
[31]
C. Batista and D. Schwabe, “LinkedTube - informações semânticas em objetos de mídia da Internet,” Simpósio Brasileiro de Sistemas …, 2009.
N. Kiyavitskaya, N. Zeni, L. Mich, and J. Cordy, “Annotating Accommodation Advertisements using Cerno,” Information and Communication Technologies in Tourism 2007, pp. 389–400, 2007.
[32]
P. Cimiano, S. Handschuh, and S. Staab, “Towards the selfannotating web,” Proceedings of the 13th conference on World Wide Web - WWW ’04, p. 462, 2004.
A. Chakravarthy, F. Ciravegna, and V. Lanfranchi, “Cross-media document annotation and enrichment,” in Proc. 1st Semantic Web Authoring and Annotation Workshop (SAAW2006), 2006. N. Bikakis, G. Giannopoulos, T. Dalamagas, and T. Sellis, “Integrating keywords and semantics on document annotation and search,” On the Move to Meaningful Internet Systems, OTM 2010, pp. 921–938, 2010.
[13]
V. Lanfranchi, F. Ciravegna, and D. Petrelli, “Semantic webbased document: editing and browsing in AktiveDoc,” ESWC, vol. 3532, pp. 623–632, 2005.
[14]
N. Kiyavitskaya, N. Zeni, J. R. Cordy, L. Mich, and J. Mylopoulos, “Cerno: Light-weight tool support for semantic annotation of textual documents,” Data & Knowledge Engineering, vol. 68, no. 12, pp. 1470–1492, Dec. 2009.
[15]
[16]
extraction from the Web: An experimental study,” Artificial Intelligence, vol. 165, no. 1, pp. 91–134, Jun. 2005.
[17]
S. Handschuh and S. Staab, “Authoring and annotation of web pages in CREAM,” Proceedings of the eleventh international conference on World Wide Web - WWW ’02, p. 462, 2002.
[33]
M. Krötzsch, D. VrandeÄ ić, and M. Völkel, “Semantic mediawiki,” The Semantic Web-ISWC 2006, pp. 935–942, 2006.
[18]
S. Handschuh, S. Staab, and F. Ciravegna, “S-CREAM—semiautomatic creation of metadata,” Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web, pp. 165–184, 2002.
[34]
A. Pipitone and R. Pirrone, “A framework for automatic semantic annotation of Wikipedia articles.”
[35]
V. Nastase and M. Strube, “Decoding Wikipedia Categories for Knowledge Acquisition Extracting Knowledge from the Wikipedia,” Artificial Intelligence, vol. 8, pp. 1–6, 2008.
[36]
S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J. A. Tomlin, and others, “SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation,” in Proceedings of the 12th international conference on World Wide Web, 2003, pp. 178– 186.
[37]
S. Dill, “A case for automated large-scale semantic annotation,” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 1, no. 1, pp. 115–132, Dec. 2003.
[38]
A. Hogue and D. Karger, “Thresher: automating the unwrapping of semantic content from the World Wide Web,” in Proceedings of the 14th international conference on World Wide Web, 2005, pp. 86–95.
[39]
D. Huynh, D. Karger, and D. Quan, “Haystack: A platform for creating, organizing and visualizing information using RDF,” in Semantic Web Workshop, 2002.
[19]
S. Corlosquet, R. Delbru, and T. Clark, “Produce and Consume Linked Data with Drupal!,” The Semantic Web- …, vol. 1380, pp. 751–766, 2009.
[20]
R. Shreves, “Open source cms market share. White paper, Water & Stone,” 2011.
[21]
B. Popov, A. Kiryakov, D. Ognyanoff, D. Manov, and A. Kirilov, “KIM–a semantic platform for information extraction and retrieval,” Natural Language Engineering, vol. 10, no. 3–4, pp. 375–392, 2004.
[22]
S. K. Malik, N. Prakash, and S. Rizvi, “Semantic Annotation Framework For Intelligent Information Retrieval Using KIM Architecture,” International Journal, vol. 1, no. October, pp. 12– 26, 2010.
[23]
B. Popov, A. Kiryakov, A. Kirilov, and D. Manov, “KIM – Semantic Annotation Platform,” Engineering, pp. 834–849, 2003.
[24]
O. Etzioni, M. Cafarella, D. Downey, a Popescu, T. Shaked, S. Soderland, D. Weld, and a Yates, “Unsupervised named-entity
2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
313