State of the Art - Homer Project

5 downloads 1986 Views 2MB Size Report
Nov 25, 2013 ... cost, preferably by downloading over the internet. ... It is essential to bear in mind that much of the data will not reach ultimate users directly, but rather via 'info- ... the national rules for access to documents, so that each EU State could ..... PDF : Portable Document Format (PDF) is a file format used to ...
Version: Final

CONTENTS

File name: 20131021stateoftheartvF.doc

Introduction 4 Disclaimer 4 1. What is open data  ? 5 2. Open data management. models and architectures 8 3. Content management systems for open data portals 11 4. Data search and indexing 14 5. Dataset management 18 6. Data catalog vocabularies 22 7. Multi-language vocabularies 25 8. Licensing issues 28 9. Linked data 32 10. Open data portals 35 11. Metadata Templates 40 Glossary terms 45

Document status: Completed Edited by: François Xavier Cardi & Elena Garshina - Mapize Validated by: Ef - collectivité territoriale de corse Date of validation: October 21, 2013 Licence: Creative commons by nc Version history:

V.0 - January 21, 2013 V.0.7 - February 14, 2013 V.0.8. - February 15, 2013 V.1 - February 28, 2013 V.1.1 - June 29, 2013 V.1.3 - September 26, 2013

The present State of the Art has been developed project and will be further elaborated and updated.

2

HOMER SOCIO - ECONOMIC IMPACT STUDY

within

the

HOMER

WP5

HOMER SOCIO - ECONOMIC IMPACT STUDY

3

INTRODUCTION

1. WHAT IS OPEN DATA ?

The present State of the Art aims at providing global view of the standards and technical solutions used by the HOMER partners or being used and recommended by experts in the field of Open Data in Europe.

“ Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike. “ OpenDefinition.org

The purpose of this document is to articulate the main principles and specifications related to the Open Data initiative in Europe. Various solutions that can be implemented by HOMER partners, whether it be Open Data portals, metadata or file formats, will be specified in the present State of the Art. This State of the Art has been drawn up based on the information received from the HOMER partners – members of WP5 and from the experts in related fields, as well as on different studies and publications in the field of Open Data. This document aims to help partners get a better understanding of the OPENDATA related technical environment as well as to down the foundations of the future technical specifications for HOMER project. The main goals of the present State of the Art are as follows: »» Give a global vision of an Open Data portal architecture types »» Provide an overview of existing technical solutions for existing Open Data platforms

The Open Data Definition sets out in detail the requirements for ‘openness’ in relation to content and data. The key features are: »» Availability and Access : the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form. »» Reuse and Redistribution : the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets. »» Universal Participation : everyone must be able to use, reuse and redistribute – there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed.

»» Reference various studies and publications that can be useful for deepening understanding of certain topics

DISCLAIMER This State of the Art has been drawn up based on the feedback received from the HOMER partners: documents, recommendations on specific topics or feedback on the existing open data portals. All other data used come from different studies and publications (see the “References” chapter). The present State of the Art is subject to modifications and updates throughout the duration of HOMER project, in compliance with partners’ comments, as well as the evolution of different OPENDATA standarts. HOMER partners’ feedback is crucial to elaborate the present State of the Art and make it really useful for all the members. There are different ways of giving feedback: Post your comments on HOMER WIKI: http://www.homerproject.eu/forum Send your remarks and comments by email to the following address : [email protected] Fix a phone or skype meeting via: [email protected]

Figure 1 : Open Data Ecosystem

4

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

5

Why Open Data? Why should data be open? The answer, of course, depends somewhat on the type of data. However, there are common reasons such as :

Transparency. In a well-functioning, democratic society citizens need to know what their government is doing.

To do that, they must be able freely to access government data and information and to share that information with other citizens. Transparency isn’t just about access, it is also about sharing and reuse — often, to understand material it needs to be analyzed and visualized and this requires that the material be open so that it can be freely used and reused.

Releasing social and commercial value. In a digital age, data is a key resource for social and commercial

activities. Everything from finding your local post office to building a search engine requires access to data, much of which is created or held by government. By opening up data, government can help drive the creation of innovative business and services that deliver social and commercial value.

Participation and engagement – participatory governance or for business and organizations engaging with your

users and audience. Much of the time citizens are only able to engage with their own governance sporadically — maybe just at an election every 4 or 5 years. By opening up data, citizens are enabled to be much more directly informed and involved in decision-making. This is more than transparency: it’s about making a full “read/write” society, not just about knowing what is happening in the process of governance but being able to contribute to it.

How to Open Up Data If you are looking for practical, more detailed, advice on how to open up data, have a look at the Open Data Handbook.1 The handbook discusses the legal, social and technical aspects of how to open up data. Read more in the Open Data Handbook. Here we provide some short suggestions for initial steps.

3 Key Rules There are three key rules we recommend following when opening up data :

Keep it simple. Start out small, simple and fast. There is no requirement that every dataset must be made open

right now. Starting out by opening up just one dataset, or even one part of a large dataset, is fine — of course, the more datasets you can open up the better. Remember this is about innovation. Moving as rapidly as possible is good because it means you can build momentum and learn from experience — innovation is as much about failure as success and not every dataset will be useful.

Engage early and engage often. Engage with actual and potential users and re-users of the data as early and as often as you can, be they citizens, businesses or developers. This will ensure that the next iteration of your service is as relevant as it can be. It is essential to bear in mind that much of the data will not reach ultimate users directly, but rather via ‘infomediaries’. These are the people who take the data and transform or remix it to be presented. For example, most of us don’t want or need a large database of GPS coordinates, we would much prefer a map. Thus, engage with infomediaries first. They will re-use and repurpose the material.

6

HOMER SOCIO - ECONOMIC IMPACT STUDY

Address common fears and misunderstandings. This is especially important if you are working with or within large institutions such as government. When opening up data you will encounter plenty of questions and fears. It is important to (a) identify the most important ones and (b) address them at as early a stage as possible.

Open data and psi

24

While the terms PSI and Open Data are used quite often without distinction (thus overlapping most of the times), a strict definition of PSI according to the PSI Directive rules would reveal certain discrepancies among the two. Moreover, one should keep in mind that both the PSI directive and the so-called Open Data Movement provide a core of rules and principles that may be practically implemented in a slightly different way within different countries and different existing legal frameworks. That being said, confronting the PSI Directive provisions and the widely acknowledged Open Data features would lead to point out that: - PSI refers to «documents held by public sector bodies». While the PSI Directive encourages public sector bodies to make available for re-use any of their documents, it also set some access and re-use restrictions to such documents: firstly, the Directive doesn’t contain an obligation to allow re-use, thus leaving each EU State or public sector body to decide whether a document should be re-usable or not; secondly, the Directive doesn’t change the national rules for access to documents, so that each EU State could maintain its own access restrictions (usually due to privacy or national security concerns). In addition to that, the PSI Directive currently doesn’t apply to documents held by public service broadcasters, educational and research establishments, and cultural establishments. - Open Data refer to «data» as a potentially much broader term which may involve any kind of work, knowledge, data (both public data and private funded data) or information with no given source limitations. Access restrictions are conceived mainly for data affecting privacy, confidentiality or public security. - PSI can be made available charging a price for re-use. The PSI Directive sets the charging upper limit at the recovery of total costs of collecting, producing, reproducing and disseminating documents together with a reasonable return on investment, though leaving the right to ask for lower charges or no charges at all. In addition to that, the Directive encourages to make documents available at charges that do not exceed the marginal costs for reproducing and disseminating the documents. - Open Data are traditionally meant to be available at no more than a reasonable reproduction cost. Yet, the online availability without charge is the first choice option. - PSI itself does not affect the existence or ownership of intellectual property rights of public sector bodies: while public sector bodies might be encouraged by the Directive to exercise their copyright in a way that facilitates re-use, the default rule adopted by the Directive seems to be the traditional all rights reserved Copyright rule. Therefore, should a public sector body have any intellectual property right on its information, it’s up to the public sector body itself to decide how broadly its information has to be licensed. Open Data experts specifically require data to adopt an Open License (e.g. Creative Commons, Open Government License) in order to be disseminated in a truly open fashion; thus aspiring to a some rights reserved Copyright rule.

HOMER SOCIO - ECONOMIC IMPACT STUDY

7

2. OPEN DATA MANAGEMENT. MODELS AND ARCHITECTURES

Structure components :

Context & purpose :

- Content Management System - Indexing & Search - Datasets & Metadata - Data Catalog Vocabularies - Language Vocabularies - Linked Data (SPARQL & RDF Index) Each chapter contains a visual support showing the part of the structure that the chapter is dedicated to.

This section will give a global view of an Open Data portal architecture. The Figure 2  shows the model of a technical architecture for an Open data platform.                                                       The Figure 3 shows a model of an Open Data solution environment. The Figure 4 illustrates the process of internal raw data publishing, till the moment of consumption of this piece of data by final users.                       The Figure 5 illustrates the Open Data workflow.

Model of an Open Data environment :

Model of an open data technical architecture :

Figure 3 : Model of an open data environment

Figure 2 : Model of an open data technical architecture

The Figure 3 represents an abstraction in an Open Data solution and in the environment necessary for communication with external platforms. The chart shows different questions related to catalog federation, which is one of the most crucial subject matters of HOMER Open Data project.

The Figure 2 shows a technical environment of an OPENDATA portal and different parts composing it. 1 This chart will serve as a support for the present State of the Art describing in each chapter a component of the whole structure.

8

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

9

Model of data publishing process :

The Figure 5 shows the process of using Open Data, which resembles a traditional data processing workflow. First, data has to be imported from heterogeneous sources. This may include proprietary file formats as well as already machine-readable data sources. Thereafter, the data has to be integrated and filtered so that further processing is possible with rather generic approaches. Second, the so-called business logic will take place. This includes normalizations, aggregations and further data enrichment. The final goal then is to present the data and its clues to the user or audience in an appropriate way. For example, this can be achieved with powerful visualizations or the like.

3. CONTENT MANAGEMENT SYSTEMS FOR OPEN DATA PORTALS Context & purpose :

In this section we will specify different Content Management Systems (CMSs) used for OPENDATA portals. We are specifically interested in Opensouce CMS solutions including specific OPENDATA modules. We have selected 4 CMSs that can be considered as reference solutions in Open Data in general, and particularly among HOMER partners.

State of the art of existing solutions :

We will list and detail the Open Source CMS solutions  the most widely used for Open Data portals. We will leave aside the CMSs which complexity doen’t allow adding specific open data modules. You can find a complete list of Open Source CMS solutions 2 for Open Data portals in the section «References».

Drupal :

The software is ready-to-use upon download and also includes a Web-based installer and add on modules. The software supports content management, collaborative authoring, newsletters, podcasts, image galleries, peer-topeer networking, file uploads/downloads and more.

Joomla :

Joomla 2.5’s ability to allow users extend Joomla 4 with libraries, in addition to components, plugins, and modules will create a platform that is much more extensible and allows for greater collaboration between developers. We think that the new Joomla platform, as well as the CMS of course, will be ideal for cleansing and manipulating open data

TYPO3 : Figure 4 : Model of data publishing process

In this model (Figure 4), data from the database is exported, transformed to an open readable format (e.g. CSV), properly named and stored on the web server. This means entrepreneurs can get all your data, load it into their own system and design their API according to their use case. Also, high loads will hit their own infrastructure without affecting other apps.

The Open Data Workflow

TYPO3 is a free and open source web content management framework based on PHP. TYPO3 is, along with Drupal, Joomla! and Wordpress, among the most popular content management systems worldwide, however it is wider spread in Europe than in other regions. Due to its features, scalability and maturity, TYPO3 is used to build and manage websites of different types and size ranges and especially Open Data project.

EzPublish :

eZ Publish is an open source enterprise PHP content management system developed by the Norwegian company eZ Systems. The eZ Publish range of features includes professional and secure development of web applications. Functional areas include content versioning, media library, role-based rights management, mobile development, sitemaps, search and printing. This CMS is used as it enables adding specific or generic extensions It make more flexible the organization’s information system.

WordPress

Figure 5 : Data Search engines used by Homer Partners

10

HOMER SOCIO - ECONOMIC IMPACT STUDY

WordPress is a free and open source blogging tool and a content management system (CMS) based on PHP and MySQL. It has many features including a plug-in architecture and a template system. WordPress is used by over 16.7% of Alexa Internet’s «top 1 million» websites and as of August 2011 manages 22% of all new websites. WordPress is currently the most popular blogging system in use on the Web. Wordpress provides the functionality to customise CMS options and to set up custom content types or import vocabularies. Each CMS platform has a module highlighting Open Data base applications.

HOMER SOCIO - ECONOMIC IMPACT STUDY

11

Apps Discovery (App Store)

It is a complementary element of a traditional CMS solution, giving access to the applications created with datasets from an Open Data portal. It’s aimed at referencing the applications (Web, mobile or desktop) developed based on the files   from data catalogs. The Apps Discovery can be considered both as a showcase of an Open Data portal and as a support for entrepreneurs and developers.

CKAN Extension

A developer community has programmed multiple extensions making most popular CMSs (WordPress, Drupal) compatible with CKAN API. For instance, we can mention the CG33 initiative and WordPress CKAN API extension (Links in «lInks to related studIes & publIcatIons»)

Feedback from homer partners :

Csi Piedmont - January 28, 2013 CSI Piedmont added the description of the Open Data Portal solution adopted. As an alternative of the Drupal CMS with CKAN they explain the solution adopted in Piedmont Region, Emilia -Romagna Region and Milan Municipality. The solution is realized with the open source CMS Joomla for the Portal and data searching engine realized with open source solution Apache Solr and Lucene. The model has open source API layer implemented and compliant with standard CKAN model to interoperate with metadata catalogs.

Tips and recommendations taken from different studies on Open Data : It is not mandatory to get a CMS to develop an Open Data portal. It is totally possible to integrate just a data catalog. The CMS provides the content management editorial and multimedia features often related to an Open Data portal development. Drupal and WordPress are open source CMSs having multiple gateways with Open Data. The CKAN API or similar indexing systems can be easily integrated. Drupal is usually considered as a reference CMS solution for Open Data portals.

Links to related studies & publications : Typo3 http://typo3.org/ Drupal http://drupal.org/ Joomla http://www.joomla.org/ EzPublish http://ez.no/fr/ Wordpress http://wordpress.com/ WordPress extension for CKAN ckan.org/2011/04/04/wordpresser-extension-released/ WordPress CKAN API extension https://github.com/okfn/ckanext-wordpresser. Ckan http://ckan.org/ CG33 initiative https://github.com/datalocale/drupal-datalocale

Figure 6 : List of CMSs used for Open Data Portals

12

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

13

4. DATA SEARCH AND INDEXING

Apache Solr

SolrTM is the popular, blazing fast open source enterprise search platform from the Apache LuceneTMproject. 5 Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search.

CKAN API

CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones). CKAN is the place to search for open knowledge resources as well as register your own. CKAN REST API is for reading and writing metadata about packages and tags. The output of the API is JSON output. CKAN is a powerful data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. CKAN is aimed at data publishers (national and regional governments, companies and organizations) wanting to make their data open and available.

Technology

Use

Apache SOLR customisation (for search en gine )

PACA / Piemont / Milan Municipality / Sicilia during 2013

CKAN API (for data sharing)

Veneto / Piemont

SPARQL (for LOD query)

6

Piedmont / Emilia-Romagna Region / Sicilia / Veneto

Figure 7 : Data search engines used by Homer Partners

Feedback from homer partners : Piedmont region : Context & purpose : This section addresses indexing, catalog and dataset search-related issues, which are the key parts of an Open Data platform. The «full text» search technologies are the simplest ones to be integrated into an Open Data portal due to different custom modules. In this chapter we will state different indexing and search solutions and give HOMER partners’ feedback on the creation of a federated index.

State of the art of existing solutions : There are 2 most used catalog search and indexing technologies. The third one (SparQL) is frequently used in the context of Linked Data (We will address this subject in the chapter 9)

Piedmont region proposes a search index federated to all the platforms using the infrastructure made available by the Piedmont region (Open Data Piedmont) and a search based on the CKAN API. Vision of federated search by Piedmont : »» to integrate the partner portals in order to allow the searchability beyond national borders and to implement the federation techniques »» a unique federated index by harvesting open data portal’s metadata catalogs that use the same platform (dati.piemonte.it) or a CKAN like’s platforms (using CKAN API) or a Geoportal (using CSW Protocol) or a customized open data platform (in this case we can develop customs APIs for integration if necessary) »» to expose the metadata form with an API interface »» search on the federated index »» for the front-end functions external searches on other platforms (CKAN) not integrated in the federated index

14

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

15

Please find below the federated schema implemented in the present solution involving Piedmont and EmiliaRomagna Region.. It summarizes how the federated engine works : API can interoperate with a CKAN instance’s data portal, with API of a generic open portal and with a catalog service of a geoportal using the CSW protocol.

Veneto region : Other solutions can also be used. Here is Veneto region’s experience feedback : - Avoid centralize the storage of catalogs with a single partner. - Adoption of international standards-based APIs schemes. - Rating opportunity to extend the federation to a cluster of native PostgreSQL, this would avoid the federation through harvesting by the API, thus reinforcing the concept of federation metadata-driven. - Implement a robust and independent solution, regardless of the user management and workflow processes. - An initial thought on federation architecture is that each open data portal could have an institutional role, but also can aggregate.. data sets published on other open data portals (eg partnership’s portals ) . For example ckan provides several interoperability and harvesting options ckan.org/features/federate/ )This concept could lead to a decentralized federation scheme. Tips and recommendations taken from different studies on Open Data : “Indexing and dataset search engines“ is one of the key topics of the HOMER project. There are several existing approaches, and choosing the technics appropriate to each of these approaches is crucial to define the level of interoperability between different partners. CSI Piedmont proposes a centralized federated index which responses to the problem of multi-portal indexing. It should be studied and eventually combined with other solutions (more decentralized). It’s a key subject of HOMER project. The issue related to specific development necessary for the implementation of a federated index is to be solved for each of the partners.

Figure 8 : Vision of federated search by Piedmont

Links to related studies & publications :

The solution proposed by Piedmont is very efficient, and the search engine indexer can work with open data portal that exposes metadata in 3 different ways:

CKAN API http://docs.ckan.org/en/latest/api.html Apache Solr http://lucene.apache.org/solr/ SPARQL http://www.w3.org/TR/rdf-sparql-query/ Open Data and Federation Technical specific from CSI http://www.homerproject.eu/deliverables/wp5/finish/18-wp5/64-open-data-and-federation-technical-specific-from-csi

1. www.dati.piemonte.it portal’s like, using 2 urls:

a.returns a xml file with the list of the data id (i.e. http://www.dati.piemonte.it/index.php?option=com_ rd&view=pceli_list2&format=xml&layout=xml )



b.returns a xml file with the attributes for the single data i.e http://www.dati.piemonte.it/index. php?option=com_rd&view=pceli_item2&format=xml&layout=xml&itemid=1083.)

2. ckan portal’s like, using 2 urls :

a.returns a json file with the list of the data id (i.e. http://it.ckan.net/api/rest/package) b.returns a json file with the attributes for the single data i.e http://it.ckan.net/api/rest/package/{id}.)

3. Geoportals exposing metadata with csw protocols: the engine exposes an API that gets the csw file and inserts it into the federated index in ckan format (i.e. the call to http://dev-psi. csi.it:8080/rpapisrv/api/2/search/csw?getRecordsUrl=http:// returns metadata in ckan format ) Evolution of functionalities: publications of LOD, creation of a SPARSQL end point for LOD query and passage to a semantic search engine.

16

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

17

5. DATASET MANAGEMENT

KML : Keyhole Markup Language (KML) is an XML notation for expressing geographic annotation and visualization within Internet-based, two-dimensional maps and three-dimensional Earth browsers. SHP : The Esri shapefile, or simply a shapefile, is a popular geospatial vector data format for geographic information system software. ODS : The Open Document Format for Office Applications (ODF), also known as OpenDocument, is an XMLbased file format for spreadsheets, charts, presentations and word processing documents. KMZ : KML files are very often distributed in KMZ files, which are zipped files with a .kmz extension. JSON : JavaScript Object Notation, is a text-based open standard designed for human-readable data interchange. TXT : A text file (sometimes spelled «textfile»: an old alternate name is «flatfile») is a kind of computer file that is structured as a sequence of lines of electronic text. HTML : HyperText Markup Language (HTML) is the main markup language for creating web pages and other information that can be displayed in a web browser. TIFF : (originally standing for Tagged Image File Format) is a file format for storing images, popular among graphic artists, the publishing industry, and both amateur and professional photographers in general. JPEG : Joint Photographic Experts Group is the most common image format used by digital cameras and other photographic image capture devices Metadata : Data that serves to provide context or additional information about other data. For example, information about the title, subject, author, typeface, enhancements, and size of the data file of a document constitute metadata about that document. It may also describe the conditions under which the data stored in a database was acquired, its accuracy, date, time, method of compilation and processing, etc.

Context & purpose : In this section we will specify different file formats that can be used for Open Data datasets.9 The process of dataset collection in every organization should be guided by a search for the most of structured data in an open format that can be easily reusable. A dataset is a collection of data records for computer processing. A dataset is described by metadata. You can find below the list of different file formats. The following state of the art includes different types of existing file formats as well as metadata describing a data- set.

State of the art of existing solutions :

Metadata elements can be subdivided into three basic categories: Descriptive Metadata : »» Describing and identifying information resources »» At the local (system) level to enable searching and retrieving (e.g., searching an image collection to find paintings of animals) »» At the Web-level, enables users to discover resources (e.g., search the Web to find digitized collections of poetry). Structural Metadata : »» Facilitates navigation and presentation of electronic resources

Here is the list of different formats used in existing Open Data platforms.

»» Provides information about the internal structure of resources including page, section, chapter numbering, indexes, and table of contents

CSV : A comma-separated values (CSV) file stores tabular data (numbers and text) in plain-text form.

»» Describes relationship among materials (e.g., photograph B was included in manuscript A)

XLS : (Microsoft Excel file format) Main spreadsheet format which holds data in worksheets, charts, and macros

»» Binds the related files and scripts (e.g., File A is the JPEG format of the archival image File B)

PDF : Portable Document Format (PDF) is a file format used to represent documents in a manner independent of application software, hardware, and operating systems. DOC : (an abbreviation of ‘document’) is a filename extension for word processing documents, most commonly in the Microsoft Word Binary File Format XML : Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. RDF : The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model.

18

HOMER SOCIO - ECONOMIC IMPACT STUDY

Administrative Metadata : »» Facilitates both short-term and long-term management and processing of digital collections »» Includes technical data on creation and quality control »» Includes rights management, access control and use requirements »» Preservation action information Different metadata vocabularies will be described in chapters 6 and 7.

HOMER SOCIO - ECONOMIC IMPACT STUDY

19

Feedback from homer partners : No feedback for this version.

Tips and recommendations taken from different studies on Open Data : In essence, this 5-star system is a re-user’s Maslow pyramid wherein the first star reflects its basic needs and the fifth star its finest hour. Although not specifically targeted at PSI (but rather at ‘open data’), the star system will likely help public sector bodies (PSBs) to achieve their re-use facilitation ambitions where the system allows them to assess their re-use propitiousness and to create a guiding star for structured improvement .

Multimedia data like videos or sounds are not yet optimised for referencing in Open Data platforms. Of course, it’s possible to add metadata to multimedia files in order to make them asseccible for the data search. There are some ongoing studies that aim at obtaining a higher level of accuracy for metadata. For instance, we can mention the M3O initiative (The Multimedia Metadata Ontology) (http://ceur-ws.org/Vol-539/paper_9.pdf) or COMM (Core Ontology for Multimedia) which goal is to make multimedia metadata more accessible.

Links to related studies & publications : Open Metadata Handbook http://en.wikibooks.org/wiki/Open_Metadata_Handbook/Open_Metadata 5stars oF tim berners lee details : http://5stardata.info/

Tim Berners-Lee’s 5-star scale :

Figure 9 : Tim Berners-Lee’s 5-star scale :

Here is the list of different formats used in existing Open Data platforms. File Format

Recommendations

CSV XLS PDF DOC XML RDF KML SHP ODS KMZ JSON TXT HTMI TIFF JPEG Figure 10 : List of file formats based on Tim Berners-Lee’s star scale

20

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

21

12

6. DATA CATALOG VOCABULARIES

DCAT is an RDF vocabulary well-suited to representing government data catalogs such as Data.gov and data. gov.uk. DCAT defines three main classes: »» dcat : Catalog represents the catalog »» dcat : Dataset represents a dataset in a catalog »» dcat : Distribution represents an accessible form of a dataset as for example a downloadable file, an RSS feed or a web service that provides the data. VOID vocabulary : Cataloging five star (cf. Figures 9 & 10) government data The Vocabulary of Interlinked Datasets (VOID) is similar to DCAT, in that 11it is an RDF based metadata schema. However while DCAT can be used to describe any data catalogue, VOID is used to describe Linked datasets. With VoID the discovery and usage of linked datasets can be performed both effectively and efficiently. A Linked dataset is a collection of data, published and maintained by a single provider, available as RDF, and accessible, for example, through dereferenceable HTTP URIs or a SPARQL endpoint. In addition to common metadata schemas, common models of how to present and structure data are required to avoid creating islands of open data in Europe, which could lead to fragmented open data initiatives. Joinup is a new collaborative platform created by the European Commission and funded by the European Union via the Interoperability Solutions for Public Administrations (ISA). Catalog Service for the Web (CSW) is a standard for exposing a catalogue of geospatial records on the Internet. 13

CSW is one part of the OGC Catalog Service, which defines common interfaces to discover, browse, and query metadata about data, services, and other potential resources. The catalogue is made up of metadata records that describe these types of data : »» geospatial data (e.g. KML) »» geospatial services (e.g. WMS) »» other related resources CSW 2.0.0 version integrates  INSPIRE

Context & purpose : Every Open Data portal includes a data catalog that needs to be described so that to improve interoperability between different datasets. Each dataset can be described by metadata to ease its reuse. There are different existing vocabularies, so we will try to establish the state of the art of these vocabularies, which can help adapt data catalogs for Open Data portals.

State of the art of existing solutions : DCAT vocabulary : Machine-readable data catalogs

14-15

recommendations related to catalogue of geospatial records

Feedback from homer partners : Vocabulary systems used by HOMER partners : Vocabulary

Recommendations

DCAT

must

VOID

should

Joinup

-

CSW

must

11

The namespace for DCAT is http://www.w3.org/ns/dcat#. However, it should be noted that DCAT makes extensive use of terms from other vocabularies, in particular Dublin Core. DCAT itself defines a minimal set of classes and properties of its own.

22

HOMER SOCIO - ECONOMIC IMPACT STUDY

Figure 11 : Vocabulary systems used by HOMER partners

HOMER SOCIO - ECONOMIC IMPACT STUDY

23

GFOSS

7. MULTI-LANGUAGE VOCABULARIES

GFOSS is using ADMS.SW to describe s/w ‘assets’ developed by its members. The ADMS.SW RDF description has been created using a spreadsheet and Google Refine [1] and the HTML+RDFa version of the description was generated using the visualisation functionality of the ADMS.SW validator [2]. (example joinup.ec.europa.eu/svn/adms_foss/admssw_federation/Ellak.gr/ )

Tips and recommendations taken from different studies on Open Data : Implementing different existing vocabularies is crucial for an Open Data portal. However, the DCAT and CSW vocabularies are absolutely essential for reading of data catalogs. Since the integration of INSPIRE in CSW it’s become crucial to use it to describe geospatial data.

Links to related studies & publications : DCAT http://www.w3.org/2011/gld/wiki/Data_Catalog_Vocabulary VOID http://semanticweb.org/wiki/VoID Joinup http://joinup.ec.europa.eu/taxonomy/terms/5340 CSW http://en.wikipedia.org/wiki/Catalog_Service_for_the_Web

Context & purpose :

16

In this section we will talk about different multi-language vocabularies used in Europe. Language compatibility between different Open Data portals is the major purpose of HOMER initiative. To meet this requirement, it’s necessary to make the state of the art of existing vocabularies and to understand their area of use.

State of the art of existing solutions : The SKOS format allows combining all necessary vocabulary and thesaurus families described below. SKOS is now the format for publishing thesauri over the web, as it is a RDF vocabulary specific to the terminology and structure of thesauri. In the SKOS modeling, preferred and non- preferred terms are all labels of the same concept, and this applies to all languages available (Isaac et al, 2009). In other words, in the SKOS modeling, a thesaurus is transformed into a set of concepts hierarchically organized by the usual BT/NT (broader/narrower) relationships, and all terms in the thesaurus in all languages are considered as labels of the same concept.

24

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

25

AGROVOC is managed by the Food and Agriculture Organization of the United Nations (FAO), and covers all its areas of interest, such as agriculture, forestry, fisheries, food and related domains. It is available in 21 languages, with an average of 40,000 terms per language. AGROVOC is available in SKOS (with close to 32,000 concepts), and published as Linked Data. EUROVOC is managed by the European Union, and covers all areas of interest of the European Union, with special attention to parliamentary subjects. It is available in 24 languages. EUROVOC is available as a SKOS resource (Smedt, 2009), with close to 7,000 concepts. GEMET, the General Multilingual Environmental Thesaurus, covers the domain of environment, and it is available in 29 languages. It is published and managed by the European Environment Information and Observation Network. Its SKOS version consists of over 5,000 concepts, and it is also available as Linked Data. The LCSH (Library of Congress Subject Headings) Thesaurus is the monolingual thesaurus (English) of subject headings, created for and maintained by the Library of Congress of the U.S.A. Its SKOS version consists of 30,000 concepts, and it is also available as Linked Data. NALT the National Agricultural Library Thesaurus, covers topics related to agriculture and is maintained by the National Agricultural Library of the U.S., USDA, and the Inter-American Institute for Cooperation on Agriculture (IICA) through the Orton Memorial Library, the Mexican Network of Agricultural Libraries (REMBA), as well as other Latin American agricultural institutions belonging to the Agriculture Information and Documentation Service of the Americas (SIDALC).It is available in two languages (English, Spanish). A SKOS version exists (consisting of some 30,000 concepts), but is not available as Linked Data. RAMEAU (Répertoire d’Autorité-Matière Encyclopédique et Alphabétique Unifié, from French National Library) covers a variety of areas, such as geography, proper names, collective bodies and titles) and is available in French only. A SKOS version is available, which consists of about 150,000 concepts, and an experimental Linked Data service is available. STW (Standard-Thesaurus Wirtschaft), Thesaurus for Economics is a bi-lingual (English, German) thesaurus of the German National Library of Economics. It covers law, sociology, politics, and geography. It is available as a SKOS resource, also published as Linked Data, and includes about 6,500 concepts (Neubert, 2009). Thesaurus

# Concepts

Number of languages

Recommendation

AGROVOC

32 000

20

May

EUROVOC

7 000

24 (EU centric)

Must

GEMET

5 300

28

May

LCSH

30 700

1 (EN)

-

NALT

30 300

2 (EN/ES)

-

RAMEAU

16 500

1 (FR)

-

STW

1 100

2 (EN/DE)

-

Feedback from homer partners : No feedback for this version.

Tips and recommendations taken from different studies on Open Data : EUROVOC seems to have become the reference thesaurus in the framework of HOMER project, as it combines multiple concepts and languages spoken by HOMER partners.

Links to related studies & publications : SKOS http://www.w3.org/2004/02/skos/ AGROVOC http://aims.fao.org/website/AGROVOC-Thesaurus/sub EUROVOC http://eurovoc.europa.eu/drupal/?q=fr GEMET http://www.eionet.europa.eu/gemet/ LCSH http://id.loc.gov/authorities/subjects.html NALT http://www.nal.usda.gov/news/NALT_LOD.shtml Rameau http://rameau.bnf.fr/informations/rameauenbref.htm STW http://zbw.eu/stw/versions/latest/about

Figure 12 : List of most used thesaurus in Europe

26

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

27

8. LICENSING ISSUES Context & purpose :

Zooming on licence interoperability complexity : Terms that may be used for a deliverative work or adaptation

Works that are published without an explicit license are usually subject to the copyright laws of the jurisdiction they are published in by default. These laws typically give several exclusive rights to the copyright holder – including the right to produce copies, and to produce derivative works. These rights prohibit unauthorised re-distribution and re-use by third parties – and can remain in effect until the date of death of the author plus 70 years. While the protections offered by copyright laws are appropriate in many circumstances, there are also circumstances in which these protections may be unnecessarily restrictive. Open licenses enable creators to allow more freedom in what others can do with their works. Benefits of this freedom include: »» allowing others to circulate the work freely – potentially giving it a greater circulation than if a single group or individual retained an exclusive right to distribute; »» not forcing users to apply for permission every time they wish to circulate a copy of the work in question – which can be a time consuming affair, especially if the work has many authors; »» encouraging others to continuously improve and add value to a work; »» encouraging others to create new works based on or derived from the original work – e.g. translations, adaptations, or works with a different scope or focus. According to the meeting in Bologna in December 2012, choosing the licence for HOMER partners’ OD portals remains a high-priority topic. Please find below the state of the art for the use of difference lisences. Even though many partners have not made their choice yet, there is a rather complete table showing different partners’ choices.

State of the art of existing solutions :

Figure 13 : State of the art of existing licensing solutions

28

HOMER SOCIO - ECONOMIC IMPACT STUDY



Creating a derivative work and licensing it under the new licence is possible (maybe just adding simple notes about the original data/work).



The creation of derivative work is arguably possible, but there is uncertainty (e.g about licensed rights) or other problems. “Attribution stacking” issues are tipically present.



Creation of derivative work under the proposed licence is impossible (as long as it is a “derivative work” in the sense of copyright law, i.e. it would infringe upon original ©).

HOMER SOCIO - ECONOMIC IMPACT STUDY

29

Feedback from homer partners :

The Directorate for Research Innovation and Competitiveness of Piedmont Region Please check out the table in the part concerning your member institution. Legend : CC = Creative Commons CC = Creative Commons BY = Attribution SA = Share Alike LO = License Ouverte ODbL = Open Data Commons Open Database License tbd = to be defined/completed [FN0] INSERT HERE THE VARIOUS DOMAINS COVERED BY THE PROJECT. [FN1] French national license, compatible with CC BY/ODC BY. [FN2] Italian national license, compatible with CC BY/ODC BY. [FN3] Not being itself a public administration, the Slovenian partner is not required to open up any data, but they plan to do so nevertheless. [FN4] Translation/adaptation of the licensing statement of Regione Piemonte Note about licensing choices During the meeting, we suggested to use Creative Commons (CC) BY 2.5 national licenses with an explicit licensing statement encompassing the database sui generis right (or a CC BY 3.0 Unported/International license with the same licensing statement and a summary in national language, if the CC BY 2.5 national licenses is not available, as in Montenegro). Using CC0 is also perfectly compatible, as per the interoperability table discussed during the meeting; using CC BY 3.0 licenses in their “ported” European versions is also perfectly compatible, since they are just more permissive with respect to databases (they are almost equivalent to the use of CC0 on the sui generis database right, with an ordinary CC BY license managing copyright). Most national licenses (such as the French License Ouverte, the Italian Open Data License 2.0, the British Open Government License) are also OK for our purposes, since they are explicitly compatible with CC BY and ODC BY licenses (and arguably with other national attribution licenses). Also Open Data Commons (ODC) BY licenses are in principle acceptable attribution licenses.

Corsica Region

25

The Etalab licence (French Open Licence) is equivalent to Creative Commons BY (by attribution), anabling any type of re-use, by anyone (including commercial purposes), on the unique condition of acknowledging the source of the data.

Tips and recommendations taken from different studies on Open Data : Experience feedback from the Directorate for Research Innovation and Competitiveness of Piedmont Region is very precious, it provides crucial elements for choosing the licence that will fit the most each partner’s needs. You can find the whole study in the below Links to Related Studies & Publications. Links to related studies & publications : License Interoperability http://www.homerproject.eu/deliverables/wp5/finish/18-wp5/65-license-interoperability State Of The Art http://www.homerproject.eu/deliverables/wp5/finish/18-wp5/66-state-of-the-art

30

HOMER SOCIO - ECONOMIC IMPACT STUDY

Figure 14 : state of art of the licences used by HOMER partners

HOMER SOCIO - ECONOMIC IMPACT STUDY

31

9. LINKED DATA

A typical case of a large Linked Dataset is DBPedia, which, essentially, makes the content of Wikipedia available in RDF. The importance of DBPedia is not only that it includes Wikipedia data, but also that it incorporates links to other datasets on the Web, e.g., to Geonames. By providing those extra links (in terms of RDF triples) applications may exploit the extra (and possibly more precise) knowledge from other datasets when developing an application; by virtue of integrating facts from several datasets, the application may provide a much better user experience.

Figure 15 : Querying data over the Web. We can see a natural language query over two search engines ; the corresponding SPARQL representation ; and the semantic gap between the user’s information needs and the data representation.

DB Tropes Hellenic FBD Hellenic PD

Crime Reports UK

reegle

NHS (EnAKTing)

EEA

Context & purpose : The Web Semantic is one of the most important web challenges of the coming years. In the field of Open Data it comes with different technics and tools enabling data providers making available their17datasets in a semantic format. Even though it’s not a priority, it’s still important to know what the Linked Data is and how it can be integrated into the Open Data approach.

State of the art of existing solutions : The Linked Data concept is based on some mature technologies which intergation is simplified by modules 18 enabling interconnection with Open Data platforms. Among these technologies we can mention the RDF as well 19 as the SPARQL query language. Resource Description Framework (RDF) is a graph model dedicated to describe in a formal way WEB ressources and their metadata, in such a way that these descriptions can be automatically processed. SPARQL consists of a query language, a means of conveying a query to a query processor service, and the XML 20 format in which query results will be returned. These technologies enable adding a semantic search engine to the traditional «full text» search (see the figure 15). Today, there are multiple «Linked Data» datasets (see the figure 16). One of the best examples is DBpedia.

32

HOMER SOCIO - ECONOMIC IMPACT STUDY

Open Election Data Pro’ ect

EU Institutions

education. data.gov. uk

UK Postcodes

ESD standards

ISTAT Immigration

Lichfield Spending Scotland Pupils ’ Exams Traffic Scotlan d

Data Gov.ie

reference. data.gov. uk

London Gazette

Eurostat (FUB)

TWC LOGD

(RKB Explorer)

Linked EDGAR (Ontology Central)

EURES

FTS

Finnish Municipalities

World Factbook

Geo Species

UMBEL

Twar’ l

EUNIS

Daily Med

DBLP (FU Berlin)

SMC Journals

Climbing

Linked GeoData

El Via’ ero Tourism

SIDER

Ocean Drilling Codices

AEMET

Metoffice Weather Forecasts

Turismo de ’ aragoza

Janus AMP

WordNet (W’ C)

National Radioactivity JP

ECS

(RKB Explorer)

DBLP (RKB Explorer)

STW

GESIS

Budapest

Pisa

RESEX

Scholarometer

IRIT

ACM

NVD

IBM DEPLO’

Newcastle

RAE2’ ’ ’

LOCAH Roma

CiteSeer

VIVO Indiana

dotAC

ePrints

IEEE RISKS

HGNC PROSITE

ChEMBL Open Data Thesaurus

ProDom

VIVO Cornell

STITCH

LAAS

NSF

KISTI

PubMed

Linked Open Colors

SGD

Gene Ontology

Open Corporates

Italian Museum s

Amsterdam Museum

OMIM

MGI

InterPro

UniParc

UniRef

UniSTS

GeneID

VIVO UF

Linked Open Numbers

Reactome

OGOLOD

Uni Pathway

Chem2 Bio2RDF

Geographic PBAC

Publications KEGG Reaction

HomoloGene

Media

ECCOTCP bible ontology

KEGG Pathway

Medi Care

Google Art wrapper

meducator

KEGG Drug

Pub Chem

KEGG Enzyme

Smart Link

Product Types Ontology

Sears

ECS Southampton

lobid Organisations

Courseware

PDB

UniProt

Affymetrix

SISVU

GEMET

Airports

Wiki

ECS Southampton

Eur’ com

(Bio2RDF)

AGROVOC

Product DB

Weather Stations

’ ahoo’ Geo Planet

Swedish Open Cultural Heritage

P2’

JISC

WordNet (RKB Explorer)

EARTh

NS’ L Catalog

Pfam

LinkedCT

Taxonomy

Cornetto

lobid

Resources

UN’ LOCODE

ERA

totl.net

WordNet (VUA)

Alpine Ski Austria

DBLP (L’ S)

Drug Bank

Enipedia

Lexvo

data dcs

Diseasome

lingvo’

Europeana Deutsche Biographie

Ulm

OAI

dataopenacuk

LODE

GeoWordNet

Italian public schools

BibBase

VIAF

UB Mannheim

Calames

BNB

TCM Gene DIT

Norwegian MeSH

GND

ndlna

data bnf.fr

UniProt

US Census (rdfabout)

Piedmont Accomodations

IdRef Sudoc

EPrints

dbpedia lite

R’ data n’ ’

PSH

OS

’ AGO

Open Cyc

riese

MARC Codes List

Freebase Pro’ ect Gutenberg

LIBRIS

LCSH

Sudoc

DDC

Open Calais

Greek DBpedia

ntnusc

Thesaurus W

RDF Book Mashup

Uberblic

US SEC

Scotland Geography

URI Burner

LEM

RAMEAU SH

Linked LCCN

SW Dog Food

Portuguese DBpedia

DBpedia

(rdfabout)

Semantic XBRL

my Experiment

flickr wrappr

t’ gm info

Open Library (Talis)

theses. fr

iServe

Fishes of Texas

Linked Sensor Data (Kno.e.sis)

Eurostat

(Ontology Central)

GovTrack

Linked MDB

Event Media

New ’ ork Times

Geo Names

Geo Linked Data

Eurostat

Goodwin Family

Pokedex

NDL sub’ ects

Open Library

SSW Thesaurus

Didactalia

NTU Resource Lists

Plymouth Reading Lists

Revyu

Taxon Concept

NASA (Data Incubator)

transport. data.gov. uk

Chronicling America

Telegraphis

LOIUS

Source Code Ecosystem Linked Data

semantic web.org

BBC Music

BBC Wildlife Finder

Rechtspraak. nl

Openly Local

data.gov.uk intervals

Classical (DB Tune)

St. Andrews Resource Lists

Manchester Reading Lists

gnoss

Last.FM (rdfize)

BBC Programmes

CORDIS

CORDIS (FUB)

Jamendo (DBtune) Pok’ p’ dia

(DBTune)

OpenEI

statistics. data.gov. uk

GovWILD Brazilian Politicians

(Data Incubator)

Ontos News Portal

Sussex Reading Lists

Bricklink

yovisto

Semantic Tweet

Linked Crunchbase

RDF ohloh

Discogs

Music Brainz (DBTune)

patents. data.gov. uk

research. data.gov. uk

Ordnance Survey

legislation data.gov.uk

Music Brainz (zitgist)

(Data Incubator)

FanHubz

Mortality (EnAKTing)

CO2 Emission (EnAKTing)

Energy (EnAKTing)

Surge Radio

Klappstuhlclub

Lotico

Last.FM artists

Population (EnAKTing)

Ren. Energy Generators

EUTC Productions

business. data.gov. uk

Crime (EnAKTing)

Ox Points

(DBTune)

tags2con delicious

Slideshare2RDF

(DBTune)

Music Brainz

John Peel

Linke d User Feedback

LOV

Audio Scrobbler

Moseley Folk

GTAA

Magnatune

KEGG Compound

KEGG Glycan

User-generated conten t Government Cross-domain Life sciences

As of September 2’ ’ ’

Figure 16 : This chart shows the datasets available in Linked Data and the links between them. The chart is constantly updated by 21 Richard Cyganiak: http://richard.cyganiak.de/2007/10/lod/

HOMER SOCIO - ECONOMIC IMPACT STUDY

33

Feedback from homer partners : At present, only the Piedmont region is working on creating a “SPARQL endpoint” for their Open Source solution.

10. OPEN DATA PORTALS

Tips and recommendations taken from different studies on Open Data: Linked Data is not a priority while creating an Open Data portal, as many technical issues should be solved before intergrating modules or datasets in Linked Data. Nevertheless, it’s a medium-term outlook, as it seems evident that in 2013 the Open Data initiative should take into account semantic issues.

Links to related studies & publications : SPARQL Query Language for RDF http://www.w3.org/TR/rdf-sparql-query/ Linked Data http://www.w3.org/DesignIssues/LinkedData Linked Data: Principles and State of the Art http://www.w3.org/2008/Talks/WWW2008-W3CTrack-LOD.pdf

Context & purpose : Getting a global and clear view of existing Open Data portals created by the HOMER partners who have already launched the Open Data initiative. Some regions and organizations have proposed to other partners to re-use their existing Open Source solutions for Open Data portals. There are different Open Source software packages matching the needs of certain 22 organizations, each one having its advantages and disadvantages.

State of the art of existing solutions : Solution Name

CMS

Organization using this solution

Licences

Open Data Piemonte Joomla

Piemont, Emilia - Romagna Region, Sicilia during 2013/ GNU GPL Milan Municipality

Lutece

Custom

Paris

BSD

CKAN software

Drupal

http://open-data.europa.eu/data.gov.uk

GNU GPL

In Cité Solution

Typo3

Rennes, Nantes

GNU GPL

Figure 17 : Open source solutions for Open Data portals

34

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

35

These solutions often including a CMS as well as search engine and API modules have a number of advantages : »» Open source data portal

Current data catalogue services:

»» Easy to use

Reading

»» Easy to install »» Easy to re-use But also some drawbacks : »» Limited technical choices »» Possible oversizing according to partners’ needs Choosing a complete Open Data solution can be relevant after a thorough analysis of needs and restrictions related to the information system used by the organization. Other regions have chosen to develop their own search engine module or API system based on different Open Source CMS solutions and data cataloging, search, indexing services that fit their information systems.

SaaS Open Data Platforms Most of existing Open Source OD platforms are listed in the ad hoc section of the State of the Art. In addition, many companies offer different SaaS (Software as a Service) solutions. These solutions can be used by certain partners seeking to get a platform from scratch, including hosting, maintenance and support provided by an external service provider. Besides, this kind of solution constantly evolves in compliance with the latest technical developments in the field of Open Data. Here are the 3 major actors in this market: »» Open Data Soft Portal »» Microsoft Datalab (OGDI) »» Socrata

Focus on the European Commission Open Data Portal The European Commission has launched its open data portal,in public beta. The European Commission data portal, at launch provides access to 5800+ data sets, mostly coming from Eurostat. With this portal the European Commission intends to lead by example in opening up public sector information pro-actively for free re-use in Europe, as part of the European open data strategy. The European Commission portal is for now aimed at publishing the European Commission’s own data, but is at the same time a first step towards a pan-European data portal that will provide access to all underlying national (and regional, local) data portals across the 27 Member States. The portal has a SPARQL endpoint to provide linked data, and will also point to applications that help work with the data

36

Core services

HOMER SOCIO - ECONOMIC IMPACT STUDY

»» Human-facing (via Drupal and CKAN) »» Machine-facing in the API (via CKAN – some elements not yet available) Writing »» Human-facing (via Drupal and CKAN) »» Import scripts  from the Office of National Statistics (ONS) and Data4NR occasional bulkEimports Searching »» Human-facing structured search (from SOLR via Drupal) »» Machine-facing in the API (from CKAN – not structured) Linked Data »» Access to SPARQL endpoints for key datasets »» API for Linked Data

Feedback from homer partners : Page 40 - 41

Tips and recommendations taken from different studies on Open Data : The maturity level of different Open Data portals created by HOMER partners is heterogeneous. Some regions as Piemont, Veneto or PACA have already advanced in their own Open Data approach and have provided rich experience feedback. However, most of other partners have only started working in Open Data by identifying datasets corresponding to 5 main federation topics. Their technical approach has not yet been developed.

Links to related studies & publications : Existing Open data Portals : http://www.homerproject.eu/deliverables/wp5/finish/18-wp5/67-existing-open-data-portals In Cité Solution http://opendata.in-cite.net/index.php?id=36 Open Data Piemonte http://sourceforge.net/projects/odpiemonte/ CKAN Software http://ckan.org/features/ Lutece http://fr.lutece.paris.fr/fr/jsp/site/Portal.jsp Opendata Soft http://www.opendatasoft.com/ ?lang=en_us Microsoft Datalab http://www.microsoft.com/government/en-ca/public-services/initiatives/Pages/open-government-data-initiative.aspx Socrata http://www.socrata.com/ EC Open data Portal http://open-data.europa.eu/open-data/

HOMER SOCIO - ECONOMIC IMPACT STUDY

37

38 HOMER SOCIO - ECONOMIC IMPACT STUDY YES

YES YES YES

76 datasets

YES 106 datasets YES 334 datasets

Data collection in progress

YES YES YES

Decentralized Administration of Crete Sewerage Board of Limassol – Amathus Provence Alpes Cote d’Azur Region Geodetic Institute of Slovenia Agencia de Gestion Agraria y Pesquera de Andalucia Local Council Association Malta Sociedad De Desarrollo Medioambiental De Aragon, S.A.U. (Sodemasa) Greek Free / Open Source Software Society (GFOSS) Piedmont Region Innovation, Research, University Directorate Sardegna Region Direzione generale degli affari generali e della società dell’informazione Emilia-Romagna Region Region ICT Department Veneto Region Information System Department Corsica Region

Did you already identify data on the 5 topics of the federation?

YES

YES

YES

YES

Regional portal

YES

NO

YES

YES

YES

YES

YES

YES

NO

NO

NO

NO

TO BE IMPLEMENTED

1) Andalusian portal 2) National portal 3) Basque Country portal 4) Catalonia portal

NO

TO BE IMPLEMENTED

YES

NO

YES it will be incorporated within the CKAN

YES we are in the process of customising a CKAN Open Data tool to suite our needs NO

Did you setup an interface that enables portal and/or data interoperability with other systems, e.g. did you expose an API set?

Did you develop/ reuse an Open Data portal, or did you prepare a metadata form on line?

YES

YES

YES

YES Geoportal

YES

YES

NO

WORK IN PROGRESS

TO BE IMPLEMENTED

NO

NO

YES

Did you use an open source engine for indexing and search of the metadata?

CMS

CMS

NO

CMS

YES

YES

YES

NO

NO

N/A

YES

YES

YES

Static website

Dublin Core

Dublin Core

YES Inspire for Geodata

NO but Inspire is mandatory for geodata portals

N/A

Compliant with metadata standards: Dublin Core, INSPIRE, etc. ?

CKAN Interface ?

Static Website or CMS ?

List of homer partners’ open data portals :

Figure 18 : list of homer partners’ open data portals

HOMER PARTNERS’ EXISTING OPEN DATA PORTALS :

The following table shows the state of the art of different projects, their current status as well as various technical specifications

Figure 19 : Open Data Piemonte : http://www.dati.piemonte.it/ Figure 20 : Open Data PACA : http://opendata.regionpaca.fr/

Figure 21 : Open Data Veneto : http://www.dati.veneto.it/ Figure 22 : Open Data PACA : http://open-data.europa.eu/open-data/

HOMER SOCIO - ECONOMIC IMPACT STUDY 39

11. METADATA TEMPLATES

40

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

41

42

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

43

GLOSSARY TERMS API (Application Programming Interface) Interface to one or more datasets for a program or application to access it. Datasets can be made available either by downloading (for data sets reasonably stable over time) or by API (for very large data sets or very volatile).

CKAN (Comprehensive Knowledge Archive Network) CKAN is a powerful data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. CKAN is aimed at data publishers (national and regional governments, companies and organizations) wanting to make their data open and available. CKAN website

CMS (Content Management System) A content management system is a computer program that allows publishing, editing and modifying content as well as maintenance from a central interface. Such systems of content management provide procedures to manage workflow in a collaborative environment. These procedures can be manual steps or an automated cascade. Wikipedia source

Dataset A data set (or dataset) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. It lists values for each of the variables, such as height and weight of an object. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows. Nontabular data sets can take the form of marked up strings of characters, such as an XML file.

Dbpedia DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to make sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. Dbpedia website EU PSI Directive The Directive on the re-use of public sector information, 2003/98/EC. “deals with the way public sector bodies should enhance re-use of their information resources.” Legislative Actions – PSI Directive

EuroVoc EuroVoc is a multilingual, multidisciplinary thesaurus covering the activities of the EU, the European Parliament in particular. It contains terms in 22 EU languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish), plus Croatian and Serbian. EuroVoc is managed by the Publications Office, which moved forward to ontology-based thesaurus management and semantic web technologies conformant to W3C recommendations as well as latest trends in thesaurus standards. Eurovoc website

IAR (Information Asset Register) IARs are registers specifically set up to capture and organise meta-data about the vast quantities of information held by government departments and agencies. A comprehensive IAR includes databases, old sets of files, recent electronic files, collections of statistics, research and so forth. Source

44

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

45

INSPIRE

Creative Commons CC BY-ND (Attribution-NoDerivs)

The INSPIRE directive came into force on 15 May 2007 and will be implemented in various stages, with full implementation required by 2019. The INSPIRE directive aims to create a European Union (EU) spatial data infrastructure. This will enable the sharing of environmental spatial information among public sector organisations and better facilitate public access to spatial information across Europe. INSPIRE website

This license allows for redistribution, commercial and non-commercial, as long as it is passed along unchanged and in whole, with credit to you.

A European Spatial Data Infrastructure will assist in policy-making across boundaries. Therefore the spatial information considered under the directive is extensive and includes a great variety of topical and technical themes

SPARQL SPARQL is the query language for the Semantic Web recommanded by W3C. SPARQL queries hide the details of data management, which lowers costs and increases robustness of data integration on the Web. “Trying to use the Semantic Web without SPARQL is like trying to use a relational database without SQL,” explained Tim Berners-Lee, W3C Director. There are already 14 implementations of the standard, which is comprised of three W3C Recommendations: SPARQL Query Language for RDF, SPARQL Protocol for RDF, and SPARQL Query ResultsXML Format. SPARQL blog

Public Sector Information (PSI) The wide range of information that public sector bodies collect, produce, reproduce and disseminate in many areas of activity while accomplishing their Public Task. Source: APPSI definition from Web searches, dictionaries or panel member proposal

Dublin Core Dublin Core is an initiative to create a digital «library card catalog» for the Web. Dublin Core is made up of 15 metadata (data that describes data) elements that offer expanded cataloging information and improved document indexing for search engine programs.

Public Domain (PD) The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

Creative Commons BY This license lets others distribute, remix, tweak, and build upon your work, even commercially, as long as they credit you for the original creation. This is the most accommodating of licenses offered. Recommended for maximum dissemination and use of licensed materials.

Creative Commons CC BY NC (Attribution-NonCommercial) This license lets others remix, tweak, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms.

Creative Commons CC BY-NC-ND (Attribution-NonCommercial-NoDerivs ) This license is the most restrictive of our six main licenses, only allowing others to download your works and share them with others as long as they credit you, but they can’t change them in any way or use them commercially.

Creative Commons CC BY-NC-SA (Attribution-NonCommercial-ShareAlike)

This license lets others remix, tweak, and build upon your work non-commercially, as long as they credit you and license their new creations under the identical terms.

46

HOMER SOCIO - ECONOMIC IMPACT STUDY

Creative Commons CC BY-SA (Attribution-ShareAlike) This license lets others remix, tweak, and build upon your work even for commercial purposes, as long as they credit you and license their new creations under the identical terms. This license is often compared to “copyleft” free and open source software licenses. All new works based on yours will carry the same license, so any derivatives will also allow commercial use. This is the license used by Wikipedia, and is recommended for materials that would benefit from incorporating content from Wikipedia and similarly licensed projects.

IODL The «Italian Open Data License» (IODL) is a license intended to allow users to easily share, edit, use and reuse different datasets, while ensuring the same freedom for others. This license is intended to facilitate the re-use of public information.

ODC : Open Data Commons http://opendatacommons.org/

Open Data Soft Portal

Thanks to its integrated platform of Big Data Management & Data Smart Publishing, OpenDataSoft has designed for governments, local authorities & other public bodies a full-featured and simple solution for their open data engagement. Turnkey back-office for loading datasets from their original format (office files or business ones like GIS, databases, business softwares, real time data flows). User-friendly web forms to select data transformation choices (filtering, fusing, renaming, geocoding, …)  before automatically publishing towards various formats (.csv, .kml, Json, Rest API, wfs …). Easy-to-custom web portal (design, domain name) with integrated data visualisation modules (tables, maps, graphics, search engine) – Possibility to set up several portals based on the same data hub.

DATALAB OGDI

DataLab (OGDI) is a cloud-based Open Data Catalogue for organizations that seek to: Give citizens access to government data, including browse, visualize, analyze & download in multiple formats Enable developers access the data via open standards Application Programming Interfaces (APIs) Streamline publishing data from government systems or by government employees from their desktops Reduce up-front infrastructure costs (servers, software, etc.) by moving to a cloud service Ensure reliability and scalability (grow compute requirements as catalogue grows) via cloud Full access to the code to modify & customize the catalogue as you see fit. OGDI is being used by a number of organizations such as the Government of Columba, Estonia & the European Union, City of Medicine Hat, AB, Canada, City of Regina, SK, Canada, most recently Niagara Region and others. The older versions of OGDI v1/v2 are available on CodePlex, those been forked by City of Nanaimo To see OGDI in action please watch this short video. DataLab / OGDI is written using C# and the .NET Framework and uses the Windows Azure Platform

HOMER SOCIO - ECONOMIC IMPACT STUDY

47

SOCRATA The Socrata Social Data Platform™ is a turnkey information delivery platform that reduces lifecycle management costs for government customers while boosting their ability to disseminate relevant information and data-driven services to a wide range of audiences including citizens, civic application developers, researchers, journalists and internal stakeholders. The cloud-based Socrata Open Data Platform™ transforms information assets – tabular data, geospatial data, unstructured content and real-time data from government transactional systems – into a consumption-optimized and socially-enriched experience, that is automatically accessible across multiple channels of interaction, to enhance governments’ ability to accomplish their mission at a reduced cost. The Socrata Open Data Platform™ is the most widely-adopted Open Data solution in Government. Socrata customers include Medicare, State of Washington and City of Seattle. Socrata was recently chosen to deliver the next-generation Open Data platform for Data.Gov and participating federal agencies, through its GSA reseller and fulfillment partner, Alamo City Engineering Services.

OpenData Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open data movement are similar to those of other «Open» movements such as open source, open hardware, open content, and open access. The philosophy behind open data ha been long established (for example in the Mertonian tradition of science), but the term «open data» itself is recent, gaining popularity with the rise of the Internet and World Wide Web and, especially, with the launch of open-data government initiatives such as Data.gov.

TABLE OF FIGURES figure 1 : Open Data Ecosystem figure 2 : Model of an open data technical architecture figure 3 : Model of Open data environment figure 4 : Model of data publishing process figure 5 : open data workflow figure 6 : List of CMSs used for Open Data Portals figure 7 : Data Search engines used by homer partners figure 8 : Vision of federated search by Piedmont figure 9 : Tim Berners-Lee’s 5-star scale figure 10 : List of file formats based on Tim Berners-Lee’s star scale figure 11 : Vocabulary systems used by HOMER partners figure 12 : List of most used thesaurus in Europe figure 13 : State of the art of existing licensing solutions figure 14 : State of the art of the licences used by HOMER partners figure 15 : Querying data over the Web figure 16 : Linked Data figure 17 : Open source solutions for Open Data portals figure 18 : list of homer partners’ open data portals figure 19 : Open Data Piemonte figure 20 : Open Data PACA figure 21 : Open Data veneto figure 22 : Open Data veneto

48

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

49

REFERENCES Feedback from HOMER partners 1

OPEN DATA HANDBOOK : HTTP://OPENDATAHANDBOOK.ORG/EN/

2

LIST OF FREE AND OPEN SOURCE CMS HTTP://WWW.SCRIPTOL.COM/CMS/LIST.PHP

3

DATAPUBLIC : DRUPAL INSTALLATION PROFILE/DISTRIBUTION FOR OPEN GOVERNMENT INITIATIVE HTTP://DRUPAL.ORG/PROJECT/DATAPUBLIC

4

OPEN DATA DEVELOPMENT WITH JOOMLA HTTP://WWW.SCOOP.IT/T/OPEN-DATA-DEVELOPMENT-WITH-JOOMLA

5

INTEGRATION OF APACHE SOLR WITH CRAWLERS HTTP://LUCENEREVOLUTION.COM/SITES/DEFAULT/FILES/SLIDES/LUCENE%20REV%20PRESO%20 BIALECKI%20SOLR-CRAWLERS-LR.PDF

6

SPARQL QUERY LANGUAGE FOR RDF HTTP://WWW.W3.ORG/TR/RDF-SPARQL-QUERY/

H 7 PIEDMONT REGION’S OPEN DATA PRESENTATION:

HTTP://WWW.HOMERPROJECT.EU/DELIVERABLES/WP5/FINISH/18-WP5/72-PIEDMONT-REGION-SOPEN-DATA-PRESENTATION

H 8 VENETO REGION’S FEEDBACK

HTTP://WWW.HOMERPROJECT.EU/DELIVERABLES/WP5/FINISH/18-WP5/68-VENETO-REGION-S-FEEDBACK

15

Semantic access to INSPIRE http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/Terra/paper7.pdf

16

Challenges for the Multilingual Web of Data http://oa.upm.es/8848/1/Multiling.pdf

17

Linked Data - Design Issues http://www.w3.org/DesignIssues/LinkedData.html

18

Linked Data - The Story So Far http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf

19

RDF, SPARQL and Semantic Repositories http://fr.slideshare.net/marin_dimitrov/rdf-sparql-and-semantic-repositories

20

Querying Heterogeneous  Datasets on the  Linked Data Web http://www.edwardcurry.org/publications/freitas_IC_12.pdf

21

Datasets in the next LOD Cloud http://wifo5-03.informatik.uni-mannheim.de/lodcloud/

22

Open data Overview: http://data.fingal.ie/media/open-data-overview.pdf

23

wordpress extension for ckan ckan.org/2011/04/04/wordpresser-extension-released/

24

Public Sector Information - Raw Data for New Services and Products http://ec.europa.eu/information_society/policy/psi/index_en.htm

25 9

THE STATE OF OPEN DATA HTTP://WWW2012.WWWCONFERENCE.ORG/PROCEEDINGS/NOCOMPANION/WWWWEBSCI2012_ BRAUNSCHWEIG.PDF

10

TIM BERNERS LEE’S 5 STAR SCALE : DETAILS HTTP://5STARDATA.INFO/

11

MAKING GOVERNEMENT DATA DISCOVERBALE WITH DCAT AND VOID : http://richard.cyganiak.de/2011/gld/gld-dcat-and-void.pdf

12

Resource Description Framework (RDF) Resource Guide http://planetrdf.com/guide/

13

Catalog Service for the Web http://en.wikipedia.org/wiki/Catalog_Service_for_the_Web

14

INSPIRE Metadata Implementing Rules: Technical Guidelines http://inspire.jrc.ec.europa.eu/documents/Metadata/INSPIRE_MD_IR_and_ISO_v1_2_20100616.pdf

50

HOMER SOCIO - ECONOMIC IMPACT STUDY

etalab license details http://epsiplatform.eu/sites/default/files/Licence-Ouverte-Open-Licence-ENG.pdf Linked Data Indexing Methods: A Survey : http://www.ksi.mff.cuni.cz/~svoboda/research/papers/2011-survey-otm-2011-08-22.pdf Vocabularies and Linked Open Data http://fr.slideshare.net/faoaims/vocabularies-and-linked-open-data Linked Open Vocabularies (LOV) http://lov.okfn.org/dataset/lov/index.html Open Metadata Handbook/Technical Overview http://en.wikibooks.org/wiki/Open_Metadata_Handbook/Technical_Overview Open Data and Metadata Standards: Should We Be Satisfied with “Good Enough”? http://odaf.org/papers/Open%20Data%20and%20Metadata%20Standards.pdf How to Publish Linked Data on the Web http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/

HOMER SOCIO - ECONOMIC IMPACT STUDY

51

linked data http://www.mkbergman.com/category/linked-data/ Towards Open Data for Linguistics : Linguistic Linked Data HTTP://WWW.LEMON-MODEL.NET/PAPERS/OPEN-DATA-FOR-LINGUISTICS.PDF THE LINKING OPEN DATA CLOUD DIAGRAM HTTP://RICHARD.CYGANIAK.DE/2007/10/LOD/ OPEN DATA WHITE PAPER : UNLEASHING THE POTENTIAL HTTP://DATA.GOV.UK/SITES/DEFAULT/FILES/OPEN_DATA_WHITE_PAPER.PDF OPEN DATA AND FEDERATION TECHNICAL SPECIFICATIONS FROM CSI HTTP://WWW.HOMERPROJECT.EU/DELIVERABLES/WP5/FINISH/18-WP5/69-OPEN-DATA-ANDFEDERATION-TECHNICAL-SPECIFICATIONS-FROM-CSI SIZE AND STRUCTURE OF FRENCH PSI HTTP://WWW.HOMERPROJECT.EU/DELIVERABLES/WP5/FINISH/18-WP5/70-SIZE-AND-STRUCTURE-OFFRENCH-PSI

CONCLUSION In this paper we’ve specified the main components of an Open Data portal. We have also drawn up the state of the art of existing technologies necessary for creating these components. Thereby, the present State of the Art may constitute a solid foundation for the technical specifications for HOMER Open data portal, which can match the information system of every HOMER partner. Further discussions around this paper will help consolidate the elements for the Technical Guideline that could become a reference document for different HOMER partners in their Open Data approach. Even though the maturity level of HOMER partners’ Open Data portals is heterogeneous, the most important thing is that many regions have already launched their Open Data initiative. The work of Piedmont region is particularly interesting in the context of the data catalog federation. We invite HOMER partners to examine the present State of the Art so that it could be amended and updated.

DESCRIBING LINKED DATASETS HTTP://WWW.HOMERPROJECT.EU/DELIVERABLES/WP5/FINISH/18-WP5/71-DESCRIBING-LINKEDDATASETS THE STATE OF OPEN DATA HTTP://WWW2012.WWWCONFERENCE.ORG/PROCEEDINGS/NOCOMPANION/WWWWEBSCI2012_ BRAUNSCHWEIG.PDF OPEN DATA FIELD GUIDE BY SOCRATA HTTP://WWW.SOCRATA.COM/OPEN-DATA-FIELD-GUIDE/ CREATIVE COMMONS HTTP://CREATIVECOMMONS.ORG/LICENSES/ PUBLIC DOMAIN HTTP://PUBLICDOMAINMANIFESTO.ORG/ ITALIAN OPEN DATA LICENCE HTTP://WWW.FORMEZ.IT/IODL/ OPEN DATA COMMONS HTTP://OPENDATACOMMONS.ORG/LICENSES/BY/1-0/

52

HOMER SOCIO - ECONOMIC IMPACT STUDY

HOMER SOCIO - ECONOMIC IMPACT STUDY

53