Mining Premium Multimedia Digital Libraries with

0 downloads 0 Views 1MB Size Report
allowing the flexibility to reach efficient implementations (since searching .... architecture, evolved in order to cater for multimedia libraries, is detailed next, in Section 3. We then ...... video content decomposition scheme of [38] focuses only on the progressive ..... of Neural Networks in Image Analysis Problems,” IEEE Trans.
Mining Premium Multimedia Digital Libraries with PIVOTS: ‘Private Information Viewing Offering Total Safety’ ANGELOS YANNOPOULOS, ANASTASIOS DOULAMIS, DIMITRIS HALKOS, YIANNIS STAVROULAS, NIKOLAOS DOULAMIS, GEORGE MOURKOUSIS, THEODORA VARVARIGOU National Technical University of Athens Department of Electrical & Computer Engineering Division of Computer Science, Telecommunications Laboratory Athens 157 73, Greece {ang, dhalk, ystavroulas, georgemr, dora}@telecom.ntua.gr, {adoulam, ndoulam}@cs.ntua.gr

Abstract―We present an operational paradigm and an according systems architecture which allows EndUsers, via the services of Search Engines, to search premium multimedia digital libraries belonging to thirdparty Content Owners. The search is accomplished using any search technology the engines wish to provide, without needing to purchase the premium content aforehand. Furthermore, the search technology is also kept secret from Content Owner when so desired. The business motivation of this technique is of course to assist end-users to purchase content which best suits their requirements – they are offered search results only, not actual content, through PIVOTS. Our objective is to counter the problems caused by the current segregation between content ownership and multimedia processing technology ownership, so that data and methods owned by different parties can nonetheless be used together. The technical essence of our work is guaranteeing undisputed security of the system (that is, the security level is totally determined by controllable system configuration, and very secure configurations exist for most applications) as well as allowing the flexibility to reach efficient implementations (since searching multimedia data is a process with huge computational requirements and any significant inefficiencies imposed for security reasons would render the paradigm impracticable). The issue of efficiency is dealt with by creating a middleware service layer which is able to function under the PIVOTS paradigm and can perform computationally intensive multimedia pre-processing as an offline operation. The actual search performance is thus boosted and reaches acceptable real-time computational requirements. We are especially concerned with achieving good results even for large-scale multimedia libraries. We describe specific multimedia processing technologies in quite some detail, in order to illustrate the efficient exploitation of the architecture presented. These include some new and innovative research results whose development was spurred by the PIVOTS point of view, but which also have an intrinsic value for mining multimedia digital libraries.

1. Motivation – State of the Art Large-scale multimedia libraries present a significant challenge to the developers of content searching, indexing and analysis technologies. This is in fact a multidisciplinary filed which is currently experiencing singularly active research development, being approached by database, knowledge engineering, pattern recognition and, of course, multimedia technologies experts.

While the commercial exploitation potential is vast, content owners cannot easily keep up with technological developments. The most impressive cutting-edge methods are far from mature, while matured methods are being rendered obsolete due to the fast progresses. The development of stable and useful standards, which is usually a pre-requisite for any large technological investment, is thus made difficult and commercial adoption of new techniques is blocked. A way around this problem is to provide for a flexible standard, where switching content searching technologies can be transparent to both end users and content owners. The ownership of content, algorithms and implementations is of course a critical parameter in this discussion: •

By its nature, multimedia content in large digital libraries has a high intrinsic value. Since the content owner cannot freely distribute the data, but is far more likely to demand a high premium for each content item to be transmitted to any client, any multimedia searching/analysis technology cannot currently be applied to the content unless the appropriate implementations are available to the content owner, or else the content items are purchased by a client before application of the technology.



Conversely, competitive algorithms and serious, industrially useful implementations are equally valuable. Although content owners would often like their clients to be able to process (e.g. search, assess) the premium content with such methods before purchasing it, they do not seem willing to countenance the costs of purchasing the necessary technology, or, even more, of developing it themselves. This is hardly good motivation for third parties to invest in creating good implementations.

In general, the critical mass required to establish a healthy technology market in this area has not yet been reached. Bridging the gap between data ownership and search/analysis technology ownership was the topic of our research as recently published in [1]. There, we were not concerned with multimedia content. We focused primarily on textual content, while also considering image libraries. The introductory example which we used for the first incarnation of our system, PIVOTS, is still useful as an illustration of our objectives: Although we can physically visit a bookstore and browse through books before purchasing them, we cannot download an e-book for a similar examination. In the case of digital data, we cannot access it at all, unless we first make a normal purchase. This is natural, since we can easily keep copies of any digital data we ever access. As a next-best solution, we would like to be able to apply advanced, custom text analysis technology to private text corpora, receiving as an output only the assessment result, and never accessing the actual corpora without making a purchase. The security implications of such a process were, naturally, the main consideration of our work. Our motivation in addressing such a topic was our experience with language processing technology which could be reasonably substituted for direct browsing, assessing textual material for us and advising for or against a purchase. Such technology exists in research, but not in commercial implementations. In the case of multimedia content, methods to search for and assess content are already more effective than their language processing counterparts, despite being a considerably younger discipline. Also, multimedia is much harder to browse manually – video clubs, unlike bookstores, generally offer no previews. We have just as much need for a system which can take our preferences as input and tell us which items of multimedia content to purchase access to – we could be a movie fan looking for scenery objects in films we

have not yet purchased, or, more seriously, a doctor looking for certain symptoms in large video archives which nonetheless require special arrangements for each separate video to be released, or a creative designer in advertising wanting to buy footage with specific characteristics from a provider who otherwise charges huge fees for manually searching for appropriate content. The security framework which is presented in [1] is in fact general enough to be applied directly – although quite naively – to multimedia content as well. The problem here would be abysmally sluggish performance due to the huge computational demands of multimedia processing. In order to make the system work in practice for multimedia libraries, we have extended it significantly: we partition the workload into that which is truly required for real-time operation and that which can be prepared offline as pre-processing; this requires a carefully customised service-level architecture. Interfacing between the series of applied algorithms is also systematically organised. Last but not least, we suggest specific video-processing techniques which make strong sense in the context of our security requirements, both as the foundation of an actual application and as an illustration of how to exploit in general the system presented. The rest of this paper is structured as follows. The core architecture of PIVOTS [1], which deals with fundamental security issues, does not need to be changed. It is reviewed in Section 2. The service-level architecture, evolved in order to cater for multimedia libraries, is detailed next, in Section 3. We then turn our attention to specific multimedia processing issues in Section 4, showing how to apply this technology under the general framework presented. Finally, conclusions and some discussion in Section 5 close the paper.

2. System Architecture And Implementation Security We first define the most important terms to which the rest of this section will make reference: •

Premium content items: This is the unit of information from the business perspective of the Content Provider. In our context of multimedia content, it would comprise files of audio-visual, graphic, image, speech, etc. information. Note that in our discussion of service-level security we assume very large file size, a constraint which is very easily met by video and quality audio content, but must be checked carefully in other cases.



Search algorithm: In the PIVOTS system, a search algorithm is a function that processes Premium Content and assesses it, according to its own arbitrary criteria. This is executable code, which the Content Provider need not trust (and cannot analyse).



Result rating/assessment: This is an array of bytes which is output from the Search Algorithm. Normally it translates to some algorithm-specific measure of the content’s relevance to the searcher’s requirements; however, there is no control as to what it actually contains. The size of the array is limited by an application-specific maximum length.



Public metadata: This is content metadata which the Content Provider chooses to make public for free. This would be information the Content Provider might also advertise on a website: a title, a summary description, file format and size information, etc.



Secret Metadata: This metadata is designed to significantly boost the performance of the Search Algorithm, by offering useful pre-processing results. We will discuss its use in Sections 3 and 4, since our current concern is security only.

2.1. Search Execution And Execution Platform Overview The fundamental objective of PIVOTS is to allow the remote execution of the Search Engine’s code at the Content Provider’s site (we will refer to Content Owners who offer PIVOTS search functionality as Content Providers). If the Content Provider simply runs any code sent by the Search Engine, this code could, amongst others, transmit premium information to illegal recipients. We start our review of the PIVOTS security core by describing its basic usage scenario. First, a search engine sends a Search Algorithm and, optionally, a Search Filter to the Content Provider's site (see Fig 1). The search filter can use Public Metadata to improve search performance – we will return to search performance issues in detail in Sections 3 and 4 – possibly even discovering search hits from the metadata alone. At the least, however, the search algorithm obtains a list of content items to be fully processed. The secure processing phase begins at this point. The list of premium content items to be rated and the Search Algorithm are passed to the Connecting API. For each content item to be rated, the Connecting API creates a different Sandbox instance. The Search Algorithm Object is placed into the Sandbox together with the full content item. The only communication allowed to the search algorithm object in the Sandbox is for it to return a rating to the Connecting API – beyond this, it has access to no other resources whatsoever. The Content Provider's block hosts all content, public and premium, and manages execution flow through the Connecting API. This block is the only component that is specific to the particular data owner; an ad-hoc implementation of its specification is needed for each application. This cannot be avoided, as it must interface to the content provider’s digital library, which cannot possibly be specified in advance by PIVOTS. The Connecting API does not have to be ad-hoc implemented. However, it constitutes part of the Content Provider's block for two reasons: first, because it has to have access to the data sources and second, because it is the security-critical component. For the latter reason, the Connecting API must be published as an open source component before commercial deployment. The Content Provider can then inspect it before incorporating it in their own module. This will allow easier adoption of the system from the Content Providers. The Mobile Agent Platform functions as a front-end for the User, accepting the search programmes submitted. This component does not access private information. For the purpose of the current presentation, we can view it as an implementation convenience; more details can be found in [1].

Fig. 1

2.2.

System-Level Architecture

Implementation Overview

There are some basic technology choices that have to be made, in order to realise the design presented. The two fundamental system factors that shape these choices are: •

Sandboxing. The selected implementation technology should provide a safe method to create and manage the Sandbox.



Code migration. PIVOTS is based on the execution of the user's search algorithms on the Content Provider's site. The user's code must be transferred to this site and incorporated in the PIVOTS system, without manual intervention. It must be stressed that this is not a case of remote code execution, which could possibly be supported by Remote Procedure Call (RPC) technologies. In the case of remote code execution, the code is executed remotely and input data are sent to the remote site. On the contrary, in PIVOTS the remote code is brought and executed locally, while the data never leave the local site.

These factors led us to choose Java as the system's implementation language. Java supports an elaborate mechanism for sandboxing, primarily intended to support the safe execution of applets [6]-[8]. Because of applets' wide-spread use, this mechanism is extremely well-tested. Java supports security policies that define access rights to files, sockets, network, security, runtime operations, Java reflection and serialisation [6][17].

To create a sandbox, one needs only to instantiate a Java Virtual Machine (JVM) based on a security policy that denies it all access rights to the outside world. Any Java code executed in this JVM will be guaranteed to have no access to the system resource types mentioned above.

2.3.

The Agent Platform

Java makes code migration a relatively easy task. The Java complier stores the bytecode for each class in a separate file. The Java Runtime Environment uses this bytecode to produce on-the-fly the executable code of the class. It is possible to load a class and manipulate it at runtime, i.e. without the class being known at compile-time, using Java's reflection capabilities. To do this, the class's bytecode file should be placed in the JVM's classpath. The class is then loaded with a call to the static method java.lang.Class.forName(), which takes the class's name as a parameter. Any other classes referenced by the class being loaded are also loaded automatically.

interface ContentItemSummary { public long getId(); public String[] getKeywords(); public String getTitle(); public String getSubtitle(); public String[] getClassificationNames(); public String[] getArtists(); public String getProducer(); public String getContentSummary(); public long getSize(); public Date getDate(); } interface FullContentItem extends ContentItemSummary { public byte[] getFullContent(); } // final, so it cannot be subclassed to cheat! final class SearchResults { static final byte N = 1; public byte[] results = new byte[N]; // it is so simple, and should be! // Let the user interpret meaning of the bytes. } interface SearchFilter { // This method filters an array of summaries, // and returns an array of ids of Content Items // to be examined in full within the sandbox public long[] filter( ContentItemSummary[] summaries); } // classes implementing this interface // are passed to the ConnectingAPI interface SearchAlgorithm { public SearchResults assessFullCntnt( FullContentItem fullCntnt); } // this specifies the Connecting API // actual implementations are subclasses abstract class AlgorithmSandbox implements Runnable { // expected results size, for optimisation protected static final int initSize = 10; protected static DataProviderInterface dpInterface; public SearchResults[] performFullCntntSearch( SearchAlgorithm algorithm, SearchFilter filter) { ... } public abstract SearchResults[] performFullCntntSearch( SearchAlgorithm[] algorithms); } interface DataProviderInterface { public ContentItemSummary[] getContentItemSummaries( SearchFilter theFilter); public FullContentItem[] getFullContentItems( SearchAlgorithm theAlgorithm); //retrieve only content items specified in ids list public FullContentItem[] getFullContentItems( SearchAlgorithm theAlgorithm, String[] cntntIds); }

Fig. 2

Simplified Java code of the main system elements

Thus the code migration problem is reduced to transporting the required class files from the user's site to the Content Provider's site, and storing them in the file system. The file transfer itself is trivial, however the management of the process is not trivial. To simplify this, we decided to use an off-the-shelf mobile agent platform [9]-[14]. Mobile Agent platforms are purpose-built to serve the migration of code between servers and their remote execution. They also support the authentication of the remotely communicating parties (this functionality can easily be replaced by formal authentication with certificates supported by Trusted Third Parties). In our Java implementation of PIVOTS we employed Grasshopper [18] as our Mobile Agent

platform. Grasshopper offers a variety of options for the code transport, including TCP and IIOP, with the option of SSL for security, which we used in our implementation. It has to be noted, though, that PIVOTS does not rely on any of the advanced agent features of Grasshopper. A straightforward implementation of PIVOTS over other platforms, such as web services, is perfectly possible. Grasshopper realises the functionality of the MA block in Fig. 1. Search requests arrive at the platform in the form of a mobile agent carrying a Search Algorithm and optionally a Search Filter, both being Java classes. The classes must both implement the Java interfaces SearchAlgorithm and SearchFilter, respectively, shown in Fig. 2. The PIVOTS components use these interfaces to manipulate the classes, without actually having to know their exact characteristics. Using interfaces is better than using abstract classes, in that no restriction is imposed on the class hierarchy employed by the user. The Agent Platform is PIVOTS's gateway to the world. A static agent, which we call the Reception, manages the interaction with incoming mobile agents carrying search requests. The mobile agent delivers the two classes to the Reception, which carries out the initial filtering as mentioned above. Up to this point, no premium data have been used so there is no need for security restrictions and sandboxing. Once the filtering is complete, the Reception delivers a list of content item IDs to be rated to the Connecting API, together with a SearchAlgorithm instance. The Connecting API asks the data source to retrieve these content items, using the DataProviderInterface of Fig. 2. This is effectively a wrapper which the Content Provider must implement to connect PIVOTS to the provider's data library through the provider's proprietary network, software and hardware configuration. In the final processing step, the Connecting API creates an instance of the Sandbox, places the search algorithm class and the full content item in it and takes out the result, once execution terminates. Additional detail about the API can be found in [1]. One point of discussion which we here omit completely has to do with risks when processing medium-size files such as text documents. In the case of audio and video data, which interests us primarily here, the problem disappears – the discussion of Section 3 regarding security at the service-level applies instead. If PIVOTS is applied to smaller-sized multimedia, such as pictures, the original perspective must again be taken into account.

3. Service Architecture And Application Security 3.1. Service-Level Security & Business Model for Multimedia Digital Libraries The core PIVOTS platform allows arbitrary software, sent by the Search Engine, to execute remotely at the Content Provider’s site. This software is able to process premium content that the Search Engine has not purchased. The only allowed result for each code run is just a few bytes long, and is intended to bear the search assessment result. In other words, the results take the form of a meaningful digest, e.g., a useful statistic, a relevance assessment, a search-hit quality measure, etc. (so it would not be possible to send results such as a sample frame to the end-user, although a frame ID could be sent and the end-user could request a preview of this frame through a previewing system unrelated to PIVOTS, if the Content Provider wished to offer such an option). However, the actual content of the returned result/digest cannot be controlled by the Content Provider, so it could in fact contain a few bytes of the premium data itself. If a Search Algorithm were customised to return a successive part of the premium data after each execution, an uncontrolled application of the core platform of Section 2 alone would allow a malicious data source to steal premium content at will, albeit quite slowly.

A significant part of the discussion in [1] is concerned with presenting an array of different measures in order to guarantee that such a threat is avoided. Space restrictions do not allow description of the less relevant of these techniques to be repeated here, but they could employed as additional safety measures in the case that our current discussion were to be applied to extremely sensitive data. For the concerns of multimedia content processing, computational complexity achieves such importance that its role becomes decisive in terms of security too. In fact, the potential to extend the PIVOTS architecture is created by the fact that security now becomes easier to deal with. We rely on the simple and fundamental premise that trivial searches need not be facilitated by the system. Non-trivial searching in a multimedia digital library is a costly process. We will discuss these costs in more detail below and in Section 4. In laying down a business model for mining multimedia digital libraries, however, it is sufficient to determine that it is necessary for the Search Engine to pay a certain fee to the Content Provider for each search performed. The Content Provider must offer storage and processing power capable of serious multimedia processing; in a conventional model, the Search Engine would need to invest in this hardware; in PIVOTS, processing is remote, so it makes sense for the Search Engine to cover the cost of the service provided to it. In some applications, high charges may make sense, but as far as security is concerned: – Even a tiny fee charged on a per-search basis guarantees security. Security here is meant in a “computational” or “pragmatic” sense: malicious Search Engines or End-Users are indirectly prohibited from stealing data owing to the fact that they would incur immense costs as a result of doing so. Consider a charge of $0.01 (or, say, €0.01) per search. If the result size per search is 10 bytes (which is enough for a relevance rating, a few frame references, a basic indication of content type, and perhaps more) then “stealing” a Kilobyte of data will in fact cost $1 (or €1). But 1KB is a trivial size in the context of multimedia. In other words, using PIVOTS to steal would simply cost far more than making an actual purchase. On the other hand, normal use of the system remains quite cheap, since it is most unlikely that users will really need to regularly perform hundreds of searches. And since 10KB is in fact just as trivially tiny as 1KB, the cost per search could be cut down tenfold and the security situation would not change. As far as security is concerned, the cost could be driven to such low levels that actually charging the End-User would become unnecessary – Search Engine funding could be drawn from banner advertising, deals with the Content Provider for a small cut on sales originating from Search Engine referrals, etc. Clearly, the exact cost configuration of the system can be optimised for each application. All this does not mean that the charge per search could not be considerably higher, if this makes sense according to the business model of a specific application. – This discussion applies to the situation facing recipient of the search results. Since the Search Engine represents End-Users to Content Providers, the Search Engines would need to pay the Content Providers and the End-Users, being the ones to actually receive the search results, would pay the Search Engines (or indirectly create revenue for them through their usage of the system). The Search Engines provide a service to End-Users and does not need to ever receive the actual search results. In fact, they must be prohibited from doing so. Consider the following scenario: a malicious Search Engine with many users regularly performs a huge number of searches; the cost is low per user, but the number of users means that the total number of results retrieved is in fact very large; the search engine only uses half of the result encoding space to retrieve actual results for users, exploiting the other half to leak premium content from the Content Provider. Such a case must clearly be avoided, as it is an important threat. The malicious search engine could steal a high volume of premium data without incurring costs as a result.

The simple solution is to arrange for the search results to always be sent directly from the Content Provider to the End-User. The Search Engine simply sends to the Content Provider, together with each search specification, an XSL stylesheet (or equivalent) to specify its desired result format. The Content Provider uses the Search Engine’s specification, transforming the binary output that comes out of the Sandbox into results formatted for End-User consumption. The Content Provider then offers these results directly to the End-User. Making sure that the Search Engine does not enter the loop requires some configuration of the results transmission mechanism, but is not challenging to accomplish. For instance: — (1) if email feedback is required, the Search Engine provides the address to which each result must be sent; encrypted emails must be used; the Content Provider must test the Search Engine by occasionally creating user accounts and checking that the email addresses provided are correct, that the Search Engine does not ask for copies, etc; (2) the Content Provider can host a web server to serve web pages presenting the search results; the Search Engine must redirect End-Users to the appropriate address after a search has been submitted; the Content Provider will likely require a simple structure for these pages, so that it can easily verify that no means are embedded in them to forward results to the Search Engine; the pages can be served to a client from a single IP address only so that even without encryption, a Search Engine’s accessing the page would prevent the End-User from viewing results; but of course encryption could be used as well—. The additional safety techniques presented in [1] could be relied upon to increase confidence. Note that, in any case, a Search Engine would need to systematically compromise the system in order to steal valuable data; any temporary breach, co-operation with a handful of End-Users, etc., would not achieve the theft of adequate information volume; on the other hand, even a single instance of detected foul play on the part of the Search Engine could be punished severely.

3.2. Why Pre-Process the Multimedia Content of Digital Libraries? We address this issue in technical detail in Section 4, but preview the main concepts here, since they motivate the discussion which follows. Modern audio-video processing algorithms are well nigh invariably organised into separate processing stages. It is very rare for an algorithm to input a waveform or a three-dimensional matrix of pixel values and produce directly, in an ‘atomic’ (non-decomposable) step, final results describing the information content. A few examples from the plethora of intermediate steps commonly performed for video data are (roughly ordered from the conceptually most simple): downsampling, colour histogram calculation, texture descriptor extraction, data compression, edge-detection, and human-figure location. Some of these processes are trivial to understand, analyse and implement, but have a huge computational cost nonetheless – mostly due to a requirement to process the uncompressed, raw data, but algorithmic complexity issues can also arise. Others can be far more sophisticated in scientific terms, such as the full content-description capabilities offered by the MPEG 4 and 7 standards, while again posing great computational requirements, but are nonetheless publicly available techniques. A proprietary algorithm will almost always make heavy use of such commonly available functionality, its uniqueness resting on the innovation present in a few steps only. Finally, there may be processing steps in proprietary algorithms which, although being innovative and proprietary themselves, produce intermediate results which (1) are always the same for given input content, regardless of the desired result and (2) cannot be exploited to gain insight into the construction of the processing from which they resulted. Such steps also constitute good candidates for removal to a pre-processing phase.

If as many of the above processing stages are performed in advance on a multimedia corpus, the results can be stored together with the raw (source) data. The outputs of pre-processing stages are rarely as large as the raw data input, and very rarely larger. Thus, storage requirements will not scale up worse than requirements for the raw data itself. On the other hand, searching in real-time is made feasible. Many search algorithms will be able to function without accessing the raw data at all, or accessing only carefully selected sections of it. Also, total operation costs will be far lower since overall processing times will also be drastically reduced.

3.3. Service-Level Architecture and Service Execution Flow Figure 3 presents a high-level view of a complete PIVOTS-based multimedia digital library search system (note that this section departs significantly from [1], but instead of constantly interrupting the presentation to point out differences, an overall, conclusive comparison will be provided in Section 5). EndUsers are depicted on the top-right of the figure, Search Engines on the top-left, and Content Providers on the bottom half. Note that several representatives of each role are shown, although only one of each actually participates in the interaction. We want to stress that, since the configuration is in no way bound to the details of any specific Search Engine’s or Data Owner’s systems, but relies on open interfaces, and Search Engine algorithms need not be trusted by the Content Provider anyway, individual representatives of each role are interchangeable. PIVOTS thus provides an open architecture, into which any party can plug the appropriate functionality. This openness may easily be restricted according to Policies followed by each party involved, however, as all interactions are based on authenticated, encrypted and role-based communications.

Fig. 3

Service-Level Architecture

We first describe the various “blocks” in the figure, before moving on to the possible interactions and processing that are carried out by the system. •

End-Users rely on thin clients to submit their searches and receive results. PIVOTS is mostly concerned with the interaction between Search Engines and Content Providers. The important feature of End-Users is that they need to be the direct recipients of search results. Thus, they cannot be totally omitted from the architecture.



Search Engines host their interface for end-users as web-pages. However, they rely on powerful, proprietary multimedia processing technology which is not available to Content Providers (see the discussion in Section 1). — A “Search Algorithm Repository” is shown, in order to visually stress the multimedia technology which can be drawn upon, as well as the fact that a variety of different algorithms may be used by each Search Engine. These algorithms may be alternative methods with the same objective but different performance characteristics, or they may address different processing stages of the complete search procedure, some used for pre-processing and some for searching in real-time. Algorithms in the repository are free to change arbitrarily over time, since they are completely transparent to the rest of the system.



Content Providers host the PIVOTS platform, so most action takes place there. — The Control Module corresponds to the Mobile Agent Platform block of our implementation as presented in Section 2. As described there, any code that is to access the Content Provider’s premium content must be passed to this block in order to be sandboxed. For clarity in fig 3, this block is displayed as including some additional functionality which belongs to it conceptually but would not actually be implemented using Mobile Agents, for instance, the mechanism to return results to End-Users. — The square block entitled “Secure and Controlled Execution Environment” is the Sandbox itself. Here, it includes task prioritisation and monitoring functionality which can share computing resources as desired amongst search requests, monitoring the resource consumption of each of these – PIVOTS calls for no innovation here, so standard taskscheduling and -management practices can be relied upon. The monitoring capability can be used simply to offer equal processing opportunities to all search requests and to counter denial-of-service attacks by killing processes which exceed predefined limits. Alternatively, it can be used to charge Search Engines depending on the computational resources they spend. — The Video Content database stores the source data, i.e., the premium content owned by the Content Provider. Processes in the Sandbox are never granted rights to alter this data. Each content item has a GUID (globally unique identifier) referenced by entries in the other system databases. (Video with audiovisual data is the medium we are centrally concerned with in this work, but any data whatsoever could be substituted, if the appropriate algorithms to search it are also available). — The Secret Metadata database stores all pre-processing results (represented by the data chunk titled “metadata”). These results may contain substantial parts of the source data, so they are protected exactly as the source data itself. The metadata is created by a Search Engine’s pre-processing code, which alone is granted write/modify access to it. Each

metadata item is complemented by a “metadata description” file. This is an XML description of the metadata which is generated by the Search Algorithm before entering the Sandbox and accessing premium content (but see step “b4” below). This meta-meta-data specifies the content format and type of the metadata itself, so that real-time searching code can discover what pre-processing results are available to it and how to access them. The metadata description is placed in this database by the Content Provider’s Control Module and is readonly just as the source data. PIVOTS strictly specifies the metadata description files format with the XML schema shown in Fig 4. An example metadata description is shown in Fig 5. The specification ensures that pre-processing and real-time search algorithms can communicate with each other. Note that the lightly shaded fields in Fig 4 are intended to point human designers of search algorithms to appropriate specifications, while the rest are machine-readable. — The Public Metadata database contains metadata which the Content Provider itself uses to describe its content (title, summary description, file format, size information, etc). It also contains copies of the metadata description files from the Secret Metadata database, which publicly specify what pre-processing has been performed, so that search algorithms can be configured appropriately before arriving at the Content Provider’s site. This data is also always read-only.

Fig. 4

XML Schema specification for the metadata description files.

123456789012345678901234567890 789012345678901234567890123456 NTUA Telecom Lab - PIVOTS Video Preprocessor http://telecom.ntua.gr/ [insert authenticity certificate here] A9FD64E12C MPEG http://147.102.7.22/_multimedia_PIVOTS/ mpeg_info.html 2004-05-22 -a:7 -m:22 1

Fig. 5

Metadata description file example.

There are three main operations involved in searching with PIVOTS: browsing public metadata, preprocessing content to create the private metadata, and executing real-time searches. Consequently there are three “paths” of interaction, each supporting a different kind of search-related operation. They are formed in Fig 3 by series of arrows, with each arrow’s point being annotated with a short tag (a1-a2, b1-b4 and c1-c7) to facilitate references in the description below. •

Referencing Public Metadata (a1-a2). The Search Engine can freely read the public metadata. Reading the owner-generated metadata is like browsing a content catalogue which could just as well have been published by the Content Provider anywhere on the Web, although the Content Provider will reasonably want to include more search-relevant information here (such as details on file formats provided and relevant encoding details). The middleware-generated metadata gives details about what pre-processing has been performed for each content item.



Creating Secret Metadata (b1-b4). — b1: The Search Engine sends a Search Algorithm to the Content Provider as described in Section 2, but authenticating it as a pre-processing module. The Search Algorithm creates the metadata description before entering the sandbox, based on the processing steps which it intends to execute (but see step b4). — b2: The search algorithm is placed inside the Sandbox, being granted access to add a metadata file to the Secret Metadata database. The algorithm can access no external resources whatsoever (e.g. any attempts to communicate beyond the Sandbox, as with the arrow marked b? in the diagram, will be blocked). — b3: The search algorithm reads the source data for its chosen content item. Processing requirements may be very great, but performance is monitored by the Sandbox, and the Content Provider can kill over-expensive processes, or charge the Search Engine for the spent processor time, depending on the service-level agreement between them. — b4: The resulting metadata is placed into the Secret Metadata database once computed. The search algorithm is terminated, without returning any results to the Search Engine. It may specify a boolean success/failure return value, however, which the Content Provider inserts into the metadata description file. This return value is far too tight a channel to allow content theft, especially since (1) pre-processing in fact has higher computational requirements than searching and therefore probably incurs larger costs, (2) the total number of such boolean values in the entire database, equalling the number of content items times the average number of metadata files present for each, is still going to be fairly small and (3) declaring failure will have an important impact since the Data Source may discard the metadata, so this information cannot be used as a communication channel without serious, and instantly detectable, side effects.



Performing the Real-Time Search (a1-a7). — c1: An End-User logs on to a Search Engine and specifies search requirements according to whatever options the Search Engine can offer. — c2: The Search Engine sends a Search Algorithm to the Content Provider as described in Section 2, providing the necessary details for sending the results directly to the End-User concerned.

— c3: The search algorithm is placed inside the Sandbox, being granted only read access to the content and metadata files it needs to examine. The algorithm can access no external resources whatsoever (e.g. any attempts to communicate beyond the Sandbox, as with the arrow marked c? in the diagram, will be blocked). — c4,c5: The search algorithm processes the metadata and/or source file. performance is monitored by the Sandbox, and the Content Provider can kill over-expensive processes, or charge the Search Engine for the spent processor time, depending on the service-level agreement between them. — c6: Upon completion, the search algorithm calls the Connecting API, passing its results to it, and is terminated. — c7: The Content Provider forwards the search results directly to the End-User (as discussed at the beginning of this Section).

3.4. The Multimedia Pre-Processing Middleware Layer The architecture we have discussed so far places no restrictions on which Search Engine can access which Metadata. Of course, a system of custom access rights can easily be implemented, allowing, for example, Search Engines to create metadata that only they can exploit for subsequent searches – of course, the Content Provider would then have reason to make special charges for the computational resources spent for these “private” files. A consequence of this is that a party could enter the system exactly as the Search Engines specified so far, but only ever perform pre-processing. This party would not offer any services to End-Users, but would perform standardised pre-processing in order to serve all other Search Engines in a well-organised manner. Although the PIVOTS design supports this as just another search engine, in terms of services provided it is in fact a separate entity, a Middleware Layer which lies conceptually between the Search Engines and the raw source data offered by the Content Providers. It translates the source data into meaningful metadata for the Search Engines, without, however, barring them access to the original data. A good example of such a middleware layer is one which relies on well-trusted international standards. For instance, representing all source data with the MPEG 4 or MPEG 7 format should not be undertaken by actual Search Engines (after all, which one would do the work first?), but by the middleware layer.

4. Multimedia Processing Technologies In this Section, we describe and analyse specific multimedia processing technologies, starting with more general analysis tools and then progressing to search-specific ones. We thus illustrate how the efficient exploitation of the architecture which we have presented so far can be achieved, and present technical detailed in some depth to support our decision to introduce the pre-processing capability. The material of this section includes some new and innovative research results whose development was spurred by the PIVOTS point of view, but which also have an intrinsic value for mining multimedia digital libraries. We focus mainly on video data, as representative of the challenges faced in multimedia processes. Of course, algorithms to process any kind of data can be used with PIVOTS. It is especially interesting to address the processing of semantic visual information in digital libraries, as this is especially useful for automated search procedures.

4.1. Audio-Visual Content Description of Video Sequences The most basic pre-processing which can be relied upon by subsequent real-time searches is the application of algorithms for audio-visual content description of video sequences. Raw pixel values cannot provide a semantic representation of visual information as humans perceive it. Extraction of high-level, semantic features to describe visual content has attracted much attention recently, especially in the context of the MPEG-4/7 standards [19], [20]. While some solutions exist for specific applications (e.g., videophone systems, news bulletins, etc. [21] - [25]) or semi-automatic methods [26][27]) using, for example, specific object models, semantic object extraction remains one of the most challenging problems in the image analysis and computer vision community. Generic description of multimedia content has been addressed by the MPEG-7 standard [19] formally known as “Multimedia Content Description Interface”. The standard uses a set of low level audio-visual Descriptors (D), and Description Schemes (DS) to represent the rich variation of multimedia content. In addition, it specifies a standardized language for defining new DSs and Ds, as well as extending or modifying existing DSs and Ds: MPEG-7 DDL, which is derived by extension of W3C XML Schema. While the XML Schema has many of the capabilities needed by MPEG-7, it had to be extended to address other requirements specific to MPEG-7. The standard caters both for audio and video representation in the content domain. A) Audio Video Content Representation MPEG-7 Audio provides structures — in conjunction with the Multimedia Description Schemes part of the standard — for describing audio content. Utilizing those structures are a set of low-level Descriptors, for audio features that cut across many applications and high-level Description Tools that are more specific to a set of applications. Those high-level tools include general sound recognition and indexing Description Tools, instrumental timbre Description Tools, spoken content Description Tools, an audio signature Description Scheme, and melodic Description Tools to facilitate query-by-humming. The low level audio descriptors of the MPEG-7 include the basic descriptors (Instantaneous waveform, power, silence), the basic spectra (Power spectrum, spectral centroid, spectral spread, spectral flatness), signal parameters (Fundamental frequency, harmonicity), spectral basis (Used primarily for sound recognition), Timbral Temporal (Log attack time, temporal centroid of a monophonic sound) and trimbral spectra (Features specific to the harmonic portions of signals – harmonic spectral centroid, spectral deviation, spectral spread – etc). In addition, the background noise, channel cross-correlation, relative delay, balance, DC offset, bandwidth, transmission technology, errors in recordings (clicks, clipping, drop outs), and the musical tempo. Some capabilities of the audio descriptors are spoken content search, sound recognition and indexing, instrument timbre search, melody search and query by humming (see part 4 of the MPEG-7 standard). B) Visual Video Content Representation The color, texture, shape and motion information are adopted as appropriate visual features by the MPEG-7, each of which is characterized by appropriate descriptors. As far as color information is concerned, the scalable color descriptor (SCD) is defined as the color distribution over the entire image. In addition, the dominant color descriptor (DCD) aims at describing local and global spatial distribution of the color. In contrast to the color histogram, in this approach, colors at a given region are clustered together in a small number of representative colors. Then, the descriptor consists of the representative colors, their percentage in a region, spatial coherency of the color and color variance.

Finally, the color layout descriptor (CLD) is used to describe spatial distribution of colors in an arbitraryshape image region. For texture information, two types of descriptors are extracted. Homogenous texture descriptors aim to represent directionality, coarseness, and regularity of patterns in a video frame. Non-homogeneous texture descriptors (the edge histogram) capture the histogram of the edges. The edge information is classified into five categories, vertical, horizontal, 45o, 135o and non-directional edge. Shape descriptors are categorized into two main classes; region-based shape descriptors and contourbased descriptors. The former uses the entire shape region to extract meaningful information, which is useful when objects have similar spatial properties. The latter exploits only the boundary (contour) information of the objects to describe entities of similar contour characteristics. Finally, the motion activity descriptor, for a video segment, represents the overall motion of the respective segment. This descriptor describes whether a scene is likely to be perceived by a viewer as being slow, fast paced, or action paced. The camera motion descriptor describes the movement of the camera or of a virtual point of the scene. All these descriptors would be made available to search algorithms through a direct application of the standard to the video database. Furthermore, Description Schemes can be used to represent more efficient combinations of the descriptors so as to allow a more semantic representation of the visual content. Descriptor/Description schemes definition is supported by the standard through the standardized DLL language. Having described the image visual content, appropriate feature vectors, say f i , are constructed able to represent the content of the entire data with high efficiency. Vector f i represents the content of the ith video frame of a given content item. These feature vectors are required by very many search techniques; their existence as metadata saves each search execution from dealing with pixel-level data to a significantly large extent.

4.2 Summarisation Summarisation is the process of extracting a small but meaningful segment (summary) of each video sequence. This small summary attempts to represent with few frames the content of the entire sequence. Extraction of a small video summary is very important in case of visual content retrieval and indexing since it permits quick but effective searching of the visual information. The filtering process (see below), which aims at discarding all useless information before applying the searching task, exploits the results of the summarization algorithm. Many algorithms have been proposed in the literature for video summarization. Particularly, in [28], a single frame is extracted for each video shot to represent its visual content. However, single frame extraction provides no information of the shot motion and thus it cannot sufficiently describe the content of shots of long duration. Other methods deal with the construction of compact image maps or image mosaic for each shot, which, however, are efficient only in case of video sequences with specific characteristics [29][30]. Furthermore, in [31] a method for analyzing video and building a pictorial summary has been presented, while in [32] a fuzzy visual content representation has been proposed with application to video summarization and content based indexing and retrieval, while color and depth information have been appropriately combined in [33] to summarize stereoscopic video sequences. A clustering method for video content summarization is adopted in [34].

As a complement to these techniques, we suggest the minimization of a cross correlation criterion [34][35]. The concept of this approach is to extract as content representatives the most uncorrelated frames as they are described by the respective feature vectors (i.e., summaries are computed utilising the results described in the previous subsection). These frames are a good summary since they represent high variations of the content activities. The cost correlation criterion is defined as R (a) = RF (a1 , K , a K ) =

K −1 K 2 2 ∑ ∑ ( ρ a ,a ) K ( K − 1) i =1 j =i +1 i j

(1)

where a is the index vector of the most representative data and ρ (a i , a j ) the correlation coefficient of the ith and jth element of vector a = [a1 L a K ]T [36]. Thus, the most representative data are obtained by the following minimization problem, aˆ = (aˆ1 , K , aˆ K ) = arg min R(a) (2) for all a

The complexity of an exhaustive search for estimating the minimum value of (1) would be unreasonably great, since all possible combinations of frames would need to be examined. For this reason, we propose a genetic algorithm (GA) approach in [37]. This scheme is able to find a solution close to the optimal one within a small number of iterations. Summarisation can greatly boost search performance by allowing a search algorithm to jump directly to the most informative frames of a video. By keeping the original index of the selected summary frames, a search algorithm can branch out into the original content if necessary. Thus, the potential head start does not come at a price of information loss, even though the summarisation process itself discards information.

4.3. Hierarchical Summarization The next step we suggest for the improvement of search performance is to provide a framework for the efficient navigation of the visual content. Video summarization is not always efficient for video navigation since its goal is to extract a small “video abstract” by discarding visual information. On the contrary, in a video navigation scheme, the objective is clearly to assist the location of any frame in the video, if it turns out to contain content of interest. Suppose, for example, that a physician accesses remote medical archives of video sequences in order to find a set of frames, e.g., a shot, which he/she is interested in. In this case, application of a video summarization scheme may discard the frames of interest, and an attempt to retrieve them would need to go back to the basic, unorganised frame sequence level. Instead, in a video navigation scheme, no information should be lost since any video segment can be considered to be of particular users’ interest [38], [39]. In past work, we have designed a method where video decomposition is performed using a tree structure, the levels of which indicate the respective content resolution, while the nodes correspond to the segments that video is partitioned into at this level [38]. In this framework, the user is able to select segments (treenodes) of interest and reject segments of non-interest. For all selected segments, further visual decomposition is accomplished, resulting in a content hierarchy from the lowest (coarse) to the highest (fine) content resolution. In that work search was performed interactively, with a user guiding the process. In the context of implementing PIVOTS, we extended the work of [38] so that video decomposition is performed in an automatic way, and the retrieval problem is handled in an integrated manner. We adapted search algorithms to navigate the hierarchical structure in an automated fashion. Since the search is able to follow

content context cues, the hierarchical video summary is capable of supporting stochastic branch-and-bound algorithms which discover desired content far more efficiently than a linear scan could hope to. The hierarchical summary tree extends into four semantic levels; the shot representatives, the shots, the frame representatives and the frames (leaves of the tree). Shot representatives are extracted by applying a clustering algorithm to all detected shots of a video sequence (as indicated by applying a shot cue detection algorithm). The second level comprises the shots of a respective cluster – for each shot representative frames are extracted by minimizing a cross correlation criterion as in a summarization stage. Next, the frames of each shot are grouped with respect to the extracted frame representatives to constructed frame classes. The frames themselves, as stated, are the leaf nodes. This structure is depicted in Fig 6. This hierarchy is conveniently represented in XML as shown in Fig 7. Video Sequence

Selected Path

Shot Representative Level Shot Level Frame Representative Level Frame Level Fig. 6

f1(,11,1)

F1(1,1)

...

. . .

s1,1

...

(1,1) f 34 ,1

. . .

γ1

F3(1,1)

γ5

s7,1

F1(7,1) ...

s1,5

F5(7,1)

F1(1,5)

. . .

An example of the tree-structure video content decomposition.

...

. . . F6(1,5)

F1(11,5)

s11,5

...

F4(11,5)

,5) f1(,11 4

...

(11,5) f 23 ,4

Fig. 7

An example of a tree-structure video content decomposition represented in XML.

4.4. Example Filtering and Searching Process The purpose of PIVOTS is to enable proprietary search algorithms to mine private multimedia digital libraries. Of course, any published algorithm could be implemented directly by a Data Owner without the use

of PIVOTS. Therefore, presenting and analysing an actual search mechanism is not a priority of this paper. We briefly describe an example search algorithm, including an appropriate filtering process, more as an example of how to make good use of the infrastructure we have described so far, that with an interest in its intrinsic filtering capabilities. Note that any process used at this level must of course be fully automated, i.e., it cannot be interactive, guided in any way by an end-user. The method we describe is divided into two steps, the filtering process and the searching process. (In comparison, the hierarchical video content decomposition scheme of [38] focuses only on the progressive representation of the visual content without dealing with the automatic hierarchical retrieval step.) The first stage of the proposed searching algorithm for visual content is to apply a filtering process able to discard useless information. This way, searching efficiency is improved since it is applied only to a set of somehow relevant data. In particular, let us denote as f q the feature vector constructed by the application of the content (k )

description algorithm to a query image file. Let us also denote as xi

, i = 1,2, L , N (k ) the ith index frame of

the summary of the kth video sequence in the database, where N ( k ) is the number of frames of the summary of the kth video. Then, the goal of the filtering process is to compare the query feature vector f q with all feature vectors of all video summaries, i.e., f (k ) . In case that there exits a similar feature vector in a video x i

summary to the query one, the respective video sequence is considered as a candidate sequence of the user’s interest. On the contrary, if all frame indices of a video summary are far away from the query one, the respective video sequence is discarded from further searching. This is expressed as S k = {k : min{d (f q , f ( k ) )} < T , ∀i = 1,2, L , N ( k ) } (2) x i

i

In equation (2), S k is the set containing all video sequence indices with at least one frame of the respective summary is close to the query one, T an appropriately selected threshold and d (⋅) a distance measure. Thus, set S k consists of the indices of the possibly selected video sequences as candidate sequences for the searching process. Having extracted a limited set of possible candidate video sequences for a given query image, the next stage is the actual searching process, which aims to retrieve and/or localize the frame/segment of the user’s interest. The hierarchical visual content representation presented in section is used in our case to increase the searching efficiency, while simultaneously reducing the computationally complexity. That is, the (k )

searching process is applied in a hierarchical framework. In particular, let us denote as γ i (k )

ith shot representative for the kth video sequence, that is to the γ i

the index of the

shot class. Then, a set of candidate shot

classes, say SC, is constructed by discarding all clusters whose content representatives are far away from the query content, i.e., (k )

SC = {γ i

: d (f q , f ( k ) ) < T , ∀k ∈ S k , ∀i} (3) γi (k )

Equation (3) means that the set SC contains all indices γ i

of shot representatives of a sequence k,

whose the distance among the query feature vector f q and the shot representative feature vector f

γ i(k )

is

lower than a specified threshold. As a result, the set SC includes all possible candidates of shot classes that

may contain relevant information to the query one. Similarly, for each selected shot class in SC, the most relevant shots are extracted. In particular, let us denote as s

l ,γ i( k )

(k )

th index of the lth shot in shot class γ i

.

Then, the set of possible shots, SS is constructed as

SS = {s

l ,γ i( k )

: d (f q , f

sl ,γ i( k )

) < T , ∀k ∈ S k , ∀s

l ,γ i( k )

∈ SC} (4)

In a similar manner the respective frame classes and frames are extracted in a hierarchical framework so that at the end the most relevant frames to the query one among all video sequences are retrieved. The proposed hierarchical searching algorithm for visual content in potentially huge video databases significantly reduces computational complexity compared to the traditional sequential search. This is due to the fact that the proposed tree-structure content representation yields a reduction of the searching data up to 16 times in the worst case, i.e., the case in which content representatives are randomly selected. A much higher reduction is achieved in case of an appropriate selection of the frame/shot representatives. See [38] for experimental results which are quite similar to the ones produced by the method described here.

5. Discussion and Conclusions We have extended the PIVOTS system presented in [1] so as to make it applicable to searching large multimedia digital libraries. The technical details of constructing a PIVOTS-based system require quite a bit of detailed examination of possible security threats, and the files sizes of premium content which must be protected play a very important role. The large size of multimedia content items, together with the expensive processing requirements implied by searching such high-volume data, actually makes dealing with security easier for the current application. Of course, without this intrinsic improvement, the application would have been impossible to achieve. By gaining flexibility due the increased ease in guaranteeing security, we are able to significantly extent the PIVOTS model to allow what is, conceptually, or at the business semantics level, a totally new middleware layer to enter scene. The middleware allows source content items to be preprocessed, so that final search effectiveness can be brought within practicable bounds. The proposed model poses a partially new, and certainly quite interesting, environment for multimedia mining, since interactivity in the search process is hampered. Still, we focused our attention in quite some detail on purely video-processing considerations, and described how fully automated search algorithms can be derived from interactive ones, showing great promise to achieve high utility. Of course, all existing algorithms which are already non-interactive by their nature can be used directly. Finally, we should note that there is rather little material to be found in the literature that could be considered as “related work”. As we also discussed in [1]: “The closest to an alternative to our proposal can be found in the area of Digital Rights Management (DRM) [2]-[5]. One could suggest, for example, that the temporary browsing of content as in a bookstore can be achieved by DRM allowing potential customers access of limited duration to digital material being considered for purchase. It is therefore interesting to highlight the important differences between the two approaches. DRM involves a very complicated framework of technology (hardware as well as software), legislation and business models, all of which must work well before any commercial system can be deployed. PIVOTS is much simpler to implement, with much more specific objectives, and involves no legal issues. The risks behind distributing premium content secured based on a DRM model are hard if not impossible to avoid, or even precisely quantify, since hackers have every opportunity to compromise the security of content which they are allowed to access. PIVOTS never

allows any human user access to premium data, and the risk of theft is well-specified and can be reduced to any extent necessary, by system configuration. If DRM achieves a status of a maturity it will offer the option of temporary personal browsing of premium content, unlike PIVOTS. Our system, however, by never allowing a human user to examine private data, is applicable to cases where the data is so sensitive that such a restriction is in fact a system requirement, [for instance, an application dealing with medical records]. In conclusion, the functionality of PIVOTS is a well-focused subset of that promised by DRM. Importantly, PIVOTS can be deployed immediately for real applications. In contrast, DRM is still a controversial issue with high aspirations but also serious critics who believe it will be many years before its commercial maturity.”

ACKNOWLEDGEMENT Development of new tools and algorithms for media content searching, indexing and analysis in digital libraries (DL) is one of the hottest research issues in the database and knowledge engineering. This is also verified by the EU-funded Network of Excellence (NoE) DELOS, which aims to integrate researchers in Europe for covering all the research issues of a digital library. The DELOS Network of Excellence also supports the presented work.

1. References

[1]. A. Yannopoulos, Y. Stavroulas and T. Varvarigou “Moving E-Commerce with PIVOTS: Private Information Viewing Offering Total Safety,” IEEE Trans. Knowledge and Data Engineering, Vol. 16, No. 6, pp. 1-12, June 2004.

[2]. B. Rosenblatt, B. Trippe, and S. Mooney, Digital Rights Management: Business and Technology, New York: John Wiley & Sons, 2001.

[3]. P. Samuelson, “Good News and Bad News on the Intellectual Property Front”, Communications of the ACM, vol. 42, no. 3, 1999, pp. 335-342.

[4]. S. Kenny and L. Korba, “Applying digital rights management systems to privacy rights management”, Computers & Security, vol. 21, no. 7, 2002, pp. 648-664.

[5]. S.H. Kwok et. al., “Digital rights management in Internet open trading protocol (IOTP)”, Proc. Int’l Conf. Electronic Commerce (ICEC 2000). 2000, pp. 179-185.

[6]. L. Gong, Inside Java 2 Platform Security, Addison-Wesley, 1999. [7]. T. Lindholm and F. Yellin, The Java virtual machine Specification. Addison-Wesley, 2001. [8].http://java.sun.com/security/ (current Mar. 2003). [9]. Agent Communication Language, FIPA 2000 Specification. Available at http://www.FIPA.org (current Mar. 2003). [10].

G. Wagner, Foundations of Knowledge Systems with Applications to Databases and Agents, New York: Kluwer

Academic Publishers, 1998.

[11].

H. Nwana, D. Ndumu, “A Perspective on Software Agents Research”, The Knowledge Engineering Review,

Cambridge University Press, vol. 14, no. 2, 1999, pp 1-18.

[12].

G. Wagner, “Towards Agent-Oriented Information Systems”, Technical Report, Freie Universitat Berlin, 1999.

Available at http://www.inf.fu-berlin.de/~wagnerg/index.html

[13].

N. R. Jennings, K. Sycara, M. Wooldridge, “A Roadmap of Agent Research and Development”, Autonomous

Agents and Multi-Agent Systems, vol. 1, no. 1, 1998, pp. 7-38.

[14].

M. J. Huber, Jam Agents in a Nutshell. Available at http://members.home.net:80/marcush/IRS/ (current

Dec. 2002).

[15].

S. Haykin, Neural Networks: A comprehensive foundation, 2nd edn. Upper Saddle River, NJ: Prentice-Hall,

1999.

[16].

http://java.sun.com/j2se/1.3/docs/ (current Mar. 2003).

[17].

Sun Microsystems, Security and signed applet, 2001. Available at http://jsp2.java.sun.com/products/

plugin/1.3/docs/netscape.html (current Mar. 2003).

[18].

http://www.ikv.de/products/grasshopper/ (current Sept. 2002)

- for newer versions, see

http://www.ikv.de/content/Produkte/osp_e.htm (current Mar. 2003).

[19].

MPEG-7 Requirements Group, "MPEG-7: Context, Objectives and Technical Roadmap, V.12," Vancouver, July

1999 ISO/IEC SC29/WG11 N2861.

[20].

ISO/IEC JTC1/SC29/WG11, “MPEG-4 Video Verification Model Version 11,0,” Doc. N2172, Mar. 1998.

[21].

N. Doulamis, A. Doulamis, D. Kalogeras and S. Kollias, “Very Low Bit-Rate Coding of Image Sequences Using

Adaptive Regions of Interest,” IEEE Trans. Circuits and Systems for Video Technology, Vol. 8, No. 8, pp. 928-934, Dec. 1998.

[22].

A. Doulamis, N. Doulamis and S. Kollias, "On Line Retrainable Neural Networks: Improving the Performance

of Neural Networks in Image Analysis Problems,” IEEE Trans. Neural Networks, Vol. 11, No.1, pp. 137-155, January 2000.

[23].

A. Doulamis, N. Doulamis, K. Ntalianis and S. Kollias, "An Efficient Fully-Unsupervised Video Object

Segmentation Scheme Using an Adaptive Neural Network Classifier Architecture," IEEE Trans. Neural Networks, accepted for publication (to appear).

[24].

R. Castagno, T. Ebrahimi and M. Kunt, “Video Segmentation Based on Multiple Features for Interactive

Multimedia Applications,” IEEE Trans. CSVT, Vol. 8, No. 5, pp. 562-571, 1998.

[25].

L. Garrido, F. Marques, M. Pardas, P. Salembier and V. Vilaplana, “A Hierarchical Technique for Image

Sequence Analysis,” in Proc. of Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp. 13-20, Louvain-la-Neuve, Belgium, June 1997.

[26].

C. Gu and M.-C. Lee, “Semiautomatic Segmentation and Tracking of Semantic Video Objects,” IEEE Tran.

Circuit. Syst. Video Techn., Vol. 8, pp. 572-584, 1998.

[27].

F. Bremond and M. Thonnat, “Tracking Multiple Nonrigid Objects in Video Sequences, ” IEEE Trans. Circuits

and Systems for Video Technology, Vol. 8, No. 5, pp. 585-591, Sept. 1998.

[28].

B. L. Yeo and B. Liu, “Rapid Scene Analysis on Compressed Videos,” IEEE Trans. Circuits and Systems for

Video Technology, Vol. 5, pp. 533- 544, Dec. 1995.

[29].

M. Irani and P.Anandan,” Video Indexing Based on Mosaic Representation,” Proceedings of the IEEE, Vol. 86,

No. 5., pp. 805-921, May 1998.

[30].

N. Vasconcelos and A. Lippman, “A Spatiotemporal Motion Model for Video Summarization,” Proc. of IEEE Int.

Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 361- 366, Santa Barbara, USA, June 1998.

[31].

M. M. Yeung and B.-L. Yeo, “Video Visualization for Compact Presentation and Fast Browsing of Pictorial

Content,” IEEE Trans. Circuits and Systems for Video Technology, Vol. 7, No. 5, pp. 771- 785, Oct. 1997.

[32].

A. Doulamis, N. Doulamis, Y. Avrithis and S. Kollias, "A Fuzzy Video Content Representation for Video

Summarization and Content-Based Retrieval," Signal Processing, Elsevier Press, Vol. 80, pp. 1049-1067, June 2000.

[33].

N. Doulamis, A. Doulamis, Y. Avrithis, K. Ntalianis and S. Kollias, "Efficient Summarization of Stereoscopic

Video Sequences," IEEE Trans. on Circuits & Systems for Video Technology, Vol. 10, No. 4, pp. 501-517, June 2000.

[34].

A. Hanjalic and H. Zhang, "An integrated scheme for automated abstraction based on unsupervised cluster-

validity analysis," IEEE Trans. on Circuits and Systems for Video Technology, Vol. 9, No. 8, pp. 1280-1289, December 1999.

[35].

Y. Avrithis, N. Doulamis, A. Doulamis, and S. Kollias, “A Stochastic Framework for Optimal Key Frame

Extraction from MPEG Video Databases,” Journal of Computer Vision and Image Understanding, Academic Press, Vol. 75, Nos 1/2, pp. 3-24, July/August 1999.

[36].

A. Papoulis, Probability, Random Variables and Stochastic Processes. Mc-Graw-Hill, 1987.

[37].

D.E. Goldberg, Genetic Algorithm in Search, Optimization and Machine Learning, Addison Wesley, 1989.

[38].

A. Doulamis and N. Doulamis, “Optimal Content-based Video Decomposition for Interactive Video Navigation

over IP-based Networks,” IEEE Trans. on Circuits and Systems for Video Technology, June 2004 to appear.

[39].

J. R. Smith, "VideoZoom: Spatio-temporal video browser," IEEE Trans. on Multimedia, vol. 1, No. 2, pp. 157-

171, June 1999.

[40].

L. Yeo and B. Liu, “Rapid Scene Analysis on Compressed Videos,” IEEE Trans. Circuits and Systems for Video

Technology, Vol. 5, pp. 533- 544, Dec. 1995.

[41].

ISO/IEC JTC 1/SC 29/WG 11/N3964,N3966, “Multimedia Description Schemes (MDS) Group”, March 2001,

Singapore.

[42].

J. Nam and A. H. Tewfik, “Video Abstract of Video,” Proc. of the IEEE Inter. Workshop on Multimedia Signal

Processing, pp. 117-122, Copenhagen, Denmark, Sept. 2000.