integration of a voice recognition-based indexing with

7 downloads 0 Views 210KB Size Report
The presented multimedia search engine is expected to fill that gap according to requirements .... The software is free to download as a trial. .... available three well-known engines: ABBYY Fine Reader 4.0 Professional for the OCR recognition, ...
INTEGRATION OF A VOICE RECOGNITION-BASED INDEXING WITH MULTIMEDIA APPLICATIONS Mikolaj Leszczuk1, Zdzisław Papir2 Department of Telecommunications, University of Mining and Metallurgy al. Mickiewicza 30, 30-059 Kraków, Poland, e-mail: [email protected], [email protected]

ABSTRACT The subject of the article is developing multimedia search engines, which use recognition mechanisms (face, text and speech recognition). The basics of creating multimedia-indexing engines are explained and an instance application [18] is described (this application employs a speech recognition engine). 1

INTRODUCTION

About three-fourth of computers worldwide have been equipped with multimedia devices (Fig. 1). Most of those computers work nowadays on-line. The breakthrough is coming; soon users will want to access multimedia information on-line. However, they will not be able to find those files without databases, being supported by advanced indexing systems.

Figure 1. Technology of tomorrow The presented multimedia search engine is expected to fill that gap according to requirements of tomorrow’s technology, where the idea of information superhighway becomes a reality. 2

STATE-OF-THE-ART

Indexing of multimedia databases has not been yet deeply investigated and it deserves much more research efforts because of its anticipated importance in the nearest future [15] [16] [17] [5]. The proposed indexing engine will be integrated into a Video Streaming Engine developed in the Department of Telecommunications under the Broadband Trial Integration (ACTS AC 362) project sponsored by the European Commission [11] [13] [14]. The Video Streaming Engine was designed to provide an easy access and basic search capabilities of digital video content using WWW interface [13]. All the predictions about the future shape of Internet present it as a huge multimedia information superhighway to be accessed by everyone. These visions include both broadband access networks as well as advanced multimedia services. In such an environment, browsing multimedia databases will be common and the operation is going to be as popular as now is browsing a non-multimedia HTML web pages. People often will be watching television news programs, browsing virtual videotape hires, listening to an Internet radio stations or just searching for movie clips.

That is the reason why browsing multimedia databases should be fast and efficient. Each database client, after querying, should also get answers fitting his/her requirements as accurately as possible, not like today, when querying popular HTML search engines, s/he gets thousands of completely unusable and not up-to-date hits.

Figure 2. AltaVista Video Search Browsing a video database with traditional methods is quite complicated. A multimedia database is not the same thing as a database with other type of records. As textual databases are easy to index, multimedia databases should contain text descriptions of its content. In addition, browsing a text file extracted from a text database is easier that browsing a video content. A new, modern, multimedia database requires another methods of indexing and browsing. Unfortunately, most of the databases still do not have any advanced indexing and searching engines. In most cases, searching the database comes through manually generated text descriptions. However, where is a power of search engines like AltaVista? Popular search engines have a huge database of web pages with many of them containing multimedia files. Therefore, for AltaVista, the only problem with creating a multimedia search engine was to construct a little bit more sophisticated queries to a database. Something like (that is only an example): SELECT web_pages WHERE page_keywords LIKE ‘mars’ AND file_extension LIKE ‘AVI’; Sometimes, more advanced solutions are being used. In these engines, searching is performed using the filenames or the text surrounding the multimedia object embedded in a web page. These solutions bypass the problem but still do not solve it. With proposed multimedia indexing service the problem is that multimedia files make up less a 5% of the whole Internet content, so only one of twenty fetched pages contains files of interest. To the best authors’ knowledge no operable indexing and browsing engines, which use recognition systems exists. When there is about thousand or more movies stored in a video server, searching through its content is quite complicated. Even narrowing the criteria gives back hundreds of answers in huge databases. Users have to stream all the movies to find the desired one and this is a both time- and CPU-load-consuming task. The video must be well described to easy the task of the video browsing as much as possible. [11] 3

MAIN GOALS

Some methods could solve problems described in the last section. Descriptions of multimedia files must be generated automatically. In most cases, the best solution is to integrate a search application with three recognition engines (voice, text and face). Each application client should be able to specify which recognition engine will be used to search.

Figure 3. “Onion” architecture of the DVL application The application should consist of three main parts (Fig. 3): 1. The Internet web robot, which automatically finds multimedia objects on the network, 2. The recognition and indexing engine, which automatically generates descriptions of multimedia objects, 3. The multimedia server (for example integrated with Oracle Video Server [4] or Real Networks Server) and database, allowing clients for search and view multimedia content.

3.1 Web Robot Let us compare Internet to a big spider web (by the way, this comparison is rather common). We can also assume that a multimedia object would be represented as a butterfly caught in this web (see the Fig. 4).

Figure 4. Web Spiders To index this object (wave file, animation, movie) we have a spider. Spider would be a program, which waits in one place of network, and when is started, tries to locate a butterfly. An application that runs in circles through the web finds “victims” (multimedia objects) and puts them into an input of a voice recognition engine. 3.2 Voice Recognition Speech (voice) recognition software is being used by hundreds of thousands of people every day. It is well known that this technology is being implemented in customer service applications, dictation systems, etc. Why is it so popular now? There are two reasons. The first one is that computer hardware is now available and affordable to take advantage of the technology. The reason more important is that these systems have been working well in lab conditions with constrained tasks (reduced vocabulary, speaker-dependent, etc.) but now they are starting to work in the real environment. In addition, for a wide range of applications, use of a GUI is not possible, e.g., the user access is through voice call or a terminal that is not equipped with any pointing device. For such situations, it is essential to access the multimedia data via spoken language interfaces or via an agent interface. It is also essential that media translation tools will be available to match the client capabilities, e.g., text-to-speech conversion to convert textual material to a spoken form for presentation to the client. The worldwide best voice recognition engines are the Microsoft SAPI 4.0a SDK Suite [9] and IBM ViaVoice SDK [7] packages for voice recognition. The software is free to download as a trial. These packages have Java and C++ interfaces to use them in own programs [2]. 3.3 Video-Content Server The last part of a development work was the integration of voice recognition engine with currently existing video-content server. Towards the Digital Video Library with an automatic indexing system. Creating of hand-made description has become not necessary anymore. The system was built using existing and well-established components. Each component is responsible for specified tasks connected with using Video Streaming Engine [6]. The Video Streaming Engine was founded on the Oracle Video Server platform. 4

IMPLEMENTATION

The section describes the implementation of all the parts of the whole application: 1. The list of programs used to build a voice-indexing engine. 2. The reference to used programming languages. 3. The detailed procedure of multimedia content indexing. 4.1 Voice Recognition-based Indexing Voice recognition-based indexing needs several programs to operate. It utilizes “Main Actor” to proceed some basic operations of multimedia files. Then, “Win Amp” makes a conversion from MPEG to WAV formats. Finally, “Audio Index” program (written using “Microsoft SAPI”) indexes files. 4.1.1 Main Concept Main Actor v3.06 Main Actor is a professional video and animation-editing suite consisting of Main Actor Sequencer, Main Actor Video Capture, Main Actor Video Editor and Main View. 4.1.2 Nullsoft WinAmp v2.60 Nullsoft WinAmp is the ultimate high-fidelity music player for Windows 95/98/NT. WinAmp supports MP3, CD and other audio formats, more than 5,000 skins and 150 audio visualization and effect plug-ins. WinAmp is freeware. 4.1.3 Microsoft SAPI v4.0a This SDK provides the tools, data, and samples needed to incorporate speech into own Windows applications. The Microsoft Speech SDK works with Windows 95/98 and Windows NT 4.0 (or better) (x86, Alpha).

4.2 The reference to used Programming Languages The criteria of choosing programming languages were different. Sometimes the language was picked because of its flexibility (“Bash”) and easily to use (“C++”). Sometimes particularly, there was no choice (“CMD”, “Rexx”, “SQL”). 4.3 Procedure The end-to-end procedure of indexing and accessing multimedia data through the Digital Video Library could be divided into four separate parts [17]: 1. Web-surfing, 2. Indexing, 3. Storing, 4. Accessing. Each of those parts will be described below. Please note that on the both sides of the application, appears the Internet: once as a content provider, second time, as a medium for multimedia files accessing.

Figure 5. Interworking of modules It is important to keep in mind that the whole application works on actually on two servers. Both web-surfing and indexing processes run on a NT machine. Storing and accessing procedures are performing on UNIX server. Communication between two parts of application goes through files transported over Samba. 4.3.1 Web-surfing The “Surf” application fetching the Internet was written in Perl. “Surf”, started with a base URL, generates a list of “links” (new URL addresses embedded on a web pages, whose addresses were given as an input). Then, a Perl script analyses, compares the input list with the output list and finds links, which could direct to multimedia content. It uses file extensions to find multimedia links. The found files are downloaded by another Perl script (Fig. 6). The currently supported extensions are listed in the appendix.

Figure 6. Processing web pages for multimedia downloading 4.3.2 Indexing The indexing script (“Index”, also written in Perl language) at the start time opens the files having been generated recently by “Surf” and directs them to “MainActor”. Now, the “MainActor” software is being launched with a “Split” script (written in the REXX language) as a parameter. That script loads each multimedia file to “MainActor” and finds data about the content like: number of frames, timecodes, format, stream, width, height and so one. In a case of a video file, the script automatically saves the first frame of the file as a small preview picture in a portable format. If the file contains an audio track, “MainActor” extracts it, and saves as an MP2 file. Having prepared some files in MP2 format, “Index” script has to execute “WinAmp” application, which transforms the MP2 audio format into the ordinary Windows wave files. “WinAmp” can be also used for amplifying speech frequencies and cutting off a background noise. The next part of „Index” job consists in indexing. For each file, the script runs “Audio Index” application, which reads each wave file and tries to recognize a speech. The accuracy depends on a type of an audio track. The example of recognizing is shown below: Speech recognition: “The passages and extract from beacon males will be improve loans. Beacon mail wanted advertising until one day he and his wife decided to move to the south from its. The book is a detailed and often amusing description of those she and then you have an account of the difficulties that forms may have when they tried to Simmons says into to.” Reality: “The passage is an extract from Peter Mayle’s book 'A Year in Provence'. Peter Mayle worked in advertising in London until, one day, he and his wife decided to move to the south of France. The book is a detailed and often amusing description of the first year in their new home, and an account of the difficulties that foreigners may have when they try to assimilate themselves into a new culture.” One hundred thirty five words were recognized. Eighty of them are correct (almost 60%).

4.3.3 Storing For storing results in a database, the script “Add” has been written. It connects to the PostgreSQL database using “Pg” module. In essence, the script is a loop, reading data written to a description file generated by the previous script and processing them (Fig. 7).

Figure 7. Storing of content When the script finds files, which URLs appear already in the database, it does not insert a new row, but just updates files’ descriptions. This avoids duplicating records in the database. 4.3.4 Accessing An access to multimedia files is given through the Web interface organized as CGI scripts written in Perl language. The main script “media” gets a requested keyword on its input as a variable “search”: …/cgi-bin/media.cgi?search=mars All the process is preceded by filling up the HTML form. The user just types a keyword and clicks a “Search” button. This launches the same script with parameters got from the HTML form.

Figure 8. Multimedia Search Engine The “media” script automatically turns into the case in- or sensitive mode. Then, script generates a HTML header necessary to keep a compatibility with the HTML standard. The script generates a query to the database also and executes it. As a result, it gets a table of rows containing hits to the query (Fig. 8). 5

CONCLUSIONS

Authors’ conclusions are related to: 1. Web surfing, 2. Retrieving of media files, 3. Indexing of media files, 4. Storing of files in database, 5. Web interface. 5.1 Web surfing Unfortunately, web surfing appeared as a quite unreliable module of the application. Web robot has to analyse web pages for links, extract them and decide if to follow them or not. In the same time, web robot has to save multimedia links in an output file. The problems appeared almost everywhere. For example, HTML tags happen not closed correctly. 5.2 Retrieving Each multimedia file must be retrieved (downloaded) from the Internet before the indexing. This procedure usually wastes a lot of time. Unfortunately, network bandwidth is still not broad enough for quick downloading large files. The solution was creating an efficient multi-thread download system. Of course, this causes a server and network load. Each downloading thread

takes up to 5 megabytes of memory and allocates as much of bandwidth as is currently available. From the other hand, this method improves the downloading efficiency as well as shortens its time. 5.3 Indexing The indexing process is also very slow but it is nothing to be done to speed up it, without a loose of speech recognition accuracy. Therefore, since the accuracy is still not so high, the only way is to improve indexing speed just by speeding up an indexing server. 5.4 Storing Storing procedures work properly. They are not option-reach, however for the start-up stage of the application, it is definitively enough. After the database will become voluminous, it will be probably necessary to check its integrity. A special script will read each database record, connect to addresses extracted from that record and check if the file still exists on its server. 5.5 Web interface Developing a web interface is a never-ending story. There is always a lot of development to do, as well many bugs to be repaired. However, fortunately the web interface works well. Of course, its functionality as perceived is still not satisfactory. The most important problem is usually to create an interface to be correctly rendered by every type of Internet browsers. Unfortunately, some of them do not support every multimedia extensions. For example, there are no problems with playing multimedia files by a Oracle Video Client web plug-in on such browsers like Microsoft Internet Explorer® or Netscape Navigator®, but still that plug-in does not work with the Opera Browser®, even if the Oracle Video Client is a standard Netscape plug-in and the Opera Browser® should support it. Unfortunately in reality, it does not work under Opera Browser®. 6

FURTHER WORK

Generally, the proposed multimedia search engine should migrate into a vertical multimedia Internet portal. For a better convenience of using the application, there is some development work to be done, especially related to an application GUI. Besides the Internet site has to be well marketed. The following features should be added: MP3 descriptions, advanced search, text recognition and face recognition. 6.1 MP3 descriptions Each MP3 file contains special descriptions (called ID3), which could be used for indexing that file. The available fields are: title, artist, album, year, genre, comment and some other MPEG information like: size, length, MPEG layer, bit-rate, frequency, mono/stereo, privacy, CRC, copyrights, original, emphasis (Fig. 9).

Figure 9. MPEG file info box + ID3 tag editor The goal is to extract data from the ID3 tag and use them for indexing multimedia content. 6.2 Advanced search For today, a method of searching keywords is very simplified. It uses “match” (regular expression), case insensitive and sensitive operators. This means that for a case sensitive query it matches: 'thomas' ~ '.*thomas.*' For a case insensitive query it matches: 'thomas' ~* '.*Thomas.*' The goal is, to create some more advanced mechanisms of search. The good example of advanced text search engine is AltaVista. The main difference is the ability of Advanced Search to use Boolean expressions. Boolean expressions are the words OR, AND, AND NOT, and NEAR used to create relationships among the keywords in the search query. 6.3 Text Recognition Text recognition (sometimes known as "OCR" – an abbreviation of "Optical Character Recognition") describes the process whereby text is "extracted" from an image. Text recognition is about 20 to 25 times faster than manual retyping.

Figure 10. ABBYY Fine Reader 4.0 Professional For the future advanced engines, it will be possible to use other recognition engines, such as hand-written text recognition engines. There are available three well-known engines: ABBYY Fine Reader 4.0 Professional for the OCR recognition, Recognita and Cognitive Cuneiform ’99 OCR. These software packages supports batches as well as application programming interfaces (APIs) for C (and C++) programmers. 6.4 Face Recognition Face recognition is a mean by which people have recognized one another since the beginning of civilization. Now, computers have the ability to recognize faces too [1]. Face recognition technology is the recent addition to the family of recognition technologies. This software allows a computer connected to a video camera to locate human faces in images, extract them from the rest of the image and identify or verify who are it by matching the facial patterns to records stored in a database. For many real-world applications, there are benefits to using face recognition technology that cannot be provided by any other recognition technologies (i.e., fingerprint, iris scan). It is also planned to implement a face recognition engine in the future multimedia indexing system. Very likely, a software package by Visionics: Face It DB will be used. Face It DB can handle a huge database of faces, to search recognized shapes within. [3] ACKNOWLEDGEMENTS The authors would like to acknowledge the suggestions of many people, in particular the insightful comments of staff of the IBM Almaden Research Center in California. APPENDIX List of supported file formats: 1. 8SVX – supports 8-bit sound, uncompressed or compressed with Fibonacci or Exponential delta compressing. 2. AIFF – supports 8/16-bit, uncompressed sound that is an old audio format from Amiga machines. 3. AU – support 8/16/32-bit linear sound, with A-Law and U-Law encoding. 4. AVI – support 4/8/16/24-bit animation with sound, uncompressed or compressed with Microsoft (RLE or Video 1), Cinepak™ Radius, Intel Indeo™ (R up to 5.0 or Raw), Motion JPEG, Ultimotion™ IBM as well as Video for Windows System codecs. 5. DL – supports 8-bit (256 colours) animation, type 1 or 2, does not support sound. 6. FLI – supports 8-bit (256 colours) animation, uncompressed or compressed with Byte Run, Byte Line Coding as well as Clear Screen codecs, supports only 320x200 animation. 7. FLC – supports 8-bit (256 colours) animation, uncompressed or compressed with Byte Run, Byte Line Coding, Word Line Coding as well as Clear Screen codecs. 8. ANM – supports 8-bit (256 colours) animation, IFF-Anim 3, 5, 7, 9 and J formats, compressed with Byte Run, Delta3, Delta5, Delta7_16, Delta7_32, Delta8_16, Delta8_32 as well as DeltaJ codecs, supports HAM6/8 (Hold And Modify) and EHB (Enhanced Half Bright). 9. MPG – supports 24-bit animation with sound, compressed with MPEG-1 or MPEG-2 codecs, bases on algorithms by MSSG. 10. MPEG – the same as .MPG, the difference is, the PC systems, usually use 3 characters’ extensions. 11. MP2 – supports 16-bit sound, compressed with MPEG-Audio Layer (the first, the second and the third). 12. MP3 – unfortunately still without full recognizing, but in future plans, .MP3 is probably the most common audio format nowadays. 13. MOV – supports 8/16/24-bit animation with sound, uncompressed or compressed with Apple (Animation, Graphics or Video), Cinepak™ Radius, Intel Indeo™ (R up to 5.0 or Raw), Motion JPEG as well as Video for Windows System codecs, unfortunately does not supports all QuickTime formats, bases on algorithms by Apple Computer. 14. WAV – supports 8/16-bit sound, compressed with Pulse Code Modulation codecs. In future, is going to be added some new multimedia formats, like VX, RA and RAM. VX comes from VXtreme streaming video, and it is a new streaming format for multimedia files (especially video). RA and RAM are well known streaming and non-streaming formats supporting RealAudio and Real Video. The problems are only in finding appropriate decoders. They must handle new format type and generate 8/16-bit, maximum 22kHz, mono wave files on their output. Unfortunately, the speech recognition system support only 8/16-bit wave files, with a frequency not higher than 22 kHz. Files also must be mono. REFERENCES [1] Kruizinga, Peter The Face Recognition Home Page, University of Groningen, 2000, http://www.cs.rug.nl/~peterkr/FACE/face.html

[2] Microsoft Staff Microsoft Speech SDK 4.0, Microsoft Corporation, 1995-1998, http://www.Microsoft.com/ [3] Visionics Staff Face Recognition Technology, Visionics, 2000, http://www.visionics.com/ [4] Altman, Naomi Department of Biometrics Home Page, Cornell University, 2000 [5]

Harold S. Stone, “Image Libraries and the Internet“, IEEE Communications Magazine, January 2000.

[6] AfB Association for Biometrics Home Page, AfB, 2000, http://www.afb.org.uk/ [7] “IBM Software: Voice Systems”, http://www-4.ibm.com/software/speech/ [8]

P. Pacyna, Z. Papir, J. Gozdecki, M. Leszczuk, "Video Retrieval Application for Distributed Education - Requirements, Implementation and Results", International Symposium on Intelligent Multimedia and Distance Education – Advances in Intelligent Computation and Multimedia Systems ISIMADE'99, pp. 31-37, 2-7 August 1999, Baden-Baden, Germany

[9] Reddy, Saveen An Introduction to C++, ACM, 2000, http://www.acm.org/crossroads/xrds1-1/ovp.html [10] 13 Computer Dictation Systems comparison, Voice Recognition, Speech Recognition, http://www.dyslexic.com/dictcomp.htm [11] M. Leszczuk, P. Pacyna, Z. Papir, "Video Content Streaming Service Using IP/RSVP Protocol Stack”, IEEE Workshop on Internet Applications WIAPP’99, pp. 89-93, San Jose (CA), USA, 26-27 July, 1999 [12] P. Pacyna, Z. Papir, J. Gozdecki, M. Koszałka, J. Maliszewski, J. Kozicki, M. Leszczuk, H. Kordas, „Design and implementation of the software/hardware MPEG encoding station”, ACTS AC362 Broadband Trial Integration Milestone 3.2.0, 14 pages, December 1998, Krakow, Poland [13] Niels Engell Andersen, “Applying QoS Control through Integration of IP and ATM”, IEEE Communications Magazine, June 2000 [14] J. Gozdecki, P. Pacyna, Z. Papir, R. Stankiewicz, and A. Szymański, “Network-based digital video library system”, Packet Video 2000, 1-2 May 2000, Sardinia, Italy [15] Chung-Sheng Li, “Digital Library Using Next Generation Internet“, IEEE Communications Magazine, January 2000 [16] Fred Mintzer, “Developing Digital Libraries of Cultural Content“, IEEE Communications Magazine, January 2000 [17] John R. Smith, “Digital Video Libraries and the Internet“, IEEE Communications Magazine, January 2000 [18] “MediaSearcher”, http://www.mediasearcher.com/