Haar Wavelet Trasform based Text extraction from ...

4 downloads 206735 Views 273KB Size Report
manufactured digital multimedia devices. Devices such as most current smartphones operating with Google's Android. Operating system, Apple's iOS, Microsoft's ...
International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

Haar Wavelet Trasform based Text extraction from complex videos Dr Sathish Kumar Selvaperumal,

Mr Chandrasekaharan Nataraj,

Senior Lecturer, School of Engineering, Kuala Lumpur, Malaysia [email protected]

Lecturer, School of Engineering, Kuala Lumpur, Malaysia [email protected]

Mr Arun Seeralan

Mr Ravi Lakshamanan

Lecturer, School of Engineering, Kuala Lumpur, Malaysia arun@apu,edu.my

Senior Lecturer, School of Engineering, Kuala Lumpur, Malaysia ravi@apu,edu.my

Abstract—In this paper, a text extraction algorithm has been developed and simulated to extract text from a complex video scene by using Matlab software. Morphological operations and Haar wavelet transform have been used to design the filter to remove the non-text pixels. An experimental setup has been set and performed. The results from different tests of the algorithm have been obtained and analysed. The algorithm has been found to perform better on videos files with higher resolution and with text that has a noticeable difference in colors from its background. Though it performs better on video at higher resolution but at the expense of the computation time. It has been found that on the video at 1280x800, it required 4 hours and 47 minutes while at a resolution of 800x600 it required 1 hour and 43 minutes while at 640x480 it required only 30 minutes and 28 seconds. All three video are of the same duration of 1 minute and 17 seconds. The recall rate and precision rate is found to be 0.908 and 0.890 which is better as compared to the existing algorithms. Keywords—wavelet transform; Haar; text extraction

I. INTRODUCTION At the current levels of advanced technological systems, companies have been filling the world with a vast variety of digital electronic devices. These devices are capable of handling as many tasks as the current personal computers being sold in the market. Among the tasks being preferred to be done in digital mobile devices such as in the smartphones is the playback, streaming or sharing of multimedia content such as audio and video files. These files contain a lot of information by which some of that information is neglected due to the lack of presentation in ways that may increase attention of their presence to the viewer. A video file may contain audio content with running images presented as frames. This make up the audio and visual aspect of the video respectively. These images in the video may contain different features that carry extra information. Information such as texts appearing within scenes provide extra information that could be useful to the viewer [6,8]. The Media Industry has long since been inventing new ways of delivering media content from their newsrooms and studios to the people. From, over the air transmission, to over 978-1-4673-9745-2 ©2016 IEEE

the internet broadcasting, the media networks have been operating with all major means of communications and keeps growing day to day due to the demand of the service. The growing demand of multimedia content is being highly influenced by the on-going growth of the number of manufactured digital multimedia devices. Devices such as most current smartphones operating with Google’s Android Operating system, Apple’s iOS, Microsoft’s Windows Phone plays a major role in helping the growth of the service by enabling playback of the multimedia files such as audio and video files with embedded rich information in the files. Video files with text embedded within the scenes have not yet been used to interact more friendly with the user while when they are properly used they can be presented to the user by other methods such as text-to-speech conversion software, etc. This will aid beneficial advantages to many business sectors, households and the media industry [7,9]. This paper is based on the development of an algorithm to extract text from a complex video scene using Haar wavelet. This includes checking of consistency and its relevance based on the need and applicability of the project for business and as a problem solving technique. II. LITERATURE REVIEW A lot of work has been done to tackle the problem of text extraction from images and videos. Different researchers have applied different image processing methods or a combination of more than one method to achieve the required task of text extraction from the video clips. Most work done currently suffers the ability of performing text extractions from images with different features such as the levels of light intensity, the font type, font orientation and font size [2,5],. Research done by C.Liu, C.Wang, and R.Dai [1] made use of filters to perform the text extraction process. The algorithm divided the whole process in four main parts. The first part of the algorithm performs the text detection, then next step

International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

performs the text localization, then the text extraction and the last step executes the text recognition. Edge detection method is used to detect the area containing the text elements of the frame. The Sobel detector is used to perform the edge detection in all four directional gradient masks. Local Thresholding and hysteresis edge recovery have been implemented on the text localization step. The surrounding area of the text connected to the background is removed while preserving the text pixels by inward filling. This procedure takes consideration of the resemblance of the text pixels to the background pixels to avoid removal of any text pixels by labelling the text pixels. This method has been put to test and found to process videos frames with a speed of 0.25 frames per second and a detection accuracy of 90.8%. While this method appears to be reliable for text detection in videos, the algorithm lacks the ability of detecting scrolling texts in the frame of the video being processed. Due to the limitation of detecting only horizontal direction texts, the method is unable to detect slanted text which is present in the frame being processed. Li-Jie Li, Jin-Li and Lei Wang [3] proposed an integration method of text extraction from video frames based on multiple frame integration and stroke filter. The background of the image is obtained by text block filtering and integration while clearing text that has high contrast. Edge detection is done by the use of stroke filters while enhancing the text pixels. Local region growing method is used to redetect missed text. This method has been found to have a Character Extraction Rate (CER) of 82.7% and Character Recognition Rate (CRR) of 84.2% while performing at a speed of 276 characters per second. Z.Li, G.Liu, X.Qian, D.Guo, and H.Jiang, [10] proposed an efficient text extraction method by using key text points. The key text points are areas of the image that highly contains characteristics of a text. Three high frequency sub-bands are used to obtain the key text points by using wavelet transformation. The text localization stage is done by an antitexture projection method. Connected components method is used to remove the background pixels for text segmentation. This method has been tested and found to operate with an average recall of 83.65% and a precision of text detection of 70.13%. This method performs well in extracting texts from complex background that have low contrast but suffers from detecting moving texts. P.Shivakumara, R.P.Sreedhar, Trung Quy Phan, Shijian Lu and C.L.Tan [4] suggested a method of text extraction from a multioriented video scene by applying Bayesian classification method. Laplacian and Sobel operations are used to enhance text pixels by finding their products then it’s classified by the Bayesian classifier. The classification is done by considering three probable matrices Text string is made by using boundary growing method using the detected text candidates by comparing them to their nearest neighbours. False positives are eliminated by the use of geometrical properties of the text blocks that were formed by the region growing method. The tests that were done, have shown this method to have a 0.87% of recall, 0.72% of precision and 978-1-4673-9745-2 ©2016 IEEE

0.78% of F-Measure. This method has been found to be very effective but with a major drawback of failing to handle false positives. Handling of non-horizontal text lines is found to be a limitation of this method when detecting texts in video frames. III. PROPOSED METHOD The design of the text extraction algorithm is divided into stages according to the process that are being performed at each stage to obtain a desired output. A. Text Detection This stage roughly detects regions with texts and regions without text from the frame of the video for which the text is to be detected. In this project, texts on the frames will be treated as of two types, which are as type of texture and as a type that contains a uniform colour. Different filters are applied at this stage with their results compared to select the most desirable output and the least time consuming to perform the task during the image processing. Neural network training is to be performed to obtain the pixels for the texts and the pixels of non-text materials after the filters are applied to the frame of the video. This provides more guarantee of the accuracy of the region containing the text. B. Text Localization This stage computes to find the exact boundaries for the strings of text on the frame of the video. This involves the breaking down the frame into regions or building up the regions into an entire string of text. Region growing technique and other non-region based methods are tested to find the better performing method. This boosts the speed of the algorithm. Other methods which include manipulation of the texture, colour or edge distribution are tested as well for comparison purposes. C. Text Extraction The proposed method makes use of filters to process images while setting the parameters of the filters to suppress all unrequired information contained within the images. This will suppress the background part of the frame which is included in the region with non-text materials. In this stage, color based and stroke based methods are tested. With different types of video clips to be tested, these methods will help detect how to overcome some of the limitations for videos with low brightness. Thresholding technique is used to segment the frame. Asymmetric filters are used to boost the stripe shapes while diminishing the other shapes in the stroke based method. A flow chart of the text extraction algorithm is presented in Fig.1. The algorithm make us of the Haar wavelet transform to obtain the low level features of an image for which a text pixel is most likely to be located. The Haar wavelet transform uses the Haar function which is an orthonormal, rectangular pair

International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

which varies in both scale and position.

(6) In digital image processing, ideal (optimal) transforms are used due to their ability of concentrating the signal energy in the smallest number of transformation coefficients. This is due to the energy of digital images being in the low frequencies while details on high frequencies. The images being transformed by the wavelet is in two dimensions in which the wavelet transform is performed along the rows and along the columns after. The diagonal details are picked up from the transformation. The correlation coefficient of two images is the mutual relationship between the two images which can indicate the degree of their resemblance. The Correlation coefficient, r is 1 when the two images are identical, 0 when they are nonidentical and -1 when they are negative of each other.

Fig. 1. Text Extraction algorithm flow chart

Considering an integer, k which is the row of the transform matrix, its relation is such that, k = 2p+q-1 (1) Where, p is the power element that does the scaling. This is the largest integer that satisfies the condition, 2p≤k (2) While q defines the position on non-zero elements in rows. This integer operates in the domain, 0≤k≤N-1 (3) Another index, i that operates in the columns is defined within a domain of range as show below, 0≤i≤N-1 (4) Such that x can be defined as, x=i/N The vectors of the Haar transform is as shown below,

978-1-4673-9745-2 ©2016 IEEE

(5)

(7) Where xi is the intensity of the ith pixel in image 1, yi is the intensity of the ith pixel in image 2, xm is the mean intensity of image 1, ym is the mean intensity of image 2. Correlation coefficients are used to find the resemblance in frames being processed to eliminate duplicate frames from neither being saved in the storage nor being processed for text extraction. The Algorithm is programmed in Matlab software by using the object oriented programming paradigm where classes of objects such as boundary points of text pixels are stored within their respective objects. A graphical user interface is presented to the user after program execution. The interface is designed with buttons that are composed of mapped blocks of codes required to perform the indicated operation by the label on the buttons and outputs are displayed to the user through label handles and through an axis window used as an on-screen image viewer. The algorithm starts by requesting a file input from the user. The file is selected from the file button on the user interface. Initial reading of the video information is done to obtain global variables such as the file path, the frame rate of the video, video resolution and the number of frames contained in the video file once the file is loaded to the program by the user. The initial values obtained are displayed to the user before the text extraction process begins. A flag, (FileSwitch) is set to 1 when an appropriate file is loaded to the program, else the flag is set to 0. This condition restricts the algorithm from performing any further manipulation processes unless a video file is loaded in the program. A second flag, (UserData) is used to control the overall process to execute only once to avoid an infinity loop of the process. This flag is set to 1 when the program initializes and is set to 0 when the extraction process completes or after the stop button

International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

being pressed by the user. The file is split into separate frames, labelled according to their respective frame numbers as in the video. These frames are saved into a newly created folder named frames, located in the same directory as of the source video. Splitting of the video file is performed by writing each selected frame from the sequence of the frames in the video to the computer’s storage device. Selection of the frames from the sequence is based on the frame rate of the video as explained in the previous chapters. The frames being selected undergo a resemblance check by calculating the correlation coefficient of two consecutive selected frames. Tests performed by Palm and DeVolpi (1995) and from the algorithm suggests that, the best range of r for which two images are to be highly or completely identical is from 0.95 to 1. By applying the same principle on the algorithm, the algorithm is set to only save frames to hard disk only if their correlation coefficient is less than 0.95. The algorithm iterates through the directory containing the frames to be processed. The mathematical computations are performed on each frame which have been detected to contain text pixels. This is done by getting the file names of each file in the folder to be processed and processing each file from the files obtained. During the extraction process, a check is performed to detect presence of any texts by doing a low level feature detection by performing a morphological dilation and erosion (bottom hat) operation to detect presence of any text in the frame to be processed. A flag, EmptyText is set to 0 when a text is detected else it’s set to 1. The text extraction process only extracts the texts from a frame with an EmptyText value of 0. Once a text frame is detected, the frame is passed to the next stage to locate the required texts. Bounding boxes are used to identify all objects present in the binary image. Region growing method is used to build sentences from the single text characters obtained from the bounding boxes. Logical image operations are used to separate and select the boxes with the real text characters from the boxes with noise. The bounding boxes are selected according to their positions next to each other, their relative area and their relative aspect ratios. Sentences are built by finding the best arrangement of boxes through rules of writing, where by most written texts are arranged in a specific orientation. This algorithm checks for all boxes arranged horizontally in a specific pattern with similar properties such as size and aspect ratio to group them as text characters. The algorithm iterates through each line of sentence at a time and combines all sentences once the last sentence is reached. Only the sentences are selected and saved without the background pixels. Before the final results are saved permanently in the output directory, a last verification check is ran to make sure blank images are discarded to reduce storage size and increase the algorithm’s performance. The output results are saved on a separate directory for easier analysis and better organization of the data obtained from the experiment. A timer is used to measure the duration of time required to perform the extraction process once the process is completed and the 978-1-4673-9745-2 ©2016 IEEE

elapsed time is also presented the user while the extraction process is being performed. This information is used to test the performance and characteristics of the algorithm as discussed in the next section. IV. SIMULATION RESULTS The algorithm is simulated in Matlab R2014b software. The simulation is carried in two phases. The first simulation phase is done during the design stage of the algorithm to determine textual characteristics and boundary points. This simulation is done on each sample image separately. This process is used as a mean of error identification and troubleshooting. The second simulation has been performed on the entire length of the video to find the overall performance of the designed system. The algorithm is tested on videos with a duration of 1 minute and 16 seconds, recorded at a frame rate of 30 frames per second. The sample video contains 2297 frames at different resolutions as stated in the previous sections. The tested videos are encoded with texts having different text properties such as different text colours and different edge colours. The tests are also performed on videos with different resolutions of 1280x720, 800x600, 640x480 and that have a changing complex background. A resolution of 320x240 has been tested but the results have been omitted due to inefficiency in the video to contain enough details for the algorithm to differentiate the text pixels from the background. Some samples of source and result frames have been presented in the results section of this report to compare the amount of details lost compared to a video at a resolution of 1280x720 pixels. The sample test video at a resolution of 1280x720 produced 64 frames. The obtained frames after the text extraction process are found to be 38. The data collected from the sample test video at a resolution of 800x640 produced 55 frames. The obtained frames after the text extraction process are found to be 55. The sample test video at a resolution of 640x480 produced 55 frames. The obtained frames after the text extraction process are found to be 53. The number of all detected objects are recorded and organized according to what they are by manually inspecting through the results obtained. The data collected from the algorithm is analysed manually by counting the output results from the obtained processed images. The number of text characters extracted by the algorithm either false positives, true positives or total available characters in the frames processed is recorded on a table and analysed. The precision and recall rate are calculated separately for each frame, for each video at a specific resolution and for each type of text colours and edges. The results obtained from this analysis is shown in Table II. The data obtained has been found to vary relatively according to the type of texts and the amount of details contained in a frame. The amount of details in a frame is varied according to the value of the resolution of the video. The higher the resolution of a video the more the details a video will contains. Bar charts are created to show

International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

the performance of the algorithm relatively to the resolution of the video as shown in Fig. 2 and Fig. 3.

Algorithm

Recall

Precision

Proposed algorithm

0.908

0.890

Palaiahnakote

0.870

0.720

0.836

0.701

Shivakumar (2012) Z. Li et al.(2011)

V. CONCLUSION

Fig. 2. Recall and precision rate per colour

Fig. 3. Recall and precision rate per total text extracted

The analysed data indicates that the values of recall and precision decrease when the resolution of the video is lowered. This has been observed to be caused by the loss of information from the missing pixels in the lower resolution video. A very low value of recall and precision rates have been obtained in text surrounded by white edges. Where by black text produced the weakest results followed by blue and red. This shows that the algorithm suffers from bright backgrounds more than on dark backgrounds. A final test has been conducted on a video with a constant white text with a changing background. The results obtained support the findings of the algorithm showing a recall rate of and a precision rate of 0.908 and 0.890. A comparison of the recall and precision rates is made with previously designed algorithms as shown in Table I. The proposed algorithm is found to perform slightly better with a better value of recall and precision rates. But the slight difference may be caused due to the difference of the source video files from which the text is being extracted. TABLE I.

Thus, Haar wavelet transform based text extraction from complex videos has been proposed using mathematical morphology. The results obtained have been found to form a pattern that shows a decrease in accuracy as the resolution of the video is reduced. The algorithm developed has been found to have a recall rate of 0.9078 and a precision of 0.8896 on videos with a resolution of 1280x720, a recall rate of 0.9050 and a precision of 0.5774 on videos with a resolution of 800x600 and a recall rate of 0.6325 and a precision of 0.4923 on videos with a resolution of 640x480. The change in video resolution has been found to affect the precision rate of the algorithm much more than how it affects the recall rate of the algorithm. This is found by the high change in the precision values for the same change of resolution. Also, the proposed algorithm has been found to perform better on texts with high color difference from their backgrounds while performing poor when the background and text color are similar. The proposed algorithm has been found to suffer considerably from texts with smooth edges that have more resemblance to their background. This resemblance may also become difficult even to the human eye when the color difference between two objects is very small. This limitation has been detected when a dark text is overlaid on a dark background or a white text is overlaid on a bright background. This similarity leads to either the text to be extracted with some noise or only the noise is extracted or no object may be detected completely

References [1]

[2]

[3]

[4] [5]

[6]

COMPARISON OF THE PROPOSED ALGORITHM WITH THE EXISTING ALGORITHMS

978-1-4673-9745-2 ©2016 IEEE

[7]

C.Liu, C.Wang, and R.Dai, “Text detection in images based on unsupervised classification of edge-based features”, Proceedings of the 2005 Eight International Conference on Document analysis and Recognition, pp.610-614, 2005. H.Li, and D.Doermann, “Automatic Identification of Text in Digital Video Key Frames, Language and Media Processing Laboratory”, Proceedings of Fourteenth International Conference on Pattern Recognition, Vol. 1, pp.129–132, 1998. Li-Jie Li, Jin-Li and Lei Wang, “An Integration Text extraction approach in video frame”, In the Machine Learning and Cybernetics interntaional conference, IEEE, pp. 2115 -2120, 2010. O.Marques, “ Pratcical image and video processing using MATLAB”, Hoboken, N.J., 2011, Wiley / IEEE press. P.Shivakumara, R.P.Sreedhar, Trung Quy Phan, Shijian Lu and C.L.Tan,“Multioriented Video Scene Text Detection Through Bayesian Classification and Boundary Growing”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 20, No. 8, pp 1227- 1235, 2012. R.J.Mooney, and U.Y.Nahm, “Text Mining with Information Extraction”, Multilingualism and Electronic Language Management: Proceedings of the 4th International MIDP Colloquium, Van Schaik Pub. pp. 141-160, 2003. T. Pratheeba, V.Kavitha and S.R. Rajeswari, “Morphology based text detection and extraction from complex video scene”, International Journal of Engineering and technology, Vol. 2(3), pp. 200-2006, 2010.

International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16) [8]

[9]

W.Kim, and C.Kim, “A new Approach for Overlay Text Detection and Extraction from Complex Video Scene”, IEEE Transactions on Image Processing, vol. 18,no. 2, pp. 1-8, 2009. X.Zhang, F.Sun, L.Gu, “A Combined Algorithm for Video Text Extraction”, Seventh International Conference of Fuzzy Systems and Knowledge Discovery (FSKD), vol.5, no., pp.2294-2298, 2010.

TABLE II.

[10] Z.Li, G.Liu, X.Qian, D.Guo, and H.Jiang, “Effective and efficient video text extraction using key text points”, Image Processing, IET, vol.5, no.8, pp.671-683, 2011.

ANALSYSED DATA FOR PRECISION AND RECALL PER COLOR AND PER RESOLUTION

DETECTION White Black Green

Blue

Red

Green

Blue

Red

Average

Video

with

with

with

with

with

with

with

with

Per

Resolution

black

white

black

black black

white

white white Resolution

1280 x 720 Precision 0.965

0.500

0.815

0.938 0.921

0.904

0.848 0.958

0.856

1.000

0.500

0.970

1.000 0.989

0.986

0.85

0.958

0.906

800 x 600 Precision 0.811

0.704

0.723

0.768 0.854

0.768

0.502 0.720

0.731

1.000

0.780

0.890

0.979 0.968

0.970

0.561 1.000

0.893

640 x 480 Precision 0.829

0.791

0.706

0.525 0.406

0.614

0.528 0.413

0.601

1.000

0.567

0.730

0.438 1.000

0.586

0.795 0.500

0.702

Precision 0.868

0.665

0.748

0.744 0.727

0.762

0.626 0.697

0.730

0.616

0.863

0.806 0.986

0.848

0.735 0.819

0.834

Recall

Recall

Recall Average Per Color

Recall

978-1-4673-9745-2 ©2016 IEEE

1.000

Suggest Documents