Image Processing Based Extraction of Data From

4 downloads 0 Views 311KB Size Report
tools for extracting the data from the graphs are available. The data ... algorithm based on image processing is developed to automatically ... Now-a-days many commercial and free software tools are available for extracting the data from the graphs. A very good comparison of quite a number of such online software tools is.
Image Processing Based Extraction of Data From Graphical Representations Viswanath K. Reddy, Kaushik C.M. Dept. of ECE, Faculty of Engineering and Technology M.S. Ramaiah University of Applied Sciences Bangalore, India [email protected] Abstract — Raw data is commonly presented in graphical form in published literature. Very often the data may be taken by the authors from publications. The raw research data is normally not available to fellow researchers due to nonavailability of data itself or lack of interest of the author to share the data. Many software tools for extracting the data from the graphs are available. The data points have to be manually selected by the user. Few automation features provided in the tools are time consuming and need postprocessing to eliminate the errors in extracting the data. An algorithm based on image processing is developed to automatically extract the data from the graphs. The data points extracted from the graph is close to the raw data with error less than 1% using the developed algorithm. The results are motivating enough to upgrade the developed algorithm with more features.

highlighted points would still be extracted. In this work, an image processing based approach is proposed to automatically extract the raw data points from the graph.

Keywords—automatic data extraction; graph digitization; image processing; algorithm

I.

Introduction

Fig. 1. Typical scanned image

Graphical representations like plots, graphs and charts are used for depicting raw data generated by results of scientific experiments and observations in publications. Publications are either easily available in the digital form or the papers can be made available in digital form through scanning. Only trends of the raw data are revealed in the graphs and plots. Many authors would like to use such published results for benchmarking or validation purposes. But most of the times the authors would be unable to source the raw data for various reasons. In such situations they would be forced to somehow use the available graphs. So the values corresponding to the raw data has to be derived from the graphs. Sometimes the authors take a print of the graph as large as possible and then use a ruler to measure the individual points and get the values with some calculations. This is a tedious process and error prone which does not add any value to the researchers’ work. Many other software tools which are freely available are also commonly being used by the authors to extract the data points [1]. If the number of points to be collected is too large, it would be time consuming process to collect the points and the chances of introducing error due to human fatigue are very high. The software tools also provide automatic data extraction option. The raw data points extracted in automatic mode using online tool WebPlotDigitizer is shown in Fig. 2. This tool also provides a good option to extract the raw data in a region selected by the user. If a region excluding the boundary is selected the

Fig. 2. False detection

II.

Literature Review

Now-a-days many commercial and free software tools are available for extracting the data from the graphs. A very good comparison of quite a number of such online software tools is provided in [2]. The graph in digital form would be displayed on the screen and then mouse is used to read individual data points. After having calibrated the axes once, the mouse coordinates are translated automatically into the plot

coordinates by a short program [3]. Extrracting data from digital documents is not reported in literaturre till recently [4]. Data extraction from graphs in web documeents has been tried [4]. As part of this work, the labels and nuumerical units of x and y axes and the legend information and the data points on the curve are extracted. Data points are exxtracted for scatter plots and curve fitted plots which contain certain geometrical shapes that act as data points in 2D-plots. These data points are separated from the curves assumingg that the curve thickness is uniform and unique compared to axes thickness. But many scanned and poor quality images may not conform to this assumption and would make data exxtraction incorrect. Aaron Baucom [5] has extracted the data pooints from a scatter plot to represent the data in different form. The data from the scatter plots is digitized in [6]. 2D line plotss may contain data like labels in the x and y-axes, legend inform mation in the form of text in addition to the data points. Autom matic extraction of this data has been attempted [7] for the search engines to efficiently dig out the relevant data. III.

Proposed Solution

A typical scanned image is as shown in Fig F 1. The image is having a , , while the graph has , , and a the graph axis may be rotated by an angle . Rotation may m be introduced while scanning a page with the graph. The image i itself might be noisy in the hard copy of the paper or noiise might be added to the graph while scanning. . Any point Let the center of the Graph axis be , , from the Image axis can be mappedd on to , in the Graph axis as per (1) and (2) after consideriing translation and rotation of axis. cos -------- (1) sin sin -------- (2) cos These two equations can be written in a matrrix form as in (3). -------- (3)

Once the pixel coordinates off the data points are mapped to the Graph coordinates, the vaalues corresponding for each of the pixel coordinate correspoonding to the data has to be extracted. To achieve this addittional input is required from the user. Feeding this input is thee calibration step. The required inputs are as follows and are hiighlighted in Fig. 3: , - Columns of two pooints on the X-axis of the graph , - Maximum and minimum values of the X-axis , -Rows of two points on the Y-axis of the graph , - Maximum and miinimum values of the Y-axis in the Knowing these values, the vallue of any point , image and corresponding to the t Graph can be computed as shown in (4) and (5). -------- (4) -------- (5) mented using MATLAB as The algorithm has been implem explained in the following sectiions with the following assumptions. A. List of Assumptions • The data plotted is in the first quadrant of the Cartesian plane. •

Single data set is encodded in the points.



Graph should have lineear scale



Only one graph/plot shhould be present



Charts do not containn heavy gridlines or other plot types (line, bar, etc.)



Text or any unusual shapes does not appear in the plotting region of the chart. c



Graph line color intennsities should be different from the background color inntensities IV.

Software Im mplementation

MATLAB is used for impllementing the algorithm. Major steps involved in the implemenntation are •

Image dataset geneeration



User input for calibbration



Data extraction thrrough image processing



Testing and validaation

The software implementation flowchart f is shown in Fig. 4.

Fig. 3. User inputs for calibration

Start

Generate graph

Save Generated graph as image Read the saved image

User input Fig. 8. Synthesized real image

Image processing algorithm for extracting data

Scan the printed document

Validation

Crop the graph and save as image

Fig. 9. Real image generation Stop

Fig. 4. Software implementation flowchart

A. Image dataset generation The data available from the published liiterature would be difficult to test the developed software and validate v the results due to lack of the raw data. Hence data is generated inhouse to mimick real raw data for testing in the folllowing ways. 1) Synthetic image: A known mathematiical model is used to generate and display a graph using MATL LAB which is then saved as an image. Fig. 5 shows the proceedure followed to generate synthetic image with an example synthetic s image in Fig. 6. It can be seen here that the image wiill not be distorted in the form of noise or axis rotation. X,Y values

Generate Figure using MATLAB

Save Figure as Image File

Fig. 5. Synthetic image generation

2) Synthesized Real Imagee: The synthetic image file is printed on a white sheet. Thhe print is scanned. The graph region is cropped and saved as a a digital image. This image would be more closer to the reeal image and mimic the image obtained from scanned paperss. But the noise characteristics would be known to the useer. The process of generating synthesized real image is shhown in Fig. 7. An example syntheiszed real image is show wn in Fig. 8. 3) Real Image: Fig.1 show ws an example real image taken from a scanned document. Thee graph portion in the document is cropped and saved as real im mage. This process of generating real image is shown in Fig. 9. B. User input for calibration The steps involved for calibbration is as follows •

Load image in to MAT TLAB and display



Select two points each on the X-axis and Y-axis



mn of the selected points Get the row and colum



Specify the values at thhe selected points

C. Data extraction through im mage processing The graph image is fed to the image processing algorithm in which the image is preproocessed for smoothing out any noise present. The smoothed im mage is binarized and the curve is converted to single pixel wiidth through thinning operation. The pseudocode of the implem mentation is as follows: 1.

Apply smoothing filteer to reduce noise

2.

Binarize the smoothedd image using thresholding

3.

Perform image thinnning operation to reduce the reduce the curve widthh to single pixel

4.

Extract the raw data corresponding to the pixels in the thinned image

Fig. 6. Synthetic image Print the Synthetic image file

Scan the printed document

Fig. 7. Synthesized real image generation

Crop the graph and save as image

the original graph image, the extracted values seem to be quite accurate. VI.

Fig. 10. Binarized and thinned image

The binarized image is thinned to reduce the graph curve to single pixel width as show in Fig. 10. The coordinates of the pixels in the thinned image are mapped to the raw data points. D. Testing and validation The data points extracted from the image processing algorithm are validated in the following ways:

Conclusions

A MATLAB based tool is developed for automatically extracting data from graphs in digital form. The user has to feed in the data corresponding to the plots for initial calibration. Once calibrated, the data points are automatically extracted and stored in a file for later usage. The tool is now capable of handling images with few assumptions. The results are better than the software tools available for the purpose. The accuracy of data extraction depends on the scale of the axes. If the axes range is less, the accuracy of extracted data increases for the same pixel resolution. In real images, the algorithm has to be modified to handle the rotated images. The scope of the algorithm can be increased to handle complex images in terms of noise and graph features like legends, multiple plots, and line styles and so on.

The synthetic image and the synthesized real image have a mathematical model using which the graphs are generated. The extracted data points are validated against the model. For the real image without rotation, validation is performed in the same manner. V.

Results and Discussion

The Fig. 12 shows the extracted data points from the image plotted over the reference data points obtained from the mathematical model. The zoomed version of this is shown in Fig. 13. The extracted data points are very close to the raw data points. The deviation of extracted data points from the reference data by a maximum of 0.008 for all the data points.

Fig. 12. Raw data extracted from a synthetic image

The developed algorithm is tested using a synthesized real image shown in Fig. 8. The extracted data points are plotted against the reference data points as shown in Fig. 14. The results are found to be very comparable to the real image. Mathematical Model for Graph Generation Fig. 13. Accuracy of the extracted data points Save the Graph as Image Subjective Comparison block Read the Saved Image

Image Processing Algorithm for Data Extraction Fig. 11. Validation Process

A real image in Fig. 1 when subjected to the developed algorithm has resulted in the extracted data points that are plotted in Fig. 14. The reference data points are not available for this graph image. When this plot is compared visually with

Fig. 14. Extracted data for real image

References [1]

[2]

[3]

B. Söderström, K. Hedlund, L. E. Jackson, T. Kätterer, E. Lugato, I. K. Thomsen, and H. B. Jørgensen, "What are the effects of agricultural management on soil organic carbon (SOC) stocks?," Environmental Evidence, vol. 3(2), (2014). A. Gross, S. Schirm, and Markus Scholz, "Ycasd-a tool for capturing and scaling data from graphical representations," BMC Bioinformatics, vol. 15(1), p. 219, 2014. P. Uwer, "EasyNData: A simple tool to extract numerical values from published plots," arXiv preprint arXiv:0710.2896, 2007.

[4]

[5] [6] [7]

S. Kataria, W. Browuer, P. Mitra and C. Giles, "Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents," AAAI. vol. 8, 2008. A. Baucom, and Christopher Echanique. "ScatterScanner: Data extraction and chart restyling of scatterplots," 2013. T. Poisot,"The digitize package: extracting numerical data from scatterplots," The R Journal,vol. 3.1, pp. 25-26, 2011. X. Lu, J.Z. Wang, P. Mitra, and C. L. Giles, "Automatic extraction of data from 2-d plots in documents,"Ninth International Conference on Document Analysis and Recognition, vol. 1, 2007.

Suggest Documents