High Dimensional Data Visualization Using 3-D Icons - CiteSeerX

3 downloads 6219 Views 206KB Size Report
High dimensional data visualization is very important to data analysts since it gives a direct and natural view of data. In our method we use one icon to represent ...
High Dimensional Data Visualization Using 3-D Icons Dr. Ping Chen CMS Dept. University of Houston-Downtown One Main St. Houston, TX 77002

Dr. Chenyi Hu CS Dept. Central Arkansas University

Dr. Heloise Lynn, Yves Simon Lynn Incorporated 14732 F Perthshire Rd. Houston, TX 77079

Abstract In this paper we present a method to visualize large amount of high dimensional data. High dimensional data visualization is very important to data analysts since it gives a direct and natural view of data. In our method we use one icon to represent one group of dimensions. Then we choose features of the icon to display the dimensions within each group. We have performed experiments on a real data set from oil industry, and the result is encouraging although there are still some open questions. 1. Background The rapid emergence of electronic data management methods has lead some to call recent times as the "Information Age". Powerful database systems for collecting and managing are in use in virtually all large and midrange companies and public organizations, and there is hardly a transaction that does not generate a computer record somewhere. Each year more operations are being computerized, all accumulate data on operations, activities and performance. All these data hold valuable information, e.g., trends and patterns, which could be used to improve business decisions and optimize success. However, today's databases contain so much data that it becomes almost impossible to manually analyze them for valuable information. In many cases, many independent attributes need to be simultaneously considered in order to accurately model the behavior of a system. Therefore, humans need assistance in their analysis capacity. This need for automated extraction of useful knowledge from huge amounts of data is widely recognized now, and leads to a rapidly developing market of automated analysis and discovery tools. Knowledge discovery and data mining are techniques to discover strategic information hidden in very large databases. Automated discovery tools have the capability to analyze the raw data and present the extracted high level information to the analyst or decision maker, rather than having the analyst find it for himself or herself. Human insight is very important to extract high-level information from a data set. Visualization plays an important role in making the discovered knowledge understandable and interpretable by human beings. Besides, the human eye brain system itself still remains the best pattern recognition device known. Visualization techniques may range from simple scatter plots and histogram plots over parallel coordinates to 3D movies.

This paper is organized as following: section 2 describes our visualization method, section 3 shows our experiment results, and section 4 gives the future directions. 2. Data Visualization Using 3-D Icons Visualization is the graphical presentation of a data set, with the goal of helping and providing the viewer with a qualitative understanding of the information contents in a natural and direct way, and direct display of data with more than three dimensions is impossible, which means that users should understand that the display of data with more that three dimensions has to be transformed in some way before they can be rendered. They have to be aware of this transformation, and be able to reverse the transformed display and have the original picture in their mind. Taking the human being into account the reversing process can’t be too complex, otherwise the visualization will be useless since the user can’t understand what the display represents. Visualization methods may use different transformation technique. There is no universal transformation technique for all fields, data sets or users. After the transformation we need map these transformed values into properties of visual objects. The visual objects could be: • point • line • polyline • glyph • 2-D or 3-D surface • 3-D solid • image • text And for each object we may choose from the following features: • color/intensity • location • style/texture/shade • size • angle • relative position/motion Our technique chooses icons to represent the original data. For each icon we can choose some features to represent the dimensions. The position of icon can represent 3 dimensions. Suppose the total number of icons is N, and the number of feature from each icon is M, then the number of dimensions we can display is: 3+M×N It seems that we can display a data set with as many dimensions as possible as long as we increase M and/or N. But if we want the user to be able to associate the features from

icons with original dimensions easily, M and N can’t be large. Our estimate for M is less than 10, and N less than 5. So approximately our method can work with a data set with less than 50 dimensions. To choose appropriate icons we need keep in mind that we will display the icons in threedimension space, and we will zoom, move or rotate the icons. So we have to make sure that different icons look different from all angles. Considering the performance issue we prefer icons with planes rather that curved surfaces. We choose color, size and angle of each icon to represent dimensions, and there are other possible choices. Since we choose the size of each icon to represent one dimension, we can’t use prospective projection for our display. Instead, we can only use orthographic projection, and the viewer has to be aware that icons with the same size represent the same amount no matter how far or close they are to the viewer. All visualization techniques have to transform or “distort” the original data in some ways to display it in a two-dimensional or three-dimensional space. In our method both of size and angle are numeric values, but color is categorical. So it is possible to assign color in a more specific way, we could show all the data ranges we are interested with some specific colors, for example, suppose we have a temperature data set, we could show all temperature below or above a threshold with red since both mean danger and need attention. 3. Experiment We have performed one experiment using our visualization method. The experiment is run on a PC with Pentium III 1GHz CPU, 256 MB RAM, and a 16 MB video card. The data we use comes from 9 SGY files, and each SGY file includes some headers and 6172871 one-dimensional records. These records are data samples from 111X111 locations within 2 seconds after exploration. And the sampling rate is 4 ms. These 9 SGY files represent the following properties, and the first column is a list of features for each icon: parallelogram size angle color box size angle color pyramid size angle

Interval Velocity Fast Interval Velocity Azimuth of the fast interval velocity (Fast-Slow) Interval Velocity Amplitude of the 5-45 degree angles of incidence Large Amplitude Variation with Offset (“AVO”) Gradient Azimuth of the large AVO Gradient Azimuthal variation in the Gradient (Large minus small) Amplitude of the 35-55 degree angles of incidence Large Amplitude Variation with Offset (“AVO”) Gradient Azimuth of the large AVO Gradient

color Azimuthal variation in the Gradient (Large minus small) We don’t display any records with fast interval velocity equal to 0, and the number of records we need show is 4262747. The loading time is 149 seconds. And view rendering (move, rotate, zoom) is done in real time. Here are two screen shots from the experiment:

Figure 1 Rendering the whole data set with sampling rate 1:20000

Figure 2 A part of the display after zooming out

Based on our experiment, we would say that our method is effective and efficient. 4. Future Work Although data visualization is a very powerful technique for data mining, a combination of data visualization with other data mining techniques may provide more information for data analysts. In our visualization method, we feel the need to integrate other data analysis methods, especially in the following aspects: 1. Non-uniformly data distribution: Within the data set, it is common that the data values are clustered, and the data distribution is not uniformly. Non-uniformly data distribution can hurt our visualization efforts since we can’t tell the small difference of data values. For example, in Figure 1, color of most icons is blue, which means that most data values falls into the range represented by blue, and we can not tell the difference of these data values since all of them are represented by blue, and we lose the information for these small difference after rendering. One option could be to increase the number of colors, but it can not be very large due to the limits of human being eyes. Another option could be to run some statistical analysis first, and assign the colors according to actual data distribution. Non-uniformly data distribution brings problems not only to categorical properties (such as color), but also the numerical properties (such as size and angle). If the data values are clustered and represented by the numerical properties, human beings can’t tell the small difference either. One possible solution is similar with categorical properties. We need figure out the actual data distribution first, and assign a bigger rendering range to the actual range with lots of data values. With this assignment we actually distorted the display, that an icon is two times bigger than another one does not mean that it represent a two times bigger value. 2. Non-uniformly knowledge/information distribution: Non-uniformly knowledge/information distribution is an even more important issue. Within the visualization field we want to show the difference of data values visually. The more different the data values are, the more differently they should look. But we have to keep in mind that our goal in visualization is to help data analysts to extract information and knowledge from a data set by transforming and displaying the original data. It is common for some data sets or fields that a small difference in a specific range could mean a big thing, which means the knowledge and information is not distributed uniformly either within the data set. Of course a user would like a visualization system to be able to show these meaningful difference clearly. To put it in another way, two differences with same values may not necessarily be rendered by the same difference on the screen. Instead the difference with more information should be displayed more clearly to get more attention from the users.

Reference 1. “XGvis: Interactive Data Visualization with Multidimensional Scaling” A. Buja, D. F. Swayne, M. Littman, N. Dean, H. Hofmann (2001). [Tentatively accepted for publication in the Journal of Computational and Graphical Statistics.] 2. “Interactive high-dimensional data visualization,” Buja A., Cook D., and Swayne D. F., (1996) Journal of Computational and Graphical Statistics 5, pp. 78-99 3. “Xmdvtool: Integrating multiple methods for visualizing multivariate data”. M. O. Ward. In Proceedings of Visualization '94, pages 326-333, October 1994.

Suggest Documents