Visual data mining over a video wall

2 downloads 48 Views 2MB Size Report
vided by the use of a distributed object over a video wall, to facilitate the ..... 5.4 Handler System. In this case we are using an implementation in the Xcode.
Visual data mining over a video wall Laura P. R. Rivera Sergio V. Chapa Vergara Amilcar Meneses Viveros [email protected] [email protected] [email protected] CINVESTAV-IPN Computer science department Abstract This paper explains how it was implemented a method of visual data mining deals to maximize the advantages provided by the use of a distributed object over a video wall, to facilitate the assessment of large amounts of data. It is seeking an alternative to the growing problem of having information and wanting to learn something from it so that they can minimize the costs in time and covers transportation, processing and visualization. That’s why we have selected the fundamental techniques in the process of mining as the coefficient of Pearson, and part of the management of distributed object has opted to use the advantages of the Mac platform. The main objective of this project is to create a system that allows for a friendly display large amounts of data.

1. Introduction Explore and analyze this volume of information is becoming incredibly difficult.The visualization and the visual data mining can help treat such information flow. The advantage of the exploration of visual data is that the user is involved in the process of data mining[9]. For data mining to be effective, it is important to include humans in the process of data exploration, it is generally desirable to combine human knowledge with the enormous storage capacity you have in the machines as well as its processing power. Visual data explorations aim at the integration of humans in the process of exploration, management with computer systems. This in order to build human capacity and the deduction and intuition, allowing mining to obtain a more complete, taking into account the capabilities inherent in our species. To support this type of exploration

978-1-61284-1325-5/12/$26.00 ©2012 IEEE

is proposed using a video wall that allows a wide view of data at the same time it makes the process of mining. Thats why the ultimate goal is to create a system to visualize scientific databases and display graphics obtained from their attributes on a video wall. It is noteworthy that the video wall is an effective technological means, which has not yet been developed and standardized, so that research involves management, computer knowledge is required which has not yet been developed, of which only have found examples of use, and ideas and suggestions for implementation. There are several displays of research to create databases, which vary in how they display or manipulate data. If we talk about the part in which we deploy information, we know that among larger area to show results, we will have a better appreciation from the data, if we speak more data in more than two dimensions. The purpose of this project is to show how they can tap into the handling on a video wall for the better functioning of any technical data mining visual.

2

Statement of the problem

For some time within various areas of knowledge have generated large amounts of information. Sometimes the information is so broad that its management is difficult[5]. We have the problem of locating patterns that help us see how the behavior of the data and then try to understand the information contained therein. There is also the problem of transport in a relatively short time. It should be mentioned that the processing time increases in proportion to the amount of data that is present. If we add all these problems we have a problem that basically consists of obtaining knowledge of large scientific databases, reducing as far as possible the computational cost it represents. As for the proposed solution to these problems we propose the following.

239

We can take advantage of the display provides for easier and better assessment of the data. Displaying the graphics in a larger area such as a video wall, we see more than one chart at a time, allowing a better assessment of the behavior of the data. If we allow the management of graphic objects available to users throughout the area of deployment, you can make educated guesses by comparing apparent. We propose the Pearson correlation as the method that will show evidence of similar behavior in the data. The proposed management of distributed objects to avoid centralized interpretation. We found different jobs with specialized technology such as optical networks and large amounts of RAM, get information from databases, the idea of this project is to build as far as possible the characteristics of distributed object management platform Mac The idea is to implement a system that consists of 4 main parts, which together solve the problem described above: • Part 1. With regard to obtaining information from the database, we must take into account the manager of the database. We must create a connection to the server to use the shortest possible time. You should verify how the data is handled. • Part 2. As for the operations to be performed to obtain the Pearson correlation. It should be noted that the data can be manipulated from a file as the query is finished. We have a safe way to perform the operations and prevent overflows. Taking into account that we do not know how big the base to handle one should be careful in that regard. • Part 3. The party making the management of distributed objects. Contains a variant because the type of objects to be manipulated are graphs. The protocol uses a client server. Where servers make their objects available to the client who handles them transparently once the connection is working. • Part 4. OpenGL graph generation represents the part that helps the user with an appreciation of the information. It is a very important part of the solution. If we joining these parties can generate a system that can obtain information from databases by generating their large correlation and a chart showing their behavior and allow a better assessment. There will be a system for managing scientific databases, for its attributes and generate the correlation of data on 18 different variables. The results of the correlations are displayed in a color matrix, which will generate the graphs of pairs of variables that the user wants. The containers are set to OpenGL graphics, and can move through the area of the screens that are part of the visualization cluster.

978-1-61284-1325-5/12/$26.00 ©2012 IEEE

3

Historical context

We have found various jobs with specialized technology such as optical networks and large amounts of RAM, get information from databases, which brings with it problems of cost and facilities that are not very feasible at times. The exploration of information in heterogeneous spaces requires mining methods as efficient as their visual interfaces. Most existing systems have long focused on algorithms or mining or visualization techniques[1][8][6]. That’s why in 2002 described a work environment for visual data mining, which combined analysis and visualization of data, to enable better understanding of the information space. For example optIPuter is an architecture-parallel optical networking infrastructure for exploration in pairs of data visualization and collaboration technologies designed to speed IP multigygabit. In this project highlights that research efforts are directed towards models and abstractions that simplify distributed applications by generating high-speed protocols and layers of communication to assist the performance of the network, dynamic optical networks, configuration management, abstraction of communication (parallel and multicast) and APIs that allows elements of scalable computing to generate wide area connections, storage and file systems that support high-speed access to remote data, security protocols and models for high with the band, large latency in an environment of millions of resources.

4

Visualization and data mining

The exploration of heterogeneous information spaces requires mining methods as efficient as visual interfaces. For some time the most concentrated systems or mining algorithms or on visualization techniques. The disadvantage you have to separate the analysis of visualization has not generally had much support in one another, that is, different applications are trying to bind very different from an application whose idea involves the use of both from the start . That’s why in 2002 described a working environment for visual data mining, which combined data analysis and visualization of data, to enable better understanding of the information space. Visualization techniques were developed for complex data. Where one of the main features of the system was a new paradigm for visualizing information structures regardless of their frame of reference. Other visual interfaces have been developed to visualize and interact is similar or Disc Cone Trees. The aim of the pre-processing is to help determine the relevant information for display. Basically the following advantages: Reducing the number of dimensions. Data filtering. This algorithm looks for similarities between infor-

240

mation objects, using the Euclidean distance or correlation coefficient.

5

Visual data mining

Visual data mining is a process which can extract knowledge, has no history of the initial data. It is a time consuming process and requires processing complex technology also can find a lot of noise in the data. In the data mining process we must use an effective way to observe the distribution and structure of the data clearly, understanding the interrelationship and performance data. Visual data mining combines data visualization and data mining in order to provide an effective way to solve this problem. Seeing is knowing, but only see occasionally is not enough. When you understand what you see you start to believe. For some time scientists have found that seeing and understanding at the same time allows men to learn more about the book with a deeper insight especially when dealing with large amounts of data. This approach integrates the skills of exploration of the human mind with the enormous processing power of computers. Display technology and process analysis have been developed in various disciplines including scientific visualization, data mining, statistics and machine learning to handle very large data with many variables, in multiple dimensions. The methodology is based on two features that characterize the structures and display data, building the capacities as men to perceive patterns and relationships.

5.1

This makes it possible to know whether they relate two sets and find more of them under another data mining techniques more specific, it is noted that this project is only a first approximation of what can be done with this technology, unfortunately in this case the time did not allow testing more interesting databases, only tested with a database of students, so these results are not as nice as we wanted.

5.2

Architecture System

The initial system was implanted in 3 machines and from there climb up the 12 machines that are today, each machine has a 23-inch display, a video server for Mac, and a database server G5. The handler database that employment is PostgreSQL. The application was developed with objetive-C with libraries libpq (for communication with PostrgeSQL) and the OpenGL framework (for the generation of graphics). The system consists of a main interface located on the video server, where it is estimated the correlation of variables. Each Mac mini locates a process to allow the customer manipulation deployment within its area[11].

System design

The system was designed with the advantages to be gained from the use of video walls data mining. One of these advantages is the deployment of several graphs at the same time, these graphics have a fixed size, the first prototype. The second advantage is the parallel processing of information. Other advantages were tested as the scalability of the video wall as they start with a wall of 6 machines and ended in 12, without significant changes in the implementation. The data mining conducted by the Pearson coefficient since the initial idea of this is taking large amounts of information that has not been tried before and through the various combinations of the variables are 3 types of results obtained with this ratio: • 1 if data sets are related • 0 if there are related • -1 If inversely related

978-1-61284-1325-5/12/$26.00 ©2012 IEEE

Figure 1. This picture shows the system architecture

5.3

Flow System

This system will allow handling of a variety of databases, will also handle of at least 18 variables. The flow system can denote under 4 modules that are executed one after another, with some dependence, for example module connects should obligatorily be executed before the correlation, and the method puts always runs after having used the method generates at least one occasion. The method connected should attribute the name of the database, tables and attributes that you want to correlate. Within correlates are

241

variable and will have the 18 operations were conducted and the colored buttons, depending on the outcome of the correlation, the red color is obtained when the correlation is 1, with 0 not painted with a black color -1 . Within the method generates is going to create an arrangement that will contain the visual objects, for the generation of these objects must have its size, arrangements for the values and chart the position of the origin. The method placed will be running for the movement of objects throughout the deployment, and requires parameters as the index and the position where they move the visual object (Figure 2 ).

Figure 3. This picture shows the system architecture Figure 2. This picture shows the flow of the processes carried out in the system for its operation

5.4

Handler System

In this case we are using an implementation in the Xcode environment that gives us the MAC platform, define graphics as NSView objects within a window object, so we can take advantage of these implementations we provide to manage the system. It all starts after the two main configurations, the first is the configuration of the database is allowing us to obtain the data that we will deploy and manage the set of attributes of a table. The second configuration is not less important is the cluster’s network to indicate which are the machines that act as nodes in the system. The (Figure 3) shows the interaction between objects displayed on the video server, the computer will start in the settings, then all correlations are generated and the results depicted in the array of buttons. When you push a button on the color matrix graph is generated that contains information about each of the sets (mean, variance and standard deviation). Since the graphs are displayed can be moved to accommodate the display at will.

5.5

Test

One of the first tests were performed on the system is to allow manipulation of the video wall, this was done by isolating only the part that enables visual management of distributed objects. To verify the scalability, the system is

978-1-61284-1325-5/12/$26.00 ©2012 IEEE

implemented is tested on 3.6 and 12 machines, with minimal changes in the client program from each of them. The failure of each machine does not alter the information held by others, but obviously it is noticed because it is a visualization tool. That a machine stop working does not mean that the image cannot be shown, only we have to move to the part where the machines work properly, it only reduces the area of deployment, but no information is lost. A more interesting test developed in this project was the implementation of distributed visual objects, which contain objects allowed in the library created with 3D Open GL. Graphs showing the data correlated by the system are created in this type of objects, and allow for deployment in the future promises to be more complicated. When you install the initial network with 3 and 6 Ethernet speed handling machines was relatively short, then increased the number of machines to 12, where the last 6 machines were connected through the wireless network, which was noted the decrease in speed as the manipulation of the video wall in the initialization. When the system is already functioning and the delay is not significant and the cost is rather by the number of messages that are sent. The implantation of the first 6 machines allowed us to use a bus schedule management with when they got the other 6 changes to switch scheme, although in the strictest sense, both are switches that the server is always one who sends the messages the interpreters. This system has delays in sending the messages due to its design features in hardware, which is why a test was done, is to send initialization in most of the information, and for handling only index which characterizes it, so no matter which transmission rate

242

is low, the weight information about each performer. When you run the solution, without a global file system, it certainly was necessary to send information replicated on each machine, but it shows that the delay is not significant and that the information is stored and backed up in each of the machines, which seems the best option in case of massive failure. During execution of the system can perceive the delay of messages and deployment speed of each machine, which prevents a real transparency of the system, but as the first test is considered as a good approximation to this feature. System reliability was tested to failure on some machines, which does not prevent the operation of the system continue, if data loss, only decrease the area of deployment. The union of the distributed objects and objects derived from NSView class work properly allowing better management of graphics on the wall. The test of creating a complex model that was not the data that are entering are 2D, but it is possible to display any complicated model later.

Figure 5. This picture shows how the system works

Figure 6. Graphics obtained from visual mining on Sinac Figure 4. This picture shows how the system works

5.7 5.6

Sinac Case

The tables that we use in this case for obtaining results Sinac database are 2, the first is called inscripcionesDet2, and inscripciones2. What is proposed is to take only the information that we can correlate on the display, ie skills are taken per quarter in the previous tables. When you take per table per quarter will generate a database containing the tables so that the viewer can get the data and can be used. This mining shows how the values of the ratings with respect to different quarters is not significantly variable, to decide that one of the quarters, is the most difficult to master’s students from CINVESTAV (Figure 6).

978-1-61284-1325-5/12/$26.00 ©2012 IEEE

Management distributed visual objects

A visualization cluster consists of a display video server that is responsible for making available the visual objects, so that the processes customers access to them, and the nodes let the interpretation and deploying them[7][4][3]. In this system was defined distributed visual objects[10], as objects that can be handled within an extension of screens, which also participate in the interpretation of the graph. In addition, this object must contain a graphic object, which accepts events of the mouse, for ease of handling. The idea is supported by the foundation that provides Cocoa to manage distributed objects, which are also adding graphics to accept events.

243

6 6.1

Conclusions and future work

6.3

Stage Data Mining

Data mining is far from being fully discovered an area that already includes within it is many very important aspects, such as the manner in which they can discover knowledge within a dataset does not explore. The kind of data that can be manipulated are so varied that it is necessary to find ways to manage. When it comes to media not only must find a way to identify him but also for processing and preliminary analysis. We speak of memory, processor time, and techniques for sending network to distribute this burden tends to be increasingly cumbersome procedure, the size of the information and by type. To illustrate these considerations just a little talk about the manipulation of large images, now if instead of video images have increased processing and storage is greater. It should be noted that within the management of various types of information has been solved many problems through the use of metadata, which itself has its own problem regarding annexed as be used. That is like trying the data that do not have a linear structure, avoiding wherever possible ambiguities that the same structure can bring. Once you can solve the problem of the type of data being manipulated and apologized that the problem involves the structure that contain this problem of communication with the application of the database, as should be consultations. Which would be the handle of databases that can really benefit the communities through which the protocol would be better to transfer the results. Another problem that arises is the search for the most appropriate model that will contain the database, although it is worth mentioning that most of the databases used the relational model, it is necessary to review if you could use a model that could be most beneficial in terms of what was previously mentioned.

6.2

Stage communication database

with

the

In this paper we take advantage of the definition of objects distributed within the language of Policy objetive-C. That’s why communication between the messages we are using the IP protocol, defined by language. Actually the way they communicate in this work does not cause problems because the messages were sent, does not contain data bigger than a whole or a chain. Because precisely the paradigm solution to be pursued, but if we see it from a broader perspective, saying that the application is actually bigger sends data as images or video. The solution must change to use a type of protocol that allows for the transmission of data packets bigger.

978-1-61284-1325-5/12/$26.00 ©2012 IEEE

Stage management of distributed object

We can improve this kind of messaging know whether explicitly as managing, in another context we may try to implement this concept with another protocol, avoiding as much as possible delays, but this clearly requires extensive research in this regard. The characteristics that must meet an object can be distributed to the limit we can really offer, we know that with regard to the concept more generally, an object is not distributed in a machine but a number and there arising competition and other problems inherent the concept of distributed. One idea that arises from the implementation done makes me suggest a way in which the subject really stays on several machines at once.

References [1] T. A. D. Chong Zhang, Jason Leigh. Terascope: Distributed visual data mining of terascale data sets over photonic networks. Journal Future Generation Computer Systems - iGrid 2002, Volume 19 Issue 6, August 2003. [2] A. Inc. Distributed objects programming topics cocoa interapplication communication. 2003, 2007 Apple Inc., 2007. [3] A. Inc. Nsopenglview class reference cocoa user experience. 2003, 2007 Apple Inc., 2007. [4] A. Inc. Nsview class reference cocoa graphics and imaging. Apple Inc. 2003, 2007 Apple Inc., 2008. [5] C. F. R. Jos´e Hern´andez Orallo, Maria Jos´e Ram´ırez Quintana. Introducci´on a la miner´ıa de datos, volume 19. Marcel Dekker. INc., 2004. [6] S. W. M. Kirill Tarassov. ivici: Interrelational visualization and correlation interface. Genome Biology 2005, doi:10.1186/gb-2005-6-13-r115 [7] M. J. K. W. J. L. Philip M. Papadopoulos, Caroline A. Papadopoulos and G. Bruno. Configuring large highperformance clusters at lightspeed: A case study. IEEE Computer Graphics and Applications, 2005. [8] S. Rajvikra. Sage: the scalable adaptive graphics environment. in WACE, 2004 [9] P. C. Wong. Visual data mining. IEEE Computer Graphics and Applications, pages 2–3, 2002. [10] L. P. Ramirez R., S. Chapa V. DVO:Model of distributed visual object to appear in WCE2011 London U.K., 6-8 july,2011 [11] A. M. Viveros, S. Chapa V. The CinvesWall ISUM2011

244