An Efficient Data Mining Dataset Preparation using Aggregation in ...

4 downloads 0 Views 128KB Size Report
S. Brintha Rajakumari* and C. Nalini. Department of CSE, B. I. S. T. (Bharath University), Chennai, India; [email protected], [email protected].
Indian Journal of Science and Technology, Vol 7(S5), 44–46, June 2014

ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645

An Efficient Data Mining Dataset Preparation using Aggregation in Relational Database S. Brintha Rajakumari* and C. Nalini Department of CSE, B. I. S. T. (Bharath University), Chennai, India; [email protected], [email protected]

Abstract To prepare the data set from relational database management system for data mining is very difficult and time consuming task. These prepared data can be used as input in data mining analysis. But traditional structured query language aggregate function returns the records in one column per aggregated group. This paper presents the horizontal representation of data used for dataset preparation in data mining analysis and reduce memory space when evaluated with the cancer dataset.

Keywords: Aggregation, Data Mining

1. Introduction

Data mining is the discovery of models for data. A model can be one of several things. Modelling can be summarizing the data succinctly and approximately, or extracting the most prominent features of the data and ignoring the rest. Building a proper dataset for data mining is a time consuming task. Different methods used for each research discipline to prepare data set for analysis. This paper presented the horizontal representation of data used for dataset preparation in data mining analysis and evaluated with the cancer dataset.

2.  Related Works Aggregation is an important concept in database design where composite objects can be modelled during the design of database applications. Therefore, maintaining the aggregation concept in database implementation is essential2. Aggregation is a composition (part-of) relationship, in which a composite object (“whole”) consists of other component objects (“parts”)3. Aggregation concept is a powerful tool in database design, and consequently, preserving aggregation

*Author for correspondence

in database implementation is essential. The aggregation problem becomes especially acute in a Database Management System (DBMS) since such a system contains a large volume of data that could form aggregates that are more sensitive than their constituent parts. It is the intent of this paper to investigate the aggregation problem in the context of a database1. Since large-scale aggregation queries typically are used to get a “big picture” of a data set, a more attractive approach is to perform online aggregation, in which progressively refined running estimate of the final aggregate values are continuously displayed to the user. The estimated proximity of a running estimate to the final result is indicated by means of an associated confidence interval. An online aggregation system must be optimized to provide more useful information quickly, rather than to minimize the time to query completion4.

3.  Problem Definition A cancer data set presented in Table 1 will be used for aggregation using SQL queries. In the table 1, the first column is used as primary key and the remaining three columns contain the information about exposure,

S. Brintha Rajakumari and C. Nalini

Table 1.  The oral cavity and Pharynx cancer data set S.No Exposure

Gender

Oral cavity and pharynx

4.  Analysis of Cancer Data Set Cancer of the mouth or the oral cavity and the oropharynx is referred to as oral cancer. The sample dataset collected from the http://www.theguardian.com/news/datablog 2011/Dec/07/ cancer-causes-list is in Table 1. The SQL aggregation function has been applied in the sample data and the resultant data set is in Table 2. The oral cavity and pharynx cancer table has 26 records of four attribute which contains a number of records, risk factor exposure, gender and percentage of oral cavity and pharynx cancer. The risk factor percentage will be calculated based on tobacco, Alcohol, Fruit and Vegetables, Meat, Fibre, salt, overweight and obesity, physical exercise, infections, radiation in ionizing and UV, occupation, Post-menopausal hormones and reproduction (breast feeding). The statistical survey on oral cancers reveals that more men are affected by the disease than women. Experimented with this method using MS SQL Server2008 and find the size of the table. In the tables 1 and 2 shows that horizontal layout representation record size is lesser than vertical representation. So the resultant table occupies less memory space in the database.

1

Tobacco

Male

69.5

2

Alcohol

Male

37.3

3

Fruit and vegetables

Male

57.2

4

Meat

Male

0

5

Fibre

Male

0

6

Salt

Male

0

7

Overweight and obesity

Male

0

8

Physical exercise

Male

0

9

Infections

Male

12.3

10

Radiation - ionising

Male

0

11

Radiation - UV

Male

0

12

Occupation

Male

0.6

13

Tobacco

Female

54.9

14

Alcohol

Female

16.9

15

Fruit & vegetables

Female

53.6

16

Meat

Female

0

17

Fibre

Female

0

18

Salt

Female

0

19

Overweight & obesity

Female

0

Table 2.  Data after horizontal layout

20

Physical exercise

Female

0

Exposure

21

Post-menopausal hormones

S.No

Female

0

Male Oral cavity and pharynx

Female Oral cavity and pharynx

22

Infections

Female

14

1

Tobacco

69.5

54.9

23

Radiation - ionising

Female

0

2

Alcohol

37.3

16.9

24

Radiation - UV

Female

0

3

Fruit and vegetables

57.2

53.6

25

Occupation

Female

0.2

4

Meat

0

0

5

Fibre

0

0

26

Reproduction (breast feeding)

6

Salt

0

0

7

Overweight and obesity

0

0

8

Physical exercise

0

0

9

Infections

12.3

14

10

Radiation – ionizing

0

0

11

Radiation – UV

0

0

12

Occupation

0.6

0.2

13

Post-menopausal hormones

0

Null

14

Reproduction (breast feeding)

0

Null

Female

0

gender and percentage of oral cavity and pharynx. From that table, one of the columns passed to standard SQL aggregations. The primary key column will not be used for aggregations. Normally SQL aggregation returns results in a vertical layout. The percentage of cavity which is a non key field will be used for analysis. Firstly we apply the SQL aggregation function to the main table that gives a vertical layout of information with rows lesser than the original table. Again, we reduce the size of the table by using transformation functions pivot which gives the result in horizontal layout.

Vol 7 (S5) | June 2014 | www.indjst.org

Indian Journal of Science and Technology

45

An Efficient Data Mining Dataset Preparation using Aggregation in Relational Database

46

5. Conclusion

6. References

In this paper presented a new approach which reduced the size of storage space in the database using horizontal layout representation and experimented with cancer data set to 27 records. In the future, a large data set can be worked out.

 1.  Thomas H. Hinke.,Inference Aggregation Detection In Database Management Systems,IEEE,1988.  2.  Johanna Wenny Rahayu and David Taniar Preserving Aggregation in an Object-Relational DBMS, SpringerVerlag Berlin Heidelberg , pp. 1–10, 2002.   3. Rumbaugh, J. et al, Object-Oriented Modelling and Design, Prentice-Hall, 1991.   4. Peter J. Haas Joseph M. Hellerstein, Ripple Joins for Online Aggregation, ACM 1999.

Vol 7 (S5) | June 2014 | www.indjst.org

Indian Journal of Science and Technology

Suggest Documents