Open source data mining tools for audit purposes - ACM Digital Library

4 downloads 3465 Views 489KB Size Report
a list of open source software available to data mining as an audit tool. Keywords. Data Mining, Open Source Software, CAATT. Categories and Subject ...
Open source data mining tools for audit purposes Nádia Valls de Almeida

Isabel Pedrosa

ISCAC - Polytechnic Institute of Coimbra, Portugal Quinta Agricola - Bencanta 3040-316 Coimbra

ISCAC - Polytechnic Institute of Coimbra, Portugal Quinta Agricola - Bencanta 3040-316 Coimbra

[email protected]

[email protected] In a quick analysis you can tell it’s an user-friendly software (see Figure 1), with a powerful but intuitive graphical user interface, using a workspace to start a project, who holds all the information needed.

ABSTRACT In a fast-growing market as software development, open source applications are an increase tendency by granting choice and reduced costs. The need of open source audit tools among a growing auditors’ population is this poster target: this work sets a list of open source software available to data mining as an audit tool.

The program window is composed by the main process window, operator’s directory and the repositories to store data and projects, there it will handle the data and meta-data process. One characteristic reported is to provide the only supporting onthe-fly error recognition and quick fixes.

Keywords Data Mining, Open Source Software, CAATT

Categories and Subject Descriptors H.2.8 [Database Applications] - Data mining. H.4.4 [Information Systems Applications – Miscellaneous] Computer Aided audit tools and techiniques.

General Terms Algorithms, Management, Measurement.

1. DATA MINING TOOLS AND AUDIT Audit is the systematic process to obtain and evaluating objectively evidences about the correspondence between information, situations or procedures and established criteria. [1] Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful to gain some advantage, usually an economic one. [2]. Crossing both definitions we can tell that data mining could be used as an audit tool in order to obtain, in a semi-automatic way, evidences that could be patterns. Data mining techniques provides the extraction from a large and variable amount of data of useful evidences. It has been shown recently that data mining techniques can be related to audit in fraud detection, forensics accounting and security evaluation.

Figure 1 . RapidMiner example for sales report [6] This data mining software can be used both for research and real-world data mining tasks, allowing experiments on a large number of operators, which are detailed in XML files and are made with the graphical user interface. Its data mining and machine learning procedures include features such data loading and transformation (ETL), data preprocessing and visualization, data integration, modeling, evaluation, deployment and analysis. The native language used to develop this software is Java and they have a big community contributing with code and forum debate. The open source RapidMiner edition is called Community Edition, which is a toolkit for data mining. This tool is able to define analytical steps and graphs generating similar to MS Excel.

2. DATA MINING OPEN SOURCE TOOLS 2.1. RapidMiner RapidMiner claims to be the “unquestionable world-leading open-source system for data mining”[7].

2.2. Orange From Slovenia comes Orange, the component-based data mining and machine learning software suite written in C++ and Python [8]. It presents a visual programming GUI for data analysis and visualization with a complete set of components for data preprocessing, scoring, filtering, modeling, model evaluation and exploration techniques.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise,to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. OSDOC'11, July 11, 2011, Lisbon, Portugal. Copyright 2011 ACM 978-1-4503-0873-1/11/07..$10.00

33

models and views. This software is also written in Java, supports an extension method through plugins. These, allow users to add modules for text, image and time series mining. Its prime version already includes several modules for data integration, transformation, analysis and visualization.

Figure 2 . Orange Classification tree viewer The application features visual programming, visualization, interaction and data analytics, a large toolbox and scripting interface. The visual programming allows the user to design the data analysis process, remembering his choices, and suggesting the most frequently used combinations, and which communication channels to use. The visualization feature is packed with different visualizations:scatterplots, bar charts, trees and network. The interaction and data analytics provides actions between data. The large toolbox selection has over 100 widgets. For last, the scripting interface with Python, allows users to program new algorithms and developing data analysis procedures.

Figure 4 . Knime application window

3.

DATA MINING TOOLS AND AUDIT

All the approaches stated as data mining open source tools can be helpful in audit processes, especially as data analyses tools, to detect data patterns and act as decision support tools. Is stated that one of Computer Assisted Audit Tools and Techniques, CAATTs, categories’ is data analyze. Several tools as Audit Command Language, ACL, and IDEA are already in use in big audit companies to do extensive data analyses: they support databases with several Exabytes, almost a million records in each table and time processing is irrelevant. However, is not common to use Open Source Software in audit companies to do these tasks. Additionally, usually auditors don’t take the advantages of data mining software to find data patterns. Regardless of that, data mining open source tools can do extensive data analyses with audit purposes, taking advantage of Data Mining functionalities, methods and algorithms.

2.3. Weka As well as RapidMiner, Weka (Waikato Environment for Knowledge Analysis) is Java written. It’s a solid, supported and well-known machine learning software that supports several typical data mining tasks, particularly: data preprocessing, clustering, classification, regression, visualization and feature selection [4]. It also provides access to SQL databases using Java Database Connectivity and can process the result returned by a database query.

4. REFERENCES [1] Morais, Georgina. Martins, Isabel. 2007. Auditoria interna função e processo. Áreas Editora [2] Witten, Ian H. Frank, Eibe. Hall, Mark A.2011. Data Mining Practical Machine Learning Tools and Techniques, 3rd Ed. Elsevier. [3] KNIME (Konstanz Information Miner), http://www.knime.org/, accessed in June, 1th, 2011 [4] WEKA 3, Data Mining with open open source machine learning software in Java, http://www.cs.waikato.ac.nz/ml/weka/, accessed in June, 2th, 2011

Figure 3 . Weka explorer window [5]

[5] Mloss.org Machine Learning open source software, http://mloss.org/media/screenshot_archive/weka_explorer_ screenshot.png, accessed in June, 2th, 2011

Its main user interface is the WEKA Explorer, but the same functionality can be accessed from the command line or through the component-based Knowledge Flow interface.

http://rapid[6] Rapid I, i.com/content/view/181/190/lang,en/#data_sheet, accessed in June, 1th, 2011

2.4. Knime Knime stands for Konstanz Information Miner. This platform [3] allows data integration, processing, analysis and exploration in a very comprehensive way. The main function way is: to execute the data analysis the user must create data flow’s or pipelines, therefore it becomes able to analyze the results,

[7] Rapid I, Report the future, http://rapidi.com/content/view/181/190/, accessed in June, 1th, 2011 [8] Orange, http://www.ailab.si/orange, accessed in June, 3 th, 2011

34

Suggest Documents