Data Preparation Tool for Exploration in Data Mining

Lee, Hock Heng (2007) Data Preparation Tool for Exploration in Data Mining. Masters thesis, University of Malaya.

[img] PDF
Chapter 3 - Aspects of DP.pdf

Download (104kB)
[img] PDF
Chapter 2 Literature review.pdf

Download (148kB)
[img] PDF
Chapter 4 - Development Methodology.pdf

Download (74kB)
[img] PDF
Chapter 5 System Development.pdf

Download (114kB)
[img] PDF
Chapter 6 - The Versatile DP tool.pdf

Download (76kB)
[img] PDF
Chapter 7 Conclusion.pdf

Download (52kB)
[img] PDF
Table of Contents.pdf

Download (24kB)
[img] PDF
Appendix A Deploying DP tool - Procedures.pdf

Download (22kB)
[img] PDF
Appendix B - use cases and diagrams.pdf

Download (46kB)
[img] PDF
Appendix C Object-analysis Artefacts.pdf

Download (-1B)
[img] PDF
Appendix D2 - Object Design Artefacts.pdf

Download (-1B)
[img] PDF
Appendix D1 Object-design Artefacts.pdf

Download (-1B)
[img] PDF
Appendix E - User Interface classes.pdf

Download (-1B)
[img] PDF
Appendix F - stored procedures.pdf

Download (-1B)
[img] PDF
Appendix G - Tool Usage.pdf

Download (-1B)
[img] PDF
Appendix H - Organisation of Source Files.pdf

Download (-1B)
[img] PDF
Appendix I - Comparison of Software Tools.pdf

Download (-1B)
[img] PDF
Appendix J - DP Tool Evaluation.pdf

Download (-1B)
[img] PDF
References.pdf

Download (-1B)
[img] PDF
Cover Page.pdf

Download (-1B)
[img] PDF
acknowledgement.pdf

Download (-1B)
[img] PDF
An Abstract.pdf

Download (-1B)
[img] PDF
Chapter 1 introduction.pdf

Download (-1B)

Abstract

Data preparation is an essential part of data mining, which consists of preparing, surveying and modelling data. It prepares the data as well as the miner so that when the prepared data is used, better and faster models are produced. Much of this important step in data mining can be automated, which led to the development of a data preparation tool (the DP tool) for data mining. Data preparation involves looking at the data variables individually as well as looking at the set of data variables as a whole. Certain variable features are problems in data mining. They include “sparse” variables, “compact” variables, monotonic variables, and outliers. For some modelling methods, these problems may affect the speed of modelling and/or the value of model. Fortunately, techniques are available to solve them before the data is mined, and some are used when performing simple data transformations on a data set using the DP tool. When preparing a data set, two areas need attention. They are getting enough data and exposing their information content. Getting enough data is known as capturing data set variability. Estimated confidence measures of each variable are compared to the computed ones to ensure a particular data collection set has enough data to build useful models. In the process, a variable status report is prepared. The data collection set may contain very complex relationships, which are often known beforehand by the business expert. Giving the mining tool such knowledge to begin with would have sped up its process. One such case is the aggregation of transaction details to the customer level, which is performed when building a data set. The DP Tool is based on a visual mining project carried out by a cellular phone company. The project aimed to identify customers churn rate and to know what actions to reduce the rate. Descriptive models will not only provide the trend of customers churn but also the profiles of churned customers. The project data sets serve as test data for the data preparation tool. Before any data can be prepared, they have to be extracted by downloading from their sources into an exploratory database. The DP Tool provides a module to extract online data from different database servers both local and remote. Another module provides scrollable edit for different data “types” such as first-load data, which are reloaded after corrections. Table records can be edited, added or deleted. When the collection data are cleaned and verified, a data set is created. Then the data set undergoes some kinds of data transformation, which are categorised into discrete items, continuous items and computed items. A housekeeping module known as database maintenance is also provided. A client/server implementation of two-tier “plus many” architecture is used to develop the data preparation tool. The client and server reside on the same host, a laptop. The main server is linked to other server instances for data access. SQL Server 2000 provides high reliability, high security, and a powerful SQL programming language, which is used to implement all the data preparation tasks. Another development tool used is Jbuilder (Borland), which provides a visual programming environment to build the user-friendly interface, consisting of frames and dialogs. The Java user-interface classes reside in the client while the data preparation stored procedures reside in the server database.

Item Type: Thesis (Masters)
Subjects: Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Depositing User: MS NOOR ZAKIRA ZULRIMI
Date Deposited: 10 Jul 2013 06:21
Last Modified: 10 Jul 2013 06:21
URI: http://repository.um.edu.my/id/eprint/110

Actions (login required)

View Item View Item