HomePublications

Dealing with Missing Data and Uncertainty in the Context of Data Mining

Research output: Chapter in Book/Report/Conference proceedingChapter

Open Access permissions

Open

Documents

Links

DOI

Authors

Organisational units

Abstract

Missing data is an issue in many real-world datasets yet robust methods for dealing with missing data appropriately still need development. In this paper we conduct an investigation of how some methods for handling missing data perform when the uncertainty increases. Using benchmark datasets from the UCI Machine Learning repository we generate datasets for our experimentation with increasing amounts of data Missing Completely At Random (MCAR) both at the attribute level and at the record level. We then apply four classification algorithms: C4.5, Random Forest, Naïve Bayes and Support Vector Machines (SVMs). We measure the performance of each classifiers on the basis of complete case analysis, simple imputation and then we study the performance of the algorithms that can handle missing data. We find that complete case analysis has a detrimental effect because it renders many datasets infeasible when missing data increases, particularly for high dimensional data. We find that increasing missing data does have a negative effect on the performance of all the algorithms tested but the different algorithms tested either using preprocessing in the form of simple imputation or handling the missing data do not show a significant difference in performance.

Details

Original languageEnglish
Title of host publicationHybrid Artificial Intelligent Systems
Subtitle of host publication13th International Conference, HAIS 2018, Oviedo, Spain, June 20-22, 2018, Proceedings
PublisherSpringer
Chapter24
Pages289-301
Number of pages13
ISBN (Electronic)978-3-319-92639-1
ISBN (Print)978-3-319-92638-4
DOIs
StatePublished - 2018

Publication series

NameHybrid Artificial Intelligent Systems
Volume10870
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

View graph of relations

ID: 139434597

Related by author
  1. Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF)

    Research output: Chapter in Book/Report/Conference proceedingOther chapter contribution

  2. Heuristic Ensemble of Filters for Reliable Feature Selection

    Research output: Contribution to conferencePaper

  3. Non-linear dimensionality reduction for privacy-preserving data classification

    Research output: Chapter in Book/Report/Conference proceedingChapter

  4. Generalised Decision Level Ensemble Method for Classifying Multi-media Data

    Research output: Contribution to conferencePaper

  5. Decision Level Ensemble Method for Classifying Multi-media Data

    Research output: Chapter in Book/Report/Conference proceedingConference contribution