Journal Article


The effect of data complexity on classifier performance

Abstract

The research area of Software Defect Prediction (SDP) is both extensive and popular, and is often treated as a classification problem. Improvements in classification, pre-processing and tuning techniques, (together with many factors which can influence model performance) have encouraged this trend. However, no matter the effort in these areas, it seems that there is a ceiling in the performance of the classification models used in SDP. In this paper, the issue of classifier performance is analysed from the perspective of data complexity. Specifically, data complexity metrics are calculated using the Unified Bug Dataset, a collection of well-known SDP datasets, and then checked for correlation with the defect prediction performance of machine learning classifiers (in particular, the classifiers C5.0, Naive Bayes,  Artificial Neural Networks, Random Forests, and Support Vector Machines). In this work, different domains of competence and incompetence are identified for the classifiers. Similarities and differences between the classifiers and the performance metrics are found and the Unified Bug Dataset is analysed from the perspective of data complexity. We found that certain classifiers work best in certain situations and that all data complexity metrics can be problematic, although certain classifiers did excel in some situations.

Attached files

Authors

Eberlein, Jonas
Rodriguez, Daniel
Harrison, Rachel

Oxford Brookes departments

School of Engineering, Computing and Mathematics

Dates

Year of publication: 2024
Date of RADAR deposit: 2024-10-29


Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 International License


Related resources

This RADAR resource is the Version of Record of The Effect of Data Complexity on Classifier Performance

Details

  • Owner: Daniel Croft (removed)
  • Collection: Outputs
  • Version: 1 (show all)
  • Status: Live
  • Views (since Sept 2022): 111