MSc.Thesis Defense:Suat Akkaş
Effect of Dataset Reduction Techniques on Computational Complexity and Predictive Performance of Classification Problem
Suat Akkaş
Data Science, MSc. Thesis, 2024
Thesis Jury
Asst. Prof. Ezgi Karabulut Tu¨rkseven (Thesis Advisor), Asst. Prof. Sinan Yıldırım,
Prof. Dr. Cem I˙yigu¨n
Date & Time: 17th of December, 2024 – 09.30 AM
Place: FMAN G062
Keywords: random sampling,classification,similarity metrics,representativeness
Abstract
The usage of big data in the industry increases day by day. This situation exists also in the financial industry. The usage of big data in the financial sector leads to enor- mous improvement in the areas of financial problems such as credit scoring problems. However, the usage of big data also increases the computational time and usage of avail- able resources enormously. Therefore, this issue makes the usage of big data in some applications and some situations inefficient.
To handle inefficiency in the usage of big data, we have focused on the sampling methods in this study. By using row-wise sampling algorithms and column-wise reduction in data, we aimed to reduce computational time for solving credit scoring problems. However, our aim in this study is not just a reduction in computational time but also the performance of the model usage in credit scoring in the case of usage of big data. We have used also feature selection and transformation algorithms in order to observe the effect of selection and transformation algorithms on different sample sizes of sampled data in terms of predictive power. Moreover, to validate whether the sample dataset represents
the main dataset or not, we have used a bunch of similarity metrics for different data types that exist in the dataset.
By using this methodology, we have observed the relation between the computational time, power and data representativeness for different sample sizes of sampled data. Ac- cording to our findings from our study, it is possible to preserve the predictive power of models until some sample size, with decreasing the computational amount in significant amounts. By demonstrating the relation between the computational time versus predic- tive power relations with different sample sizes and different feature reduction methods, we aim to propose the sample size and feature reduction selection for one’s main concerns.