Skip to main content
TR EN

MSc.Thesis Defense:Suat Akkaş

Effect of Dataset Reduction Techniques on Computational Complexity and Predictive Performance of Classification Problem

 

 

Suat Akkaş

Data Science, MSc. Thesis, 2024

 

 

                                   Thesis Jury

Asst. Prof. Ezgi Karabulut Tu¨rkseven (Thesis Advisor), Asst. Prof. Sinan Yıldırım,

Prof. Dr. Cem I˙yigu¨n

 

 

Date & Time: 17th of December, 2024 – 09.30 AM

Place: FMAN G062

 

 

Keywords: random sampling,classification,similarity metrics,representativeness

 

 

                                     Abstract

 

The usage of big data in the industry increases day by day. This situation exists also in the financial industry. The usage of big data in the financial sector leads to enor- mous improvement in the areas of financial problems such as credit scoring problems. However, the usage of big data also increases the computational time and usage of avail- able resources enormously. Therefore, this issue makes the usage of big data in some applications and some situations inefficient.

To handle inefficiency in the usage of big data, we have focused on the sampling methods in this study. By using row-wise sampling algorithms and column-wise reduction in data, we aimed to reduce computational time for solving credit scoring problems. However, our aim in this study is not just a reduction in computational time but also the performance of the model usage in credit scoring in the case of usage of big data. We have used also feature selection and transformation algorithms in order to observe the effect of selection and transformation algorithms on different sample sizes of sampled data in terms of predictive power. Moreover, to validate whether the sample dataset represents

 

the main dataset or not, we have used a bunch of similarity metrics for different data types that exist in the dataset.

By using this methodology, we have observed the relation between the computational time, power and data representativeness for different sample sizes of sampled data. Ac- cording to our findings from our study, it is possible to preserve the predictive power of models until some sample size, with decreasing the computational amount in significant amounts. By demonstrating the relation between the computational time versus predic- tive power relations with different sample sizes and different feature reduction methods, we aim to propose the sample size and feature reduction selection for one’s main concerns.