广西科学

引用本文：

林泳昌,朱晓姝.一种基于SMOTE的不均衡样本KNN分类方法[J].广西科学,2020,27(3):276-283. [点击复制]
LIN Yongchang,ZHU Xiaoshu.A SMOTE based KNN Classification Method for Unbalanced Samples[J].Guangxi Sciences,2020,27(3):276-283. [点击复制]

【打印本页】【在线阅读全文】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

本文已被：浏览 880次下载 700次	码上扫一扫！
一种基于SMOTE的不均衡样本KNN分类方法
林泳昌, 朱晓姝
0 字体:加大+\|默认\|缩小-
(玉林师范学院计算机科学与工程学院, 广西玉林 537000)

摘要:

针对在数据样本不均衡时，K近邻（K-nearest Neighbor，KNN）方法的预测结果会偏向样本数占优类的问题，本文提出了一种基于合成少数类过采样方法（SMOTE）的KNN不均衡样本分类优化方法（KSID）。该方法过程为：首先使用SMOTE方法将不均衡的训练集均衡化，并训练逻辑回归模型；然后使用逻辑回归模型对训练集进行预测，获取预测为正样本的数据，通过使用SMOTE方法均衡化该正样本，并训练KNN模型；最后把测试集放入该结合逻辑回归方法的KNN模型进行预测，得到最终的预测结果。围绕6个不均衡数据集，将KSID与逻辑回归、KNN和支持向量机（SVM）决策树等方法进行对比实验，结果表明，KSID方法在准确率、查全率、查准率、F1值这4个性能指标上均优于其他3种方法。通过引入SMOTE，KSID方法克服了KNN模型遇到样本不均衡数据集时，产生分类偏向的问题，为进一步研究KNN方法的优化和应用提供参考。

关键词: 不均衡样本 KNN SMOTE KSID 逻辑回归分类

DOI：10.13656/j.cnki.gxkx.20200707.001

基金项目:国家自然科学基金项目（61762087），广西自然科学基金项目（2018JJA170175）和大学生创新创业计划项目（201810606014）资助。

A SMOTE based KNN Classification Method for Unbalanced Samples

LIN Yongchang, ZHU Xiaoshu

(School of Computer Science and Engineering, Yulin Normal University, Yulin, Guangxi, 537000, China)

Abstract:

In order to solve the problem that the prediction result of the KNN method will be biased to the dominant class when the data samples are not balanced, this paper proposes a KNN classification optimization method (KSID) for unbalanced samples based on the synthetic minority oversampling technique (SMOTE). Firstly, this method uses the SMOTE method to equalize the unbalanced training set and the logistic regression model is trained. Secondly, the logistic regression model is used to predict the training set, and the data predicted as positive samples is obtained. The SMOTE method is used to equalize the positive samples and train the KNN model. Finally, the test set is put into the KNN model combined with the logistic regression method for prediction, and the final prediction result is obtained. Based on six unbalanced data sets, KSID is compared with logistic regression, KNN, and SVM decision trees. The results show that the KSID method is superior to the other three methods in the four performance indicators of accuracy, recall, precision, and F1 score.By introducing SMOTE, the KSID method overcomes the problem of classification bias when KNN encounters an unbalanced sample data set, and provides a reference for further research on the optimization and application of the KNN method.

Key words: unbalanced sample KNN SMOTE KSID Logistic regression classification

用微信扫一扫