A random forest algorithm based on similarity measure and dynamic weighted voting

ZHAO Shu-xu, MA Qin-jing, LIU Li-jiao

（School of Electronic and Information Engineering, Lanzhou Jiaotong Univercity, Lanzhou 730070, China）

Abstract： The random forest model is universal and easy to understand, which is often used for classification and prediction. However, it uses non-selective integration and the majority rule to judge the final result, thus the difference between the decision trees in the model is ignored and the prediction accuracy of the model is reduced. Taking into consideration these defects, an improved random forest model based on confusion matrix (CM-RF)is proposed. The decision tree cluster is selectively constructed by the similarity measure in the process of constructing the model, and the result is output by using the dynamic weighted voting fusion method in the final voting session. Experiments show that the proposed CM-RF can reduce the impact of low-performance decision trees on the output result, thus improving the accuracy and generalization ability of random forest model.

Key words： random forest; confusion matrix; similarity measure; dynamic weighted voting

CLD number： TP312.8 Document code： A

Article ID： 1674-8042（2019）03-0277-08 doi： 103969/jissn1674-8042-2019-03-011

References

［1］Breiman l. Random forests. Machine Learning, 2001, 45(1)： 5-32.
［2］Ho t k. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(8)： 832-844．
［3］Kulkarni V Y, Sinha P K. Efficient learning of random forest classifier using disjoint partitioning approach. Lecture Notes in Engineering & Computer Science, 2013, 2205(1)： 1-5.
［4］Kulkarni V Y, Sinha P K. Pruning of random forest classifiers： a survey and future directions. In： Proceedings of International Conference on Data Science & Engineering, IEEE, Cochin, Kerala, India, 2012： 64-68.
［5］Oshiro T M, Perez P S, Baranauskas J A. How many trees in a random forest? In： Proceedings of International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, Berlin, Heidelberg, 2012： 154-168.
［6］Jian C F, Chen J C, Zhang M Y. Improved random forest with S_Dbw based variable feature extraction operators. Journal of Chinese Mini-Micro Computer Systems, 2018, 39(2)： 393-395.
［7］Li H, Li Z, She K. An improvement of random forests algorithm based on comprehensive sampling without replacement. Computer Engineering and Science, 2015, 37(7)： 1233-1238.
［8］Guo J X, Chen W. Face recognition based on HOG multi-feature fusion and random forest. Computer Science, 2013, 40(10)： 279-282.
［9］Breiman L. Bagging Preditors. Machine Learning, 1996, 24(2)： 123-140.
［10］Breiman l. Out-of-bag estimation. 1996.
［11］Mu Y S, Liu X D, Yang Z H, et al. A parallel C4.5 decision tree algorithm based on MapReduce . Concurrency and Computation： Practice and Experience, 2017, 29 (8)： .
［12］Rutkowski L, Jaworski W, Pietruczuk L,et al. The CART decision tree for mining data streams. Information Sciences, 2014, 266： 1-15.
［13］Fawcett T .An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27(8)： 861-874.
［14］Bi K, Wang X D, Yao X, et al. Adaptively selective ensemble algorithm based on bagging and confusion matrix. Acta Electronica Sinica, 2014, 42(4)： 711-716.
［15］Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27(8)： 861-874.
［16］Duan D G, Gai X X, Han Z M, et al. Micro-blog misinformation detection based on gradient boost decision tree. Journal of Computer Application, 2018, 38(2)： 410-414.
［17］Menard S. Applied logistic regression analysis. Technometrics, 2002, 38(2)： 192-192.
［18］Girshick R. Fast R-CNN. Computer Science, 2015.
［19］Wang G, Ma J. A hybrid ensemble approach for enterprise credit risk assessment based on support vector machine. Expert Systems with Applications, 2012, 39(5)： 5325-5331.
［20］Chen Y, Shi S, Pan Y, et al. Hybrid ensemble approach for credit risk assessment based on SVM. Computer Engineering and Application, 2016, 52(4)： 115-120.
［21］Wang S J, Mathew A, Chen Y, et al. Empirical analysis of support vector machine ensemble classifiers. Expert Systems with Applications, 2009, 36(3)： 6466-6476.

一种基于相似性度量和动态加权投票的随机森林算法

赵庶旭，马秦靖，刘李姣

（兰州交通大学电子与信息工程学院，甘肃兰州 730070）

摘要：随机森林模型易于理解，普适性强，常用于分类、预测等问题，但其采用无选择性集成和简单的少数服从多数投票原则进行最终结果判定，忽略了模型中各决策树之间的强弱差异，从而降低了模型的预测精度。针对该缺陷，提出了一种基于混淆矩阵的改进随机森林模型（Ramdom forest model based on confusion matrix, CM-RF）。在构建模型过程中通过相似性度量有选择性地构成决策树簇，并在最终投票环节使用动态加权投票融合方法进行结果输出。实验结果表明，该方法能减少低性能决策树对输出结果的影响，从而提升随机森林模型的正确率与泛化能力。

关键词：随机森林；混淆矩阵；相似性度量；动态加权投票

引用格式：ZHAO Shu-xu, MA Qin-jing, LIU Li-jiao. A random forest algorithm based on similarity measure and dynamic weighted voting. Journal of Measurement Science and Instrumentation, 2019, 10（3）： 277-284. ［doi： 103969／jissn1674-8042201903011］

[full text view]

此页面上的内容需要较新版本的 Adobe Flash Player。

A random forest algorithm based on similarity measure and dynamic weighted voting