[资料]

GBDT算法原理和模型训练

2019-1-23 14:38:58 4295 GBDT 决策树

0 算法原理再讲GBDT之前先给大家讲个故事，有一个年轻的阿姨今年50岁，现在我们不知道她的真实年龄，我们想通过他的皮肤、穿着打扮、头发颜色、言行举止、面部特征来推测她的真实年龄，假如我们根据这些输入特征首先猜测她今年40岁，然后计算拟合残差为50-40=10，与真实年龄偏差了10岁。这时我们还是以他的皮肤、穿着打扮、头发颜色、言行举止、面部特征作为输入，以拟合残差10作为我们要预是这位年轻阿姨的真实年龄，这就是GBDT算法的原理。 GBDT（Gradient Boosting Decision Tree）梯度提升决策树算法，其核心思想其实是一种梯度下降的近似算法，利用损失函数（拟合残差）的负梯度在当测的值，我们再次推测偏差了6岁，这时重新计算拟合残差10-6=4，这时我们进行第三次推测假设偏差为2，重新计算拟合残差4-2=2，最后再次推测偏差为2，重新计算拟合残差2-2=0，这时停止推测。我们把这四次推测加和，40+6+2+2=50，刚好就前模型的值作为回归树算法残差的近似值，递归的拟合残差值，直到达到迭代步数或者残差趋于零。那么到底这个梯度体现在哪里呢？对于回归树而言，我们通常使用的损失函数是平方损失，即L(y,y1) = 1/2(y-y1)(y-y1)，其中y为实际值，y1为预测值。所以我们希望最小化损失函数，即让预测值和实际值尽量接近。其实求梯度就是求偏导，如果让损失函数对y求偏导，@l = y-y1，所以我们需要每次拟合残差值（也就是负梯度值）。从损失函数的角度看，GBDT的损失函数是平方损失（即所有样本点的残差平方和）：L(y,y1) = 1/2(y-y1)(y-y1)，其中y取值1或-1（代表二分类的类别标签），这也是GBDT可以用来解决分类问题的原因。模型训练代码地址 https://github.com/qianshuang/ml-exp def train(): print("start training...") # 处理训练数据 train_feature,train_target = process_file(train_dir, word_to_id, cat_to_id) # 模型训练 model.fit(train_feature, train_target) def test(): print("start testing...") # 处理测试数据 test_feature, test_target = process_file(test_dir,word_to_id, cat_to_id) # test_predict = model.predict(test_feature) # 返回预测类别 test_predict_proba = model.predict_proba(test_feature) # 返回属于各个类别的概率 test_predict = np.argmax(test_predict_proba,1) # 返回概率最大的类别标签 # accuracy true_false = (test_predict == test_target) accuracy = np.count_nonzero(true_false) / float(len(test_target)) print() print("accuracy is %f" % accuracy) # precision recall f1-score print() print(metrics.classification_report(test_target, test_predict, target_names=categories)) # 混淆矩阵 print("Confusion Matrix...") print(metrics.confusion_matrix(test_target, test_predict)) if not os.path.exists(vocab_dir): # 构建词典表 build_vocab(train_dir, vocab_dir) categories, cat_to_id = read_category() words, word_to_id = read_vocab(vocab_dir) # kNN # model = neighbors.KNeighborsClassifier() # decision tree # model = tree.DecisionTreeClassifier() # random forest # model = ensemble.RandomForestClassifier(n_estimators=10) # n_estimators为基决策树的数量，一般越大效果越好直至趋于收敛 # AdaBoost model = ensemble.AdaBoostClassifier(learning_rate=1.0) # learning_rate的作用是收缩基学习器的权重贡献值 # GBDT model = ensemble.GradientBoostingClassifier(n_estimators=10) train() test() 运行结果： read_category... read_vocab... start training... start testing... accuracy is 0.923000 precision recall f1-score support 财经 0.97 0.94 0.96 115 房产 0.94 0.98 0.96 104 家居 0.83 0.84 0.84 89 教育 0.90 0.84 0.87 104 体育 0.98 0.97 0.98 116 娱乐 0.96 0.96 0.96 89 时政 0.78 0.88 0.83 94 游戏 0.95 0.94 0.95 104 时尚 0.94 0.89 0.92 91 科技 0.96 0.97 0.96 94 avg / total 0.92 0.92 0.92 1000 Confusion Matrix... [[108 1 1 1 0 0 4 0 0 0] [ 0 102 0 1 0 0 1 0 0 0] [ 0 4 75 3 0 1 3 2 0 1] [ 0 0 6 87 1 0 9 1 0 0] [ 0 0 1 0 113 0 0 0 2 0] [ 0 0 2 0 0 85 1 0 1 0] [ 3 1 1 4 1 0 83 1 0 0] [ 0 0 0 1 0 0 4 98 1 0] [ 0 0 2 0 0 3 1 1 81 3] [ 0 0 2 0 0 0 0 0 1 91]] 社群 QQ交流群微信公众号了解更多干货文章，可以关注小程序八斗问答 0
举报淘帖0 只看该作者相关推荐 • 【大语言模型：原理与工程实践】大语言模型的预训练 605 • 训练好的ai模型导入cubemx不成功怎么解决？ 468 • 探索一种降低ViT模型训练成本的方法 978 • 在Ubuntu上使用Nvidia GPU训练模型 2078 • 人工智能基本概念机器学习算法 2303 • 用S3C2440训练神经网络算法 1564 • 教程：如何使用FPGA加速广告推荐算法 2862 • 医疗模型人训练系统是什么？ 1832 • 算法原理与模型训练 3336 • Pytorch模型训练实用PDF教程【中文】 4545 1条评论发表评论只看该作者