算法原理与模型训练

算法原理

再讲GBDT之前先给大家讲个故事，有一个年轻的阿姨今年50岁，现在我们不知道她的真实年龄，我们想通过他的皮肤、穿着打扮、头发颜色、言行举止、面部特征来推测她的真实年龄，假如我们根据这些输入特征首先猜测她今年40岁，然后计算拟合残差为50-40=10，与真实年龄偏差了10岁。这时我们还是以他的皮肤、穿着打扮、头发颜色、言行举止、面部特征作为输入，以拟合残差10作为我们要预是这位年轻阿姨的真实年龄，这就是GBDT算法的原理。
GBDT（Gradient Boosting Decision Tree）梯度提升决策树算法，其核心思想其实是一种梯度下降的近似算法，利用损失函数（拟合残差）的负梯度在当测的值，我们再次推测偏差了6岁，这时重新计算拟合残差10-6=4，这时我们进行第三次推测假设偏差为2，重新计算拟合残差4-2=2，最后再次推测偏差为2，重新计算拟合残差2-2=0，这时停止推测。我们把这四次推测加和，40+6+2+2=50，刚好就前模型的值作为回归树算法残差的近似值，递归的拟合残差值，直到达到迭代步数或者残差趋于零。

那么到底这个梯度体现在哪里呢？对于回归树而言，我们通常使用的损失函数是平方损失，即L(y,y1)
= 1/2(y-y1)*(y-y1)，其中y为实际值，y1为预测值。所以我们希望最小化损失函数，即让预测值和实际值尽量接近。其实求梯度就是求偏导，如果让损失函数对y求偏导，@l =
y-y1，所以我们需要每次拟合残差值（也就是负梯度值）。
从损失函数的角度看，GBDT的损失函数是平方损失（即所有样本点的残差平方和）：L(y,y1)
= 1/2(y-y1)*(y-y1)，其中y取值1或-1（代表二分类的类别标签），这也是GBDT可以用来解决分类问题的原因。

模型训练

代码地址 https://github.com/qianshuang/ml-exp

def train():
print("start training...")
# 处理训练数据
train_feature,train_target = process_file(train_dir, word_to_id, cat_to_id)
# 模型训练
model.fit(train_feature, train_target)

def test():
print("start testing...")
# 处理测试数据
test_feature, test_target = process_file(test_dir,word_to_id, cat_to_id)
# test_predict = model.predict(test_feature)  # 返回预测类别
test_predict_proba = model.predict_proba(test_feature) # 返回属于各个类别的概率
test_predict = np.argmax(test_predict_proba,1)  # 返回概率最大的类别标签

# accuracy
true_false = (test_predict == test_target)
accuracy = np.count_nonzero(true_false) / float(len(test_target))
print()
print("accuracy is %f" % accuracy)

# precision recall  f1-score
print()
print(metrics.classification_report(test_target, test_predict, target_names=categories))

# 混淆矩阵
print("Confusion Matrix...")
print(metrics.confusion_matrix(test_target, test_predict))

if not  os.path.exists(vocab_dir):
# 构建词典表
build_vocab(train_dir, vocab_dir)

categories, cat_to_id = read_category()
words, word_to_id = read_vocab(vocab_dir)

# kNN
# model = neighbors.KNeighborsClassifier()
# decision tree
# model = tree.DecisionTreeClassifier()
# random forest
# model = ensemble.RandomForestClassifier(n_estimators=10) # n_estimators为基决策树的数量，一般越大效果越好直至趋于收敛
# AdaBoost
model = ensemble.AdaBoostClassifier(learning_rate=1.0)  # learning_rate的作用是收缩基学习器的权重贡献值
# GBDT
model = ensemble.GradientBoostingClassifier(n_estimators=10)

train()
test()

运行结果：
read_category...
read_vocab...
start training...
start testing...

accuracy is 0.923000

      precision recall f1-score support

财经       0.97    0.94    0.96    115
房产       0.94    0.98    0.96    104
家居       0.83    0.84    0.84       89
教育       0.90    0.84    0.87    104
体育       0.98    0.97    0.98    116
娱乐       0.96    0.96    0.96       89
时政       0.78    0.88    0.83       94
游戏       0.95    0.94    0.95    104
时尚       0.94    0.89    0.92       91
科技       0.96    0.97    0.96       94

avg / total    0.92    0.92    0.92    1000

Confusion Matrix...
[[108 1 1 1 0 0 4 0 0 0]
[  0 102 0 1 0 0 1 0 0 0]
[  0 4  75 3 0 1 3 2 0 1]
[  0 0 6  87 1 0 9 1 0 0]
[  0 0 1 0 113 0 0 0 2 0]
[  0 0 2 0 0  85 1 0 1 0]
[  3 1 1 4 1 0  83 1 0 0]
[  0 0 0 1 0 0 4  98 1 0]
[  0 0 2 0 0 3 1 1  81 3]
[  0 0 2 0 0 0 0 0 1  91]]

社群

QQ交流群