Naive_Bayes

数据描述

Naive Bayes 广泛用于文本分类任务，包括互联网新闻的分类，垃圾邮件的筛选。本文使用经典的20类新闻文本作为实验数据。
获取数据：

# 从 sklearn.datasets 里导入新闻数据抓取器 fetch_20newsgroups
from sklearn.datasets import fetch_20newsgroups
# 与之前预存的数据不同，fetch_20newsgroups 需要即时从互联网下载数据
news = fetch_20newsgroups(subset='all')
# 查验数据规模和细节
print len(news.data)
print news.data[0]

输出：

该数据共有18846条新闻；
这些文本数据既没有被设定特征，也没有数字化的量度。在交给朴素贝叶斯分类器前学习前，需要作进一步的处理。

准备训练、测试数据

# 从 sklearn.cross_validation 导入train_test_split
from sklearn.cross_validation import train_test_split
# 随机采样 25% 的数据样本作为测试集
X_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test.size=0.25, random_state=33)

使用朴素贝叶斯进行类别预测

# 从 sklearn.feature_extraction.text 里导入用于文本特征向量转化模块。
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X_train = vec.fit_transform(X_train)
X_test = vec.transform(X_test)
# 从 sklearn.naive_bayes 里导入朴素贝叶斯模型
from sklearn.naive_bayes import MultinomialNB
# 使用默认配置初始化朴素贝叶斯模型
mnb = MultinumialNB()
# 利用训练数据对模型参数进行估计
mnb.fit(X_train, y_train)
# 对测试样本进行类别预测，结果存储在变量 y_predict中
y_predict = mnb.predict(X_test)

性能测评

# 从 sklearn.metrics 里导入 classification_report 用于详细的分类性能报告
from sklearn.metrics import classification_report 
# print 'The accuracy of Naive Bayes Classifier is', mnb.score(X_test, y_test)
print classification_report(y_test, y_predict, target_names=news.target_names)

输出如下：

enter image description here

由输出可知，分类准确性约为 83.977%。

分析

朴素贝叶斯模型被广泛用于海量互联网文本分类任务。由于其较强的特征条件独立假设，使得模型预测所需要估计的参数规模从幂指数量级向线性数量级减少，极大地节约了内存消耗和计算时间；
受这种强假设的限制，模型训练时无法将各个特征之间的联系考量在内，使得该模型在其他数据特征关联较强的分类任务上的性能表现不佳。