NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

本文作者： AI研習(xí)社-譯站

2020-09-30 10:04

導(dǎo)語：BERT的表現(xiàn)要比之前的模型稍好，它能識別的科技新聞要比其他模型多一些。

字幕組雙語原文：NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

英語原文：Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT

翻譯：雷鋒字幕組（關(guān)山、wiige）

概要

在本文中，我將使用NLP和Python來解釋3種不同的文本多分類策略：老式的詞袋法（tf-ldf），著名的詞嵌入法（Word2Vec）和最先進(jìn)的語言模型（BERT）。

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

NLP（自然語言處理）是人工智能的一個(gè)領(lǐng)域，它研究計(jì)算機(jī)和人類語言之間的交互作用，特別是如何通過計(jì)算機(jī)編程來處理和分析大量的自然語言數(shù)據(jù)。NLP常用于文本數(shù)據(jù)的分類。文本分類是指根據(jù)文本數(shù)據(jù)內(nèi)容對其進(jìn)行分類的問題。

我們有多種技術(shù)從原始文本數(shù)據(jù)中提取信息，并用它來訓(xùn)練分類模型。本教程比較了傳統(tǒng)的詞袋法（與簡單的機(jī)器學(xué)習(xí)算法一起使用）、流行的詞嵌入模型（與深度學(xué)習(xí)神經(jīng)網(wǎng)絡(luò)一起使用）和最先進(jìn)的語言模型（和基于attention的transformers模型中的遷移學(xué)習(xí)一起使用），語言模型徹底改變了NLP的格局。

我將介紹一些有用的Python代碼，這些代碼可以輕松地應(yīng)用在其他類似的案例中（僅需復(fù)制、粘貼、運(yùn)行），并對代碼逐行添加注釋，以便你能復(fù)現(xiàn)這個(gè)例子（下面是全部代碼的鏈接）。

mdipietro09/DataScience_ArtificialIntelligence_Utils

我將使用“新聞類別數(shù)據(jù)集”（News category dataset），這個(gè)數(shù)據(jù)集提供了從HuffPost獲取的2012-2018年間所有的新聞標(biāo)題，我們的任務(wù)是把這些新聞標(biāo)題正確分類，這是一個(gè)多類別分類問題（數(shù)據(jù)集鏈接如下）。

News Category Dataset

特別地，我要講的是：

設(shè)置：導(dǎo)入包，讀取數(shù)據(jù)，預(yù)處理，分區(qū)。
詞袋法：用scikit-learn進(jìn)行特征工程、特征選擇以及機(jī)器學(xué)習(xí)，測試和評估，用lime解釋。
詞嵌入法：用gensim擬合Word2Vec，用tensorflow/keras進(jìn)行特征工程和深度學(xué)習(xí)，測試和評估，用Attention機(jī)制解釋。
語言模型：用transformers進(jìn)行特征工程，用transformers和tensorflow/keras進(jìn)行預(yù)訓(xùn)練BERT的遷移學(xué)習(xí)，測試和評估。

設(shè)置

首先，我們需要導(dǎo)入下面的庫：

## for data
import json
import pandas as pd
import numpy as np## for plotting
import matplotlib.pyplot as plt
import seaborn as sns## for bag-of-words
from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing## for explainer
from lime import lime_text## for word embedding
import gensim
import gensim.downloader as gensim_api## for deep learning
from tensorflow.keras import models, layers, preprocessing as kprocessing
from tensorflow.keras import backend as K## for bert language model
import transformers

該數(shù)據(jù)集包含在一個(gè)jason文件中，所以我們首先將其讀取到一個(gè)帶有json的字典列表中，然后將其轉(zhuǎn)換為pandas的DataFrame。

lst_dics = []
with open('data.json', mode='r', errors='ignore') as json_file:
for dic in json_file:
lst_dics.append( json.loads(dic) )## print the first one
lst_dics[0]

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

原始數(shù)據(jù)集包含30多個(gè)類別，但出于本教程中的目的，我將使用其中的3個(gè)類別：娛樂（Entertainment）、政治（Politics）和科技（Tech）。

## create dtf
dtf = pd.DataFrame(lst_dics)## filter categories
dtf = dtf[ dtf["category"].isin(['ENTERTAINMENT','POLITICS','TECH']) ][["category","headline"]]## rename columns
dtf = dtf.rename(columns={"category":"y", "headline":"text"})## print 5 random rows
dtf.sample(5)

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

從圖中可以看出，數(shù)據(jù)集是不均衡的：和其他類別相比，科技新聞的占比很小，這會使模型很難識別科技新聞。

在解釋和構(gòu)建模型之前，我將給出一個(gè)預(yù)處理示例，包括清理文本、刪除停用詞以及應(yīng)用詞形還原。我們要寫一個(gè)函數(shù)，并將其用于整個(gè)數(shù)據(jù)集上。

'''
Preprocess a string.
:parameter
:param text: string - name of column containing text
:param lst_stopwords: list - list of stopwords to remove
:param flg_stemm: bool - whether stemming is to be applied
:param flg_lemm: bool - whether lemmitisation is to be applied
:return
cleaned text
'''
def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):
## clean (convert to lowercase and remove punctuations and
characters and then strip)
text = re.sub(r'[^\w\s]', '', str(text).lower().strip())

## Tokenize (convert from string to list)
lst_text = text.split() ## remove Stopwords
if lst_stopwords is not None:
lst_text = [word for word in lst_text if word not in
lst_stopwords]

## Stemming (remove -ing, -ly, ...)
if flg_stemm == True:
ps = nltk.stem.porter.PorterStemmer()
lst_text = [ps.stem(word) for word in lst_text]

## Lemmatisation (convert the word into root word)
if flg_lemm == True:
lem = nltk.stem.wordnet.WordNetLemmatizer()
lst_text = [lem.lemmatize(word) for word in lst_text]

## back to string from list
text = " ".join(lst_text)
return text

該函數(shù)從語料庫中刪除了一組單詞（如果有的話）。我們可以用nltk創(chuàng)建一個(gè)英語詞匯的通用停用詞列表（我們可以通過添加和刪除單詞來編輯此列表）。

lst_stopwords = nltk.corpus.stopwords.words("english")
lst_stopwords

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

現(xiàn)在，我將在整個(gè)數(shù)據(jù)集中應(yīng)用編寫的函數(shù)，并將結(jié)果存儲在名為“text_clean”的新列中，以便你選擇使用原始的語料庫，或經(jīng)過預(yù)處理的文本。

dtf["text_clean"] = dtf["text"].apply(lambda x:
utils_preprocess_text(x, flg_stemm=False, flg_lemm=True,
lst_stopwords=lst_stopwords))dtf.head()

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

如果你對更深入的文本分析和預(yù)處理感興趣，你可以查看這篇文章。我將數(shù)據(jù)集劃分為訓(xùn)練集（70%）和測試集（30%），以評估模型的性能。

## split dataset
dtf_train, dtf_test = model_selection.train_test_split(dtf, test_size=0.3)## get target
y_train = dtf_train["y"].values
y_test = dtf_test["y"].values

讓我們開始吧！

詞袋法

詞袋法的模型很簡單：從文檔語料庫構(gòu)建一個(gè)詞匯表，并計(jì)算單詞在每個(gè)文檔中出現(xiàn)的次數(shù)。換句話說，詞匯表中的每個(gè)單詞都成為一個(gè)特征，文檔由具有相同詞匯量長度的矢量（一個(gè)“詞袋”）表示。例如，我們有3個(gè)句子，并用這種方法表示它們：

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較特征矩陣的形狀：文檔數(shù)x詞匯表長度

可以想象，這種方法將會導(dǎo)致很嚴(yán)重的維度問題：文件越多，詞匯表越大，因此特征矩陣將是一個(gè)巨大的稀疏矩陣。所以，為了減少維度問題，詞袋法模型通常需要先進(jìn)行重要的預(yù)處理（詞清除、刪除停用詞、詞干提取/詞形還原）。

詞頻不一定是文本的最佳表示方法。實(shí)際上我們會發(fā)現(xiàn)，有些常用詞在語料庫中出現(xiàn)頻率很高，但是它們對目標(biāo)變量的預(yù)測能力卻很小。為了解決此問題，有一種詞袋法的高級變體，它使用詞頻-逆向文件頻率（Tf-Idf）代替簡單的計(jì)數(shù)?；旧?，一個(gè)單詞的值和它的計(jì)數(shù)成正比地增加，但是和它在語料庫中出現(xiàn)的頻率成反比。

先從特征工程開始，我們通過這個(gè)流程從數(shù)據(jù)中提取信息來建立特征。使用Tf-Idf向量器(vectorizer)，限制為1萬個(gè)單詞（所以詞長度將是1萬），捕捉一元文法（即 "new "和 "york"）和二元文法（即 "new york"）。以下是經(jīng)典的計(jì)數(shù)向量器的代碼:

ngram_range=(1,2))vectorizer = feature_extraction.text.TfidfVectorizer(max_features=10000, ngram_range=(1,2))

現(xiàn)在將在訓(xùn)練集的預(yù)處理語料上使用向量器來提取詞表并創(chuàng)建特征矩陣。

corpus = dtf_train["text_clean"]vectorizer.fit(corpus)X_train = vectorizer.transform(corpus)dic_vocabulary = vectorizer.vocabulary_

特征矩陣X_train的尺寸為34265（訓(xùn)練集中的文檔數(shù)）×10000（詞長度），這個(gè)矩陣很稀疏:

sns.heatmap(X_train.todense()[:,np.random.randint(0,X.shape[1],100)]==0, vmin=0, vmax=1, cbar=False).set_title('Sparse Matrix Sample')

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

從特征矩陣中隨機(jī)抽樣（黑色為非零值）

為了知道某個(gè)單詞的位置，可以這樣在詞表中查詢:

word = "new york"dic_vocabulary[word]

如果詞表中存在這個(gè)詞，這行腳本會輸出一個(gè)數(shù)字N，表示矩陣的第N個(gè)特征就是這個(gè)詞。

為了降低矩陣的維度所以需要去掉一些列，我們可以進(jìn)行一些特征選擇（Feature Selection），這個(gè)流程就是選擇相關(guān)變量的子集。操作如下:

將每個(gè)類別視為一個(gè)二進(jìn)制位（例如，"科技"類別中的科技新聞將分類為1，否則為0）;
進(jìn)行卡方檢驗(yàn)，以便確定某個(gè)特征和其（二進(jìn)制）結(jié)果是否獨(dú)立;
只保留卡方檢驗(yàn)中有特定p值的特征。

y = dtf_train["y"]
X_names = vectorizer.get_feature_names()
p_value_limit = 0.95dtf_features = pd.DataFrame()
for cat in np.unique(y):
    chi2, p = feature_selection.chi2(X_train, y==cat)
    dtf_features = dtf_features.append(pd.DataFrame(
                   {"feature":X_names, "score":1-p, "y":cat}))
    dtf_features = dtf_features.sort_values(["y","score"],
                    ascending=[True,False])
    dtf_features = dtf_features[dtf_features["score"]>p_value_limit]X_names = dtf_features["feature"].unique().tolist()

這將特征的數(shù)量從10000個(gè)減少到3152個(gè)，保留了最有統(tǒng)計(jì)意義的特征。選一些打印出來是這樣的:

for cat in np.unique(y):
   print("# {}:".format(cat))
   print("  . selected features:",
         len(dtf_features[dtf_features["y"]==cat]))
   print("  . top features:", ",".join(
dtf_features[dtf_features["y"]==cat]["feature"].values[:10]))
   print(" ")

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

我們將這組新的詞表作為輸入，在語料上重新擬合向量器。這將輸出一個(gè)更小的特征矩陣和更短的詞表。

vectorizer = feature_extraction.text.TfidfVectorizer(vocabulary=X_names)vectorizer.fit(corpus)X_train = vectorizer.transform(corpus)dic_vocabulary = vectorizer.vocabulary_

新的特征矩陣X_train的尺寸是34265（訓(xùn)練中的文檔數(shù)量）×3152（給定的詞表長度）。你看矩陣是不是沒那么稀疏了:

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

從新的特征矩陣中隨機(jī)抽樣（非零值為黑色）

現(xiàn)在我們該訓(xùn)練一個(gè)機(jī)器學(xué)習(xí)模型試試了。我推薦使用樸素貝葉斯算法：它是一種利用貝葉斯定理的概率分類器，貝葉斯定理根據(jù)可能相關(guān)條件的先驗(yàn)知識進(jìn)行概率預(yù)測。這種算法最適合這種大型數(shù)據(jù)集了，因?yàn)樗鼤?dú)立考察每個(gè)特征，計(jì)算每個(gè)類別的概率，然后預(yù)測概率最高的類別。

classifier = naive_bayes.MultinomialNB()

我們在特征矩陣上訓(xùn)練這個(gè)分類器，然后在經(jīng)過特征提取后的測試集上測試它。因此我們需要一個(gè)scikit-learn流水線：這個(gè)流水線包含一系列變換和最后接一個(gè)estimator。將Tf-Idf向量器和樸素貝葉斯分類器放入流水線，就能輕松完成對測試數(shù)據(jù)的變換和預(yù)測。

## pipelinemodel = pipeline.Pipeline([("vectorizer", vectorizer),
("classifier", classifier)])## train classifiermodel["classifier"].fit(X_train, y_train)## testX_test = dtf_test["text_clean"].values
predicted = model.predict(X_test)
predicted_prob = model.predict_proba(X_test)

至此我們可以使用以下指標(biāo)評估詞袋模型了:

準(zhǔn)確率: 模型預(yù)測正確的比例。
混淆矩陣: 是一張記錄每類別預(yù)測正確和預(yù)測錯(cuò)誤數(shù)量的匯總表。
ROC: 不同閾值下，真正例率與假正例率的對比圖。曲線下的面積(AUC)表示分類器中隨機(jī)選擇的正觀察值排序比負(fù)觀察值更靠前的概率。
精確率: "所有被正確檢索的樣本數(shù)(TP)"占所有"實(shí)際被檢索到的(TP+FP)"的比例。
召回率: 所有"被正確檢索的樣本數(shù)(TP)"占所有"應(yīng)該檢索到的結(jié)果(TP+FN)"的比例。

classes = np.unique(y_test)
y_test_array = pd.get_dummies(y_test, drop_first=False).values
    ## Accuracy, Precision, Recallaccuracy = metrics.accuracy_score(y_test, predicted)
auc = metrics.roc_auc_score(y_test, predicted_prob,
                            multi_)
print("Accuracy:",  round(accuracy,2))
print("Auc:", round(auc,2))
print("Detail:")
print(metrics.classification_report(y_test, predicted))
    ## Plot confusion matrixcm = metrics.confusion_matrix(y_test, predicted)
fig, ax = plt.subplots()
sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues,
            cbar=False)
ax.set(xlabel="Pred", ylabel="True", xticklabels=classes,
       yticklabels=classes, title="Confusion matrix")
plt.yticks(rotation=0)
fig, ax = plt.subplots(nrows=1, ncols=2)## Plot rocfor i in range(len(classes)):
    fpr, tpr, thresholds = metrics.roc_curve(y_test_array[:,i],
                           predicted_prob[:,i])
    ax[0].plot(fpr, tpr, lw=3,
              label='{0} (area={1:0.2f})'.format(classes[i],
                              metrics.auc(fpr, tpr))
               )
ax[0].plot([0,1], [0,1], color='navy', lw=3, line)
ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05],
          xlabel='False Positive Rate',
          ylabel="True Positive Rate (Recall)",
          title="Receiver operating characteristic")
ax[0].legend(loc="lower right")
ax[0].grid(True)
    ## Plot precision-recall curvefor i in range(len(classes)):
    precision, recall, thresholds = metrics.precision_recall_curve(
                 y_test_array[:,i], predicted_prob[:,i])
    ax[1].plot(recall, precision, lw=3,
               label='{0} (area={1:0.2f})'.format(classes[i],
                                  metrics.auc(recall, precision))
              )
ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall',
          ylabel="Precision", title="Precision-Recall curve")
ax[1].legend(loc="best")
ax[1].grid(True)
plt.show()

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

詞袋模型能夠在測試集上正確分類85%的樣本（準(zhǔn)確率為0.85），但在辨別科技新聞方面卻很吃力（只有252條預(yù)測正確）。

讓我們探究一下為什么模型會將新聞分類為其他類別，順便看看預(yù)測結(jié)果是不是能解釋些什么。lime包可以幫助我們建立一個(gè)解釋器。為讓這更好理解，我們從測試集中隨機(jī)采樣一次, 看看能發(fā)現(xiàn)些什么:

## select observationi = 0
txt_instance = dtf_test["text"].iloc[i]## check true value and predicted valueprint("True:", y_test[i], "--> Pred:", predicted[i], "| Prob:", round(np.max(predicted_prob[i]),2))## show explanationexplainer = lime_text.LimeTextExplainer(class_names=
np.unique(y_train))
explained = explainer.explain_instance(txt_instance,
model.predict_proba, num_features=3)
explained.show_in_notebook(text=txt_instance, predict_proba=False)

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

這就一目了然了：雖然"舞臺(stage)"這個(gè)詞在娛樂新聞中更常見, "克林頓(Clinton) "和 "GOP "這兩個(gè)詞依然為模型提供了引導(dǎo)（政治新聞）。

詞嵌入

詞嵌入（Word Embedding）是將中詞表中的詞映射為實(shí)數(shù)向量的特征學(xué)習(xí)技術(shù)的統(tǒng)稱。這些向量是根據(jù)每個(gè)詞出現(xiàn)在另一個(gè)詞之前或之后的概率分布計(jì)算出來的。換一種說法，上下文相同的單詞通常會一起出現(xiàn)在語料庫中，所以它們在向量空間中也會很接近。例如，我們以前面例子中的3個(gè)句子為例:

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

二維向量空間中的詞嵌入

在本教程中，我門將使用這類模型的開山怪: Google的Word2Vec（2013）。其他流行的詞嵌入模型還有斯坦福大學(xué)的GloVe（2014）和Facebook的FastText（2016）。

Word2Vec生成一個(gè)包含語料庫中的每個(gè)獨(dú)特單詞的向量空間，通常有幾百維, 這樣在語料庫中擁有共同上下文的單詞在向量空間中的位置就會相互靠近。有兩種不同的方法可以生成詞嵌入：從某一個(gè)詞來預(yù)測其上下文（Skip-gram）或根據(jù)上下文預(yù)測某一個(gè)詞（Continuous Bag-of-Words）。

在Python中，可以像這樣從genism-data中加載一個(gè)預(yù)訓(xùn)練好的詞嵌入模型:

nlp = gensim_api.load("word2vec-google-news-300")

我將不使用預(yù)先訓(xùn)練好的模型，而是用gensim在訓(xùn)練數(shù)據(jù)上自己訓(xùn)練一個(gè)Word2Vec。在訓(xùn)練模型之前，需要將語料轉(zhuǎn)換為n元文法列表。具體來說，就是嘗試捕獲一元文法（"york"）、二元文法（"new york"）和三元文法（"new york city"）。

corpus = dtf_train["text_clean"]## create list of lists of unigramslst_corpus = []
for string in corpus:
   lst_words = string.split()
   lst_grams = [" ".join(lst_words[i:i+1])
               for i in range(0, len(lst_words), 1)]
   lst_corpus.append(lst_grams)## detect bigrams and trigramsbigrams_detector = gensim.models.phrases.Phrases(lst_corpus,
                 delimiter=" ".encode(), min_count=5, threshold=10)
bigrams_detector = gensim.models.phrases.Phraser(bigrams_detector)trigrams_detector = gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],
            delimiter=" ".encode(), min_count=5, threshold=10)
trigrams_detector = gensim.models.phrases.Phraser(trigrams_detector)

在訓(xùn)練Word2Vec時(shí)，需要設(shè)置一些參數(shù):

詞向量維度設(shè)置為300;
窗口大小，即句子中當(dāng)前詞和預(yù)測詞之間的最大距離，這里使用語料庫中文本的平均長度;
訓(xùn)練算法使用 skip-grams (sg=1)，因?yàn)橐话銇碚f它的效果更好。

## fit w2vnlp = gensim.models.word2vec.Word2Vec(lst_corpus, size=300,
window=8, min_count=1, sg=1, iter=30)

現(xiàn)在我們有了詞嵌入模型，所以現(xiàn)在可以從語料庫中任意選擇一個(gè)詞，將其轉(zhuǎn)化為一個(gè)300維的向量。

word = "data"nlp[word].shape

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

甚至可以通過某些維度縮減算法（比如TSNE），將一個(gè)單詞及其上下文可視化到一個(gè)更低的維度空間（2D或3D）。

word = "data"
fig = plt.figure()## word embedding
tot_words = [word] + [tupla[0] for tupla in
                 nlp.most_similar(word, topn=20)]
X = nlp[tot_words]## pca to reduce dimensionality from 300 to 3
pca = manifold.TSNE(perplexity=40, n_components=3, init='pca')
X = pca.fit_transform(X)## create dtf
dtf_ = pd.DataFrame(X, index=tot_words, columns=["x","y","z"])
dtf_["input"] = 0
dtf_["input"].iloc[0:1] = 1## plot 3d
from mpl_toolkits.mplot3d import Axes3D
ax = fig.add_subplot(111, projection='3d')
ax.scatter(dtf_[dtf_["input"]==0]['x'],
           dtf_[dtf_["input"]==0]['y'],
           dtf_[dtf_["input"]==0]['z'], c="black")
ax.scatter(dtf_[dtf_["input"]==1]['x'],
           dtf_[dtf_["input"]==1]['y'],
           dtf_[dtf_["input"]==1]['z'], c="red")
ax.set(xlabel=None, ylabel=None, zlabel=None, xticklabels=[],
       yticklabels=[], zticklabels=[])
for label, row in dtf_[["x","y","z"]].iterrows():
    x, y, z = row
    ax.text(x, y, z, s=label)

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

這非常酷，但詞嵌入在預(yù)測新聞類別這樣的任務(wù)上有何裨益呢？詞向量可以作為神經(jīng)網(wǎng)絡(luò)的權(quán)重。具體是這樣的:

首先，將語料轉(zhuǎn)化為單詞id的填充(padded)序列，得到一個(gè)特征矩陣。
然后，創(chuàng)建一個(gè)嵌入矩陣，使id為N的詞向量位于第N行。
最后，建立一個(gè)帶有嵌入層的神經(jīng)網(wǎng)絡(luò)，對序列中的每一個(gè)詞都用相應(yīng)的向量進(jìn)行加權(quán)。

還是從特征工程開始，用 tensorflow/keras 將 Word2Vec 的同款預(yù)處理語料（n-grams 列表）轉(zhuǎn)化為文本序列的列表:

## tokenize texttokenizer = kprocessing.text.Tokenizer(lower=True, split=' ',
                     oov_token="NaN",
                     filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(lst_corpus)
dic_vocabulary = tokenizer.word_index## create sequencelst_text2seq= tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_train = kprocessing.sequence.pad_sequences(lst_text2seq,
                    maxlen=15, padding="post", truncating="post")

特征矩陣X_train的尺寸為34265×15（序列數(shù)×序列最大長度）?？梢暬幌率沁@樣的:

sns.heatmap(X_train==0, vmin=0, vmax=1, cbar=False)
plt.show()

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

特征矩陣(34 265 x 15)

現(xiàn)在語料庫中的每一個(gè)文本都是一個(gè)長度為15的id序列。例如，如果一個(gè)文本中有10個(gè)詞符，那么這個(gè)序列由10個(gè)id和5個(gè)0組成，這個(gè)0這就是填充元素（而詞表中沒有的詞其id為1）。我們來輸出一下看看一段訓(xùn)練集文本是如何被轉(zhuǎn)化成一個(gè)帶有填充元素的詞序列:

i = 0## list of text: ["I like this", ...]len_txt = len(dtf_train["text_clean"].iloc[i].split())print("from: ", dtf_train["text_clean"].iloc[i], "| len:", len_txt)## sequence of token ids: [[1, 2, 3], ...]len_tokens = len(X_train[i])print("to: ", X_train[i], "| len:", len(X_train[i]))## vocabulary: {"I":1, "like":2, "this":3, ...}print("check: ", dtf_train["text_clean"].iloc[i].split()[0],
" -- idx in vocabulary -->",
dic_vocabulary[dtf_train["text_clean"].iloc[i].split()[0]])print("vocabulary: ", dict(list(dic_vocabulary.items())[0:5]), "... (padding element, 0)")

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

記得在測試集上也要做這個(gè)特征工程:

corpus = dtf_test["text_clean"]## create list of n-gramslst_corpus = []
for string in corpus:
    lst_words = string.split()
    lst_grams = [" ".join(lst_words[i:i+1]) for i in range(0,
                 len(lst_words), 1)]
    lst_corpus.append(lst_grams)
    ## detect common bigrams and trigrams using the fitted detectorslst_corpus = list(bigrams_detector[lst_corpus])
lst_corpus = list(trigrams_detector[lst_corpus])## text to sequence with the fitted tokenizerlst_text2seq = tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_test = kprocessing.sequence.pad_sequences(lst_text2seq, maxlen=15,
             padding="post", truncating="post")

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

X_test (14,697 x 15)

現(xiàn)在我們就有了X_train和X_test，現(xiàn)在需要創(chuàng)建嵌入矩陣，它將作為神經(jīng)網(wǎng)絡(luò)分類器的權(quán)重矩陣.

## start the matrix (length of vocabulary x vector size) with all 0sembeddings = np.zeros((len(dic_vocabulary)+1, 300))for word,idx in dic_vocabulary.items():
    ## update the row with vector    try:
        embeddings[idx] =  nlp[word]
    ## if word not in model then skip and the row stays all 0s    except:
        pass

這段代碼生成的矩陣尺寸為22338×300（從語料庫中提取的詞表長度×向量維度）。它可以通過詞表中的詞id。

word = "data"print("dic[word]:", dic_vocabulary[word], "|idx")print("embeddings[idx]:", embeddings[dic_vocabulary[word]].shape,
"|vector")

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

終于要建立深度學(xué)習(xí)模型了! 我門在神經(jīng)網(wǎng)絡(luò)的第一個(gè)Embedding層中使用嵌入矩陣，訓(xùn)練它之后就能用來進(jìn)行新聞分類。輸入序列中的每個(gè)id將被視為訪問嵌入矩陣的索引。這個(gè)嵌入層的輸出是一個(gè) 包含輸入序列中每個(gè)詞id對應(yīng)詞向量的二維矩陣（序列長度 x 詞向量維度）。以 "我喜歡這篇文章(I like this article) "這個(gè)句子為例:

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

我的神經(jīng)網(wǎng)絡(luò)的結(jié)構(gòu)如下:

一個(gè)嵌入層，如前文所述, 將文本序列作為輸入, 詞向量作為權(quán)重。
一個(gè)簡單的Attention層，它不會影響預(yù)測，但它可以捕捉每個(gè)樣本的權(quán)重, 以便將作為一個(gè)不錯(cuò)的解釋器（對于預(yù)測來說它不是必需的，只是為了提供可解釋性，所以其實(shí)可以不用加它）。這篇論文（2014）提出了序列模型（比如LSTM）的Attention機(jī)制，探究了長文本中哪些部分實(shí)際相關(guān)。
兩層雙向LSTM，用來建模序列中詞的兩個(gè)方向。
最后兩層全連接層，可以預(yù)測每個(gè)新聞類別的概率。

## code attention layerdef attention_layer(inputs, neurons):
    x = layers.Permute((2,1))(inputs)
    x = layers.Dense(neurons, activation="softmax")(x)
    x = layers.Permute((2,1), name="attention")(x)
    x = layers.multiply([inputs, x])
    return x## inputx_in = layers.Input(shape=(15,))## embeddingx = layers.Embedding(input_dim=embeddings.shape[0],
                     output_dim=embeddings.shape[1],
                     weights=[embeddings],
                     input_length=15, trainable=False)(x_in)## apply attentionx = attention_layer(x, neurons=15)## 2 layers of bidirectional lstmx = layers.Bidirectional(layers.LSTM(units=15, dropout=0.2,
                         return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(units=15, dropout=0.2))(x)## final dense layersx = layers.Dense(64, activation='relu')(x)
y_out = layers.Dense(3, activation='softmax')(x)## compilemodel = models.Model(x_in, y_out)
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])
model.summary()

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

現(xiàn)在來訓(xùn)練模型，不過在實(shí)際測試集上測試之前，我們要在訓(xùn)練集上劃一小塊驗(yàn)證集來驗(yàn)證模型性能。

## encode ydic_y_mapping = {n:label for n,label in
                 enumerate(np.unique(y_train))}
inverse_dic = {v:k for k,v in dic_y_mapping.items()}
y_train = np.array([inverse_dic[y] for y in y_train])## traintraining = model.fit(x=X_train, y=y_train, batch_size=256,
                     epochs=10, shuffle=True, verbose=0,
                     validation_split=0.3)## plot loss and accuracymetrics = [k for k in training.history.keys() if ("loss" not in k) and ("val" not in k)]
fig, ax = plt.subplots(nrows=1, ncols=2, sharey=True)ax[0].set(title="Training")
ax11 = ax[0].twinx()
ax[0].plot(training.history['loss'], color='black')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss', color='black')for metric in metrics:
    ax11.plot(training.history[metric], label=metric)
ax11.set_ylabel("Score", color='steelblue')
ax11.legend()ax[1].set(title="Validation")
ax22 = ax[1].twinx()
ax[1].plot(training.history['val_loss'], color='black')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Loss', color='black')for metric in metrics:
     ax22.plot(training.history['val_'+metric], label=metric)
ax22.set_ylabel("Score", color="steelblue")
plt.show()

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

Nice！在某些epoch中準(zhǔn)確率達(dá)到了0.89。為了對詞嵌入模型進(jìn)行評估，在測試集上也要進(jìn)行預(yù)測，并用相同指標(biāo)進(jìn)行對比（評價(jià)指標(biāo)的代碼與之前相同）。

## testpredicted_prob = model.predict(X_test)
predicted = [dic_y_mapping[np.argmax(pred)] for pred in
predicted_prob]

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

該模式的表現(xiàn)與前一個(gè)模型差不多。其實(shí)，它的科技新聞分類也不怎么樣。

但它也具有可解釋性嗎? 是的! 因?yàn)樵谏窠?jīng)網(wǎng)絡(luò)中放了一個(gè)Attention層來提取每個(gè)詞的權(quán)重，我們可以了解這些權(quán)重對一個(gè)樣本的分類貢獻(xiàn)有多大。所以這里我將嘗試使用Attention權(quán)重來構(gòu)建一個(gè)解釋器（類似于上一節(jié)里的那個(gè)）:

## select observationi = 0txt_instance = dtf_test["text"].iloc[i]## check true value and predicted valueprint("True:", y_test[i], "--> Pred:", predicted[i], "| Prob:", round(np.max(predicted_prob[i]),2))## show explanation### 1. preprocess inputlst_corpus = []for string in [re.sub(r'[^\w\s]','', txt_instance.lower().strip())]:
    lst_words = string.split()
    lst_grams = [" ".join(lst_words[i:i+1]) for i in range(0,
                 len(lst_words), 1)]
    lst_corpus.append(lst_grams)
lst_corpus = list(bigrams_detector[lst_corpus])
lst_corpus = list(trigrams_detector[lst_corpus])
X_instance = kprocessing.sequence.pad_sequences(
              tokenizer.texts_to_sequences(corpus), maxlen=15,
              padding="post", truncating="post")### 2. get attention weightslayer = [layer for layer in model.layers if "attention" in
         layer.name][0]
func = K.function([model.input], [layer.output])
weights = func(X_instance)[0]
weights = np.mean(weights, axis=2).flatten()### 3. rescale weights, remove null vector, map word-weightweights = preprocessing.MinMaxScaler(feature_range=(0,1)).fit_transform(np.array(weights).reshape(-1,1)).reshape(-1)
weights = [weights[n] for n,idx in enumerate(X_instance[0]) if idx
           != 0]
dic_word_weigth = {word:weights[n] for n,word in
                   enumerate(lst_corpus[0]) if word in
                   tokenizer.word_index.keys()}### 4. barplotif len(dic_word_weigth) > 0:
   dtf = pd.DataFrame.from_dict(dic_word_weigth, orient='index',
                                columns=["score"])
   dtf.sort_values(by="score",
           ascending=True).tail(top).plot(kind="barh",
           legend=False).grid(axis='x')
   plt.show()else:
   print("--- No word recognized ---")### 5. produce html visualizationtext = []for word in lst_corpus[0]:
    weight = dic_word_weigth.get(word)
    if weight is not None:
         text.append('<b><span >' + word + '</span></b>')
    else:
         text.append(word)
text = ' '.join(text)### 6. visualize on notebookprint("\033[1m"+"Text with highlighted words")from IPython.core.display import display, HTML
display(HTML(text))

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

就像之前一樣，"克林頓 (clinton)"和 "老大黨(gop) "這兩個(gè)詞激活了模型的神經(jīng)元，而且這次發(fā)現(xiàn) "高(high) "和 "班加西(benghazi) "與預(yù)測也略有關(guān)聯(lián)。

語言模型

語言模型, 即上下文/動態(tài)詞嵌入（Contextualized/Dynamic Word Embeddings），克服了經(jīng)典詞嵌入方法的最大局限：多義詞消歧義，一個(gè)具有不同含義的詞（如" bank "或" stick"）只需一個(gè)向量就能識別。最早流行的是 ELMO（2018），它并沒有采用固定的嵌入，而是利用雙向 LSTM觀察整個(gè)句子，然后給每個(gè)詞分配一個(gè)嵌入。

到Transformers時(shí)代, 谷歌的論文Attention is All You Need（2017）提出的一種新的語言建模技術(shù)，在該論文中，證明了序列模型（如LSTM）可以完全被Attention機(jī)制取代，甚至獲得更好的性能。

而后谷歌的BERT（Bidirectional Encoder Representations from Transformers，2018）包含了ELMO的上下文嵌入和幾個(gè)Transformers，而且它是雙向的（這是對Transformers的一大創(chuàng)新改進(jìn)）。BERT分配給一個(gè)詞的向量是整個(gè)句子的函數(shù)，因此，一個(gè)詞可以根據(jù)上下文不同而有不同的詞向量。我們輸入岸河(bank river)到Transformer試試:

txt = "bank river"## bert tokenizertokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)## bert modelnlp = transformers.TFBertModel.from_pretrained('bert-base-uncased')## return hidden layer with embeddingsinput_ids = np.array(tokenizer.encode(txt))[None,:]
embedding = nlp(input_ids)
embedding[0][0]

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

如果將輸入文字改為 "銀行資金(bank money)"，則會得到這樣的結(jié)果:

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

為了完成文本分類任務(wù)，可以用3種不同的方式來使用BERT:

從零訓(xùn)練它，并將其作為分類器使用。
提取詞嵌入，并在嵌入層中使用它們（就像上面用Word2Vec那樣）。
對預(yù)訓(xùn)練模型進(jìn)行精調(diào)(遷移學(xué)習(xí))。

我打算用第三種方式，從預(yù)訓(xùn)練的輕量 BERT 中進(jìn)行遷移學(xué)習(xí)，人稱 Distil-BERT （用6600 萬個(gè)參數(shù)替代1.1 億個(gè)參數(shù)）

## distil-bert tokenizertokenizer = transformers.AutoTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)

在訓(xùn)練模型之前，還是需要做一些特征工程，但這次會比較棘手。為了說明我們需要做什么，還是以我們這句 "我喜歡這篇文章(I like this article) "為例，他得被轉(zhuǎn)化為3個(gè)向量（Ids, Mask, Segment）:

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

尺寸為 3 x 序列長度

首先，我們需要確定最大序列長度。這次要選擇一個(gè)大得多的數(shù)字(比如50)，因?yàn)锽ERT會將未知詞分割成子詞符(sub-token)，直到找到一個(gè)已知的單字。比如若給定一個(gè)像 "zzdata "這樣的虛構(gòu)詞，BERT會把它分割成["z"，"##z"，"##data"]。除此之外, 我們還要在輸入文本中插入特殊的詞符，然后生成掩碼(musks)和分段(segments)向量。最后，把它們放進(jìn)一個(gè)張量里得到特征矩陣，其尺寸為3（id、musk、segment）x 語料庫中的文檔數(shù) x 序列長度。

這里我使用原始文本作為語料（前面一直用的是clean_text列）。

corpus = dtf_train["text"]
maxlen = 50## add special tokensmaxqnans = np.int((maxlen-20)/2)
corpus_tokenized = ["[CLS] "+
             " ".join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n', '',
             str(txt).lower().strip()))[:maxqnans])+
             " [SEP] " for txt in corpus]## generate masksmasks = [[1]*len(txt.split(" ")) + [0]*(maxlen - len(
           txt.split(" "))) for txt in corpus_tokenized]
    ## paddingtxt2seq = [txt + " [PAD]"*(maxlen-len(txt.split(" "))) if len(txt.split(" ")) != maxlen else txt for txt in corpus_tokenized]
    ## generate idxidx = [tokenizer.encode(seq.split(" ")) for seq in txt2seq]
    ## generate segmentssegments = [] for seq in txt2seq:
    temp, i = [], 0    for token in seq.split(" "):
        temp.append(i)
        if token == "[SEP]":
             i += 1    segments.append(temp)## feature matrixX_train = [np.asarray(idx, dtype='int32'),
           np.asarray(masks, dtype='int32'),
           np.asarray(segments, dtype='int32')]

特征矩陣X_train的尺寸為3×34265×50。我們可以從特征矩陣中隨機(jī)挑一個(gè)出來看看:

i = 0print("txt: ", dtf_train["text"].iloc[0])
print("tokenized:", [tokenizer.convert_ids_to_tokens(idx) for idx in X_train[0][i].tolist()])
print("idx: ", X_train[0][i])
print("mask: ", X_train[1][i])
print("segment: ", X_train[2][i])

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

這段代碼在dtf_test["text"]上跑一下就能得到X_test。

現(xiàn)在要從預(yù)練好的 BERT 中用遷移學(xué)習(xí)一個(gè)深度學(xué)習(xí)模型。具體就是，把 BERT 的輸出用平均池化壓成一個(gè)向量，然后在最后添加兩個(gè)全連接層來預(yù)測每個(gè)新聞類別的概率.

下面是使用BERT原始版本的代碼（記得用正確的tokenizer重做特征工程):

## inputsidx = layers.Input((50), dtype="int32", name="input_idx")
masks = layers.Input((50), dtype="int32", name="input_masks")
segments = layers.Input((50), dtype="int32", name="input_segments")## pre-trained bertnlp = transformers.TFBertModel.from_pretrained("bert-base-uncased")
bert_out, _ = nlp([idx, masks, segments])## fine-tuningx = layers.GlobalAveragePooling1D()(bert_out)
x = layers.Dense(64, activation="relu")(x)
y_out = layers.Dense(len(np.unique(y_train)),
                     activation='softmax')(x)## compilemodel = models.Model([idx, masks, segments], y_out)for layer in model.layers[:4]:
    layer.trainable = Falsemodel.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])model.summary()

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

這里用輕量級的Distil-BERT來代替BERT:

## inputsidx = layers.Input((50), dtype="int32", name="input_idx")
masks = layers.Input((50), dtype="int32", name="input_masks")## pre-trained bert with configconfig = transformers.DistilBertConfig(dropout=0.2,
           attention_dropout=0.2)
config.output_hidden_states = Falsenlp = transformers.TFDistilBertModel.from_pretrained('distilbert-
                  base-uncased', config=config)
bert_out = nlp(idx, attention_mask=masks)[0]## fine-tuningx = layers.GlobalAveragePooling1D()(bert_out)
x = layers.Dense(64, activation="relu")(x)
y_out = layers.Dense(len(np.unique(y_train)),
                     activation='softmax')(x)## compilemodel = models.Model([idx, masks], y_out)for layer in model.layers[:3]:
    layer.trainable = Falsemodel.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])model.summary()

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

最后我們訓(xùn)練.測試并評估該模型 (評價(jià)代碼與前文一致):

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

BERT的表現(xiàn)要比之前的模型稍好，它能識別的科技新聞要比其他模型多一些.

結(jié)語

本文是一個(gè)通俗教程，展示了如何將不同的NLP模型應(yīng)用于多類分類任務(wù)上。文中比較了3種流行的方法: 用Tf-Idf的詞袋模型, 用Word2Vec的詞嵌入, 和用BERT的語言模型. 每個(gè)模型都介紹了其特征工程與特征選擇、模型設(shè)計(jì)與測試、模型評價(jià)與模型解釋，并在(可行時(shí)的)每一步中比較了這3種模型。

雷鋒字幕組是一個(gè)由AI愛好者組成的翻譯團(tuán)隊(duì)，匯聚五五多位志愿者的力量，分享最新的海外AI資訊，交流關(guān)于人工智能技術(shù)領(lǐng)域的行業(yè)轉(zhuǎn)變與技術(shù)創(chuàng)新的見解。

團(tuán)隊(duì)成員有大數(shù)據(jù)專家，算法工程師，圖像處理工程師，產(chǎn)品經(jīng)理，產(chǎn)品運(yùn)營，IT咨詢?nèi)耍谛熒?；志愿者們來自IBM，AVL，Adobe，阿里，百度等知名企業(yè)，北大，清華，港大，中科院，南卡羅萊納大學(xué)，早稻田大學(xué)等海內(nèi)外高校研究所。

如果，你也是位熱愛分享的AI愛好者。歡迎與雷鋒字幕組一起，學(xué)習(xí)新知，分享成長。

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較