我用 tensorflow 實現(xiàn)的 “一個神經(jīng)聊天模型”

本文作者： AI研習社

編輯：賈智龍

2017-09-07 10:26

導語：一個基于深度學習的聊天機器人。

雷鋒網(wǎng)按：本文原作者小灰灰，原載于專欄。雷鋒網(wǎng)已獲得作者授權(quán)。

概述

這個工作嘗試重現(xiàn)這個論文的結(jié)果 A Neural Conversational Model (aka the Google chatbot).
它使用了循環(huán)神經(jīng)網(wǎng)絡（seq2seq 模型）來進行句子預測。它是用 python 和 TensorFlow 開發(fā)。

程序的加載主體部分是參考 Torch 的 neuralconvo from macournoyer.

現(xiàn)在, DeepQA 支持一下對話語料:

Cornell Movie Dialogs corpus (default). Already included when cloning the repository.
OpenSubtitles (thanks to Eschnou). Much bigger corpus (but also noisier). To use it, follow those instructions and use the flag --corpus opensubs.
Supreme Court Conversation Data (thanks to julien-c). Available using --corpus scotus. See the instructions for installation.
Ubuntu Dialogue Corpus (thanks to julien-c). Available using --corpus ubuntu. See the instructions for installation.
Your own data (thanks to julien-c) by using a simple custom conversation format (See here for more info).

To speedup the training, it's also possible to use pre-trained word embeddings (thanks to Eschnou). More info here.

安裝

這個程序需要一下依賴 (easy to install using pip: pip3 install -r requirements.txt):

python 3.5
tensorflow (tested with v1.0)
numpy
CUDA (for using GPU)
nltk (natural language toolkit for tokenized the sentences)
tqdm (for the nice progression bars)

你可能需要下載附帶的數(shù)據(jù)讓 nltk 正常工作。

python3 -m nltk.downloader punkt

Cornell 數(shù)據(jù)集已經(jīng)包括了。其他的數(shù)據(jù)集查看 readme 文件到他們所在的文件夾。 (在 data/).

網(wǎng)站接口需要一些附加的包：

django (tested with 1.10)
channels
Redis (see here)
asgi_redis (at least 1.0)

Docker 安裝也是支持的，更多詳細的教程參考 here.

運行

聊天機器人

訓練這個模型，直接運行 main.py 。一旦訓練完成，你可以測試結(jié)果用 main.py --test
(結(jié)果生成在 'save/model/samples_predictions.txt') 或者用 main.py --test interactive (更有趣).

Here are some flags which could be useful. For more help and options, use python main.py -h:

--modelTag: allow to give a name to the current model to differentiate between them when testing/training.
--keepAll: use this flag when training if when testing, you want to see the predictions at different steps (it can be interesting to see the program changes its name and age as the training progress). Warning: It can quickly take a lot of storage space if you don't increase the --saveEvery option.
--filterVocab 20 or --vocabularySize 30000: Limit the vocabulary size to and optimize the performances and memory usage. Replace the words used less than 20 times by thetoken and set a maximum vocabulary size.
--verbose: when testing, will print the sentences as they are computed.
--playDataset: show some dialogue samples from the dataset (can be use conjointly with --createDataset if this is the only action you want to perform).

To visualize the computational graph and the cost with TensorBoard, just run tensorboard --logdir save/.

默認的這個網(wǎng)絡架構(gòu)是一個標準的 encoder/decoder 有兩個 LSTM layers (隱藏層大小 256) ，然后 vocabulary 的 embedding size 是 32. 這個網(wǎng)絡用 ADAM 訓練。最大的句子長度設(shè)置為 10 個單詞，但是可以增加。

結(jié)果

當然，這個網(wǎng)絡并不會很擅長聊天：

Q: Hi
A: Hi.

Q: What is your name ?
A: Laura.

Q: What does that mean ?
A: I dunno.

Q: How old are you ?
A: thirty-five.

Q: Will Google hire me ?
A: No.

Q: Tell me the alphabet
A: Fuck you.

Q: That's not nice
A: Yeah.

這里有一些情況它并不能正確回答：

Q: Who is Laura ?
A: My brother.

Q: Say 'goodbye'
A: Alright.

Q: What is cooking ?
A: A channel.

Q: Can you say no ?
A: No.

Q: Two plus two
A: Manny...