從原理到實(shí)戰(zhàn) 英偉達(dá)教你用PyTorch搭建RNN（下）

本文作者：三川

2017-05-06 20:36

導(dǎo)語(yǔ)：動(dòng)態(tài)計(jì)算圖加持，PyTorch 相比 TensorFlow 是否有優(yōu)勢(shì)？

雷鋒網(wǎng)按：本文為《從原理到實(shí)戰(zhàn) 英偉達(dá)教你用PyTorch搭建RNN》的下篇，閱讀上篇請(qǐng)點(diǎn)擊這里。文章原載于英偉達(dá)博客，雷鋒網(wǎng)編譯。

代碼實(shí)操

在開(kāi)始創(chuàng)建神經(jīng)網(wǎng)絡(luò)之前，我需要設(shè)置一個(gè) data loader。對(duì)于深度學(xué)習(xí)而言，在數(shù)據(jù)樣例的 batch 上運(yùn)行模型十分常見(jiàn)，這能通過(guò)并行計(jì)算加速訓(xùn)練，并在每一步有更加平滑的梯度?，F(xiàn)在我們就開(kāi)始，下文會(huì)解釋上篇描述的如何對(duì) stack-manipulation 進(jìn)行 batch。 PyTorch text library 內(nèi)置的系統(tǒng)，能把相近長(zhǎng)度的樣例組合起來(lái)自動(dòng)生成 batch，以下 Python 代碼便向該系統(tǒng)加載了一些數(shù)據(jù)。運(yùn)行這些代碼之后，, train_iter、dev_iter、test_iter 中的迭代器，會(huì)在 SNLI 訓(xùn)練、驗(yàn)證、測(cè)試階段在 batch 上循環(huán)。

from torchtext import data, datasets
TEXT = datasets.snli.ParsedTextField(lower=True)
TRANSITIONS = datasets.snli.ShiftReduceField()
LABELS = data.Field(sequential=False)
train, dev, test = datasets.SNLI.splits(
TEXT, TRANSITIONS, LABELS, wv_type='glove.42B')
TEXT.build_vocab(train, dev, test)
train_iter, dev_iter, test_iter = data.BucketIterator.splits(
(train, dev, test), batch_size=64)

你可以在 train.py 找到其余代碼，包括訓(xùn)練循環(huán)（loop）的和衡量精度的?，F(xiàn)在講模型。如同上篇所描述，一個(gè) SPINN 編碼器包含一個(gè)參數(shù)化的 Reduce 層，以及可選的 recurrent Tracker，以追蹤語(yǔ)境。這通過(guò)在神經(jīng)網(wǎng)絡(luò)每讀取一個(gè)詞語(yǔ)、或應(yīng)用 Reduce 的時(shí)候，更新隱藏狀態(tài)來(lái)實(shí)現(xiàn)。下面的代碼其實(shí)表示了，創(chuàng)建一個(gè) SPINN 只是意味著創(chuàng)建這兩個(gè)子模塊而已，以及把它們放到容器里面以日后使用。

import torch
from torch import nn
# subclass the Module class from PyTorch’s neural network package
class SPINN(nn.Module):
   def __init__(self, config):
       super(SPINN, self).__init__()
       self.config = config
       self.reduce = Reduce(config.d_hidden, config.d_tracker)
       if config.d_tracker is not None:
           self.tracker = Tracker(config.d_hidden, config.d_tracker)

創(chuàng)建模型時(shí)，SPINN.__init__被調(diào)用一次。它分配、初始化參數(shù)，但不進(jìn)行任何神經(jīng)網(wǎng)絡(luò)運(yùn)算，也不涉及創(chuàng)建計(jì)算圖。每組新數(shù)據(jù) batch 上運(yùn)行的代碼，在 SPINN 中定義。PyTorch 里，用戶定義模型前饋通道的方法名為 “forward”。事實(shí)上，它是對(duì)上文提到的 stack-manipulation 算法的實(shí)現(xiàn)，在普通 Python 里，它運(yùn)行于 Buffer 和堆棧的 batch 上——對(duì)每個(gè)樣例使用兩者之一。在轉(zhuǎn)換過(guò)程包含的“shift” 和 “reduce” op 上迭代，如果它存在，就運(yùn)行 Tracker，并運(yùn)行于 batch 中的每個(gè)樣例以應(yīng)用 “shift”op，或加入需要 “reduce” op 的樣例列表。然后在列表所有的樣例上運(yùn)行 Reduce 層，把結(jié)果 push 回相關(guān)堆棧。

def forward(self, buffers, transitions):
       # The input comes in as a single tensor of word embeddings;
       # I need it to be a list of stacks, one for each example in
       # the batch, that we can pop from independently. The words in
       # each example have already been reversed, so that they can
       # be read from left to right by popping from the end of each
       # list; they have also been prefixed with a null value.
       buffers = [list(torch.split(b.squeeze(1), 1, 0))
                  for b in torch.split(buffers, 1, 1)]
       # we also need two null values at the bottom of each stack,
       # so we can copy from the nulls in the input; these nulls
       # are all needed so that the tracker can run even if the
       # buffer or stack is empty
       stacks = [[buf[0], buf[0]] for buf in buffers]
       if hasattr(self, 'tracker'):
           self.tracker.reset_state()
       for trans_batch in transitions:
           if hasattr(self, 'tracker'):
               # I described the Tracker earlier as taking 4
               # arguments (context_t, b, s1, s2), but here I
               # provide the stack contents as a single argument
               # while storing the context inside the Tracker
               # object itself.
               tracker_states, _ = self.tracker(buffers, stacks)
           else:
               tracker_states = itertools.repeat(None)
           lefts, rights, trackings = [], [], []
           batch = zip(trans_batch, buffers, stacks, tracker_states)
           for transition, buf, stack, tracking in batch:
               if transition == SHIFT:
                   stack.append(buf.pop())
               elif transition == REDUCE:
                   rights.append(stack.pop())
                   lefts.append(stack.pop())
                   trackings.append(tracking)
           if rights:
               reduced = iter(self.reduce(lefts, rights, trackings))
               for transition, stack in zip(trans_batch, stacks):
                   if transition == REDUCE:
                       stack.append(next(reduced))
       return [stack.pop() for stack in stacks]

調(diào)用 self.tracker 或 self.reduce，會(huì)相對(duì)應(yīng)地運(yùn)行 Tracker 中的“forward”方式，或 Reduce 子模塊。這需要在一個(gè)樣例列表來(lái)執(zhí)行該 op。所有數(shù)學(xué)運(yùn)算密集、用 GPU 加速、收益用 batch 的 op 都發(fā)生在 Tracker 和 Reduce 之中。因此，在主要的“forward”方式中，單獨(dú)在不同樣例上運(yùn)行；對(duì) batch 中的每個(gè)樣例保持獨(dú)立的 buffer 和堆棧，都是意義的。為了更干凈地寫(xiě)這些函數(shù)，我會(huì)用一些輔助，把這些樣例列表轉(zhuǎn)為 batch 化的張量，反之亦然。

我傾向于讓 Reduce 模塊自動(dòng) batch 參數(shù)來(lái)加速計(jì)算，然后 unbatch 它們，這樣之后能單獨(dú)地 push、pop。把每一組左右子短語(yǔ)放到一起，來(lái)表示母短語(yǔ)的合成函數(shù)是 TreeLSTM，一個(gè)常規(guī) LSTM 的變種。此合成函數(shù)要求，所有子樹(shù)的狀態(tài)要由兩個(gè)張量組成，一個(gè)隱藏狀態(tài) h 和一個(gè)內(nèi)存單元狀態(tài) c。定義該函數(shù)的因素有兩個(gè)：運(yùn)行于子樹(shù)隱藏狀態(tài)中的兩個(gè)線性層 (nn.Linear)，以及非線性合成函數(shù) tree_lstm，后者把線性層的結(jié)果和子樹(shù)內(nèi)存單元的狀態(tài)組合起來(lái)。在 SPINN 中，這通過(guò)加入第三個(gè)運(yùn)行于 Tracker 隱藏狀態(tài)的線性層來(lái)拓展。

從原理到實(shí)戰(zhàn) 英偉達(dá)教你用PyTorch搭建RNN（下）

def tree_lstm(c1, c2, lstm_in):
   # Takes the memory cell states (c1, c2) of the two children, as
   # well as the sum of linear transformations of the children’s
   # hidden states (lstm_in)
   # That sum of transformed hidden states is broken up into a
   # candidate output a and four gates (i, f1, f2, and o).
   a, i, f1, f2, o = lstm_in.chunk(5, 1)
   c = a.tanh() * i.sigmoid() + f1.sigmoid() * c1 + f2.sigmoid() * c2
   h = o.sigmoid() * c.tanh()
   return h, c

class Reduce(nn.Module):
   def __init__(self, size, tracker_size=None):
       super(Reduce, self).__init__()
       self.left = nn.Linear(size, 5 * size)
       self.right = nn.Linear(size, 5 * size, bias=False)
       if tracker_size is not None:
           self.track = nn.Linear(tracker_size, 5 * size, bias=False)

   def forward(self, left_in, right_in, tracking=None):
       left, right = batch(left_in), batch(right_in)
       tracking = batch(tracking)
       lstm_in = self.left(left[0])
       lstm_in += self.right(right[0])
       if hasattr(self, 'track'):
           lstm_in += self.track(tracking[0])
       return unbatch(tree_lstm(left[1], right[1], lstm_in))

由于 Reduce 層和以與之類似方式執(zhí)行的 Tracker 都在 LSTM 上運(yùn)行，batch 和 unbatch 輔助函數(shù)會(huì)在成對(duì)隱藏、內(nèi)存狀態(tài)上運(yùn)行。

def batch(states):
   if states is None:
       return None
   states = tuple(states)
   if states[0] is None:
       return None
   # states is a list of B tensors of dimension (1, 2H)
   # this returns two tensors of dimension (B, H)
   return torch.cat(states, 0).chunk(2, 1)

def unbatch(state):
   if state is None:
       return itertools.repeat(None)
   # state is a pair of tensors of dimension (B, H)
   # this returns a list of B tensors of dimension (1, 2H)
   return torch.split(torch.cat(state, 1), 1, 0)

這就是全部的實(shí)操講解了。其余代碼，包含 Tracker，都在 spinn.py 里。至于從兩個(gè)句子編碼上計(jì)算 SNLI 類別、并把結(jié)果與目標(biāo)做對(duì)比，以給出最終損失變量的分類層，在 model.py 里。 SPINN 的 “forward”代碼及其子模塊，所產(chǎn)生的是極度復(fù)雜的計(jì)算圖（下圖），在損失上達(dá)到高潮。其細(xì)節(jié)與數(shù)據(jù)集中的每一個(gè) batch 都完全不同，但每次都可簡(jiǎn)單地調(diào)用 loss.backward() 以自動(dòng)反向傳播，其成本很低。loss.backward() 是 PyTorch 內(nèi)置的一個(gè)函數(shù)，能在計(jì)算圖的任意一個(gè)點(diǎn)上進(jìn)行反向傳播。

完整代碼里的模型和超參數(shù)，其性能可與原始 SPINN 論文相提并論。但在 GPU 上，它更快好幾倍——它的實(shí)現(xiàn)充分利用了 batch 和以及 Pytorch 的高效率。原始的 SPINN 編譯計(jì)算圖花費(fèi)了 21 分鐘（意味著執(zhí)行時(shí)的修補(bǔ)漏洞周期至少也這么長(zhǎng)），訓(xùn)練花了大約五天。本文描述的這一版本并沒(méi)有便宜步驟，在 Tesla K40 GPU 上訓(xùn)練只用了 13 小時(shí)，相當(dāng)于 Quadro GP100 上的九個(gè)小時(shí)。

從原理到實(shí)戰(zhàn) 英偉達(dá)教你用PyTorch搭建RNN（下）

整合強(qiáng)化學(xué)習(xí)

上文描述的、該模型不含 Tracker 的版本，其實(shí)特別適合 TensorFlow 的 tf.fold，針對(duì)動(dòng)態(tài)計(jì)算圖特殊情形的 TensorFlow 新專用語(yǔ)言。包含 Tracker 的版本實(shí)現(xiàn)起來(lái)要難得多。這背后的原因是：加入 Tracker，就意味著從 recursive 模式切換為基于堆棧的模式。在上面的代碼里，這以最直觀的形式表現(xiàn)了出來(lái)，這使用的是取決于輸入值的 conditional branches。 Fold 并沒(méi)有內(nèi)建的 conditional branch op，所以模型里的圖結(jié)構(gòu)只取決于輸入的結(jié)構(gòu)而非值。另外，創(chuàng)建一個(gè)由 Tracker 決定如何解析輸入語(yǔ)句的 SPINN 實(shí)際上是不可能的。這是因?yàn)?Fold 里的圖結(jié)構(gòu)——雖然它們?nèi)Q于輸入樣例的結(jié)構(gòu)，在一個(gè)輸入樣例加載之后，它必須完全固定下來(lái)。

DeepMind 和谷歌大腦的研究人員正在摸索一個(gè)類似的模型。他們用強(qiáng)化學(xué)習(xí)來(lái)訓(xùn)練一個(gè) SPINN 的 Tracker，來(lái)解析輸入語(yǔ)句，而不需要任何外部解析數(shù)據(jù)。本質(zhì)上，這樣的模型以隨機(jī)的猜想開(kāi)始，當(dāng)它的解析在整體分類任務(wù)上生成較好精度時(shí)，獎(jiǎng)勵(lì)它自己，以此來(lái)學(xué)習(xí)。研究人員們寫(xiě)道，他們“使用 batch size 1，因?yàn)槿Q于 policy network [Tracker] 的樣本，對(duì)于每個(gè)樣例，計(jì)算圖需要在每次迭代后重建?！钡幢阍谙癖疚倪@么復(fù)雜、結(jié)構(gòu)有隨機(jī)變化特性的神經(jīng)網(wǎng)絡(luò)上，在 PyTorch 上，研究人員們也能只用 batch 訓(xùn)練。

PyTorch 還是第一個(gè)在算法庫(kù)內(nèi)置了強(qiáng)化學(xué)習(xí)的框架，即它的 stochastic computation graphs （隨機(jī)計(jì)算圖）。這使得 policy gradient 強(qiáng)化學(xué)習(xí)像反向傳播一樣易于使用。若想要把它加入上面描述的模型，你只需要像重寫(xiě)主 SPINN 的頭幾行代碼，生成下面一樣的循環(huán)，讓 Tracker 來(lái)定義做任何一種解析器（parser）轉(zhuǎn)換的概率。

!# nn.functional contains neural network operations without parameters
from torch.nn import functional as F
transitions = []
for i in range(len(buffers[0]) * 2 - 3): # we know how many steps
   # obtain raw scores for each kind of parser transition
   tracker_states, transition_scores = self.tracker(buffers, stacks)
   # use a softmax function to normalize scores into probabilities,
   # then sample from the distribution these probabilities define
   transition_batch = F.softmax(transition_scores).multinomial()
   transitions.append(transition_batch

當(dāng) batch 一路運(yùn)行下來(lái)，模型知道了它的類別預(yù)測(cè)精確程度之后，我可以在反向傳播之外，用傳統(tǒng)方式通過(guò)圖的其余部分把獎(jiǎng)勵(lì)信號(hào)傳回這些隨機(jī)計(jì)算圖節(jié)點(diǎn)：

# losses should contain a loss per example, while mean and std
# represent averages across many batches
rewards = (-losses - mean) / std
for transition in transitions:
transition.reinforce(rewards)
# connect the stochastic nodes to the final loss variable
# so that backpropagation can find them, multiplying by zero
# because this trick shouldn’t change the loss value
loss = losses.mean() + 0 * sum(transitions).sum()
# perform backpropagation through deterministic nodes and
# policy gradient RL for stochastic nodes
loss.backward()

谷歌研究人員從 SPINN+增強(qiáng)學(xué)習(xí)報(bào)告的結(jié)果，比在 SNLI 獲得的原始 SPINN 要好一點(diǎn)，雖然它的增強(qiáng)學(xué)習(xí)版并沒(méi)有預(yù)計(jì)算語(yǔ)法樹(shù)。深度增強(qiáng)學(xué)習(xí)在 NLP 的應(yīng)用是一個(gè)全新的領(lǐng)域，其中的研究問(wèn)題十分廣泛。通過(guò)把增強(qiáng)學(xué)習(xí)整合到框架里，PyTorch 極大降低了使用門(mén)檻。

via nvidia，雷鋒網(wǎng)編譯。

“TensorFlow & 神經(jīng)網(wǎng)絡(luò)算法高級(jí)應(yīng)用班”要開(kāi)課啦！

從原理到實(shí)戰(zhàn) 英偉達(dá)教你用PyTorch搭建RNN（下）

從初級(jí)到高級(jí)，理論+實(shí)戰(zhàn)，一站式深度了解 TensorFlow！

本課程面向深度學(xué)習(xí)開(kāi)發(fā)者，講授如何利用 TensorFlow 解決圖像識(shí)別、文本分析等具體問(wèn)題。課程跨度為 10 周，將從 TensorFlow 的原理與基礎(chǔ)實(shí)戰(zhàn)技巧開(kāi)始，一步步教授學(xué)員如何在 TensorFlow 上搭建 CNN、自編碼、RNN、GAN 等模型，并最終掌握一整套基于 TensorFlow 做深度學(xué)習(xí)開(kāi)發(fā)的專業(yè)技能。

兩名授課老師佟達(dá)、白發(fā)川身為 ThoughtWorks 的資深技術(shù)專家，具有豐富的大數(shù)據(jù)平臺(tái)搭建、深度學(xué)習(xí)系統(tǒng)開(kāi)發(fā)項(xiàng)目經(jīng)驗(yàn)。

時(shí)間：每周二、四晚 20：00-21：00

開(kāi)課時(shí)長(zhǎng)：總學(xué)時(shí) 20 小時(shí)，分 10 周完成，每周2次，每次 1 小時(shí)

線上授課地址：http://www.ozgbdpf.cn/special/custom/mooc04.html

PyTorch 的預(yù)訓(xùn)練，是時(shí)候?qū)W習(xí)一下了

GAN 很復(fù)雜？如何用不到 50 行代碼訓(xùn)練 GAN（基于 PyTorch）

雷峰網(wǎng)版權(quán)文章，未經(jīng)授權(quán)禁止轉(zhuǎn)載。詳情見(jiàn)轉(zhuǎn)載須知。