用 NVIDIA DALI 加速PyTorch：訓(xùn)練速度提升 4 倍

本文作者： skura

2020-02-03 10:56

導(dǎo)語：幾個(gè)小時(shí)就可以在GPU上完成訓(xùn)練!

本文展示了一些提高 DALI 資源使用率以及創(chuàng)建一個(gè)完全基于 CPU 的管道的技術(shù)。這些技術(shù)長期穩(wěn)定內(nèi)存使用率，將 CPU & GPU 管道的 batch 大小提高 50%。用特斯拉 V100 加速器顯示 PyTorch+DALI 可以達(dá)到接近 4000 個(gè)圖像/秒的處理速度，比原生 PyTorch 快了大約 4 倍。

簡(jiǎn)介

過去幾年見證了深度學(xué)習(xí)硬件的長足進(jìn)步。英偉達(dá)的最新產(chǎn)品，Tesla V100 & Geforce RTX 系列，包含特定的張量核，以加速常用的神經(jīng)網(wǎng)絡(luò)操作。特別是，V100 已經(jīng)具備足夠的性能。能夠以每秒數(shù)千幅圖像的速度訓(xùn)練神經(jīng)網(wǎng)絡(luò)。這使得在 ImageNet 數(shù)據(jù)集上的單一 GPU 訓(xùn)練時(shí)間減少到幾個(gè)小時(shí)。而在 202 年，在 ImageNet 上訓(xùn)練 AlexNet 模型花了 5 天時(shí)間！

如此強(qiáng)大的 gpu 使數(shù)據(jù)預(yù)處理管道變得緊張。為了解決這個(gè)問題，Tensorflow 發(fā)布了一個(gè)新的數(shù)據(jù)加載器：tf.data.Dataset。管道是用 C++ 編寫的，使用基于圖的方法，預(yù)處理操作被鏈接在一起形成一個(gè)管道。另一方面，PyTorch 使用在 PIL 庫上用 Python 編寫的數(shù)據(jù)加載器，它具備良好的易于用和靈活性，誕生在速度方面不是那么出色。盡管 PIL-SIMD 庫確實(shí)改善了這種情況。

NVIDIA 數(shù)據(jù)加載庫（DALI）旨在解決數(shù)據(jù)預(yù)處理瓶頸，讓數(shù)據(jù)在訓(xùn)練時(shí)全速運(yùn)行。DALI 主要用于在 GPU 上進(jìn)行預(yù)處理，但是其大多數(shù)操作也有一個(gè)快速的 CPU 實(shí)現(xiàn)。本文主要關(guān)注 PyTorch，但 DALI 也支持 Tensorflow、MXNet 和 TensorRT，尤其是 TensorRT 的支持非常好。它允許訓(xùn)練和推理使用完全相同的預(yù)處理代碼。Tensorflow 和 PyTorch 這樣的框架在數(shù)據(jù)加載器之間通常具有一定的差異，這可能會(huì)影響準(zhǔn)確性。

以下是開始使用 DALI 的一些重要資源：

DALI Home：https://developer.nvidia.com/DALI
Fast AI Data Preprocessing with NVIDIA DALI：https://devblogs.nvidia.com/fast-ai-data-preprocessing-with-nvidia-dali/
DALI Developer Guide：https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html
Getting Started：https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/getting%20started.html

在本文的其余部分中，我將假設(shè)你對(duì) ImageNet 預(yù)處理及 DALI ImageNet 實(shí)例有一定的理解。我來談?wù)勗谑褂?DALI 的時(shí)候遇到的問題，以及我是如何解決的。我們將研究 CPU 和 GPU 管道。

DALI 長期內(nèi)存使用

我在 DALI 中遇到的第一個(gè)問題是，隨著訓(xùn)練階段的推移，RAM 的使用率增加，這都會(huì)導(dǎo)致 OOM 錯(cuò)誤（即使在內(nèi)存為 78GB 的虛擬機(jī)上也是如此）。它已經(jīng)被標(biāo)記位（278，344，486），但是還沒有被修復(fù)。

用 NVIDIA DALI 加速PyTorch：訓(xùn)練速度提升 4 倍

我唯一能找到的解決辦法并不美好：重新導(dǎo)入 DALI，重新訓(xùn)練和驗(yàn)證管道：

del self.train_loader, self.val_loader, self.train_pipe,
self.val_pipe
torch.cuda.synchronize()
torch.cuda.empty_cache()
gc.collect()

importlib.reload(dali)
from dali import HybridTrainPipe, HybridValPipe, DaliIteratorCPU,
DaliIteratorGPU

<rebuild DALI pipeline>

注意，有了這個(gè)解決方案，DALI 仍然需要大量 RAM 來獲得最好的結(jié)果。考慮到現(xiàn)在的 RAM 有多便宜，這不是什么大問題；相反，GPU 內(nèi)存才是問題所在。從下表可以看出，使用 DALI 時(shí)的最大批的大小可能比 TorchVision 低 50%：

用 NVIDIA DALI 加速PyTorch：訓(xùn)練速度提升 4 倍

在下面的部分中，我將介紹一些減少 GPU 內(nèi)存使用的方法。

構(gòu)建完全基于 CPU 的管道

當(dāng)不需要峰值吞吐量時(shí)（例如，當(dāng)使用 ResNet50 等中大型模型時(shí)），基于 CPU 的管道非常有用。CPU 訓(xùn)練管道只在 CPU 上執(zhí)行解碼和大小調(diào)整操作，而 Cropmirnormalize 操作在 GPU 上運(yùn)行。這點(diǎn)很重要。我發(fā)現(xiàn)，即使是用 DALI 將輸出傳輸?shù)?GPU，也會(huì)占用大量的 GPU 內(nèi)存。為了避免這種情況，我修改了示例 CPU 管道，使其完全在 CPU 上運(yùn)行：

class HybridTrainPipe(Pipeline):
def __init__(self, batch_size, num_threads, device_id, data_dir,
crop,
mean, std, local_rank=0, world_size=1,
dali_cpu=False, shuffle=True, fp16=False,
min_crop_size=0.08):
# As we're recreating the Pipeline at every epoch, the seed
must be -1 (random seed)
super(HybridTrainPipe, self).__init__(batch_size,
num_threads, device_id, seed=-1)
# Enabling read_ahead slowed down processing ~40%
self.input = ops.FileReader(file_root=data_dir,
shard_id=local_rank, num_shards=world_size,
random_shuffle=shuffle)
# Let user decide which pipeline works best with the chosen
model
if dali_cpu:
decode_device = "cpu"
self.dali_device = "cpu"
self.flip = ops.Flip(device=self.dali_device)
else:
decode_device = "mixed"
self.dali_device = "gpu"
output_dtype = types.FLOAT
if self.dali_device == "gpu" and fp16:
output_dtype = types.FLOAT16
self.cmn = ops.CropMirrorNormalize(device="gpu",
output_dtype=output_dtype,
output_layout=types.NCHW,
crop=(crop, crop),
image_type=types.RGB,
mean=mean,
std=std,)
# To be able to handle all images from full-sized ImageNet,
this padding sets the size of the internal nvJPEG buffers without
additional reallocations
device_memory_padding = 211025920 if decode_device == 'mixed'
else 0
host_memory_padding = 140544512 if decode_device == 'mixed'
else 0
self.decode =
ops.ImageDecoderRandomCrop(device=decode_device,
output_type=types.RGB,
device_memory_padding=device_memory_padding,
host_memory_padding=host_memory_padding,
random_aspect_ratio=
[0.8, 1.25],
random_area=
[min_crop_size, 1.0],
num_attempts=100)
# Resize as desired. To match torchvision data loader, use
triangular interpolation.
self.res = ops.Resize(device=self.dali_device, resize_x=crop,
resize_y=crop,
interp_type=types.INTERP_TRIANGULAR)
self.coin = ops.CoinFlip(probability=0.5)
print('DALI "{0}" variant'.format(self.dali_device))
def define_graph(self):
rng = self.coin()
self.jpegs, self.labels = self.input(name="Reader")
# Combined decode & random crop
images = self.decode(self.jpegs)
# Resize as desired
images = self.res(images)
if self.dali_device == "gpu":
output = self.cmn(images, mirror=rng)
else:
# CPU backend uses torch to apply mean & std
output = self.flip(images, horizontal=rng)
self.labels = self.labels.gpu()
return [output, self.labels]

DALI 管道現(xiàn)在在 CPU 上輸出一個(gè) 8 位張量。我們需要使用 PyTorch 來完成 CPU->GPU 傳輸、浮點(diǎn)數(shù)轉(zhuǎn)換和規(guī)范化。最后兩個(gè)操作是在 GPU 上完成的，因?yàn)樵趯?shí)踐中，它們非?？?，并且減少了 CPU->GPU 內(nèi)存帶寬需求。在轉(zhuǎn)到 GPU 之前，我試著固定張力，但沒有從中獲得任何性能提升。

將它與預(yù)取器組合在一起：

def _preproc_worker(dali_iterator, cuda_stream, fp16, mean, std,
output_queue, proc_next_input, done_event, pin_memory):
"""
Worker function to parse DALI output & apply final preprocessing
steps
"""
while not done_event.is_set():
# Wait until main thread signals to proc_next_input --
normally once it has taken the last processed input
proc_next_input.wait()
proc_next_input.clear()
if done_event.is_set():
print('Shutting down preproc thread')
break
try:
data = next(dali_iterator)
# Decode the data output
input_orig = data[0]['data']
target = data[0]['label'].squeeze().long() # DALI should
already output target on device
# Copy to GPU and apply final processing in separate CUDA
stream
with torch.cuda.stream(cuda_stream):
input = input_orig
if pin_memory:
input = input.pin_memory()
del input_orig # Save memory
input = input.cuda(non_blocking=True)
input = input.permute(0, 3, 1, 2)
# Input tensor is kept as 8-bit integer for transfer
to GPU, to save bandwidth
if fp16:
input = input.half()
else:
input = input.float()
input = input.sub_(mean).div_(std)
# Put the result on the queue
output_queue.put((input, target))
except StopIteration:
print('Resetting DALI loader')
dali_iterator.reset()
output_queue.put(None)
class DaliIteratorCPU(DaliIterator):
"""
Wrapper class to decode the DALI iterator output & provide
iterator that functions in the same way as TorchVision.
Note that permutation to channels first, converting from 8-bit
integer to float & normalization are all performed on GPU
pipelines (Pipeline): DALI pipelines
size (int): Number of examples in set
fp16 (bool): Use fp16 as output format, f32 otherwise
mean (tuple): Image mean value for each channel
std (tuple): Image standard deviation value for each channel
pin_memory (bool): Transfer input tensor to pinned memory, before
moving to GPU
"""
def __init__(self, fp16=False, mean=(0., 0., 0.), std=(1., 1.,
1.), pin_memory=True, **kwargs):
super().__init__(**kwargs)
print('Using DALI CPU iterator')
self.stream = torch.cuda.Stream()
self.fp16 = fp16
self.mean = torch.tensor(mean).cuda().view(1, 3, 1, 1)
self.std = torch.tensor(std).cuda().view(1, 3, 1, 1)
self.pin_memory = pin_memory
if self.fp16:
self.mean = self.mean.half()
self.std = self.std.half()
self.proc_next_input = Event()
self.done_event = Event()
self.output_queue = queue.Queue(maxsize=5)
self.preproc_thread = threading.Thread(
target=_preproc_worker,
kwargs={'dali_iterator': self._dali_iterator,
'cuda_stream': self.stream, 'fp16': self.fp16, 'mean': self.mean,
'std': self.std, 'proc_next_input': self.proc_next_input,
'done_event': self.done_event, 'output_queue': self.output_queue,
'pin_memory': self.pin_memory})
self.preproc_thread.daemon = True
self.preproc_thread.start()
self.proc_next_input.set()
def __next__(self):
torch.cuda.current_stream().wait_stream(self.stream)
data = self.output_queue.get()
self.proc_next_input.set()
if data is None:
raise StopIteration
return data
def __del__(self):
self.done_event.set()
self.proc_next_input.set()
torch.cuda.current_stream().wait_stream(self.stream)
self.preproc_thread.join()

基于 GPU 的管道

在我的測(cè)試中，上面詳述的新的完整 CPU 管道的速度大約是 TooVIEW 數(shù)據(jù)加載程序的兩倍，同時(shí)達(dá)到了幾乎相同的最大批大小。CPU 管道在 ResNet50 這樣的大型模型中工作得很好，但是，當(dāng)使用 AlexNet 或 ResNet18 這樣的小型模型時(shí)，CPU 管道仍然無法跟上 GPU。對(duì)于這些情況，示例 GPU 管道表現(xiàn)最好。問題是，GPU 管道將最大可能的批大小減少了 50%，限制了吞吐量。

顯著減少 GPU 內(nèi)存使用的一種方法是，在一個(gè)階段結(jié)束時(shí)，將驗(yàn)證管道保留在 GPU 之外，直到它真正需要被使用為止。這很容易做到，因?yàn)槲覀円呀?jīng)重新導(dǎo)入 DALI 庫并在每個(gè)階段重新創(chuàng)建數(shù)據(jù)加載程序。

更多提示

使用 DALI 的更多提示：

對(duì)于驗(yàn)證，均勻劃分?jǐn)?shù)據(jù)集大小的批大小最有效，例如當(dāng)驗(yàn)證集大小為 50000 時(shí)，最好的批大小是 500 而不是 512，這避免了驗(yàn)證數(shù)據(jù)集會(huì)剩余一部分。

與 Tensorflow 和 PyTorch 數(shù)據(jù)加載程序類似，TorchVision 和 DALI 管道不會(huì)產(chǎn)生完全相同的輸出，你將看到驗(yàn)證精度略有不同。我發(fā)現(xiàn)這是由于不同的 JPEG 圖像解碼器造成的。以前在大小調(diào)整上有問題，但現(xiàn)在是管道固定。另一方面，DALI 支持 TensorRT，允許在訓(xùn)練和推理中使用完全相同的預(yù)處理。

對(duì)于峰值吞吐量，請(qǐng)嘗試將數(shù)據(jù)加載程序工作線程數(shù)設(shè)置為虛擬 CPU 內(nèi)核數(shù)。2 提供最佳性能（2 個(gè)虛擬內(nèi)核=1 個(gè)物理內(nèi)核）。

如果你想要絕對(duì)的最佳性能，并且不介意輸出類似于 TorchVision，請(qǐng)嘗試關(guān)閉 DALI 圖像調(diào)整器上的三角形插值。

別忘了磁盤 IO。確保有足夠的內(nèi)存來緩存數(shù)據(jù)集以及一個(gè)非常快的 SSD。DALI 的磁盤傳輸速度可以達(dá)到 400Mb/s！

集成在一起

為了方便地集成這些修改，我創(chuàng)建了一個(gè)數(shù)據(jù)加載器類，其中包含了這里描述的所有修改，包括 DALI 和 TorchVision 后端。用法很簡(jiǎn)單。實(shí)例化數(shù)據(jù)加載器：

dataset = Dataset(data_dir,
batch_size,
val_batch_size
workers,
use_dali,
dali_cpu,
fp16）

然后獲取訓(xùn)練和驗(yàn)證數(shù)據(jù)集加載程序：

train_loader = dataset.get_train_loader()
val_loader = dataset.get_val_loader()

在每個(gè)訓(xùn)練周期結(jié)束時(shí)重置數(shù)據(jù)加載器：

dataset.reset()

或者，可以在模型驗(yàn)證之前在 GPU 上重新創(chuàng)建驗(yàn)證管道：

dataset.prep_for_val()

基準(zhǔn)

以下是我可以用 ResNet18 使用的最大批處理大?。?/p>

用 NVIDIA DALI 加速PyTorch：訓(xùn)練速度提升 4 倍

因此，通過應(yīng)用這些修改，在 CPU 和 GPU 模式下 DALI 可以使用的最大批處理大小增加了約 50%！

以下是 Shufflenet V2 0.5 和批大小 512 的吞吐量數(shù)據(jù)：

用 NVIDIA DALI 加速PyTorch：訓(xùn)練速度提升 4 倍

下面是使用 DALI GPU 管道訓(xùn)練 TorchVision 中包含的各種網(wǎng)絡(luò)的一些結(jié)果：

用 NVIDIA DALI 加速PyTorch：訓(xùn)練速度提升 4 倍

所有測(cè)試都在一個(gè) Google Cloud V100 實(shí)例上運(yùn)行，該實(shí)例有 12 個(gè) vCPUs（6 個(gè)物理核）、78GB RAM，使用 Apex FP16 進(jìn)行訓(xùn)練。要重現(xiàn)這些結(jié)果，請(qǐng)使用以下參數(shù)：

— fp16 — batch-size 512 — workers 10 — arch “shufflenet_v2_x0_5 or resnet18” — prof — use-dali

所以，有了DALI，一臺(tái) Tesla V100 的處理速度可以達(dá)到每秒處理近 4000 張圖像！但這僅僅是 Nvidia 超昂貴的 DGX-1 8 V100 GPU 的一半多一點(diǎn)。對(duì)我來說，能夠在幾個(gè)小時(shí)內(nèi)在一個(gè) GPU 上進(jìn)行 ImageNet 訓(xùn)練完全改變了生產(chǎn)力，希望對(duì)你來說也是如此！

本文提供的代碼可以在如下網(wǎng)址找到：https://github.com/yaysummeriscoming/DALI_pytorch_demo

via：http://t.cn/A6PlsjM1

雷鋒網(wǎng)雷鋒網(wǎng)雷鋒網(wǎng)

雷峰網(wǎng)版權(quán)文章，未經(jīng)授權(quán)禁止轉(zhuǎn)載。詳情見轉(zhuǎn)載須知。

0人收藏

相關(guān)文章

skura

編輯

發(fā)私信

當(dāng)月熱門文章