Text Generation: (Distil)GPT-2

Andrew Fogarty

7/16/2020

# load python
library(reticulate)
use_condaenv("my_ml")
# load packages
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as datautils
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset, random_split
from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, LineByLineTextDataset
from transformers import get_linear_schedule_with_warmup, AdamW
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.metrics import roc_auc_score
import time, os, datetime, random, re
from torch.cuda.amp import autocast, GradScaler
from sklearn.model_selection import train_test_split

SEED = 15
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
## <torch._C.Generator object at 0x000000001FA5B090>
torch.backends.cudnn.deterministic = True

# tell pytorch to use cuda
device = torch.device("cuda")

1 Introduction

Language models are trained to predict the probability the next “token” considering the preceding tokens that came before it. A token can be a word, a letter, or a subcomponent of a word. When generating text with language models, we often provide a starting sequence, like “The Taliban launched an attack”, and then the language model then outputs probabilities for what token comes next, drawing on all possible outcomes from the vocabulary.

In this guide, we use the a decoder-only language model transformer to predict text from our novel insurgent propaganda corpus. Generative Pre-Training (GPT-2) that takes input of up to 1024 byte pair tokens, is comprised of 12 decoder layers and 12 attention heads, and outputs a 768 dimensional vector (for the small version). GPT-2 is trained with Book Corpus via long contiguous runs of text.

2 Preparing the Data

Unlike the other guides, we need to process our data a little bit differently so as to create long contiguous inputs. The code below does minor clean up on the data set and creates a train and validation set that are then saved as text files.

# prepare and load data
def prepare_df(pkl_location):
    # read pkl as pandas
    df = pd.read_pickle(pkl_location)
    # remove excess white spaces
    df['body'] = df['body'].apply(lambda x: " ".join(x.split()))
    # remove excess spaces near punctuation
    df['body'] = df['body'].apply(lambda x: re.sub(r'\s([?.!"](?:\s|$))', r'\1', x))
    # split and shuffle data
    train, valid = train_test_split(df['body'], test_size=0.2)
    return train.reset_index(drop=True), valid.reset_index(drop=True)


# instantiate shuffled train and validation
train, valid = prepare_df('C:\\Users\\Andrew\\Desktop\\df.pkl')

# save to text for transformers TextDataset
np.savetxt('C:\\Users\\Andrew\\Desktop\\train.txt', train, fmt="%s")
np.savetxt('C:\\Users\\Andrew\\Desktop\\valid.txt', valid, fmt="%s")

2.1 Tokenizing

Next, we instantiate the GPT-2 tokenizer and add some special tokens to account for the beginning and end of a sentence, as well as padding. However, I am not certain that these tokens are all that necessary.

# instantiate GPT2 tokenizer, byte-level encoding
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')

# add special tokens that otherwise all share the same id
tokenizer.add_special_tokens({'bos_token': '<bos>',
                              'eos_token': '<eos>',
                              'pad_token': '<pad>'})

# check token ids
## 3
tokenizer.eos_token_id
## 50258
tokenizer.bos_token_id
## 50257
tokenizer.unk_token_id
## 50256
tokenizer.pad_token_id
## 50259

2.2 Instantiating Contiguous Data

Next, we use TextDataset from the transformers package to build our contiguous data for language modeling.

# Transfomer Data Set -- we need everything the same length
train_set = TextDataset(tokenizer=tokenizer,
                        file_path='C:\\Users\\Andrew\\Desktop\\train.txt',
                        block_size=1025)

valid_set = TextDataset(tokenizer=tokenizer,
                        file_path='C:\\Users\\Andrew\\Desktop\\valid.txt',
                        block_size=1025)

2.3 Instantiating the Model

Like most transformer models, we instantiate it and attach it to the GPU like so. Since GPT-2 is such a large model, we use the distilled version.

# instantiate model GPT2 transformer with a language modeling head on top
model = GPT2LMHeadModel.from_pretrained('distilgpt2').cuda()  # to GPU
## Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias']
## You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Next, data loaders are prepared. Due to the contiguous nature of our data, ensure to use SequentialSampler.

2.4 Data Loaders

# prepare data loaders
train_dataloader = datautils.DataLoader(dataset=train_set,
                                        sampler=SequentialSampler(train_set),
                                        batch_size=3,
                                        drop_last=True,
                                        shuffle=False)


valid_dataloader = datautils.DataLoader(dataset=valid_set,
                                        sampler=SequentialSampler(valid_set),
                                        batch_size=3,
                                        drop_last=True,
                                        shuffle=False)

Helper training functions are then instantiated. It is important to note that in training, we offset our batch by 1, yielding two 1024-length sequences (the maximum length for DistilGPT-2). Training a language model is training the model to predict the next word. We offset the sequence y by 1 since our “target” per token in x is the token to it’s right. Offsetting our inputs by one space is general practice for language modeling.

2.5 Training and Helper Functions

# time function
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    # format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))


def train(model, dataloader, optimizer):

    # capture time
    total_t0 = time.time()

    # Perform one full pass over the training set.
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch + 1, epochs))
    print('Training...')

    # reset total loss for epoch
    train_total_loss = 0

    # put model into traning mode
    model.train()

    # for each batch of training data...
    for step, batch in enumerate(dataloader):

        # progress update every 40 batches.
        if step % 40 == 0 and not step == 0:

            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(dataloader)))

        # Unpack this training batch from our dataloader:
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using
        #
        # `batch` contains our text in a PyTorch tensor
        #  that we need to slice opposite ends off
        x = batch[:, :-1].cuda()
        y = batch[:, 1:].cuda()

        # clear previously calculated gradients
        optimizer.zero_grad()

        # runs the forward pass with autocasting.
        with autocast():
            # forward propagation (evaluate model on training batch)
            logits = model(input_ids=x)[0]

            loss = criterion(logits.flatten(0, 1), y.flatten(0))
            # sum the training loss over all batches for average loss at end
            # loss is a tensor containing a single value
            train_total_loss += loss.item()

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        scaler.scale(loss).backward()

        # clip the gradients to 1 to reduce exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

        # update the learning rate
        scheduler.step()

    # calculate the average loss over all of the batches
    avg_train_loss = train_total_loss / len(dataloader)

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'Train Loss': avg_train_loss
        }
    )

    # training time end
    training_time = format_time(time.time() - total_t0)

    # print result summaries
    print("")
    print("summary results")
    print("epoch | trn loss | trn time ")
    print(f"{epoch+1:5d} | {avg_train_loss:.5f} | {training_time:}")

    return training_stats


def validating(model, dataloader):

    # capture validation time
    total_t0 = time.time()

    # After the completion of each training epoch, measure our performance on
    # our validation set.
    print("")
    print("Running Validation...")

    # put the model in evaluation mode
    model.eval()

    # track variables
    total_valid_loss = 0

    # evaluate data for one epoch
    for batch in dataloader:

        # Unpack this training batch from our dataloader:
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using
        #
        # `batch` contains our text in a PyTorch tensor
        #  that we need to slice opposite ends off
        x = batch[:, :-1].cuda()
        y = batch[:, 1:].cuda()

        # tell pytorch not to bother calculating gradients
        # as its only necessary for training
        with torch.no_grad():
            # forward propagation (evaluate model on training batch)
            logits = model(input_ids=x)[0]

            loss = criterion(logits.flatten(0, 1), y.flatten(0))
            # sum the training loss over all batches for average loss at end
            # loss is a tensor containing a single value
            total_valid_loss += loss.item()

    # calculate the average loss over all of the batches.
    global avg_val_loss
    avg_val_loss = total_valid_loss / len(dataloader)

    # Record all statistics from this epoch.
    valid_stats.append(
        {
            'Val Loss': avg_val_loss,
            'Val PPL.': np.exp(avg_val_loss)
        }
    )

    # capture end validation time
    training_time = format_time(time.time() - total_t0)

    # print result summaries
    print("")
    print("summary results")
    print("epoch | val loss | val ppl | val time")
    print(f"{epoch+1:5d} | {avg_val_loss:.5f} | {np.exp(avg_val_loss):.3f} | {training_time:}")

    return valid_stats

2.6 Training Preparation

Now we are almost ready to train. A few other preparatory objects are created like the loss criteria, epochs, the optimizer, and our optimizer scheduler.


# create gradient scaler for mixed precision
scaler = GradScaler()

# training length
epochs = 8

# loss function
criterion = nn.CrossEntropyLoss()

# optimizer: Adam w/ Weight Decay Fix
# set to optimizer_grouped_parameters or model.parameters()
optimizer = AdamW(model.parameters(),
                  lr=2e-5)


# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=total_steps)

2.7 Train

Now we are ready to train our model.

# create training result storage
training_stats = []
valid_stats = []
best_valid_loss = float('inf')

# for each epoch
for epoch in range(epochs):
    # train
    train(model, train_dataloader, optimizer)
    # validate
    validating(model, valid_dataloader)
    # check validation loss
    if valid_stats[epoch]['Val Loss'] < best_valid_loss:
        best_valid_loss = valid_stats[epoch]['Val Loss']
        # save best model for use later
        torch.save(model.state_dict(), 'gpt2-model1.pt')  # torch save
        model_to_save = model.module if hasattr(model, 'module') else model
        model_to_save.save_pretrained('./model_save/gpt2/')  # transformers save
        tokenizer.save_pretrained('./model_save/gpt2/')  # transformers save
## 
## ======== Epoch 1 / 8 ========
## Training...
##   Batch    40  of  1,102.
##   Batch    80  of  1,102.
##   Batch   120  of  1,102.
##   Batch   160  of  1,102.
##   Batch   200  of  1,102.
##   Batch   240  of  1,102.
##   Batch   280  of  1,102.
##   Batch   320  of  1,102.
##   Batch   360  of  1,102.
##   Batch   400  of  1,102.
##   Batch   440  of  1,102.
##   Batch   480  of  1,102.
##   Batch   520  of  1,102.
##   Batch   560  of  1,102.
##   Batch   600  of  1,102.
##   Batch   640  of  1,102.
##   Batch   680  of  1,102.
##   Batch   720  of  1,102.
##   Batch   760  of  1,102.
##   Batch   800  of  1,102.
##   Batch   840  of  1,102.
##   Batch   880  of  1,102.
##   Batch   920  of  1,102.
##   Batch   960  of  1,102.
##   Batch 1,000  of  1,102.
##   Batch 1,040  of  1,102.
##   Batch 1,080  of  1,102.
## 
## summary results
## epoch | trn loss | trn time 
##     1 | 3.70191 | 0:03:05
## [{'Train Loss': 3.701909479125658}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     1 | 3.56233 | 35.245 | 0:00:19
## [{'Val Loss': 3.5623276340030827, 'Val PPL.': 35.24513952738014}]
## ('./model_save/gpt2/vocab.json', './model_save/gpt2/merges.txt', './model_save/gpt2/special_tokens_map.json', './model_save/gpt2/added_tokens.json')
## 
## ======== Epoch 2 / 8 ========
## Training...
##   Batch    40  of  1,102.
##   Batch    80  of  1,102.
##   Batch   120  of  1,102.
##   Batch   160  of  1,102.
##   Batch   200  of  1,102.
##   Batch   240  of  1,102.
##   Batch   280  of  1,102.
##   Batch   320  of  1,102.
##   Batch   360  of  1,102.
##   Batch   400  of  1,102.
##   Batch   440  of  1,102.
##   Batch   480  of  1,102.
##   Batch   520  of  1,102.
##   Batch   560  of  1,102.
##   Batch   600  of  1,102.
##   Batch   640  of  1,102.
##   Batch   680  of  1,102.
##   Batch   720  of  1,102.
##   Batch   760  of  1,102.
##   Batch   800  of  1,102.
##   Batch   840  of  1,102.
##   Batch   880  of  1,102.
##   Batch   920  of  1,102.
##   Batch   960  of  1,102.
##   Batch 1,000  of  1,102.
##   Batch 1,040  of  1,102.
##   Batch 1,080  of  1,102.
## 
## summary results
## epoch | trn loss | trn time 
##     2 | 3.62447 | 0:03:08
## [{'Train Loss': 3.701909479125658}, {'Train Loss': 3.624465454944465}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     2 | 3.51926 | 33.759 | 0:00:19
## [{'Val Loss': 3.5623276340030827, 'Val PPL.': 35.24513952738014}, {'Val Loss': 3.5192573451641325, 'Val PPL.': 33.759347609035665}]
## ('./model_save/gpt2/vocab.json', './model_save/gpt2/merges.txt', './model_save/gpt2/special_tokens_map.json', './model_save/gpt2/added_tokens.json')
## 
## ======== Epoch 3 / 8 ========
## Training...
##   Batch    40  of  1,102.
##   Batch    80  of  1,102.
##   Batch   120  of  1,102.
##   Batch   160  of  1,102.
##   Batch   200  of  1,102.
##   Batch   240  of  1,102.
##   Batch   280  of  1,102.
##   Batch   320  of  1,102.
##   Batch   360  of  1,102.
##   Batch   400  of  1,102.
##   Batch   440  of  1,102.
##   Batch   480  of  1,102.
##   Batch   520  of  1,102.
##   Batch   560  of  1,102.
##   Batch   600  of  1,102.
##   Batch   640  of  1,102.
##   Batch   680  of  1,102.
##   Batch   720  of  1,102.
##   Batch   760  of  1,102.
##   Batch   800  of  1,102.
##   Batch   840  of  1,102.
##   Batch   880  of  1,102.
##   Batch   920  of  1,102.
##   Batch   960  of  1,102.
##   Batch 1,000  of  1,102.
##   Batch 1,040  of  1,102.
##   Batch 1,080  of  1,102.
## 
## summary results
## epoch | trn loss | trn time 
##     3 | 3.59240 | 0:03:03
## [{'Train Loss': 3.701909479125658}, {'Train Loss': 3.624465454944465}, {'Train Loss': 3.5924020607545026}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     3 | 3.49755 | 33.034 | 0:00:19
## [{'Val Loss': 3.5623276340030827, 'Val PPL.': 35.24513952738014}, {'Val Loss': 3.5192573451641325, 'Val PPL.': 33.759347609035665}, {'Val Loss': 3.4975468048818934, 'Val PPL.': 33.03431285932696}]
## ('./model_save/gpt2/vocab.json', './model_save/gpt2/merges.txt', './model_save/gpt2/special_tokens_map.json', './model_save/gpt2/added_tokens.json')
## 
## ======== Epoch 4 / 8 ========
## Training...
##   Batch    40  of  1,102.
##   Batch    80  of  1,102.
##   Batch   120  of  1,102.
##   Batch   160  of  1,102.
##   Batch   200  of  1,102.
##   Batch   240  of  1,102.
##   Batch   280  of  1,102.
##   Batch   320  of  1,102.
##   Batch   360  of  1,102.
##   Batch   400  of  1,102.
##   Batch   440  of  1,102.
##   Batch   480  of  1,102.
##   Batch   520  of  1,102.
##   Batch   560  of  1,102.
##   Batch   600  of  1,102.
##   Batch   640  of  1,102.
##   Batch   680  of  1,102.
##   Batch   720  of  1,102.
##   Batch   760  of  1,102.
##   Batch   800  of  1,102.
##   Batch   840  of  1,102.
##   Batch   880  of  1,102.
##   Batch   920  of  1,102.
##   Batch   960  of  1,102.
##   Batch 1,000  of  1,102.
##   Batch 1,040  of  1,102.
##   Batch 1,080  of  1,102.
## 
## summary results
## epoch | trn loss | trn time 
##     4 | 3.57474 | 0:03:03
## [{'Train Loss': 3.701909479125658}, {'Train Loss': 3.624465454944465}, {'Train Loss': 3.5924020607545026}, {'Train Loss': 3.574735596478526}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     4 | 3.48494 | 32.620 | 0:00:19
## [{'Val Loss': 3.5623276340030827, 'Val PPL.': 35.24513952738014}, {'Val Loss': 3.5192573451641325, 'Val PPL.': 33.759347609035665}, {'Val Loss': 3.4975468048818934, 'Val PPL.': 33.03431285932696}, {'Val Loss': 3.4849378885389704, 'Val PPL.': 32.620400946173945}]
## ('./model_save/gpt2/vocab.json', './model_save/gpt2/merges.txt', './model_save/gpt2/special_tokens_map.json', './model_save/gpt2/added_tokens.json')
## 
## ======== Epoch 5 / 8 ========
## Training...
##   Batch    40  of  1,102.
##   Batch    80  of  1,102.
##   Batch   120  of  1,102.
##   Batch   160  of  1,102.
##   Batch   200  of  1,102.
##   Batch   240  of  1,102.
##   Batch   280  of  1,102.
##   Batch   320  of  1,102.
##   Batch   360  of  1,102.
##   Batch   400  of  1,102.
##   Batch   440  of  1,102.
##   Batch   480  of  1,102.
##   Batch   520  of  1,102.
##   Batch   560  of  1,102.
##   Batch   600  of  1,102.
##   Batch   640  of  1,102.
##   Batch   680  of  1,102.
##   Batch   720  of  1,102.
##   Batch   760  of  1,102.
##   Batch   800  of  1,102.
##   Batch   840  of  1,102.
##   Batch   880  of  1,102.
##   Batch   920  of  1,102.
##   Batch   960  of  1,102.
##   Batch 1,000  of  1,102.
##   Batch 1,040  of  1,102.
##   Batch 1,080  of  1,102.
## 
## summary results
## epoch | trn loss | trn time 
##     5 | 3.56358 | 0:03:06
## [{'Train Loss': 3.701909479125658}, {'Train Loss': 3.624465454944465}, {'Train Loss': 3.5924020607545026}, {'Train Loss': 3.574735596478526}, {'Train Loss': 3.5635797964466462}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     5 | 3.47629 | 32.340 | 0:00:19
## [{'Val Loss': 3.5623276340030827, 'Val PPL.': 35.24513952738014}, {'Val Loss': 3.5192573451641325, 'Val PPL.': 33.759347609035665}, {'Val Loss': 3.4975468048818934, 'Val PPL.': 33.03431285932696}, {'Val Loss': 3.4849378885389704, 'Val PPL.': 32.620400946173945}, {'Val Loss': 3.476289682671926, 'Val PPL.': 32.33950935815459}]
## ('./model_save/gpt2/vocab.json', './model_save/gpt2/merges.txt', './model_save/gpt2/special_tokens_map.json', './model_save/gpt2/added_tokens.json')
## 
## ======== Epoch 6 / 8 ========
## Training...
##   Batch    40  of  1,102.
##   Batch    80  of  1,102.
##   Batch   120  of  1,102.
##   Batch   160  of  1,102.
##   Batch   200  of  1,102.
##   Batch   240  of  1,102.
##   Batch   280  of  1,102.
##   Batch   320  of  1,102.
##   Batch   360  of  1,102.
##   Batch   400  of  1,102.
##   Batch   440  of  1,102.
##   Batch   480  of  1,102.
##   Batch   520  of  1,102.
##   Batch   560  of  1,102.
##   Batch   600  of  1,102.
##   Batch   640  of  1,102.
##   Batch   680  of  1,102.
##   Batch   720  of  1,102.
##   Batch   760  of  1,102.
##   Batch   800  of  1,102.
##   Batch   840  of  1,102.
##   Batch   880  of  1,102.
##   Batch   920  of  1,102.
##   Batch   960  of  1,102.
##   Batch 1,000  of  1,102.
##   Batch 1,040  of  1,102.
##   Batch 1,080  of  1,102.
## 
## summary results
## epoch | trn loss | trn time 
##     6 | 3.55630 | 0:03:09
## [{'Train Loss': 3.701909479125658}, {'Train Loss': 3.624465454944465}, {'Train Loss': 3.5924020607545026}, {'Train Loss': 3.574735596478526}, {'Train Loss': 3.5635797964466462}, {'Train Loss': 3.5563041921535983}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     6 | 3.47130 | 32.179 | 0:00:19
## [{'Val Loss': 3.5623276340030827, 'Val PPL.': 35.24513952738014}, {'Val Loss': 3.5192573451641325, 'Val PPL.': 33.759347609035665}, {'Val Loss': 3.4975468048818934, 'Val PPL.': 33.03431285932696}, {'Val Loss': 3.4849378885389704, 'Val PPL.': 32.620400946173945}, {'Val Loss': 3.476289682671926, 'Val PPL.': 32.33950935815459}, {'Val Loss': 3.4713046940729075, 'Val PPL.': 32.17869842605189}]
## ('./model_save/gpt2/vocab.json', './model_save/gpt2/merges.txt', './model_save/gpt2/special_tokens_map.json', './model_save/gpt2/added_tokens.json')
## 
## ======== Epoch 7 / 8 ========
## Training...
##   Batch    40  of  1,102.
##   Batch    80  of  1,102.
##   Batch   120  of  1,102.
##   Batch   160  of  1,102.
##   Batch   200  of  1,102.
##   Batch   240  of  1,102.
##   Batch   280  of  1,102.
##   Batch   320  of  1,102.
##   Batch   360  of  1,102.
##   Batch   400  of  1,102.
##   Batch   440  of  1,102.
##   Batch   480  of  1,102.
##   Batch   520  of  1,102.
##   Batch   560  of  1,102.
##   Batch   600  of  1,102.
##   Batch   640  of  1,102.
##   Batch   680  of  1,102.
##   Batch   720  of  1,102.
##   Batch   760  of  1,102.
##   Batch   800  of  1,102.
##   Batch   840  of  1,102.
##   Batch   880  of  1,102.
##   Batch   920  of  1,102.
##   Batch   960  of  1,102.
##   Batch 1,000  of  1,102.
##   Batch 1,040  of  1,102.
##   Batch 1,080  of  1,102.
## 
## summary results
## epoch | trn loss | trn time 
##     7 | 3.55258 | 0:03:08
## [{'Train Loss': 3.701909479125658}, {'Train Loss': 3.624465454944465}, {'Train Loss': 3.5924020607545026}, {'Train Loss': 3.574735596478526}, {'Train Loss': 3.5635797964466462}, {'Train Loss': 3.5563041921535983}, {'Train Loss': 3.5525805800882746}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     7 | 3.46844 | 32.087 | 0:00:19
## [{'Val Loss': 3.5623276340030827, 'Val PPL.': 35.24513952738014}, {'Val Loss': 3.5192573451641325, 'Val PPL.': 33.759347609035665}, {'Val Loss': 3.4975468048818934, 'Val PPL.': 33.03431285932696}, {'Val Loss': 3.4849378885389704, 'Val PPL.': 32.620400946173945}, {'Val Loss': 3.476289682671926, 'Val PPL.': 32.33950935815459}, {'Val Loss': 3.4713046940729075, 'Val PPL.': 32.17869842605189}, {'Val Loss': 3.4684418701327866, 'Val PPL.': 32.086708216550335}]
## ('./model_save/gpt2/vocab.json', './model_save/gpt2/merges.txt', './model_save/gpt2/special_tokens_map.json', './model_save/gpt2/added_tokens.json')
## 
## ======== Epoch 8 / 8 ========
## Training...
##   Batch    40  of  1,102.
##   Batch    80  of  1,102.
##   Batch   120  of  1,102.
##   Batch   160  of  1,102.
##   Batch   200  of  1,102.
##   Batch   240  of  1,102.
##   Batch   280  of  1,102.
##   Batch   320  of  1,102.
##   Batch   360  of  1,102.
##   Batch   400  of  1,102.
##   Batch   440  of  1,102.
##   Batch   480  of  1,102.
##   Batch   520  of  1,102.
##   Batch   560  of  1,102.
##   Batch   600  of  1,102.
##   Batch   640  of  1,102.
##   Batch   680  of  1,102.
##   Batch   720  of  1,102.
##   Batch   760  of  1,102.
##   Batch   800  of  1,102.
##   Batch   840  of  1,102.
##   Batch   880  of  1,102.
##   Batch   920  of  1,102.
##   Batch   960  of  1,102.
##   Batch 1,000  of  1,102.
##   Batch 1,040  of  1,102.
##   Batch 1,080  of  1,102.
## 
## summary results
## epoch | trn loss | trn time 
##     8 | 3.54971 | 0:03:07
## [{'Train Loss': 3.701909479125658}, {'Train Loss': 3.624465454944465}, {'Train Loss': 3.5924020607545026}, {'Train Loss': 3.574735596478526}, {'Train Loss': 3.5635797964466462}, {'Train Loss': 3.5563041921535983}, {'Train Loss': 3.5525805800882746}, {'Train Loss': 3.5497063517354146}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     8 | 3.46751 | 32.057 | 0:00:19
## [{'Val Loss': 3.5623276340030827, 'Val PPL.': 35.24513952738014}, {'Val Loss': 3.5192573451641325, 'Val PPL.': 33.759347609035665}, {'Val Loss': 3.4975468048818934, 'Val PPL.': 33.03431285932696}, {'Val Loss': 3.4849378885389704, 'Val PPL.': 32.620400946173945}, {'Val Loss': 3.476289682671926, 'Val PPL.': 32.33950935815459}, {'Val Loss': 3.4713046940729075, 'Val PPL.': 32.17869842605189}, {'Val Loss': 3.4684418701327866, 'Val PPL.': 32.086708216550335}, {'Val Loss': 3.4675112727849458, 'Val PPL.': 32.05686230040219}]
## ('./model_save/gpt2/vocab.json', './model_save/gpt2/merges.txt', './model_save/gpt2/special_tokens_map.json', './model_save/gpt2/added_tokens.json')
## 
## C:\Users\Andrew\Anaconda3\envs\my_ml\lib\site-packages\torch\optim\lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
##   "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

2.8 Beam Search: Text Generation

While there are a couple of different ways to generate predictions, I found that beam search provided the best results. There are several parameters you should play around with to get great predictions:

  1. temperature – very low values, circa below .5 tended to create near unintelligble text while values above 1, the default, seemed better. This value impacts the next token’s probability.

  2. top_k – values below the default, 50, seemed to work better. Top_k represents the number of highest probability vocabulary tokens to keep.

  3. top_p – values below the default, 1.0, seemed to work better. Top_p represents the cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling

  4. num_beams – open for experimentation. Num_beamsrepresents how many sequences will be generated, returning the sequence whose overall probability is the highest

  5. num_return_sequences – how many examples you want returned.

model.eval();
# beam search
## GPT2LMHeadModel(
##   (transformer): GPT2Model(
##     (wte): Embedding(50257, 768)
##     (wpe): Embedding(1024, 768)
##     (drop): Dropout(p=0.1, inplace=False)
##     (h): ModuleList(
##       (0): Block(
##         (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (attn): Attention(
##           (c_attn): Conv1D()
##           (c_proj): Conv1D()
##           (attn_dropout): Dropout(p=0.1, inplace=False)
##           (resid_dropout): Dropout(p=0.1, inplace=False)
##         )
##         (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (mlp): MLP(
##           (c_fc): Conv1D()
##           (c_proj): Conv1D()
##           (dropout): Dropout(p=0.1, inplace=False)
##         )
##       )
##       (1): Block(
##         (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (attn): Attention(
##           (c_attn): Conv1D()
##           (c_proj): Conv1D()
##           (attn_dropout): Dropout(p=0.1, inplace=False)
##           (resid_dropout): Dropout(p=0.1, inplace=False)
##         )
##         (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (mlp): MLP(
##           (c_fc): Conv1D()
##           (c_proj): Conv1D()
##           (dropout): Dropout(p=0.1, inplace=False)
##         )
##       )
##       (2): Block(
##         (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (attn): Attention(
##           (c_attn): Conv1D()
##           (c_proj): Conv1D()
##           (attn_dropout): Dropout(p=0.1, inplace=False)
##           (resid_dropout): Dropout(p=0.1, inplace=False)
##         )
##         (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (mlp): MLP(
##           (c_fc): Conv1D()
##           (c_proj): Conv1D()
##           (dropout): Dropout(p=0.1, inplace=False)
##         )
##       )
##       (3): Block(
##         (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (attn): Attention(
##           (c_attn): Conv1D()
##           (c_proj): Conv1D()
##           (attn_dropout): Dropout(p=0.1, inplace=False)
##           (resid_dropout): Dropout(p=0.1, inplace=False)
##         )
##         (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (mlp): MLP(
##           (c_fc): Conv1D()
##           (c_proj): Conv1D()
##           (dropout): Dropout(p=0.1, inplace=False)
##         )
##       )
##       (4): Block(
##         (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (attn): Attention(
##           (c_attn): Conv1D()
##           (c_proj): Conv1D()
##           (attn_dropout): Dropout(p=0.1, inplace=False)
##           (resid_dropout): Dropout(p=0.1, inplace=False)
##         )
##         (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (mlp): MLP(
##           (c_fc): Conv1D()
##           (c_proj): Conv1D()
##           (dropout): Dropout(p=0.1, inplace=False)
##         )
##       )
##       (5): Block(
##         (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (attn): Attention(
##           (c_attn): Conv1D()
##           (c_proj): Conv1D()
##           (attn_dropout): Dropout(p=0.1, inplace=False)
##           (resid_dropout): Dropout(p=0.1, inplace=False)
##         )
##         (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##         (mlp): MLP(
##           (c_fc): Conv1D()
##           (c_proj): Conv1D()
##           (dropout): Dropout(p=0.1, inplace=False)
##         )
##       )
##     )
##     (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
##   )
##   (lm_head): Linear(in_features=768, out_features=50257, bias=False)
## )
text = "The Afghan National Army reported"
ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0).cuda()
generated_ids = model.generate(
                        input_ids=ids,  # input
                        max_length=45,  # default 20
                        min_length=0,  # default 0
                        do_sample=True,  # don't use greedy decoding
                        early_stopping=False,  # search is stopped when at least num_beams sentences finished
                        temperature=2.45,  # default 1.0
                        top_k=45,  # default 50
                        top_p=0.7,  # default 1.0
                        repetition_penalty=2.0,  # rep. penalty
                        num_beams=6,
                        num_return_sequences=2, #  num ind. computed returned sequences
                        bos_token_id=tokenizer.bos_token_id
                        )
## Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
results = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]

for i in results:
    print(i, end='\n \n')
## The Afghan National Army reported that a mine explosion on the Kabul Ghouta district, at around 9:00am on Saturday night, killed at least 17 people and destroyed several properties. The blast took place as it was being
##  
## The Afghan National Army reported a bomb blast that killed at least six Afghan soldiers in Kabul yesterday morning, but the exact number is not known. However, officials of the Taliban have denied the attack and said no such incident took place
## 
# beam search
text = "In Helmand Province, the Taliban"
ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0).cuda()
generated_ids = model.generate(
                        input_ids=ids,  # input
                        max_length=45,  # default 20
                        min_length=0,  # default 0
                        do_sample=True,  # don't use greedy decoding
                        early_stopping=False,  # search is stopped when at least num_beams sentences finished
                        temperature=2.45,  # default 1.0
                        top_k=45,  # default 50
                        top_p=0.7,  # default 1.0
                        repetition_penalty=2.0,  # rep. penalty
                        num_beams=6,
                        num_return_sequences=2, #  num ind. computed returned sequences
                        bos_token_id=tokenizer.bos_token_id
                        )
## Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
results = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]

for i in results:
    print(i, end='\n \n')
## In Helmand Province, the Taliban have killed over 40 soldiers and more than 20 others in a brutal attack against security forces. On Monday, Afghan officials announced they had killed at least 22 policemen and dozens of other people in a
##  
## In Helmand Province, the Taliban and its supporters are fighting an Islamic-inspired attack in Kabul this month. (AP Photo/Umm Ismail Aliuddin)
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 

3 Sources