Summarization: T5

Andrew Fogarty

7/18/2020

# load python
library(reticulate)
use_condaenv("my_ml")
# load packages
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from collections import Counter
from transformers import T5Tokenizer, T5ForConditionalGeneration, AdamW
from torch.utils.data import TensorDataset, random_split, DataLoader, RandomSampler, SequentialSampler
import time, os, datetime, random, re
from transformers import get_linear_schedule_with_warmup
from torch.cuda.amp import autocast, GradScaler
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn as nn
import nlp

torch.cuda.amp.autocast(enabled=True)
## <torch.cuda.amp.autocast_mode.autocast object at 0x0000000032834FC8>
SEED = 15
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
## <torch._C.Generator object at 0x000000002050A850>
torch.backends.cudnn.deterministic = True

# tell pytorch to use cuda
device = torch.device("cuda")

1 Introduction

There are two types of summaries: (1) abstractive, or explaining in your own words, and (2) extractive, or building a summary from existing text. Humans are mostly abstractive while NLP systems are mostly extractive. In this guide we use T5, a pre-trained and very large (e.g., roughly twice the size of BERT-base) encoder-decoder Transformer model. T5, a model devised by Google, is an important advancement in the field of Transformers because it achieves near human-level performance on a variety of benchmarks like GLUE and SQuAD.

The guide proceeds by (1) preparing the data for text summarization with T5 small – a small version of T5 base, and (2) training the data in PyTorch.

1.1 Data Preparation

Some unique pre-processing is required when using T5 for summarization. Specifically, we need to add “summarize:” to the beginning of all of our text that needs to be summarized and we need shift our summaries rightward by one token which we do by adding “<pad” to the beginning of each summary. T5’s tokenizer in the transformers library will handle the details from there.

# prepare and load data
def prepare_df(pkl_location):
    # read pkl as pandas
    df = pd.read_pickle(pkl_location)
    # just keep us/kabul labels
    df = df.loc[(df['target'] == 'US') | (df['target'] == 'Kabul')]
    # mask DV to recode
    us = df['target'] == 'US'
    kabul = df['target'] == 'Kabul'
    # apply mask
    df.loc[us, 'target'] = 1
    df.loc[kabul, 'target'] = 0
    # reset index
    df = df.reset_index(drop=True)
    return df


# load df
df = prepare_df('C:\\Users\\Andrew\\Desktop\\df.pkl')


# prepare data
def clean_df(df):
    # strip dash but keep a space
    df['body'] = df['body'].str.replace('-', ' ')
    # lower case the data
    df['body'] = df['body'].apply(lambda x: x.lower())
    # remove excess spaces near punctuation
    df['body'] = df['body'].apply(lambda x: re.sub(r'\s([?.!"](?:\s|$))', r'\1', x))
    # generate a word count for body
    df['word_count'] = df['body'].apply(lambda x: len(x.split()))
    # generate a word count for summary
    df['word_count_summary'] = df['title_osc'].apply(lambda x: len(x.split()))
    # remove excess white spaces
    df['body'] = df['body'].apply(lambda x: " ".join(x.split()))
    # lower case to body
    df['body'] = df['body'].apply(lambda x: x.lower())
    # lower case to summary
    df['title_osc'] = df['title_osc'].apply(lambda x: x.lower())
    # add summarize akin to T5 setup
    df['body'] = 'summarize: ' + df['body']
    # add pad token to summaries
    df['title_osc'] = '<pad>' + df['title_osc']
    # add " </s>" to end of body
    df['body'] = df['body'] + " </s>"
    # add " </s>" to end of review
    df['title_osc'] = df['title_osc'] + " </s>"
    return df


# clean df
df = clean_df(df)

1.2 Instantiate Tokenizer

Next, we instantiate the T5 tokenizer from transformers and check some special token IDs.

## 1
## 2
## 0

1.3 Tokenize the Corpus

Then, we proceed to tokenize our corpus like usual. Notice that we effectively do this process twice as we tokenize our corpus and also tokenize our summaries.

## 19.956993529118964
## 20.0
## 44
## len_tokens    30.0
## Name: 0.99, dtype: float64

1.4 Prepare and Split Data

Next, we split our data into train, validation, and test sets.

1.5 Instantiate Training Models

Now we are ready to prepare our training scripts which follow the other guides closely. T5ForConditionalGeneration asks that we supply three inputs into the model’s forward function: (1) corpus token ids, (2) corpus attention masks, (3) our summaries, and (4) our summary attention masks.

def train(model, dataloader, optimizer):

    # capture time
    total_t0 = time.time()

    # Perform one full pass over the training set.
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch + 1, epochs))
    print('Training...')

    # reset total loss for epoch
    train_total_loss = 0
    total_train_f1 = 0

    # put model into traning mode
    model.train()

    # for each batch of training data...
    for step, batch in enumerate(dataloader):

        # progress update every 40 batches.
        if step % 40 == 0 and not step == 0:

            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(dataloader)))

        # Unpack this training batch from our dataloader:
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input tokens
        #   [1]: attention masks
        #   [2]: summary tokens
        #   [3]: summary masks
        b_input_ids = batch[0].cuda()
        b_input_mask = batch[1].cuda()
        b_summary_ids = batch[2].cuda()
        b_summary_mask = batch[3].cuda()

        # clear previously calculated gradients
        optimizer.zero_grad()

        # runs the forward pass with autocasting.
        with autocast():
            # forward propagation (evaluate model on training batch)
            outputs = model(input_ids=b_input_ids,
                            attention_mask=b_input_mask,
                            labels=b_summary_ids,
                            decoder_attention_mask=b_summary_mask)

            loss, prediction_scores = outputs[:2]

            # sum the training loss over all batches for average loss at end
            # loss is a tensor containing a single value
            train_total_loss += loss.item()

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

        # update the learning rate
        scheduler.step()

    # calculate the average loss over all of the batches
    avg_train_loss = train_total_loss / len(dataloader)

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'Train Loss': avg_train_loss
        }
    )

    # training time end
    training_time = format_time(time.time() - total_t0)

    # print result summaries
    print("")
    print("summary results")
    print("epoch | trn loss | trn time ")
    print(f"{epoch+1:5d} | {avg_train_loss:.5f} | {training_time:}")

    return training_stats


def validating(model, dataloader):

    # capture validation time
    total_t0 = time.time()

    # After the completion of each training epoch, measure our performance on
    # our validation set.
    print("")
    print("Running Validation...")

    # put the model in evaluation mode
    model.eval()

    # track variables
    total_valid_loss = 0

    # evaluate data for one epoch
    for batch in dataloader:

        # Unpack this training batch from our dataloader:
        # `batch` contains three pytorch tensors:
        #   [0]: input tokens
        #   [1]: attention masks
        #   [2]: summary tokens
        #   [3]: summary masks
        b_input_ids = batch[0].cuda()
        b_input_mask = batch[1].cuda()
        b_summary_ids = batch[2].cuda()
        b_summary_mask = batch[3].cuda()

        # tell pytorch not to bother calculating gradients
        # as its only necessary for training
        with torch.no_grad():

            # forward propagation (evaluate model on training batch)
            outputs = model(input_ids=b_input_ids,
                            attention_mask=b_input_mask,
                            labels=b_summary_ids,
                            decoder_attention_mask=b_summary_mask)

            loss, prediction_scores = outputs[:2]

            # sum the training loss over all batches for average loss at end
            # loss is a tensor containing a single value
            total_valid_loss += loss.item()

    # calculate the average loss over all of the batches.
    global avg_val_loss
    avg_val_loss = total_valid_loss / len(dataloader)

    # Record all statistics from this epoch.
    valid_stats.append(
        {
            'Val Loss': avg_val_loss,
            'Val PPL.': np.exp(avg_val_loss)
        }
    )

    # capture end validation time
    training_time = format_time(time.time() - total_t0)

    # print result summaries
    print("")
    print("summary results")
    print("epoch | val loss | val ppl | val time")
    print(f"{epoch+1:5d} | {avg_val_loss:.5f} | {np.exp(avg_val_loss):.3f} | {training_time:}")

    return valid_stats


def testing(model, dataloader):

    print("")
    print("Running Testing...")

    # measure training time
    t0 = time.time()

    # put the model in evaluation mode
    model.eval()

    # track variables
    total_test_loss = 0
    predictions = []
    actuals = []

    # evaluate data for one epoch
    for step, batch in enumerate(dataloader):
        # progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(dataloader), elapsed))

        # Unpack this training batch from our dataloader:
        # `batch` contains three pytorch tensors:
        #   [0]: input tokens
        #   [1]: attention masks
        #   [2]: summary tokens
        #   [3]: summary masks
        b_input_ids = batch[0].cuda()
        b_input_mask = batch[1].cuda()
        b_summary_ids = batch[2].cuda()
        b_summary_mask = batch[3].cuda()

        # tell pytorch not to bother calculating gradients
        # as its only necessary for training
        with torch.no_grad():

            # forward propagation (evaluate model on training batch)
            outputs = model(input_ids=b_input_ids,
                            attention_mask=b_input_mask,
                            labels=b_summary_ids,
                            decoder_attention_mask=b_summary_mask)

            loss, prediction_scores = outputs[:2]

            total_test_loss += loss.item()

            generated_ids = model.generate(
                    input_ids=b_input_ids,
                    attention_mask=b_input_mask,
                    do_sample=True,  # add some random sampling
                    temperature=0.8,  # sharper conditional next distribution
                    top_k=45,  # k most likely next words
                    top_p=0.9,  # smallest set of cum. prob exceed p for k
                    max_length=18,  # slightly longer that word count
                    min_length=14,  # dont eos before reaching
                    num_beams=1,  # not using beam
                    repetition_penalty=2.5,
                    length_penalty=2.5,
                    early_stopping=False,  # for use with beam
                    use_cache=True,
                    num_return_sequences=1
                    )

            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in b_summary_ids]
            predictions.extend(preds)
            actuals.extend(target)

    # calculate the average loss over all of the batches.
    avg_test_loss = total_test_loss / len(dataloader)

    # Record all statistics from this epoch.
    test_stats.append(
        {
            'Test Loss': avg_test_loss,
            'Test PPL.': np.exp(avg_test_loss),
        }
    )
    global df2
    temp_data = pd.DataFrame({'predicted': predictions, 'actual': actuals})
    df2 = df2.append(temp_data)

    return test_stats
    

# time function
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

Before training, several prepatory objects are instantiated like the model, data loaders, and the optimizer.

1.6 Prepare for Training

## Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-small and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
## You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

1.7 Train and Validate

Finally we are ready to train. Two containers are created to store the results of each training and validation epoch

## 
## ======== Epoch 1 / 6 ========
## Training...
##   Batch    40  of    503.
##   Batch    80  of    503.
##   Batch   120  of    503.
##   Batch   160  of    503.
##   Batch   200  of    503.
##   Batch   240  of    503.
##   Batch   280  of    503.
##   Batch   320  of    503.
##   Batch   360  of    503.
##   Batch   400  of    503.
##   Batch   440  of    503.
##   Batch   480  of    503.
## 
## summary results
## epoch | trn loss | trn time 
##     1 | 1.59530 | 0:01:52
## [{'Train Loss': 1.5952966689589483}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     1 | 0.73627 | 2.088 | 0:00:06
## [{'Val Loss': 0.7362715734375848, 'Val PPL.': 2.0881355227329945}]
## ('./model_save/t5/spiece.model', './model_save/t5/special_tokens_map.json', './model_save/t5/added_tokens.json')
## 
## ======== Epoch 2 / 6 ========
## Training...
##   Batch    40  of    503.
##   Batch    80  of    503.
##   Batch   120  of    503.
##   Batch   160  of    503.
##   Batch   200  of    503.
##   Batch   240  of    503.
##   Batch   280  of    503.
##   Batch   320  of    503.
##   Batch   360  of    503.
##   Batch   400  of    503.
##   Batch   440  of    503.
##   Batch   480  of    503.
## 
## summary results
## epoch | trn loss | trn time 
##     2 | 0.82822 | 0:01:53
## [{'Train Loss': 1.5952966689589483}, {'Train Loss': 0.828221511651223}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     2 | 0.65254 | 1.920 | 0:00:06
## [{'Val Loss': 0.7362715734375848, 'Val PPL.': 2.0881355227329945}, {'Val Loss': 0.6525381051358723, 'Val PPL.': 1.9204088481889185}]
## ('./model_save/t5/spiece.model', './model_save/t5/special_tokens_map.json', './model_save/t5/added_tokens.json')
## 
## ======== Epoch 3 / 6 ========
## Training...
##   Batch    40  of    503.
##   Batch    80  of    503.
##   Batch   120  of    503.
##   Batch   160  of    503.
##   Batch   200  of    503.
##   Batch   240  of    503.
##   Batch   280  of    503.
##   Batch   320  of    503.
##   Batch   360  of    503.
##   Batch   400  of    503.
##   Batch   440  of    503.
##   Batch   480  of    503.
## 
## summary results
## epoch | trn loss | trn time 
##     3 | 0.75328 | 0:01:53
## [{'Train Loss': 1.5952966689589483}, {'Train Loss': 0.828221511651223}, {'Train Loss': 0.7532829459808456}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     3 | 0.61503 | 1.850 | 0:00:06
## [{'Val Loss': 0.7362715734375848, 'Val PPL.': 2.0881355227329945}, {'Val Loss': 0.6525381051358723, 'Val PPL.': 1.9204088481889185}, {'Val Loss': 0.6150305460369776, 'Val PPL.': 1.8497131001001426}]
## ('./model_save/t5/spiece.model', './model_save/t5/special_tokens_map.json', './model_save/t5/added_tokens.json')
## 
## ======== Epoch 4 / 6 ========
## Training...
##   Batch    40  of    503.
##   Batch    80  of    503.
##   Batch   120  of    503.
##   Batch   160  of    503.
##   Batch   200  of    503.
##   Batch   240  of    503.
##   Batch   280  of    503.
##   Batch   320  of    503.
##   Batch   360  of    503.
##   Batch   400  of    503.
##   Batch   440  of    503.
##   Batch   480  of    503.
## 
## summary results
## epoch | trn loss | trn time 
##     4 | 0.71392 | 0:01:52
## [{'Train Loss': 1.5952966689589483}, {'Train Loss': 0.828221511651223}, {'Train Loss': 0.7532829459808456}, {'Train Loss': 0.713922281388497}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     4 | 0.59374 | 1.811 | 0:00:06
## [{'Val Loss': 0.7362715734375848, 'Val PPL.': 2.0881355227329945}, {'Val Loss': 0.6525381051358723, 'Val PPL.': 1.9204088481889185}, {'Val Loss': 0.6150305460369776, 'Val PPL.': 1.8497131001001426}, {'Val Loss': 0.5937370728878748, 'Val PPL.': 1.8107426642946385}]
## ('./model_save/t5/spiece.model', './model_save/t5/special_tokens_map.json', './model_save/t5/added_tokens.json')
## 
## ======== Epoch 5 / 6 ========
## Training...
##   Batch    40  of    503.
##   Batch    80  of    503.
##   Batch   120  of    503.
##   Batch   160  of    503.
##   Batch   200  of    503.
##   Batch   240  of    503.
##   Batch   280  of    503.
##   Batch   320  of    503.
##   Batch   360  of    503.
##   Batch   400  of    503.
##   Batch   440  of    503.
##   Batch   480  of    503.
## 
## summary results
## epoch | trn loss | trn time 
##     5 | 0.69332 | 0:01:53
## [{'Train Loss': 1.5952966689589483}, {'Train Loss': 0.828221511651223}, {'Train Loss': 0.7532829459808456}, {'Train Loss': 0.713922281388497}, {'Train Loss': 0.6933188264815519}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     5 | 0.58224 | 1.790 | 0:00:06
## [{'Val Loss': 0.7362715734375848, 'Val PPL.': 2.0881355227329945}, {'Val Loss': 0.6525381051358723, 'Val PPL.': 1.9204088481889185}, {'Val Loss': 0.6150305460369776, 'Val PPL.': 1.8497131001001426}, {'Val Loss': 0.5937370728878748, 'Val PPL.': 1.8107426642946385}, {'Val Loss': 0.5822378590939536, 'Val PPL.': 1.790039808684565}]
## ('./model_save/t5/spiece.model', './model_save/t5/special_tokens_map.json', './model_save/t5/added_tokens.json')
## 
## ======== Epoch 6 / 6 ========
## Training...
##   Batch    40  of    503.
##   Batch    80  of    503.
##   Batch   120  of    503.
##   Batch   160  of    503.
##   Batch   200  of    503.
##   Batch   240  of    503.
##   Batch   280  of    503.
##   Batch   320  of    503.
##   Batch   360  of    503.
##   Batch   400  of    503.
##   Batch   440  of    503.
##   Batch   480  of    503.
## 
## summary results
## epoch | trn loss | trn time 
##     6 | 0.68405 | 0:01:51
## [{'Train Loss': 1.5952966689589483}, {'Train Loss': 0.828221511651223}, {'Train Loss': 0.7532829459808456}, {'Train Loss': 0.713922281388497}, {'Train Loss': 0.6933188264815519}, {'Train Loss': 0.6840471257390843}]
## 
## Running Validation...
## 
## summary results
## epoch | val loss | val ppl | val time
##     6 | 0.57983 | 1.786 | 0:00:06
## [{'Val Loss': 0.7362715734375848, 'Val PPL.': 2.0881355227329945}, {'Val Loss': 0.6525381051358723, 'Val PPL.': 1.9204088481889185}, {'Val Loss': 0.6150305460369776, 'Val PPL.': 1.8497131001001426}, {'Val Loss': 0.5937370728878748, 'Val PPL.': 1.8107426642946385}, {'Val Loss': 0.5822378590939536, 'Val PPL.': 1.790039808684565}, {'Val Loss': 0.5798299776183234, 'Val PPL.': 1.7857347900558997}]
## ('./model_save/t5/spiece.model', './model_save/t5/special_tokens_map.json', './model_save/t5/added_tokens.json')
## 
## C:\Users\Andrew\Anaconda3\envs\my_ml\lib\site-packages\torch\optim\lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
##   "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

Now we can present our training results nicely in a data frame.

##        Train Loss  Val Loss  Val PPL.
## Epoch                                
## 1           1.595     0.736     2.088
## 2           0.828     0.653     1.920
## 3           0.753     0.615     1.850
## 4           0.714     0.594     1.811
## 5           0.693     0.582     1.790
## 6           0.684     0.580     1.786

1.8 Test and Generate Summaries

While also checking our loss in our held-out test data, we also generate predicted summaries to compare with our actual summaries so we can evaluate it by ROGUE metrics.

## <All keys matched successfully>
## 
## Running Testing...
##   Batch    40  of     63.    Elapsed: 0:00:56.
## [{'Test Loss': 0.6018462706179846, 'Test PPL.': 1.8254860322482742}]
##                                            predicted                                             actual
## 0  afghan taliban commentary says us never indica...  afghan taliban commentary no other options but...
## 1  taliban say six soldiers killed post captured ...  taliban say posts captured six soldiers killed...
## 2  taliban say two soldiers killed in blasts in a...  taliban report two army soldiers killed in bla...
## 3  taliban say four soldiers killed in attack in ...  taliban say four afghan soldiers killed in wes...
## 4  taliban say two soldiers killed three injured ...  taliban say two soldiers killed three injured ...

1.9 ROUGE Metrics

Evaluating summarization tasks are hard to measure and tune for as there are many criteria:

  1. Information satisfaction (answer query)
  2. Coverage (summarize corpus)
  3. Fluency (sounds natural)
  4. Concision (not redundant, no fluff)

In BLEU, we are about precision: how many n-grams in the candidate are in the reference translation? In ROGUE, we care about recall: how many n-grams in the reference summaries are in the candidate?

## <pandas.io.formats.style.Styler object at 0x0000000009DC7B48>

We interpret the metrics below as follows:

##    rouge1  rouge2  rougeL
## P   0.532   0.309   0.507
## R   0.492   0.289   0.469
## F   0.506   0.296   0.483

2 Sources