Regularization

Andrew Fogarty

10/06/2020

# load python
library(reticulate)
use_condaenv("my_ml")
# load packages
import sys
sys.path.append("C:/Users/Andrew/Desktop/Projects/Deep Learning/utils")  # this is the folder with py files
from tools import AverageMeter, ProgressBar #scriptName without .py extension; import each class
from radam import RAdam
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import get_linear_schedule_with_warmup, AdamW
from torch.utils.data import TensorDataset, random_split, DataLoader, RandomSampler, SequentialSampler
import time, datetime, random, re, os
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from torch.cuda.amp import autocast, GradScaler
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, Subset
from sklearn.preprocessing import LabelEncoder
from torchvision import transforms

# set seed and gpu requirements
SEED = 15
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
## <torch._C.Generator object at 0x0000000021E70370>
torch.backends.cudnn.deterministic = True
torch.cuda.amp.autocast(enabled=True)

# set gpu/cpu
## <torch.cuda.amp.autocast_mode.autocast object at 0x0000000033ACEBC8>
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

1 Introduction

We can think of regularization as a general means to reduce overfitting. In practice, this means that regularizing effects aim to reduce the model capacity and/or the variance of the predictions. While the best way to reduce overfitting is to acquire more data, data augmentation (e.g., for images: random rotation, crop, etc) is the next best thing. Barring those possibilities, we can reduce the capacity of the model through regularization methods.

1.1 Reducing Model Capacity

We can reduce the capacity of model in the following ways:

  1. Smaller architecture: fewer hidden layers, neurons, and dropout

  2. Smaller weights: early stopping (not very common anymore) and L1/L2 norm penalties

  3. Adding Noise: dropout

1.1.1 L1/L2 Regularization

These methods shrink the weights and can be thought of as penalties against model complexity.

  • L1 Regularization – encourages sparsity but does not work well in practice because its not smooth and is hard to optimize.

  • L2 Regularization – shrinks weights. Large regularization yields high bias while low regularization yields high variance. We want to aim for something in between.

1.1.2 L2 Regularization in PyTorch

In this section, we will apply L2 regularization both manually and through PyTorch. To apply L2 regularization through PyTorch, we simply manipulate weight_decay in the optimizer instance. To apply L2 regularization manually, we manipulate our cost function before sending it through backward propagation and our optimizer.

It is important to note that when you use weight_decay, PyTorch will also regularize the intercept. I do not think we should regularize the bias term because if we consider linear regression, we have an intercept term because we want to consider all possible lines. Without an intercept, a line can only be fit through the origin which is of course limiting to the hypothesis space of our model.

1.1.5 Dropout

Dropout effectively drops neurons by relying on Bernoulli sampling to choose neurons to drop. Dropout works very well because:

  1. Our model will learn to not rely on particular connections heavily

  2. Our model will consider more connections

  3. The weight values will be more spread out and perhaps some smaller weights like L2 norm

In practice, can use different dropout probabilities for different layers and we should probably assign dropout probabilities proportional to the number of neurons in a layer. PyTorch uses inverted dropout which scales the activation values by the factor \(\frac{1}{1-p}\) during training instead of inference. Thus, we need to set model.train() and model.eval() appropriately in our training and validation functions.

1.1.6 Dropout Tips

In practice, do not use dropout if the model does not overfit. However, in the case that your model does not overfit, it is then recommended that you increase the model’s capacity to make it overfit so that you may then use dropout to retain the benefits of a larger capacity model while constraining its ability to overfit.

1.2 Regularized Models in Practice

In this section, we will apply the regularized functions on the MNIST data set.

1.2.1 L2 Norm


# create Dataset
class CSVDataset(Dataset):
    """MNIST dataset."""

    def __init__(self, csv_file, transform=None):
        """
        Args:
            csv_file (string): Path to the csv file.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        # initialize
        self.data_frame = pd.read_csv(csv_file)
        # all columns but the last
        self.features = self.data_frame[self.data_frame.columns[:-1]]
        # the last column
        self.target = self.data_frame[self.data_frame.columns[-1]]
        # initialize the transform if specified
        self.transform = transform

        # get length of df
    def __len__(self):
        return len(self.data_frame)

        # get sample target
    def __get_target__(self):
        return self.target

        # get df filtered by indices
    def __get_values__(self, indices):
        return self.data_frame.iloc[indices]

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        # pull a sample in a dict
        sample = {'features': torch.tensor(self.features.iloc[idx].values),
                  'target': torch.tensor(self.target.iloc[idx]),
                  'idx': torch.tensor(idx)}

        if self.transform:
            sample = self.transform(sample)

        return sample


class Pixel_Normalize():

    # retrieve sample and unpack it
    def __call__(self, sample):
        features, target, idx = (sample['features'],
                              sample['target'],
                              sample['idx'])

        # normalize each pixel
        normalized_pixels = torch.true_divide(sample['features'], 255)

        # yield another dict
        return {'features': normalized_pixels,
                'target': target,
                'idx': idx}


# instantiate the lazy data set
csv_dataset = CSVDataset(csv_file='https://datahub.io/machine-learning/mnist_784/r/mnist_784.csv',
                         transform=Pixel_Normalize())

# set train, valid, and test size
train_size = int(0.8 * len(csv_dataset))
valid_size = int(0.1 * len(csv_dataset))

# use random split to create three data sets;
train_ds, valid_ds, test_ds = torch.utils.data.random_split(csv_dataset, [train_size, valid_size, valid_size])

# check the output
for i, batch in enumerate(train_ds):
    if i == 0:
        break


# check the distribution of dependent variable; some imbalance
csv_dataset.__get_target__().value_counts()


# prepare weighted sampling for imbalanced classification
## 1    7877
## 7    7293
## 3    7141
## 2    6990
## 9    6958
## 0    6903
## 6    6876
## 8    6825
## 4    6824
## 5    6313
## Name: class, dtype: int64
def create_sampler(train_ds, csv_dataset):
    # get indicies from train split
    train_indices = train_ds.indices
    # generate class distributions [y1, y2, etc...]
    bin_count = np.bincount(csv_dataset.__get_target__()[train_indices])
    # weight gen
    weight = 1. / bin_count.astype(np.float32)
    # produce weights for each observation in the data set
    samples_weight = torch.tensor([weight[t] for t in csv_dataset.__get_target__()[train_indices]])
    # prepare sampler
    sampler = torch.utils.data.WeightedRandomSampler(weights=samples_weight,
                                                     num_samples=len(samples_weight),
                                                     replacement=True)
    return sampler


# create sampler for the training ds
train_sampler = create_sampler(train_ds, csv_dataset)

# create NN
# nn.Module tells PyTorch to do backward propagation
class FF_NN(torch.nn.Module):
    def __init__(self, num_features, num_classes):
        super(FF_NN, self).__init__()
        # initialize 3 layers
        # first hidden layer
        self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
        # second hidden layer
        self.linear_2 = torch.nn.Linear(num_hidden_1, num_hidden_2)
        # output layer
        self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)

    # define how and what order model parameters should be used in forward prop.
    def forward(self, x):
        # run inputs through first layer
        out = self.linear_1(x)
        # apply relu
        out = F.relu(out)
        # run inputs through second layer
        out = self.linear_2(out)
        # apply relu
        out = F.relu(out)
        # run inputs through final classification layer
        logits = self.linear_out(out)
        probs = F.log_softmax(logits, dim=1)
        return logits, probs


# load the NN model
num_features = 784
num_hidden_1 = 128
num_hidden_2 = 256
num_classes = 10
model = FF_NN(num_features=num_features, num_classes=num_classes).to(DEVICE)


# optimizer
optimizer = RAdam(model.parameters(), lr=0.1)

# set number of epochs
epochs = 4


# create DataLoaders with samplers
train_dataloader = DataLoader(train_ds,
                              batch_size=100,
                              sampler=train_sampler,
                              shuffle=False)

valid_dataloader = DataLoader(valid_ds,
                              batch_size=100,
                              shuffle=True)

test_dataloader = DataLoader(test_ds,
                              batch_size=100,
                              shuffle=True)

# set LR scheduler
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,
                                                max_lr=0.01,
                                                total_steps=len(train_dataloader)*epochs)

# create gradient scaler for mixed precision
scaler = GradScaler()

# train function
def train(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Training')
    train_loss = AverageMeter()
    model.train()
    for batch_idx, batch in enumerate(dataloader):
        b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
        optimizer.zero_grad()
        with autocast():
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target)
            # regularize loss -- but not the intercept
            LAMBDA, L2 = 0.5, 0.
            for name, p in model.named_parameters():
                if 'weight' in name:
                    L2 = L2 + (p**2).sum()
            loss = loss + 2./b_target.size(0) * LAMBDA * L2
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        #pbar(step=batch_idx, info={'loss': loss.item()})
        train_loss.update(loss.item(), n=1)
    return {'loss': train_loss.avg}


# valid/test function
def test(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Testing')
    valid_loss = AverageMeter()
    valid_acc = AverageMeter()
    valid_f1 = AverageMeter()
    model.eval()
    count = 0
    with torch.no_grad():
        for batch_idx, batch in enumerate(dataloader):
            b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target).item()
            pred = probs.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct = pred.eq(b_target.view_as(pred)).sum().item()
            f1 = f1_score(pred.to("cpu").numpy(), b_target.to("cpu").numpy(), average='macro')
            valid_f1.update(f1, n=b_features.size(0))
            valid_loss.update(loss, n=b_features.size(0))
            valid_acc.update(correct, n=1)
            count += b_features.size(0)
            #pbar(step=batch_idx)
    return {'valid_loss': valid_loss.avg,
            'valid_acc': valid_acc.sum /count,
            'valid_f1': valid_f1.avg}

# training
for epoch in range(1, epochs + 1):
    train_log = train(train_dataloader)
    valid_log = test(valid_dataloader)
    logs = dict(train_log, **valid_log)
    show_info = f'\nEpoch: {epoch} - ' + "-".join([f' {key}: {value:.4f} ' for key, value in logs.items()])
    print(show_info)
## 
## Epoch: 1 -  loss: 1.5542 - valid_loss: 0.4248 - valid_acc: 0.8790 - valid_f1: 0.8705 
## 
## Epoch: 2 -  loss: 0.8890 - valid_loss: 0.3804 - valid_acc: 0.8974 - valid_f1: 0.8915 
## 
## Epoch: 3 -  loss: 0.8092 - valid_loss: 0.3693 - valid_acc: 0.9080 - valid_f1: 0.9015 
## 
## Epoch: 4 -  loss: 0.7494 - valid_loss: 0.3302 - valid_acc: 0.9173 - valid_f1: 0.9116 
## 
## C:/Users/Andrew/Desktop/Projects/Deep Learning/utils\radam.py:60: UserWarning: This overload of add_ is deprecated:
##  add_(Number alpha, Tensor other)
## Consider using one of the following signatures instead:
##  add_(Tensor other, *, Number alpha) (Triggered internally at  ..\torch\csrc\utils\python_arg_parser.cpp:766.)
##   exp_avg.mul_(beta1).add_(1 - beta1, grad)
## {'valid_loss': 0.3280821489436286, 'valid_acc': 0.918, 'valid_f1': 0.9128352114054326}

1.2.2 Dropout

# load the NN model
num_features = 784
num_hidden_1 = 128
num_hidden_2 = 256
num_classes = 10
dropout_prob = 0.2
model = FF_NN(num_features=num_features, num_classes=num_classes, drop_prob=dropout_prob).to(DEVICE)


# optimizer
optimizer = RAdam(model.parameters(), lr=0.1)

# set number of epochs
epochs = 4


# create DataLoaders with samplers
train_dataloader = DataLoader(train_ds,
                              batch_size=100,
                              sampler=train_sampler,
                              shuffle=False)

valid_dataloader = DataLoader(valid_ds,
                              batch_size=100,
                              shuffle=True)

test_dataloader = DataLoader(test_ds,
                              batch_size=100,
                              shuffle=True)

# set LR scheduler
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,
                                                max_lr=0.01,
                                                total_steps=len(train_dataloader)*epochs)

# create gradient scaler for mixed precision
scaler = GradScaler()

# train function
def train(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Training')
    train_loss = AverageMeter()
    model.train()
    for batch_idx, batch in enumerate(dataloader):
        b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
        optimizer.zero_grad()
        with autocast():
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        #pbar(step=batch_idx, info={'loss': loss.item()})
        train_loss.update(loss.item(), n=1)
    return {'loss': train_loss.avg}


# valid/test function
def test(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Testing')
    valid_loss = AverageMeter()
    valid_acc = AverageMeter()
    valid_f1 = AverageMeter()
    model.eval()
    count = 0
    with torch.no_grad():
        for batch_idx, batch in enumerate(dataloader):
            b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target).item()
            pred = probs.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct = pred.eq(b_target.view_as(pred)).sum().item()
            f1 = f1_score(pred.to("cpu").numpy(), b_target.to("cpu").numpy(), average='macro')
            valid_f1.update(f1, n=b_features.size(0))
            valid_loss.update(loss, n=b_features.size(0))
            valid_acc.update(correct, n=1)
            count += b_features.size(0)
            #pbar(step=batch_idx)
    return {'valid_loss': valid_loss.avg,
            'valid_acc': valid_acc.sum /count,
            'valid_f1': valid_f1.avg}

# training
for epoch in range(1, epochs + 1):
    train_log = train(train_dataloader)
    valid_log = test(valid_dataloader)
    logs = dict(train_log, **valid_log)
    show_info = f'\nEpoch: {epoch} - ' + "-".join([f' {key}: {value:.4f} ' for key, value in logs.items()])
    print(show_info)
## 
## Epoch: 1 -  loss: 0.7804 - valid_loss: 0.2648 - valid_acc: 0.9176 - valid_f1: 0.9117 
## 
## Epoch: 2 -  loss: 0.1970 - valid_loss: 0.1802 - valid_acc: 0.9483 - valid_f1: 0.9461 
## 
## Epoch: 3 -  loss: 0.1286 - valid_loss: 0.1242 - valid_acc: 0.9643 - valid_f1: 0.9607 
## 
## Epoch: 4 -  loss: 0.0768 - valid_loss: 0.1024 - valid_acc: 0.9706 - valid_f1: 0.9691
## {'valid_loss': 0.10206319407692978, 'valid_acc': 0.9702857142857143, 'valid_f1': 0.9684667952447602}

2 Sources