1 Introduction

We can think of regularization as a general means to reduce overfitting. In practice, this means that regularizing effects aim to reduce the model capacity and/or the variance of the predictions. While the best way to reduce overfitting is to acquire more data, data augmentation (e.g., for images: random rotation, crop, etc) is the next best thing. Barring those possibilities, we can reduce the capacity of the model through regularization methods.

1.1 Reducing Model Capacity

We can reduce the capacity of model in the following ways:

Smaller architecture: fewer hidden layers, neurons, and dropout
Smaller weights: early stopping (not very common anymore) and L1/L2 norm penalties
Adding Noise: dropout

1.1.1 L1/L2 Regularization

These methods shrink the weights and can be thought of as penalties against model complexity.

L1 Regularization – encourages sparsity but does not work well in practice because its not smooth and is hard to optimize.
L2 Regularization – shrinks weights. Large regularization yields high bias while low regularization yields high variance. We want to aim for something in between.

1.1.2 L2 Regularization in PyTorch

In this section, we will apply L2 regularization both manually and through PyTorch. To apply L2 regularization through PyTorch, we simply manipulate weight_decay in the optimizer instance. To apply L2 regularization manually, we manipulate our cost function before sending it through backward propagation and our optimizer.

It is important to note that when you use weight_decay, PyTorch will also regularize the intercept. I do not think we should regularize the bias term because if we consider linear regression, we have an intercept term because we want to consider all possible lines. Without an intercept, a line can only be fit through the origin which is of course limiting to the hypothesis space of our model.

1.1.3 Manual: L2 Regularization in PyTorch

To see how we would apply L2 regularization manually, consider the following familiar training function from the multilayer perceptron demonstration.

# train function
def train(dataloader):
    pbar = ProgressBar(n_total=len(dataloader), desc='Training')
    train_loss = AverageMeter()
    model.train()
    for batch_idx, batch in enumerate(dataloader):
        b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
        optimizer.zero_grad()
        with autocast():
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target)
            # regularize loss -- but not the intercept
            LAMBDA, L2 = 2, 0.
            for name, p in model.named_parameters():
                if 'weight' in name:
                    L2 = L2 + (p**2).sum()
            loss = loss + 2./b_target.size(0) * LAMBDA * L2
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        pbar(step=batch_idx, info={'loss': loss.item()})
        train_loss.update(loss.item(), n=1)
    return {'loss': train_loss.avg}

1.1.4 Automatic: L2 Regularization in PyTorch

LAMBDA = 2
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, weight_decay=LAMBDA)

1.1.5 Dropout

Dropout effectively drops neurons by relying on Bernoulli sampling to choose neurons to drop. Dropout works very well because:

Our model will learn to not rely on particular connections heavily
Our model will consider more connections
The weight values will be more spread out and perhaps some smaller weights like L2 norm

In practice, can use different dropout probabilities for different layers and we should probably assign dropout probabilities proportional to the number of neurons in a layer. PyTorch uses inverted dropout which scales the activation values by the factor \(\frac{1}{1-p}\) during training instead of inference. Thus, we need to set model.train() and model.eval() appropriately in our training and validation functions.

1.1.6 Dropout Tips

In practice, do not use dropout if the model does not overfit. However, in the case that your model does not overfit, it is then recommended that you increase the model’s capacity to make it overfit so that you may then use dropout to retain the benefits of a larger capacity model while constraining its ability to overfit.

1.1.7 Dropout in PyTorch

# create NN
# nn.Module tells PyTorch to do backward propagation
class FF_NN(torch.nn.Module):
    def __init__(self, num_features, num_classes, drop_prob):
        super(FF_NN, self).__init__()

        # first hidden layer
        self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
        # second hidden layer
        self.linear_2 = torch.nn.Linear(num_hidden_1, num_hidden_2)
        # output layer
        self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)
        # dropout
        self.drop_prob = drop_prob

    # define how and what order model parameters should be used in forward prop.
    def forward(self, x):
        # run inputs through first layer
        out = self.linear_1(x)
        # apply relu
        out = F.relu(out)
        # apply dropout
        out = F.dropout(p=self.drop_prob)
        # run inputs through second layer
        out = self.linear_2(out)
        # apply relu
        out = F.relu(out)
        # apply dropout
        out = F.dropout(p=self.drop_prob)
        # run inputs through final classification layer
        logits = self.linear_out(out)
        probs = F.log_softmax(logits, dim=1)
        return logits, probs

1.2 Regularized Models in Practice

In this section, we will apply the regularized functions on the MNIST data set.

1.2.1 L2 Norm


# create Dataset
class CSVDataset(Dataset):
    """MNIST dataset."""

    def __init__(self, csv_file, transform=None):
        """
        Args:
            csv_file (string): Path to the csv file.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        # initialize
        self.data_frame = pd.read_csv(csv_file)
        # all columns but the last
        self.features = self.data_frame[self.data_frame.columns[:-1]]
        # the last column
        self.target = self.data_frame[self.data_frame.columns[-1]]
        # initialize the transform if specified
        self.transform = transform

        # get length of df
    def __len__(self):
        return len(self.data_frame)

        # get sample target
    def __get_target__(self):
        return self.target

        # get df filtered by indices
    def __get_values__(self, indices):
        return self.data_frame.iloc[indices]

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        # pull a sample in a dict
        sample = {'features': torch.tensor(self.features.iloc[idx].values),
                  'target': torch.tensor(self.target.iloc[idx]),
                  'idx': torch.tensor(idx)}

        if self.transform:
            sample = self.transform(sample)

        return sample


class Pixel_Normalize():

    # retrieve sample and unpack it
    def __call__(self, sample):
        features, target, idx = (sample['features'],
                              sample['target'],
                              sample['idx'])

        # normalize each pixel
        normalized_pixels = torch.true_divide(sample['features'], 255)

        # yield another dict
        return {'features': normalized_pixels,
                'target': target,
                'idx': idx}


# instantiate the lazy data set
csv_dataset = CSVDataset(csv_file='https://datahub.io/machine-learning/mnist_784/r/mnist_784.csv',
                         transform=Pixel_Normalize())

# set train, valid, and test size
train_size = int(0.8 * len(csv_dataset))
valid_size = int(0.1 * len(csv_dataset))

# use random split to create three data sets;
train_ds, valid_ds, test_ds = torch.utils.data.random_split(csv_dataset, [train_size, valid_size, valid_size])

# check the output
for i, batch in enumerate(train_ds):
    if i == 0:
        break


# check the distribution of dependent variable; some imbalance
csv_dataset.__get_target__().value_counts()


# prepare weighted sampling for imbalanced classification

## 1    7877
## 7    7293
## 3    7141
## 2    6990
## 9    6958
## 0    6903
## 6    6876
## 8    6825
## 4    6824
## 5    6313
## Name: class, dtype: int64

def create_sampler(train_ds, csv_dataset):
    # get indicies from train split
    train_indices = train_ds.indices
    # generate class distributions [y1, y2, etc...]
    bin_count = np.bincount(csv_dataset.__get_target__()[train_indices])
    # weight gen
    weight = 1. / bin_count.astype(np.float32)
    # produce weights for each observation in the data set
    samples_weight = torch.tensor([weight[t] for t in csv_dataset.__get_target__()[train_indices]])
    # prepare sampler
    sampler = torch.utils.data.WeightedRandomSampler(weights=samples_weight,
                                                     num_samples=len(samples_weight),
                                                     replacement=True)
    return sampler


# create sampler for the training ds
train_sampler = create_sampler(train_ds, csv_dataset)

# create NN
# nn.Module tells PyTorch to do backward propagation
class FF_NN(torch.nn.Module):
    def __init__(self, num_features, num_classes):
        super(FF_NN, self).__init__()
        # initialize 3 layers
        # first hidden layer
        self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
        # second hidden layer
        self.linear_2 = torch.nn.Linear(num_hidden_1, num_hidden_2)
        # output layer
        self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)

    # define how and what order model parameters should be used in forward prop.
    def forward(self, x):
        # run inputs through first layer
        out = self.linear_1(x)
        # apply relu
        out = F.relu(out)
        # run inputs through second layer
        out = self.linear_2(out)
        # apply relu
        out = F.relu(out)
        # run inputs through final classification layer
        logits = self.linear_out(out)
        probs = F.log_softmax(logits, dim=1)
        return logits, probs


# load the NN model
num_features = 784
num_hidden_1 = 128
num_hidden_2 = 256
num_classes = 10
model = FF_NN(num_features=num_features, num_classes=num_classes).to(DEVICE)


# optimizer
optimizer = RAdam(model.parameters(), lr=0.1)

# set number of epochs
epochs = 4


# create DataLoaders with samplers
train_dataloader = DataLoader(train_ds,
                              batch_size=100,
                              sampler=train_sampler,
                              shuffle=False)

valid_dataloader = DataLoader(valid_ds,
                              batch_size=100,
                              shuffle=True)

test_dataloader = DataLoader(test_ds,
                              batch_size=100,
                              shuffle=True)

# set LR scheduler
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,
                                                max_lr=0.01,
                                                total_steps=len(train_dataloader)*epochs)

# create gradient scaler for mixed precision
scaler = GradScaler()

# train function
def train(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Training')
    train_loss = AverageMeter()
    model.train()
    for batch_idx, batch in enumerate(dataloader):
        b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
        optimizer.zero_grad()
        with autocast():
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target)
            # regularize loss -- but not the intercept
            LAMBDA, L2 = 0.5, 0.
            for name, p in model.named_parameters():
                if 'weight' in name:
                    L2 = L2 + (p**2).sum()
            loss = loss + 2./b_target.size(0) * LAMBDA * L2
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        #pbar(step=batch_idx, info={'loss': loss.item()})
        train_loss.update(loss.item(), n=1)
    return {'loss': train_loss.avg}


# valid/test function
def test(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Testing')
    valid_loss = AverageMeter()
    valid_acc = AverageMeter()
    valid_f1 = AverageMeter()
    model.eval()
    count = 0
    with torch.no_grad():
        for batch_idx, batch in enumerate(dataloader):
            b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target).item()
            pred = probs.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct = pred.eq(b_target.view_as(pred)).sum().item()
            f1 = f1_score(pred.to("cpu").numpy(), b_target.to("cpu").numpy(), average='macro')
            valid_f1.update(f1, n=b_features.size(0))
            valid_loss.update(loss, n=b_features.size(0))
            valid_acc.update(correct, n=1)
            count += b_features.size(0)
            #pbar(step=batch_idx)
    return {'valid_loss': valid_loss.avg,
            'valid_acc': valid_acc.sum /count,
            'valid_f1': valid_f1.avg}

# training
for epoch in range(1, epochs + 1):
    train_log = train(train_dataloader)
    valid_log = test(valid_dataloader)
    logs = dict(train_log, **valid_log)
    show_info = f'\nEpoch: {epoch} - ' + "-".join([f' {key}: {value:.4f} ' for key, value in logs.items()])
    print(show_info)

## 
## Epoch: 1 -  loss: 1.5542 - valid_loss: 0.4248 - valid_acc: 0.8790 - valid_f1: 0.8705 
## 
## Epoch: 2 -  loss: 0.8890 - valid_loss: 0.3804 - valid_acc: 0.8974 - valid_f1: 0.8915 
## 
## Epoch: 3 -  loss: 0.8092 - valid_loss: 0.3693 - valid_acc: 0.9080 - valid_f1: 0.9015 
## 
## Epoch: 4 -  loss: 0.7494 - valid_loss: 0.3302 - valid_acc: 0.9173 - valid_f1: 0.9116 
## 
## C:/Users/Andrew/Desktop/Projects/Deep Learning/utils\radam.py:60: UserWarning: This overload of add_ is deprecated:
##  add_(Number alpha, Tensor other)
## Consider using one of the following signatures instead:
##  add_(Tensor other, *, Number alpha) (Triggered internally at  ..\torch\csrc\utils\python_arg_parser.cpp:766.)
##   exp_avg.mul_(beta1).add_(1 - beta1, grad)

# testing
test_log = test(test_dataloader)
print(test_log)

## {'valid_loss': 0.3280821489436286, 'valid_acc': 0.918, 'valid_f1': 0.9128352114054326}

1.2.2 Dropout

class FF_NN(torch.nn.Module):
    def __init__(self, num_features, num_classes, drop_prob):
        super(FF_NN, self).__init__()

        # first hidden layer
        self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
        # second hidden layer
        self.linear_2 = torch.nn.Linear(num_hidden_1, num_hidden_2)
        # output layer
        self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)
        # dropout
        self.drop_prob = drop_prob

    # define how and what order model parameters should be used in forward prop.
    def forward(self, x):
        # run inputs through first layer
        out = self.linear_1(x)
        # apply relu
        out = F.relu(out)
        # apply dropout
        out = F.dropout(out, p=self.drop_prob)
        # run inputs through second layer
        out = self.linear_2(out)
        # apply relu
        out = F.relu(out)
        # apply dropout
        out = F.dropout(out, p=self.drop_prob)
        # run inputs through final classification layer
        logits = self.linear_out(out)
        probs = F.log_softmax(logits, dim=1)
        return logits, probs

# load the NN model
num_features = 784
num_hidden_1 = 128
num_hidden_2 = 256
num_classes = 10
dropout_prob = 0.2
model = FF_NN(num_features=num_features, num_classes=num_classes, drop_prob=dropout_prob).to(DEVICE)


# optimizer
optimizer = RAdam(model.parameters(), lr=0.1)

# set number of epochs
epochs = 4


# create DataLoaders with samplers
train_dataloader = DataLoader(train_ds,
                              batch_size=100,
                              sampler=train_sampler,
                              shuffle=False)

valid_dataloader = DataLoader(valid_ds,
                              batch_size=100,
                              shuffle=True)

test_dataloader = DataLoader(test_ds,
                              batch_size=100,
                              shuffle=True)

# set LR scheduler
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,
                                                max_lr=0.01,
                                                total_steps=len(train_dataloader)*epochs)

# create gradient scaler for mixed precision
scaler = GradScaler()

# train function
def train(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Training')
    train_loss = AverageMeter()
    model.train()
    for batch_idx, batch in enumerate(dataloader):
        b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
        optimizer.zero_grad()
        with autocast():
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        #pbar(step=batch_idx, info={'loss': loss.item()})
        train_loss.update(loss.item(), n=1)
    return {'loss': train_loss.avg}


# valid/test function
def test(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Testing')
    valid_loss = AverageMeter()
    valid_acc = AverageMeter()
    valid_f1 = AverageMeter()
    model.eval()
    count = 0
    with torch.no_grad():
        for batch_idx, batch in enumerate(dataloader):
            b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target).item()
            pred = probs.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct = pred.eq(b_target.view_as(pred)).sum().item()
            f1 = f1_score(pred.to("cpu").numpy(), b_target.to("cpu").numpy(), average='macro')
            valid_f1.update(f1, n=b_features.size(0))
            valid_loss.update(loss, n=b_features.size(0))
            valid_acc.update(correct, n=1)
            count += b_features.size(0)
            #pbar(step=batch_idx)
    return {'valid_loss': valid_loss.avg,
            'valid_acc': valid_acc.sum /count,
            'valid_f1': valid_f1.avg}

# training
for epoch in range(1, epochs + 1):
    train_log = train(train_dataloader)
    valid_log = test(valid_dataloader)
    logs = dict(train_log, **valid_log)
    show_info = f'\nEpoch: {epoch} - ' + "-".join([f' {key}: {value:.4f} ' for key, value in logs.items()])
    print(show_info)

## 
## Epoch: 1 -  loss: 0.7804 - valid_loss: 0.2648 - valid_acc: 0.9176 - valid_f1: 0.9117 
## 
## Epoch: 2 -  loss: 0.1970 - valid_loss: 0.1802 - valid_acc: 0.9483 - valid_f1: 0.9461 
## 
## Epoch: 3 -  loss: 0.1286 - valid_loss: 0.1242 - valid_acc: 0.9643 - valid_f1: 0.9607 
## 
## Epoch: 4 -  loss: 0.0768 - valid_loss: 0.1024 - valid_acc: 0.9706 - valid_f1: 0.9691

# testing
test_log = test(test_dataloader)
print(test_log)

## {'valid_loss': 0.10206319407692978, 'valid_acc': 0.9702857142857143, 'valid_f1': 0.9684667952447602}

Regularization

Andrew Fogarty

10/06/2020