1 Introduction

In this guide, we will walk through feature normalization and weight initialization schemes in PyTorch. In short, we normalize our inputs for gradient descent because large weights will dominate our updates in our attempt to find global or local minima. Separately, we use custom weight initialization schemes to improve our ability to converge during optimization or to improve our ability to use certain activation functions.

1.1 Feature Normalization

Normalization is also known as standardization which means that our features will be scaled to have zero mean and unit variance. However, normalizing our inputs only affects the first hidden layer of a neural network. To solve this problem, researchers invented batch normalization.

1.1.1 Batch Normalization

Batch normalization normalizes the inputs of hidden layers which in turn: (1) reduce exploding/vanishing gradients, and (2) increases stability and convergence rate. Two parameters, \(\gamma\) and \(\beta\) are two parameters (variance and mean respectively) that can learn to perform standardization with zero mean and unit variance.

By using batch normalization, its \(\beta\) parameter makes intercepts redundant.

1.1.2 Batch Normalization in PyTorch

Batch normalization can be applied in two ways. The first and most common way is this:

Compute net inputs -> BatchNorm -> Activation Function -> Compute next layer net inputs

Another viable option, with some evidence suggesting better performance, is as follows:

Compute net inputs -> Activation Function -> BatchNorm -> Compute next layer net inputs

class FF_NN(torch.nn.Module):
    def __init__(self, num_features, num_classes):
        super(FF_NN, self).__init__()
        # initialize 3 layers
        # first hidden layer
        self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
        self.linear_1_bn = torch.nn.BatchNorm1d(num_hidden_1)
        # second hidden layer
        self.linear_2 = torch.nn.Linear(num_hidden_1, num_hidden_2)
        self.linear_2_bn = torch.nn.BatchNorm1d(num_hidden_2)
        # output layer
        self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)

    # define how and what order model parameters should be used in forward prop.
    def forward(self, x):
        # run inputs through first layer
        out = self.linear_1(x)
        # apply relu
        out = F.relu(out)
        # apply batchnorm
        out = self.linear_1_bn(out)
        # run inputs through second layer
        out = self.linear_2(out)
        # apply relu
        out = F.relu(out)
        # apply batchnorm
        out = self.linear_2_bn(out)
        # run inputs through final classification layer
        logits = self.linear_out(out)
        probs = F.log_softmax(logits, dim=1)
        return logits, probs

1.2 Weight Initialization

Traditionally, we can initialize weights by sampling from a random uniform distribution in a range between [0, 1], or better, [-0.5, 0.5]. Alternatively, we could choose a Gaussian distribution with a mean of 0 and a small variance of 0.1 or 0.01. Separately, we can initialize all the intercepts to zeros.

In PyTorch, custom weight initialization looks like this:

# nn.Module tells PyTorch to do backward propagation
class FF_NN(torch.nn.Module):
    def __init__(self, num_features, num_classes):
        super(FF_NN, self).__init__()
        # initialize 3 layers
        # first hidden layer
        self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
        self.linear_1.weight.detach().normal_(0.0, 0.1)
        self.linear_1.bias.detach().zero_()
        # second hidden layer
        self.linear_2 = torch.nn.Linear(num_hidden_1, num_hidden_2)
        self.linear_2.weight.detach().normal_(0.0, 0.1)
        self.linear_2.bias.detach().zero_()
        # output layer
        self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)
        self.linear_out.weight.detach().normal_(0.0, 0.1)
        self.linear_out.bias.detach().zero_()        

    # define how and what order model parameters should be used in forward prop.
    def forward(self, x):
        # run inputs through first layer
        out = self.linear_1(x)
        # apply relu
        out = F.relu(out)
        # run inputs through second layer
        out = self.linear_2(out)
        # apply relu
        out = F.relu(out)
        # run inputs through final classification layer
        logits = self.linear_out(out)
        probs = F.log_softmax(logits, dim=1)
        return logits, probs

1.2.1 Xavier Weight Initialization

To perform Xavier initialization, we first initialize weights from a Gaussian or uniform distribution and then we scale the weights proportional to the number of inputs to the layer. This means for that the first hidden layer, the number of inputs is the number of features. For the second hidden layer, the number of inputs is the number of neurons in the first hidden layer.

To apply this automatically in PyTorch, we add the following to our model:

def weights_init(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.xavier_uniform_(m.weight)
        torch.nn.init.xavier_uniform_(m.bias)
model.apply(weights_init)

PyTorch uses a weight initialization scheme similar to Xavier which is is as follows:

def reset_parameters(self):
    stdv = 1. / math.sqrt(self.weight.size(1))
    self.weight.data.uniform_(-stdv, stdv)
    if self.bias is not None:
        self.bias.data.uniform_(-stdv, stdv)

It is important to note that if we choose to use batch normalization, this means that initial feature weight choice is less important.

1.3 Feature Normalization and Weight Initialization in Practice

In this section, we will apply the feature normalization and weight initialization functions on the MNIST data set.

1.3.1 Batch Normalization

# create Dataset
class CSVDataset(Dataset):
    """MNIST dataset."""

    def __init__(self, csv_file, transform=None):
        """
        Args:
            csv_file (string): Path to the csv file.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        # initialize
        self.data_frame = pd.read_csv(csv_file)
        # all columns but the last
        self.features = self.data_frame[self.data_frame.columns[:-1]]
        # the last column
        self.target = self.data_frame[self.data_frame.columns[-1]]
        # initialize the transform if specified
        self.transform = transform

        # get length of df
    def __len__(self):
        return len(self.data_frame)

        # get sample target
    def __get_target__(self):
        return self.target

        # get df filtered by indices
    def __get_values__(self, indices):
        return self.data_frame.iloc[indices]

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        # pull a sample in a dict
        sample = {'features': torch.tensor(self.features.iloc[idx].values),
                  'target': torch.tensor(self.target.iloc[idx]),
                  'idx': torch.tensor(idx)}

        if self.transform:
            sample = self.transform(sample)

        return sample


class Pixel_Normalize():

    # retrieve sample and unpack it
    def __call__(self, sample):
        features, target, idx = (sample['features'],
                              sample['target'],
                              sample['idx'])

        # normalize each pixel
        normalized_pixels = torch.true_divide(sample['features'], 255)

        # yield another dict
        return {'features': normalized_pixels,
                'target': target,
                'idx': idx}


# instantiate the lazy data set
csv_dataset = CSVDataset(csv_file='https://datahub.io/machine-learning/mnist_784/r/mnist_784.csv',
                         transform=Pixel_Normalize())

# set train, valid, and test size
train_size = int(0.8 * len(csv_dataset))
valid_size = int(0.1 * len(csv_dataset))

# use random split to create three data sets;
train_ds, valid_ds, test_ds = torch.utils.data.random_split(csv_dataset, [train_size, valid_size, valid_size])

# check the output
for i, batch in enumerate(train_ds):
    if i == 0:
        break


# check the distribution of dependent variable; some imbalance
csv_dataset.__get_target__().value_counts()


# prepare weighted sampling for imbalanced classification

## 1    7877
## 7    7293
## 3    7141
## 2    6990
## 9    6958
## 0    6903
## 6    6876
## 8    6825
## 4    6824
## 5    6313
## Name: class, dtype: int64

def create_sampler(train_ds, csv_dataset):
    # get indicies from train split
    train_indices = train_ds.indices
    # generate class distributions [y1, y2, etc...]
    bin_count = np.bincount(csv_dataset.__get_target__()[train_indices])
    # weight gen
    weight = 1. / bin_count.astype(np.float32)
    # produce weights for each observation in the data set
    samples_weight = torch.tensor([weight[t] for t in csv_dataset.__get_target__()[train_indices]])
    # prepare sampler
    sampler = torch.utils.data.WeightedRandomSampler(weights=samples_weight,
                                                     num_samples=len(samples_weight),
                                                     replacement=True)
    return sampler


# create sampler for the training ds
train_sampler = create_sampler(train_ds, csv_dataset)

# create NN
# nn.Module tells PyTorch to do backward propagation
class FF_NN(torch.nn.Module):
    def __init__(self, num_features, num_classes):
        super(FF_NN, self).__init__()
        # initialize 3 layers
        # first hidden layer
        self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
        self.linear_1_bn = torch.nn.BatchNorm1d(num_hidden_1)
        # second hidden layer
        self.linear_2 = torch.nn.Linear(num_hidden_1, num_hidden_2)
        self.linear_2_bn = torch.nn.BatchNorm1d(num_hidden_2)
        # output layer
        self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)

    # define how and what order model parameters should be used in forward prop.
    def forward(self, x):
        # run inputs through first layer
        out = self.linear_1(x)
        # apply relu
        out = F.relu(out)
        # apply batchnorm
        out = self.linear_1_bn(out)
        # run inputs through second layer
        out = self.linear_2(out)
        # apply relu
        out = F.relu(out)
        # apply batchnorm
        out = self.linear_2_bn(out)
        # run inputs through final classification layer
        logits = self.linear_out(out)
        probs = F.log_softmax(logits, dim=1)
        return logits, probs


# load the NN model
num_features = 784
num_hidden_1 = 128
num_hidden_2 = 256
num_classes = 10
model = FF_NN(num_features=num_features, num_classes=num_classes).to(DEVICE)


# optimizer
optimizer = RAdam(model.parameters(), lr=0.1)

# set number of epochs
epochs = 4


# create DataLoaders with samplers
train_dataloader = DataLoader(train_ds,
                              batch_size=100,
                              sampler=train_sampler,
                              shuffle=False)

valid_dataloader = DataLoader(valid_ds,
                              batch_size=100,
                              shuffle=True)

test_dataloader = DataLoader(test_ds,
                              batch_size=100,
                              shuffle=True)

# set LR scheduler
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,
                                                max_lr=0.01,
                                                total_steps=len(train_dataloader)*epochs)

# create gradient scaler for mixed precision
scaler = GradScaler()

# train function
def train(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Training')
    train_loss = AverageMeter()
    model.train()
    for batch_idx, batch in enumerate(dataloader):
        b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
        optimizer.zero_grad()
        with autocast():
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target)
            # regularize loss -- but not the intercept
            LAMBDA, L2 = 0.5, 0.
            for name, p in model.named_parameters():
                if 'weight' in name:
                    L2 = L2 + (p**2).sum()
            loss = loss + 2./b_target.size(0) * LAMBDA * L2
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        #pbar(step=batch_idx, info={'loss': loss.item()})
        train_loss.update(loss.item(), n=1)
    return {'loss': train_loss.avg}


# valid/test function
def test(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Testing')
    valid_loss = AverageMeter()
    valid_acc = AverageMeter()
    valid_f1 = AverageMeter()
    model.eval()
    count = 0
    with torch.no_grad():
        for batch_idx, batch in enumerate(dataloader):
            b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target).item()
            pred = probs.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct = pred.eq(b_target.view_as(pred)).sum().item()
            f1 = f1_score(pred.to("cpu").numpy(), b_target.to("cpu").numpy(), average='macro')
            valid_f1.update(f1, n=b_features.size(0))
            valid_loss.update(loss, n=b_features.size(0))
            valid_acc.update(correct, n=1)
            count += b_features.size(0)
            #pbar(step=batch_idx)
    return {'valid_loss': valid_loss.avg,
            'valid_acc': valid_acc.sum /count,
            'valid_f1': valid_f1.avg}

# training
for epoch in range(1, epochs + 1):
    train_log = train(train_dataloader)
    valid_log = test(valid_dataloader)
    logs = dict(train_log, **valid_log)
    show_info = f'\nEpoch: {epoch} - ' + "-".join([f' {key}: {value:.4f} ' for key, value in logs.items()])
    print(show_info)

## 
## Epoch: 1 -  loss: 3.7701 - valid_loss: 0.4558 - valid_acc: 0.8626 - valid_f1: 0.8627 
## 
## Epoch: 2 -  loss: 0.9400 - valid_loss: 0.4367 - valid_acc: 0.8791 - valid_f1: 0.8729 
## 
## Epoch: 3 -  loss: 0.8238 - valid_loss: 0.2809 - valid_acc: 0.9397 - valid_f1: 0.9364 
## 
## Epoch: 4 -  loss: 0.6290 - valid_loss: 0.1779 - valid_acc: 0.9646 - valid_f1: 0.9622 
## 
## C:/Users/Andrew/Desktop/Projects/Deep Learning/utils\radam.py:60: UserWarning: This overload of add_ is deprecated:
##  add_(Number alpha, Tensor other)
## Consider using one of the following signatures instead:
##  add_(Tensor other, *, Number alpha) (Triggered internally at  ..\torch\csrc\utils\python_arg_parser.cpp:766.)
##   exp_avg.mul_(beta1).add_(1 - beta1, grad)

# testing
test_log = test(test_dataloader)
print(test_log)

## {'valid_loss': 0.18130239312137877, 'valid_acc': 0.9631428571428572, 'valid_f1': 0.9605992527642337}

1.3.2 Xavier Initialization

class FF_NN(torch.nn.Module):
    def __init__(self, num_features, num_classes):
        super(FF_NN, self).__init__()
        # initialize 3 layers
        # first hidden layer
        self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
        self.linear_1.weight.detach().normal_(0.0, 0.1)
        self.linear_1.bias.detach().zero_()
        # second hidden layer
        self.linear_2 = torch.nn.Linear(num_hidden_1, num_hidden_2)
        self.linear_2.weight.detach().normal_(0.0, 0.1)
        self.linear_2.bias.detach().zero_()
        # output layer
        self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)
        self.linear_out.weight.detach().normal_(0.0, 0.1)
        self.linear_out.bias.detach().zero_()

    # define how and what order model parameters should be used in forward prop.
    def forward(self, x):
        # run inputs through first layer
        out = self.linear_1(x)
        # apply relu
        out = F.relu(out)
        # run inputs through second layer
        out = self.linear_2(out)
        # apply relu
        out = F.relu(out)
        # run inputs through final classification layer
        logits = self.linear_out(out)
        probs = F.log_softmax(logits, dim=1)
        return logits, probs

# load the NN model
num_features = 784
num_hidden_1 = 128
num_hidden_2 = 256
num_classes = 10
model = FF_NN(num_features=num_features, num_classes=num_classes).to(DEVICE)


# optimizer
optimizer = RAdam(model.parameters(), lr=0.1)

# set number of epochs
epochs = 4


# create DataLoaders with samplers
train_dataloader = DataLoader(train_ds,
                              batch_size=100,
                              sampler=train_sampler,
                              shuffle=False)

valid_dataloader = DataLoader(valid_ds,
                              batch_size=100,
                              shuffle=True)

test_dataloader = DataLoader(test_ds,
                              batch_size=100,
                              shuffle=True)

# set LR scheduler
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer,
                                                max_lr=0.01,
                                                total_steps=len(train_dataloader)*epochs)

# create gradient scaler for mixed precision
scaler = GradScaler()

# train function
def train(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Training')
    train_loss = AverageMeter()
    model.train()
    for batch_idx, batch in enumerate(dataloader):
        b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
        optimizer.zero_grad()
        with autocast():
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()
        #pbar(step=batch_idx, info={'loss': loss.item()})
        train_loss.update(loss.item(), n=1)
    return {'loss': train_loss.avg}


# valid/test function
def test(dataloader):
    #pbar = ProgressBar(n_total=len(dataloader), desc='Testing')
    valid_loss = AverageMeter()
    valid_acc = AverageMeter()
    valid_f1 = AverageMeter()
    model.eval()
    count = 0
    with torch.no_grad():
        for batch_idx, batch in enumerate(dataloader):
            b_features, b_target, b_idx = batch['features'].to(DEVICE),  batch['target'].to(DEVICE), batch['idx'].to(DEVICE)
            logits, probs = model(b_features)
            loss = F.cross_entropy(logits, b_target).item()
            pred = probs.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct = pred.eq(b_target.view_as(pred)).sum().item()
            f1 = f1_score(pred.to("cpu").numpy(), b_target.to("cpu").numpy(), average='macro')
            valid_f1.update(f1, n=b_features.size(0))
            valid_loss.update(loss, n=b_features.size(0))
            valid_acc.update(correct, n=1)
            count += b_features.size(0)
            #pbar(step=batch_idx)
    return {'valid_loss': valid_loss.avg,
            'valid_acc': valid_acc.sum /count,
            'valid_f1': valid_f1.avg}

# training
for epoch in range(1, epochs + 1):
    train_log = train(train_dataloader)
    valid_log = test(valid_dataloader)
    logs = dict(train_log, **valid_log)
    show_info = f'\nEpoch: {epoch} - ' + "-".join([f' {key}: {value:.4f} ' for key, value in logs.items()])
    print(show_info)

## 
## Epoch: 1 -  loss: 0.6772 - valid_loss: 0.1668 - valid_acc: 0.9484 - valid_f1: 0.9455 
## 
## Epoch: 2 -  loss: 0.1306 - valid_loss: 0.1363 - valid_acc: 0.9583 - valid_f1: 0.9551 
## 
## Epoch: 3 -  loss: 0.0715 - valid_loss: 0.0803 - valid_acc: 0.9750 - valid_f1: 0.9732 
## 
## Epoch: 4 -  loss: 0.0297 - valid_loss: 0.0627 - valid_acc: 0.9814 - valid_f1: 0.9798

# testing
test_log = test(test_dataloader)
print(test_log)

## {'valid_loss': 0.06501132133749447, 'valid_acc': 0.9835714285714285, 'valid_f1': 0.9821693332461051}

Feature Normalization and Initialization

Andrew Fogarty

10/06/2020