2 years ago

#68073

test-img

Opps_0

Getting NaN value from loss function for k-fold validation

I am trying to implement MNIST using PyTorch Lightning. Here, I wanted to use k-fold cross-validation.

The problem is I am getting the NaN value from the loss function (for at least 1 fold). From below 3rd time, I was getting NaN values from the loss function.

Epoch 19: 100%|█████████████████████████████████| 110/110 [00:03<00:00, 29.24it/s, loss=0.963, v_num=287]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 39.94it/s]

Epoch 19: 100%|█████████████████████████████████| 110/110 [00:04<00:00, 25.69it/s, loss=0.825, v_num=288]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 41.19it/s]

Epoch 19: 100%|███████████████████████████████████| 110/110 [00:03<00:00, 30.19it/s, loss=nan, v_num=289]
Testing: 100%|███████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 42.15it/s

Or very big loss value (terminated before completing full epocs)

Epoch 0:  44%|█████████████▉                  | 48/110 [00:02<00:02, 22.87it/s, loss=2.08e+23, v_num=295]

The code I have used for data preparation, k-fold, and trainer is given below

def prepare_data():
  transform=transforms.Compose([transforms.ToTensor(), 
                                transforms.Normalize((0.1307,), (0.3081,))])
  mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform)
  mnist_test = MNIST(os.getcwd(), train=False, download=True, transform=transform)
  dataset = ConcatDataset([mnist_train, mnist_test])
  return dataset


k_folds=5
epochs=20

kfold=KFold(n_splits=k_folds,shuffle=True)

dataset = prepare_data()
model = LightningMNIST(lr_rate=0.01)

for fold, (train_idx, val_idx) in enumerate(kfold.split(dataset)):
  train_subsampler = torch.utils.data.SubsetRandomSampler(train_idx)
  val_subsampler = torch.utils.data.SubsetRandomSampler(val_idx)

  train_loader = torch.utils.data.DataLoader(dataset, num_workers=8, batch_size=512, sampler=train_subsampler)
  val_loader = torch.utils.data.DataLoader(dataset, num_workers=8, batch_size=512, sampler=val_subsampler)
  model.apply(reset_weights) # reset model for every fold
  early_stopping = EarlyStopping('train_loss', mode='min', patience=5)
  model_checkpoint = ModelCheckpoint(dirpath=model_path+'mnist_{epoch}-{train_loss:.2f}',
                                        monitor='train_loss', mode='min', save_top_k=3)
  trainer = pl.Trainer(max_epochs=epochs, profiler=False, callbacks = [model_checkpoint],default_root_dir=model_path) 
  trainer.fit(model, train_dataloader=train_loader)
  trainer.test(test_dataloaders=val_loader, ckpt_path=None)

The training step is given below

  def training_step(self, train_batch, batch_idx):
    x, y = train_batch
    logits = self.forward(x)
    loss = self.error_loss(logits.squeeze(-1), y.float())
    self.log('train_loss', loss)
    return {'loss': loss}

I assume, maybe I am doing something wrong in the k-fold data preparation or in the training step. Otherwise getting NaN or very big value is not expected for this simple problem and simple model.

I have gone through several posts like this, this, and that. Some of them suggested that it could happen because the dataset might contain NaN (but I think MNIST does not contain NaN as directly downloading from the module), model's learning rate is 0.01 (not too big not too small). Moreover, I believe that this post is not duplicated (because here, trying to use k-fold thought the error seems the same).

Any suggestions?

python

machine-learning

deep-learning

pytorch

pytorch-lightning

0 Answers

Your Answer

Accepted video resources