Monitor training/validation process in Caffe

11,190

Solution 1

1) You can use the NVIDIA-DIGITS app to monitor your networks. They provide a GUI including dataset preparation, model selection, and learning curve visualization. More, they use a caffe distribution allowing multi-GPU training.

2) Or, you can simply use the log-parser inside caffe.

/pathtocaffe/build/tools/caffe train --solver=solver.prototxt 2>&1 | tee lenet_train.log

This allows you to save train log into "lenet_train.log". Then by using:

python /pathtocaffe/tools/extra/parse_log.py lenet_train.log .

you parse your train log into two csv files, containing train and test loss. You can then plot them using the following python script

import pandas as pd
from matplotlib import *
from matplotlib.pyplot import *

train_log = pd.read_csv("./lenet_train.log.train")
test_log = pd.read_csv("./lenet_train.log.test")
_, ax1 = subplots(figsize=(15, 10))
ax2 = ax1.twinx()
ax1.plot(train_log["NumIters"], train_log["loss"], alpha=0.4)
ax1.plot(test_log["NumIters"], test_log["loss"], 'g')
ax2.plot(test_log["NumIters"], test_log["acc"], 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy')
savefig("./train_test_image.png") #save image as png

Solution 2

Caffe creates logs each time you try to train something, and its located in the tmp folder (both linux and windows).
I also wrote a plotting script in python which you can easily use to visualize your loss/accuracy.
Just place your training logs with .log extension next to the script and double click on it. You can use command prompts as well, but for ease of use, when executed it loads all logs (*.log) it can find in the current directory. it also shows the top 4 accuracies and at-which accuracy they were achieved.

you can find it here : https://gist.github.com/Coderx7/03f46cb24dcf4127d6fa66d08126fa3b

Solution 3

python /pathtocaffe/tools/extra/parse_log.py lenet_train.log

command produces the following error:

usage: parse_log.py [-h] [--verbose] [--delimiter DELIMITER]
                logfile_path output_dir
parse_log.py: error: too few arguments

Solution:

For successful execution of "parse_log.py" command, we should pass the two arguments:

  1. log file
  2. path of output directory

So the correct command is as follows:

python /pathtocaffe/tools/extra/parse_log.py lenet_train.log output_dir
Share:
11,190
DucCuong
Author by

DucCuong

I am currently a junior college student. My interests are on database systems and web development.

Updated on June 06, 2022

Comments

  • DucCuong
    DucCuong almost 2 years

    I'm training Caffe Reference Model for classifying images. My work requires me to monitor the training process by drawing graph of accuracy of the model after every 1000 iterations on entire training set and validation set which has 100K and 50K images respectively. Right now, Im taking the naive approach, make snapshots after every 1000 iterations, run the C++ classififcation code which reads raw JPEG image and forward to the net and output the predicted labels. However, this takes too much time on my machine (with a Geforce GTX 560 Ti)

    Is there any faster way that I can do to have the graph of accuracy of the snapshot models on both training and validation sets?

    I was thinking about using LMDB format instead of raw images. However, I cannot find documentation/code about doing classification in C++ using LMDB format.