Train Tensorflow Object Detection on own dataset

20,495

Solution 1

This assumes the module is already installed. Please refer to their documentation if not.

Disclaimer

This answer is not meant to be the right or only way of training the object detection module. This is simply I sharing my experience and what has worked for me. I'm open to suggestions and learning more about this as I am still new to ML in general.

TL;DR

  1. Create your own PASCAL VOC format dataset
  2. Generate TFRecords from it
  3. Configure a pipeline
  4. Visualize

Each section of this answer consists of a corresponding Edit (see below). After reading each section, please read its Edit as well for clarifications. Corrections and tips were added for each section.

Tools used

LabelImg: A tool for creating PASCAL VOC format annotations.

1. Create your own PASCAL VOC dataset

PS: For simplicity, the folder naming convention of my answer follows that of Pascal VOC 2012

A peek into the May 2012 dataset, you'll notice the folder as having the following structure

+VOCdevkit +VOC2012 +Annotations +ImageSets +Action +Layout +Main +Segmentation +JPEGImages +SegmentationClass +SegmentationObject

For the time being, amendments were made to the following folders:

Annotations: This is were all the images' corresponding XML files will be placed in. Use the suggested tool above to create the annotations. Do not worry about <truncated> and <difficulty> tags as they will be ignored by the training and eval binaries.

JPEGImages: Location of your actual images. Make sure they are of type JPEG because that's what is currently supported in order to create TFRecords using their provided script.

ImageSets->Main: This simply consists of text files. For each class, there exists a corresponding train.txt, trainval.txt and val.txt. Below is a sample of the contents of the aeroplane_train.txt in the VOC 2012 folder

2008_000008 -1
2008_000015 -1
2008_000019 -1
2008_000023 -1
2008_000028 -1
2008_000033  1

The structure is basically image name followed by a boolean saying whether the corresponding object exists in that image or not. Take for example image 2008_000008 does not consist of an aeroplane hence marked with a -1 but image 2008_000033 does.

I wrote a small Python script to generate these text files. Simply iterate through the image names and assign a 1 or -1 next to them for object existence. I added some randomness among my text files by shuffling the image names.

The {classname}_val.txt files consist of the testing validation datasets. Think of this as the test data during training. You want to divide your dataset into training and validation. More info can be found here. The format of these files is similar to that of training.

At this point, your folder structure should be

+VOCdevkit +VOC2012 +Annotations --(for each image, generated annotation) +ImageSets +Main --(for each class, generated *classname*_train.txt and *classname*_val.txt) +JPEGImages --(a bunch of JPEG images)


1.1 Generating label map

With the dataset prepared, we need to create the corresponding label maps. Navigate to models/object_detection/data and open pascal_label_map.pbtxt.

This file consists of a JSON that assigns an ID and name to each item. Make amendments to this file to reflect your desired objects.


2. Generate TFRecords

If you look into their code especially this line, they explicitly grab the aeroplane_train.txt only. For curios minds, here's why. Change this file name to any of your class train text file.

Make sure VOCdevkit is inside models/object_detection then you can go ahead and generate the TFRecords.

Please go through their code first should you run into any problems. It is self explanatory and well documented.


3. Pipeline Configuration

The instructions should be self explanatory to cover this segment. Sample configs can be found in object_detection/samples/configs.

For those looking to train from scratch as I did, just make sure to remove the fine_tune_checkpoint and from_detection_checkpoint nodes. Here's what my config file looked like for reference.

From here on you can continue with the tutorial and run the training process.


4. Visualize

Be sure to run the eval in parallel to the training in order to be able to visualize the learning process. To quote Jonathan Huang

the best way is to just run the eval.py binary. We typically run this binary in parallel to training, pointing it at the directory holding the checkpoint that is being trained. The eval.py binary will write logs to an eval_dir that you specify which you can then point to with Tensorboard.

You want to see that the mAP has "lifted off" in the first few hours, and then you want to see when it converges. It's hard to tell without looking at these plots how many steps you need.


EDIT I (28 July '17):

I never expected my response to get this much attention so I decided to come back and review it.

Tools

For my fellow Apple users, you could actually use RectLabel for annotations.

Pascal VOC

After digging around, I finally realized that trainval.txt is actually the union of training and validation datasets.

Please look at their official development kit to understand the format even better.

Label Map Generation

At the time of my writing, ID 0 represents none_of_the_above. It is recommended that your IDs start from 1.

Visualize

After running your evaluation and directed tensorboard to your Eval directory, it'll show you the mAP of each category along with each category's performance. This is good but I like seeing my training data as well in parallel with Eval.

To do this, run tensorboard on a different port and point it to your train directory

tensorboard --logdir=${PATH_TO_TRAIN} --port=${DESIRED_NUMBER}

Solution 2

I wrote a blog post on Medium about my experience as well on how I trained an object detector (in particular, it's a Raccoon detector) with Tensorflow on my own dataset. This might also be useful for others and is complimentary to eshirima's answer.

Share:
20,495

Related videos on Youtube

eshirima
Author by

eshirima

This is a short highlight of the journey. Swing by LinkedIn for a more detailed one. Enjoy ;) Software Development is one of my favourite hobbies and the best way I discover and learn new material. My skills are all over the place as shown below Industry Mobile Development: Mainly iOS Firebase, Parse, No-SQL backends, Handful of other frameworks Backend Web Development: Responsible for the deployment of a web app on pythonanywhere. Flask, SQL Computer Vision: Detection of the boogie board OpenCV, Tensorflow Object Detection Probabilistic Modeling: Fall Risk Assessment of patients at Cleveland Clinic Data Analysis Research Physical DNA Mapping: DNA reconstruction Graph Theory, Proper Interval Graphs More info Abnormal Human Activities Detection: (Current) Imagine a person walking in a parking lot then suddenly collapses. Without anyone around or the personnel responsible for monitoring the CCTV cameras catching it, help is delayed. The goal of the research is developing a system that would be able to detect any form of abnormal activities (contingent of its current application) and learn in due time. OpenCV, Machine Learning Hobbies?? Hiking, running, road cycling and mountain biking are a few of my outdoor favourites. I try posting a video or two while am at it on YouTube and my segments on Strava too. Ethos I prefer being the little fish in a big pond because learning is my number one priority. I jokingly say, The fall of a/one fool is the rise of another. The plan is to not be the fool.

Updated on July 27, 2020

Comments

  • eshirima
    eshirima almost 4 years

    After spending a couple days trying to achieve this task, I would like to share my experience of how I went about answering the question:

    How do I use TS Object Detection to train using my own dataset?

    • Michael Ramos
      Michael Ramos almost 7 years
      you're a boss, thanks!
  • AruniRC
    AruniRC almost 7 years
    it's great to hear that someone got the TF "training locally" example up and running and also modified it for their own dataset! could you let me know what version of Python you used? I am constantly running into Python2.7/Python3 code issues when just trying to run their example for training on PASCAL. thanks!
  • eshirima
    eshirima almost 7 years
    I used Python 2.7 and I believe their codebase was written with that in mind.
  • AruniRC
    AruniRC almost 7 years
    thanks. I ended up shifting to 2.7 as well and things were better.
  • gdelab
    gdelab almost 7 years
    How do you run eval.py simultaneously with train.py ? I have to do it on another device, for memory reasons, but I don't know how to specify the device...
  • gdelab
    gdelab almost 7 years
    And do you know how to get the same summaries in tensorboard for the evaluation as for the training ? (TotalLoss, global_step/sec, etc., to compare eval and train more precisely, and get the inference time)
  • eshirima
    eshirima almost 7 years
    @gdelab I have no idea on how to run eval.py from another device. I ran mine locally by simply opening up a new terminal window. One way I can think of is by you uploading your results into a server, then pull them down on another device and run them there.. To answer your second question, even I've been trying to get that working but no success. I'll update your Github issue should I find a way
  • Michael Ramos
    Michael Ramos almost 7 years
    @eshirima , so we need to first create an image classification model? perhaps using inception?
  • eshirima
    eshirima almost 7 years
    @rambossa. No no no no.. Classification is different from object localization. Classification simply tells you if a certain object exists in your image without actually telling you where in the image. Object localization extends this to include the location of the detected object. If what you want is classification, this answer will help you. But if what you want is object detection, just follow the instructions of this answer.
  • Michael Ramos
    Michael Ramos almost 7 years
    @eshirima perfect thanks, Detection is definitely what I want, just wasn't sure if classification was a requirement. Thanks for the good and helpful work
  • eshirima
    eshirima almost 7 years
    @rambossa Anytime.. Don't forget to upvote the answer if it helps resolve your issue.. Happy coding. Cheers mate!!
  • Michael Ramos
    Michael Ramos almost 7 years
    @eshirima , how did you go about deciding on the size of the images in the training set? I currently have a set of large images, 2880X1800, but am worried these are too big. I would want the images to be as close to real-world resolutions as possible and am afraid that reducing the image size might hurt accuracy when detecting the actual objects. Thoughts?
  • eshirima
    eshirima almost 7 years
    2880X1800 is too big for sure. If you look at the config file under image_resizer, the object detector ends up resizing every image to 300X300. I feed it images of 618X816 though and it still does a good job of detecting my desired classes. I'd recommend resizing the images first before running the detection to see what scales still maintain a good visual of your objects (this is what I did as well). You could also tweak around the parameters for image_resizer, run your detector and compare results.
  • Michael Ramos
    Michael Ramos almost 7 years
    @eshirima thanks, So the resizer is also smart enough to adjust the annotations and bounding drawn for the original images?
  • eshirima
    eshirima almost 7 years
    I can't give u a concrete answer to that but in the core, the bounding boxes are estimates of the location of the pixels that consist of a majority of your objects' attributes/features. The final box that you see is actually a result of multiple closely-packed boxes grouped together. The issue with feeding the entire 2880X1800 is that you'd end up having too many features that'd be impossible to hold them in memory and computationally penalizing resulting to a single layer computation taking a long time.
  • eshirima
    eshirima almost 7 years
    The idea behind resizing is to find enough features such that they can be held in memory but not as penalizing computationally.. Theoretically, once it has learnt all of these features, it should be able to find them as well in larger images. But processing large frames is still an ongoing problem in computer vision.
  • Shamane Siriwardhana
    Shamane Siriwardhana almost 7 years
    @eshirima I tried to train the model with just 5 images of examples . I converted my data set to Tfrecords with oxfard IIT format. (Using the pet script) But my model giving errors when executing . IIT gives warning of having sparse metrics so I thought of changing the batch size in the config file. But by default it is 1 . So what should I change ?
  • eshirima
    eshirima almost 7 years
    @ShamaneSiriwardhana Was it giving u errors or warnings? If errors, what were they? The batch size is just the number of images you want to feed in at a time. I'm not sure if that'll help resolve the warnings/errors. This answer will help u understand batch sizes
  • Shamane Siriwardhana
    Shamane Siriwardhana almost 7 years
    @eshirima I know what is batch size. In the default config it says one . Normally in the sgd we take a mini batch . So why they have put 1 batch size in the config file? Yes they are warning (!!) . This is the error "The replica master 0 ran out-of-memory and exited with a non-zero status of 247"
  • eshirima
    eshirima almost 7 years
    @ShamaneSiriwardhana Which config file are you using? All of them, with exception to ssd extractors have batch_size:1. It does make sense why they'd do so because they don't know the users' available memory so set it to the lowest batch size. Also how big are your images? How much memory do u have? I did my training on a 16GB machine. Try tweaking the image size parameters in the config file as well. Or it may be that the model is just too big to keep in memory. Train on mobilenet instead then.
  • Shamane Siriwardhana
    Shamane Siriwardhana almost 7 years
    I am using the cloud . The config file was the one in the pet detection tutorial . I actually trained the pet detector as given in the tutorial. Then I wanted to train it for detect two objects , I took only 5 examples and then converted in to tfrecords and uploaded it to the cloud as it given in the tutorial. But I got errors saying "sparsity matrices can take lot of memory etc" . I think it's lake of data to train on the cloud , What do you think
  • eshirima
    eshirima almost 7 years
    @ShamaneSiriwardhana I trained mine locally. Its funny you should say the memory thing because I received a warning as well > Userwarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape.This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " But it never actually stopped or crashed.
  • eshirima
    eshirima almost 7 years
  • Shamane Siriwardhana
    Shamane Siriwardhana almost 7 years
    @eshirima I think I found the error. It's my image size(nearly 3Mb) and the height and width . In what scale I should keep these things ?
  • eshirima
    eshirima almost 7 years
    I actually looked @ your real-time post as well and learnt a lot from it. A couple questions/suggestions. 1: In the config file, do you have an idea of what num_hard_examples and num_examples represents? 2: For image annotations on a Mac, you could've used RectLabel. 3: I was actually about to explore training on own dataset which isn't of Pascal Voc format. You beat me to the punch :)
  • Dat Tran
    Dat Tran almost 7 years
    Hey thanks for the suggestions:) I had a look at RectLabel. Looks pretty good. I will give it a try. Concerning your first question, num_hard_examples has something to do with the hard example miner. Have a look at this paper to understand this. The num_examples has something to do with the evaluation. During the evaluation it fetches images and you need to specify how much you have. They also used max_eval to limit evaluation process. For number 3:) Yeh doesn't matter haha it's not who comes first but learning from each other.
  • Shamane Siriwardhana
    Shamane Siriwardhana over 6 years
    Hi did you add hard negative samples when training the ssd. Because the paper says it will work better if we add hard negative samples .
  • Yirga
    Yirga over 6 years
    I read your blog @DatTran and i have one question! can we train our dataset using CPU?
  • Dat Tran
    Dat Tran over 6 years
    @Yirga sure but that can take a while.
  • Michael Ramos
    Michael Ramos over 6 years
    @DatTran if we wanted to train for rectangular detection, and trained images on such rectangles, do you think your method would be better at creating a model that recognizes those specific/actual rectangular boundings (vs jumpy bounds like the racoon)?
  • Dat Tran
    Dat Tran over 6 years
    @rambossa if you care about the stability of those rectangular, you should have a look at ROLO.
  • Shamane Siriwardhana
    Shamane Siriwardhana over 6 years
    @DatTran Hi I am trying to train SSD -mobilenet in-order to detect 13 classes. I also trained a faster rcnn -resnet101 . My training data images have resolution of 265 * 450 . (most of them) and each class had 400 images. Then this weird thing happened faster rcnn converged faster with batch size of 1 . But my SSD didn't . It's not converging at all . Here I will put the two graphs , Here are the loss graphs stackoverflow.com/questions/45633957/…
  • Shamane Siriwardhana
    Shamane Siriwardhana over 6 years
    @eshirima I found a nice explanation about hard negative samples from the SSD research paper . Normally when we talking the loss most number of default boxes are negative . So there is an imbalance . So we will select at negative boxes with highest confidence loss (Which means top most things algorithm couldn't correctly identify them as background ). "we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. We found that this leads to faster optimization and a more stable training"
  • eshirima
    eshirima over 6 years
    @ShamaneSiriwardhana Which paper? Is it titled SSD: Single Shot MultiBox Detector
  • Shamane Siriwardhana
    Shamane Siriwardhana over 6 years
    @eshirima Yeah . That's the paper. Check with those quotes . And did you get any false positive examples ?
  • Shamane Siriwardhana
    Shamane Siriwardhana over 6 years
    @eshirima I have this question which data set orientation is good ? Oxford IIT or PASCAL . I think with oxford IIT when converting data in to TF records you can only have one class in the training image. No multiple types .
  • eshirima
    eshirima over 6 years
    @ShamaneSiriwardhana I encountered some false positive detections after training. This is prone to happen because the model isn't guaranteed to always be 100% correct since the mAP never fully converges to 0. Regarding data sets, I used PASCAL because it was the industry standard before ImageNet hence a larger community.
  • Shamane Siriwardhana
    Shamane Siriwardhana over 6 years
    @eshirima There is no much difference between pascal and oxford IIT . In the given converting script for oxford data set uses image name as the label of the data . So which means we can only draw one bounding box in an image. That box should belongs to class that indicated by image name . But in the given script for converting pascal data set , it's clear that it obtain the class from the object name in XML file. So we can draw multiple objects in one image. Any way my image set had only one bounding box per an image. What about you ? You had many ground truth boxes ?
  • Shamane Siriwardhana
    Shamane Siriwardhana over 6 years
    @eshirima Refer to code in this from TF-OD API . github.com/tensorflow/models/blob/master/object_detection/co‌​re/…
  • Shamane Siriwardhana
    Shamane Siriwardhana over 6 years
    Do you think than can affect the accuracy ?
  • Michael Ramos
    Michael Ramos over 6 years
    @DatTran do you have any insight on more accurate result bounding boxes? (ones that can skew and rotate with the object detected) vs the generic square around the result. I've seen some opencv examples that seem to do this...
  • Jundong
    Jundong over 6 years
    @ShamaneSiriwardhana You mentioned that your training images are of 2880 by 1800. How big is the object you are trying to detect in each image? Did your training success eventually? I am building an object detector on my own data, all images are of 1920 b 1080. Each object needed to be detected is only about 50 by 15. It is difficult for me to obtain decent results.
  • YuFeng Shen
    YuFeng Shen over 6 years
    If the TS Object Detection can be combined with transfer learning ?
  • YuFeng Shen
    YuFeng Shen over 6 years
    May I know if the TS Object Detection can be combined with transfer learning ?
  • Ciprian Tomoiagă
    Ciprian Tomoiagă over 6 years
    @Jason yes. you can load already trained models and refine them on your dataset. Consider asking another question with more details
  • Stepan Yakovenko
    Stepan Yakovenko almost 5 years
    How do I apply model to test images? How can I get rectangles? This is whats missing.