ML Speed sign Recognition
Introduction
In our project, we were inspired by recent advancements in deep learning and self-driving technology to train a portable speed-limit sign detector. This problem is essential for self-driving cars because it must be aware of the speed limit in current sections or the road or highway. If a self-driving car doesn't have a way to determine the speed limit, either through GPS and database referencing or with speed limit sign detection, then it could infer incorrectly that it is going the correct speed limit when it is going too fast or too slow. To have enough make the inference fast enough about the local speed limit, we made use of Google's MobileNet-SSD, which is a fast object detection network for mobile devices, and TensorFlow's object detection API. We utilized two datasets, the LISA dataset, and the Berkeley DeepDrive dataset, and made a small dataset of our own with Google Images.
Related Work
Using the Cuda-convnet for Cifar-10 classification on the LISA traffic sign dataset to detect US traffic sign super-classes: (a) No Turn. (b) Speed Limit. (c) Stop. (d) Warning., this paper achieves 97% accuracy on the test data [3]. They use data augmentation and dropout to avoid over-fitting, and hard negative mining to make the network learn from images it classified incorrectly (false positives).
In [4], A TSR system that implemented on Jetson TX1 has been presented. The Deep neural network has been trained on GTSRB (German Traffic Sign Recognition Benchmark) since this is a larger dataset. They use a detection algorithm that uses both color and shape information with a recognition algorithm that utilizes CNN. The accuracy is 96.2%.
In the case of infrequent objects, such as traffic signs, hand labeling many hours of road scene video for a good enough dataset is impractical. In [5], they propose an iterative Search & Learn method capable of quickly creating object detection datasets numbering in the 10’s of thousands of real-world examples within days, using only a few hours of human labeling. Starting with a few hundred hand-labeled objects, this is used to create a “search-detector” with sufficient precision to find a more significant collection of real-world examples from an extensive video library. These examples can then be used to train an improved search detector.
Also, other researchers such as [6] propose methodologies to ensure that classification complies with regulatory standards as well as determining physical degradation status. They utilized transfer learning to avoid issues that would occur with the small datasets that they had available.
There is also research about how to handle variations in traffic signs, in different countries. [7] introduces a neural network-integrable unit, Dense Spatial Translation Network (DSTN), that compensates for complex intra-class variations in spatial appearance. This efficient unit is explicitly designed for this rectification task. Instead of increasing the network capacity, this problem can be tackled by sampling input feature maps which are augmented by intra-class variations,. It produces output feature maps compensating for these variations. This should simplify the classification tasks.
Methodology
Tools
TensorFlow Object Detection API:
We use the TensorFlow Object Detection API to do transfer training based on pre-trained models listed below [8].
OpenCV:
We used OpenCV alongside TensorFlow to process videos and do real-time detection.
ssd_mobilenet_v1_coco model:
The architecture is showed in Figure 1. The network use MobileNet as the feature extractor and run SSD detection to find interesting regions. The advantage of this network is that it is fast. It can be trained in a short time and the trained model can be used on a smart phone. From Figure 2, we can tell the feature extraction accuracy of MobileNet is low compared with other models.
faster_rcnn_inception_resnet_v2_atrous_coco model:
The architecture is shown in Figure 3. The network runs the image through a CNN to get a feature map and then runs it through a Region Proposal Network, that outputs interesting regions. From Figure 2, we can tell this model is more accurate than MobileNet-SSD model. However, we have found that Faster RCNN models are slower than MobileNet models.
Figure 3: A diagram of Faster RCNN inception v2 [11], which we trained and used because of its accuracy.
Datasets
We have chosen the LISA dataset [1] developed by the University of California, San Diego. This is the most extensive available dataset of US traffic signs. According to section 3.3 of the dataset description document, it contains 47 US sign types, 7855 annotations on 6610 frames. The dataset was created from footage from driving around California, mostly in San Diego, with several different vehicles and cameras. The resolution of the full captured frames vary from 640x480 to 1024x522 pixels, and the annotations range from 6x6 to 167x168 pixels. Some images are in color and some in gray-scale. The authors argue that this is closer to reality, but would make it useless for developing color dependent detection algorithms.
According to section 3.4 of the dataset description, the size of the dataset is in league with some other datasets, like the STS dataset and the KUL dataset. However, it contains many sign types, and not all sign types are well represented. Depending on the learning algorithm used for a TSR system, we might need a large number - thousands of images. If the super-classes of the sign categories are evaluated, however, speed limit signs and warning signs are both well represented. It is more suitable for general speed limit sign detection than for classifying speed limit signs.
We have also used the Berkeley DeepDrive dataset [2] to extract speed limit sign images to expand our dataset. This is a much larger dataset. The Road Object Detection section contains 100,000 images annotated for buses, traffic lights, traffic signs, people, bikes, trucks, motorcycles, cars, trains, and riders.
Training Our Model Using the LISA Dataset
We initially trained our models using the LISA dataset. The LISA dataset has 1376 speed limit signs in total. We were hoping for high-quality images, but the LISA dataset was of low quality, as shown in Figure 4. Of the 1376 speed limit signs, the two largest classes are 25 mph and 35 mph, which contain 349 and 538, respectively.
We first prepared the data by writing a Python script to convert the given labels from the LISA dataset into the standard TensorFlow label compatible with our model. Second, we modified the MobileNet-SSD network configuration file. This configuration was necessary because it was previously used to classify many different objects like people, kites, and so on. We updated the file to have one object or two objects that the model was attempting to recognize and label. In the first case, it was a complete speed limit sign category with a training set of 1376 images. This case had the most training data, and we thought it would be the most promising. The second case was based on 25 mph and 35 mph speed limit signs, which had 349 and 538, respectively.
We then trained with the ssd_mobilenet_v1_coco model using the TensorFlow Object Detection API. We used a local server that a team member had that contained a Nvidia GeForce GTX 960. Training took many hours with this graphics card, but the time frame was satisfactory.
Results
Testing Our Preliminary Models with Real Drive Tests in Blacksburg, VA
We collected our test dataset by driving around town during daytime while recording with a 1080p fish-eye lens dash cam. We manually went through the video recordings, counting the number of each speed limit sign and used this number to judge our performance for both accuracy and false positive rate. We did a post editing to fast forward through any scene without a speed limit sign to save some time, as shown in the YouTube demo below. There is no fish lens correction to our test video. Also, we show two short clips side-by-side in Figures 5 and 6.
A 7 Minute Demo of Our Mobile Speed Limit Sign Detector
Our preliminary mobile speed limit sign detector worked well in Blacksburg. We were able to label speed limit signs in real-time around town.
Blacksburg isn't filled with white signs that throw our preliminary model off. However, the Berkeley DeepDrive dataset does have examples of many busy streets.
In the next section, we test our model more rigorously. The results from the test encouraged us to train and test new models.
Using the Berkeley DeepDrive Dataset to Test Our Model and Improve Our Dataset
With the Berkeley DeepDrive Dataset, we had access to much more data [2]. Around 100,000 labeled images are available from real-world drive tests and the labeling consists of bounding boxes for traffic signs, traffic lights, cars, and motorcycles. The main issue with this dataset for our purposes is that it doesn't label speed limit signs individually and instead groups them with other traffic signs. Thus, to make more sense of the data, we ran our network that was trained on the LISA dataset to extract images that it labeled as speed limit signs.
Of the 100,000 labeled images, around 57,000 contained traffic signs. The images with traffic signs were of interest to us since the speed limit sign would be in those images. We initially used a small subset of the 10,000 images to manually find a few speed limit signs and checked the network against those images. The results were quite poor. Of around 30 images containing speed limit signs, only around 2 were labeled correctly with the remaining images having too low of scores to be correctly labeled. Because of this, we ran inference on the set of 57,000 images in two stages.
The first stage was running inference with a set threshold of 0.1. This means that if an object in the image is labeled with a threshold higher than 0.1, it was stored in a separate folder for us to look over later. At this end of stage 1, we had 1204 images in total. We then ran those 1204 images through stage two, which consisted of running inference with a set threshold of 0.5. When the images from the first stage had a maximum score greater than 0.5, we would further separate them into another folder. After stage 2, we then manually found the false positives and correctly labeled images and stored them in separate folders.
After separating the correctly labeled images and the false positives, we found that we had 194 and 328 images, respectively. Of the images in stage 1, that did pass through stage 2, we had 682 images remaining, and 93 of these 682 images were false negatives. Thus, around 421 of 1204 images were labeled improperly. However, this does not include the many false negatives in the approximately 56,000 remaining images that did not pass through stage 1, which are probably in the hundreds to low thousands in quantity.
As an example, we present a 50 mph speed limit sign appropriately labeled in Figure 8 below. Looking over the set of false positives, it seems the network labels mainly white rectangles as speed limit signs, as shown in Figures 9 and 10. These results are probably due to the LISA dataset being of such low quality that the main feature it derives for a speed limit sign is the shape of the sign and its color.
Our goals after finding these results were two-fold. First, we wanted to expand our training set with high-quality images, so we searched Google images to expand our new dataset manually. Second, we wanted to train our object detector to perform better to reduce false positives and false negatives.
Testing Our Updated Model
Testing Our Updated Model
We used the same dataset with additional images from Berkeley DeepDrive and Google Images with Fast RCNN for our next experiment. We have the same setup as the previous experiment. With the added higher resolution testing and training image, we had the hope of improving our results.
A 7 Minute Demo of Our 2nd Generation Speed Limit Sign Detector
Unfortunately, Fast RCNN and the high resolution image doesn't actually improve our old model entirely and we had mixed results. The network clearly starts to learn any white object with black bounding box as a speed limit sign. This indicate that the network didn't learn the speed sign feature properly. However, it does seem to label speed limit signs for longer periods of time and more consistently.
Discussion
Training a speed limit sign detector is difficult because of a lack of data. Mislabeling occurs in both of our trained models because many traffic signs look similar to speed limit signs when resolution is low. The Fast RCNN model seems to pick out speed limit signs and label them more consistently and for longer periods of time than MobileNet-SSD, but it also labels a large amount of false positives. Thus, our improvements over our first model are mixed. On one hand, we were able to label speed limit signs for longer periods of time and with consistency. But on the other hand, we improved the registration of white traffic signs so that the model labels more false positives than previously.
We would improve our models in a number of way if given the chance. First, we would train with the complete LISA dataset to reduce false positives. Our second experiment with Fast RCNN shows that the network is picking up undesirable features for speed limit signs. This mislabeling is not what we want. LISA has a complete traffic sign dataset, but we just used the speed limit sign portion of it. It is possible that by learning all the different traffic signs, we can convert the false positives into categories for other traffic signs. We also could do transfer learning on an existing pre-trained model that learned most of the ordinary objects, such as windows.
Secondly, we would like to use our best model combined with Optical Character Recognition (OCR). Although our network had a very high false positive rate, it also had a very high true positive rate. We could use computer vision techniques to post-process the resulting cropped speed limit sign image generated by the network. This post-processing is essentially a filter for our neural network. We will only return a positive result if the OCR detects "speed limit" and a number. This methodology should effectively reduce the false positive rate.
References
[1] "LISA dataset," UC San Diego. [Online]. Available: http://cvrr.ucsd.edu/LISA/lisa-traffic-sign-dataset.html. [Accessed: Nov 01, 2018].
Andreas Møgelmose, Mohan M. Trivedi, and Thomas B. Moeslund, "Vision based Traffic Sign Detection and Analysis for Intelligent Driver Assistance Systems: Perspectives and Survey," IEEE Transactions on Intelligent Transportation Systems, 2012.
[2] "Berkeley DeepDrive," UC Berkeley. [Online]. Available: http://bdd-data.berkeley.edu/. [Accessed: Nov 01, 2018].
[3] Y. Li, A. Møgelmose and M. M. Trivedi, "Pushing the “Speed Limit”: High-Accuracy US Traffic Sign Recognition With Convolutional Neural Networks," in IEEE Transactions on Intelligent Vehicles, vol. 1, no. 2, pp. 167-176, June 2016.
[4] Y. Han and E. Oruklu, "Traffic sign recognition based on the NVIDIA Jetson TX1 embedded system using convolutional neural networks," 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, 2017, pp. 184-187.
[5] G. Overett and W. Wang, "Iterative search & learn for sign detection in large datasets," 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, 2017, pp. 1399-1406.
[6] J. P. N. Acilo, A. G. S. Dela Cruz, M. K. L. Kaw, M. D. Mabanta, V. G. G. Pineda and E. A. Roxas, "Traffic sign integrity analysis using deep learning," 2018 IEEE 14th International Colloquium on Signal Processing & Its Applications (CSPA), Batu Feringghi, 2018, pp. 107-112.
[7] Weimeng Zhu, Siegemund, J.; Kummert, "Dense Spatial Translation Network", 2018 IEEE International Conference on Vehicular Electronics and Safety (ICVES). Proceedings, p 7 pp
[8] “TensorFlow Object Detection API,” GitHub. [Online]. Available: https://github.com/tensorflow/models/tree/master/research/object_detection. [Accessed: Nov 02, 2018].
[9] "MobileNet-SSD Architecture," ResearchGate. [Online]. Available: https://www.researchgate.net/figure/MobileNet-SSD-AF-architecture-we-use-MobileNet-as-the-feature-extractor-network-and-SSD_fig7_324584455. [Accessed Dec 03, 2018].
[10] "Object Detection Speed and Accuracy Comparison," Medium. [Online]. Available: https://medium.com/@jonathan_hui/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359. [Accessed: Dec 03, 2018].
[11] "From R-CNN to Mask R-CNN," Medium. [Online]. Available: https://medium.com/@umerfarooq_26378/from-r-cnn-to-mask-r-cnn-d6367b196cfd. [Accessed: Dec 03, 2018].