The goal of this project is to address pixel-wise identification of car and road/lane objects in camera images or video captured from Carla simulation environment. In this project, the model training is done on the efficient neural network called MobileNets using depth-wise separable convolutions combined in a encoder-decoder structure for the Semantic Segmentation Detection (SSD).
With light-weighted network architecture, the model could be trained using single GTX1080Ti GPU, 15k images captured from Carla, and 20 epoches in 10 - 20 hours. After the training, the model is able to produce good results: 97% IOU and 0.9179 FScore with high frame rate of 15 - 16 FPS using Lyft Challenge platform.
The backbone of encoder for SSD model uses MobileNet architecture with several modifications:
- The depthwise separable convolution is kept but the 1-, 2-, 4- dilated convolutions are added. In this way, the context module aggregate multi-scale contextual information. The detailed info could be found in Multi-Scale Context Aggregation by Dilated Convolutions.
Depthwise Separable Convolution
- The layers with depth of 1024 in the original MobileNet are removed. The output of the encoder is in depth of 512, it is to increase the speed of the model.
- The encoder is added following the MobileNet Encoder for the pixel-wise SSD. Also, the s16, s8, s4, and s2 layers are bypassed connected to the decoder.
The stree camera data could be generated using Carla Simulation Tools with python script. More than 15k images with ground truth labels generated using provided Carla simulation tools.
The following code is the setting in python code to generate the camera images and ground truth for SSD.
# The default camera captures RGB images of the scene.
camera0 = Camera('CameraRGB')
# Set image resolution in pixels.
camera0.set_image_size(800, 600)
# Set its position relative to the car in meters.
camera0.set_position(1.30, 0, 1.30)
settings.add_sensor(camera0)
camera1 = Camera('CameraSeg', PostProcessing='SemanticSegmentation')
camera1.set_image_size(800, 600)
camera1.set_position(1.30, 0, 1.30)
settings.add_sensor(camera1)
To preprocess data, only car's label and road/lane label are used for SSD training and final testing results. Therefore, the Carla generate ground truth labels need to be processed. Also, the car itself's hood is also in the image, in order to avoid the mis-training to include that in the SSD, the car labels of own car's hood need to be removed, by using the following code:
def preprocess_labels(label_image):
road_id = 7
lane_id = 6
car_id = 10
# Create a new single channel label image to modify
labels_new = np.copy(label_image[:, :, 0])
# Identify lane marking pixels (label is 6)
lane_marking_pixels = (label_image[:, :, 0] == lane_id).nonzero()
# Set lane marking pixels to road (label is 7)
labels_new[lane_marking_pixels] = road_id
# Identify all vehicle pixels
vehicle_pixels = (label_image[:, :, 0] == car_id).nonzero()
# Isolate vehicle pixels associated with the hood (y-position > 496)
hood_indices = (vehicle_pixels[0] >= 496).nonzero()[0]
hood_pixels = (vehicle_pixels[0][hood_indices], \
vehicle_pixels[1][hood_indices])
# Set hood pixel labels to 0
labels_new[hood_pixels] = 0
# Return the preprocessed label image
return labels_new
The following are the ground truth images after the label processing.
Car Label
Road Label
Background Label
Also, the augmentaions of flipping in horizontal and brightness changes are randomly added to the images for
The training data set is in size of 19k, and the training and validation sets are 80% and 20%, the batch size is 12 due to the limit of the GPU memory.
N_train: 15302 N_val:3826 Train steps: 1275 Val_steps: 318 Batch size: 12
The cost function for the optimization is the combination of cross entropy and the modified IOU with different weight of car's F score and road's F score. The weight for car and road are 80% and 20%, it is because road's variation from image to image is small and F score is always much better than car's F score. By adding more weight on the car, the training would improve more on the car's SSD results.
The score is calculated based on the provided equations:
The following code is the calculation of the loss for optimization. Since the higher the score the better SSD, then, then use difference between the max F score and the calculated weighted F score for the minimization process.
weights = [0.0, 0.8, 0.2]
ce_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=correct_label, logits=logits))
loss = ce_loss + (F_max - (weights[1] * F_car + weights[2] * F_road))
Also, the training status is saved during the training for monitoring the loss improvement using the tensorboard. The training paramteres are as following:
learning_rate_val = 0.001
epochs = 20
decay = learning_rate_val / (2 * epochs)
batch_size = 12
After 20 epochs of learning with learning rate of 0.001, the fine tuning is followed, with 1/10th of learning rate, and the first 10 layers are locked. The performance of the model could be improved a little further, but not very much.
Here are a few predictions @~97% validation IoU. More testing results could be found in the runs folder. The SSD model is able to label most of the road pixels and car pixels accurately and the according to the grader program on the Challenge workspace, the FPS reaches to 16 FPS, which is fast enough for real-time SSD applications.
Your program has run, now scoring...
Your program runs at 15.151 FPS
Car F score: 0.848 | Car Precision: 0.803 | Car Recall: 0.860 | Road F score: 0.988 | Road Precision: 0.988 | Road Recall: 0.986| Averaged F score: 0.918
To improve the SSD performance better with trade-off on the FPS, several models could be implemented: including MobileNetV2, MobileNetV2 Keras, deeplabV3+ with backbone on MobileNet V2 and Xception. I have tried to adapted those models into current training implementation, and I was able to train the model with very small batch size of 4 with current GTX1080Ti GPU. More simplifications on the models are needed for fast and high accuracy SSD.
Make sure you have the following is installed:
To train the model, please save the data under the ./Train folder. ./Train/Test folder is for final test. ./Train/episode_* folders are for data captured from Carla program. Under each ./Train/episode_* folder, ./Train/episode_number/CameraRGB are images for training, and ./Train/episode_number/CameraSeg are data of ground truth labels.
Command to train the model:
python main.py
The saved trained model is under foder ./checkpoint
To use the specific model for inference testing, change the 'model_path' variable in 'common.py' to the model for testing.
To run the model for grading:
grader 'python ./demo.py'
-
Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen link
-
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam link
-
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam. arXiv: 1802.02611. link
-
Xception: Deep Learning with Depthwise Separable Convolutions François Chollet link