Object Recognition with Videos


Supervisor: Dr. Kenneth Wong
Student: Liu Bingbin

Project Poster

Overview


Object recognition refers to the task of classifying objects in images or videos, and has been an active research topic in the recent years. Using deep convolutional neural networks (CNNs), great progress has been made in object recognition with still images, however object recognition with videos is still underexplored. Due to complexities in videos such as pose and scale changes, directly applying still-image frameworks to videos cannot achieve satisfying performance; instead, we should try to incorporate special characteristics about videos. For example, videos contain temporal information (in other words, videos can be treated as sequences of images) and richer contextual information than still images. Therefore, taking advantage of these information should lead to a better result.

In this project, we choose the baseline framework to be T-CNN, a deep learning framework combining object detection and object tracking incorporating temporal and contextual information in videos. The objective for this project is to improve the performance over the baseline framework by exploring different modifications to the baseline framework. Specifically, the performance of the enhanced framework will be evaluated on ImageNet ILSVRC2015 VID dataset, using the mean Averaged Precision (AP) as the evaluation metric.

Methodology


Tentative modifications to the baseline framework

Selected MGP

It was shown that the original MGP will result in too many duplicates which can increase the computation cost in later stages. To reduce duplicates, this modification

Enhanced Feature Maps

The quality of single-frame results are essential to the overall performance. This modification therefore attempts to incorporate contextual information from neighboring frames, especting the enhanced feature maps to result in better proposals and more accurate classification.

Temporal Loss

To enforce temporal consistency on detections from adjacent frames, a temporal loss will be considered in addition to the original loss function.

Schedule


Task Status Start Date End Date
Literature Review Completed 1/Sept/2016 30/Sept/2016
Programming environment setup Completed 25/Sept/2016 10/Oct/2016
Coarse baseline on VOC Completed 10/Oct/2016 3/Jan/2017
Baseline on VID Completed 3/Jan/2017 15/Jan/2017
Selected MGP - 1 Completed 15/Jan/2017 16/Jan/2017
Selected MGP - 2 Completed 19/Jan/2017 19/Jan/2017
NMS Variants Completed 17/Jan/2017 10/Feb/2017
3D Convolution for feature maps Completed 4/Feb/2017 10/April/2017
Propagated Proposals Completed 20/Mar/2017 10/April/2017
Documentation and Wrap-up Completed 10/April/2017 16/April/2017
Presentation Completed 21/April/2017 21/April/2017
Merging feature maps using optical flow Future work -- --
Temporal loss Future work -- --

Documentation