Object recognition refers to the task of classifying objects in images or videos, and has been an active research topic in the recent years. Using deep convolutional neural networks (CNNs), great progress has been made in object recognition with still images, however object recognition with videos is still underexplored. Due to complexities in videos such as pose and scale changes, directly applying still-image frameworks to videos cannot achieve satisfying performance; instead, we should try to incorporate special characteristics about videos. For example, videos contain temporal information (in other words, videos can be treated as sequences of images) and richer contextual information than still images. Therefore, taking advantage of these information should lead to a better result.
In this project, we choose the baseline framework to be T-CNN, a deep learning framework combining object detection and object tracking incorporating temporal and contextual information in videos. The objective for this project is to improve the performance over the baseline framework by exploring different modifications to the baseline framework. Specifically, the performance of the enhanced framework will be evaluated on ImageNet ILSVRC2015 VID dataset, using the mean Averaged Precision (AP) as the evaluation metric.
Tentative modifications to the baseline framework
It was shown that the original MGP will result in too many duplicates which can increase the computation cost in later stages. To reduce duplicates, this modification
The quality of single-frame results are essential to the overall performance. This modification therefore attempts to incorporate contextual information from neighboring frames, especting the enhanced feature maps to result in better proposals and more accurate classification.
To enforce temporal consistency on detections from adjacent frames, a temporal loss will be considered in addition to the original loss function.
|Task||Status||Start Date||End Date|
|Programming environment setup||Completed||25/Sept/2016||10/Oct/2016|
|Coarse baseline on VOC||Completed||10/Oct/2016||3/Jan/2017|
|Baseline on VID||Completed||3/Jan/2017||15/Jan/2017|
|Selected MGP - 1||Completed||15/Jan/2017||16/Jan/2017|
|Selected MGP - 2||Completed||19/Jan/2017||19/Jan/2017|
|3D Convolution for feature maps||Completed||4/Feb/2017||10/April/2017|
|Documentation and Wrap-up||Completed||10/April/2017||16/April/2017|
|Merging feature maps using optical flow||Future work||--||--|
|Temporal loss||Future work||--||--|