COMP4801





Weakly Supervised Localization with Temporal Information for Assembly Video Understanding

Implement a supervisory system to teach, monitor and guide workers on assembly lines using deep learning technology

Learn more

Introduction

The real life motivation for our study is to implement a supervisory system to teach, monitor and guide workers on assembly lines. It is a common practice for robot systems to learn from images with category labels and bounding boxes which has been well explored in the work related to supervised object detection. However, it takes a lot of effort to annotate images with precise bounding boxes which is infeasible in the above scenario. Instead, demonstration videos with phase-based labels are more accessible and common in such industry.

In light of that, our work would like to make use of the implicit temporal information in videos instead of focusing on the image level spatial information only. Inspired by some works that use temporal related approaches [8, 6], we would like to feed videos with only phase-based object labels to networks. The intuition behind is to let implicit temporal information serve as free supervision signals [9, 5] to compensate the absence of localization ground truth. Ideally, after trained with simply labeled demonstration videos, the system is able to recognize all assembly phases and localize the objects of interest in each phase so as to guide and monitor assembling work.

Methodology

  • 1. Class Activation Mapping

    In a network architecture, for instance ResNet, adding a global average pooling before the output layer can maintain the features detected in former layers. Assigning those features with different weights can generate a heat map, through which the information of image region used for discrimination can be tracked by distinct colors.

  • 2. Cycle Consistency of Time

    CCT model is a free supervision technique. The deep feature space embedded by φ can be used to propagate masks specified in first frame throughout the video. We propose to use localization information generated in CAMs as mask for propagation. In this way, connectivity between neighbor frames can be ensured by performing time cycle consistency.

  • 3. Data Collection

    We will train the model on self-collected dataset, which contains video demonstrating the procedure of assembling hard drive, power supply and CD-ROM into a computer case. Demonstration videos will be labelled with the category of objects on phase basis.

  • 4. Evaluation

    We plan to compare the accuracy of localization with models making use of CAMs. At the same time, we propose to test the accuracy in different assembling working scenerios. For instance, we will experiment under different manufacture work and distinct working background to examine the accuracy.

Results and Deliverables

To be released

Timeline

Picture

Sep, 2019

● Milestone 1: Detailed project plan submission

● Project webpage goes live

● Literature review

Sep, 2019 Read more
Movie

Oct, 2019

● Literature review

● Dataset collection

Oct, 2019 Read more
Picture

Nov, 2019

● Test and improve localization with weakly supervised learning in assembly line dataset, namely test CAM model.

Nov, 2019 Read more
Location

Dec, 2019

● Test and apply techniques to enforce temporal smoothness of the localization results.

● Milestone 2: Provide a system using weakly supervised learning for localization.

● Submit interim report.

Dec, 2019 Read more
Location

Feb, 2019

● Collect more data and extend our evaluation.

Feb, 2019 Read more
Movie

Mar, 2019

● Evaluate the system

● Milestone 3: Submit final report and poster design

● Prepare presentation.

Mar, 2019
Location

Apr, 2019

● Final exhibition

Apr, 2019

Our Team

Send a message

If you are interested in our topic, feel free to reach out to us.