Methodology

Datasets Used
For the training of the various models in this project a number of datasets will be required. Currently, I have selected the following datasets:
- NYU Hand Pose dataset [4]: This dataset contains 8252 test-set and 72757 training-set frames of captured RGBD data with ground-truth hand-pose information. All the images comprise of various hand poses. For each hand pose, Kinect data from three different angles is captured. Finally, this dataset is presently popular among Hand Pose researchers due to the high variability of Hand Poses captured in this dataset.
- ICVL Hand Pose dataset [5]: This dataset annotates 16 joint locations with (x,y,z). Coordinates available for each image. The x and y coordinates are measured in pixels while z coordinate is measured mm.
- MSRA Hand Pose dataset [6]: This dataset contains images from 9 subjects' right hands are captured using Intel's Creative Interactive Gesture Camera. Each subject has 17 gestures captured and there are about 500 frames for each gesture.
- Multiview Hand Pose Dataset [7]: This dataset captures hand pose from different angles. This dataset not only provides the 3D hand joint locations for each image but also provides the bounding boxes for the hands in the images.
- EgoGesture [8][9]: This dataset contains 2,081 RGB-D videos, 24,161 gesture samples and 2,953,224 frames from 50 distinct subjects. The authors define 83 classes of static or dynamic gestures focused on interaction with wearable devices.
- NVIDIA Dynamic Hand Gesture Dataset [10] : nvGesture is a dataset of 25 gesture classes, each intended for human-computer interfaces. The dataset has 1532 weakly segmented videos, which are performed by 20 subjects at an indoor car simulator. The dataset is then split into training (70%) and test (30%) sets, resulting in 1050 training and 482 test videos [11].
Currently, I have been focused on the Hand Pose Estimation Problem. And hence, the majority of the datasets listed pertain to hand pose recognition. However, with the progress of the project other datasets focusing gesture recognition will also be added to the list.
Development Details : Division of Work into two phases
The project has been primarily spilt into two phases. The first half of the project focuses on Hand Pose Estimation and the second half of the project focuses on Hand Gesture recognition. Currently, the project is in Phase I. The details of the two phases are as follows:
Phase 1: Hand Pose Estimation Problem For this phase, I am attempting to create a light weight hand pose estimating model. To achieve this, I have spent a majority of the time during the summer internship reading research papers regarding the same problem. After gaining suitable knowledge about the state of the art performance on the subject, I came up with my own insight into the problem. This being, the movements of the hand joints is primarily hierarchical and we can use this fact to refine the estimations made by the model. After this I designed my first model and I have been refining its design ever since. The implementation details of these models are as follows:
Implementation Details The salient features that are common to the designs over all designs include a branching system specializing in regressing the positions of a subset of the complete joint set. The architecture consists of three branches. The first branch is responsible for regressing the position of the palm joint as well as the finger bases. The second branch then uses the positions predicted by the first branch in addition to its own feature extraction layer in order to detect the center joints of each finger. Note that this layer also acts as a refining layer for the joints detected in the first stage. This is achieved by adding the set of joints predicted by the first layer to set of joints predicted by the second layer. Lastly, the third layer computes the position of the fingers tips using the output of the second layer in addition to its own feature extraction layer. Also note that similar to the second layer, the third layer also acts as a refining layer for the joints predicted by the preceding two layers. Hence, the output of the third layer is the complete set of joints in the hand. Currently, I am on the sixth iteration of my design. For this iteration, I am combining the initial feature extraction stage for the three sub – networks. This design allows me to significantly reduce the number of parameters that need to be trained. This also as an effect of significantly lowering the execution time of the model. Other than the sixth iteration, I also have design ideas for future iterations. One of the main ideas is to embed a generic hand model into the architecture itself. This change is design allows to restrict the output latent space from purely 3D to 2.5D. This decrease in the dimension of the output space, should help with the output accuracy.
Phase 2: Hand Gesture Recognition Problem After completing the hand pose estimation problem satisfactorily, I plan to build upon that model for hand gesture recognition. I am currently searching for state of the art research material regarding the subject and different architectures employed for the task. My current plan is to use a RNN based model that would use my model from phase 1 as a data preprocessor. This phase will be elaborated once phase 1 is complete and I can start working on the corresponding problem for this phase.
As mentioned earlier, the project is currently in phase 1. I should be able to complete phase 1 by mid October. After that, I will initially focus on elaborating the details of the second phase and then move in development of the hand Gesture recognition system.
Challenges
I do realize that there are a number of challenges for this project. Chief amongst them being:
- Finding a complete hand gesture recognition dataset: Even though a number of hand gesture datasets exist, each of them have their own shortcomings. Some of them have a very small number of unique gestures while some of them have poor lighting conditions. Another big shortcoming for most datasets is the low resolutions of the videos due to the file size restrictions.
- Making a satisfactory Hand Pose Estimator: Since my goal for Hand Pose estimator is to keep it as light weight as possible, I need to find a balance between accuracy and execution speed.