Scene classification

Project Background

Computer vision, a subdomain of artificial intelligence, has been widely applied in different fields of works and applications, such as automated image organization, visual searching, image & face recognition and interactive marketing [1]. As the physical capacity (For example, calculation power of GPU is increasing exponentially in recent years [2]) and convenience and user-friendliness of software resources (Take Google's Tensorflow as example, it integrates simpler API keras for more intuitive operations [3]) are gradually increasing owing to technology improvement and maturity, using machine learning approach to solve a specific task is considered to be one feasible and effective approach, and it performs sumptuously in some intense manual tasks like image classification, object detection and automatic car driving, with a decreasing cost??.

One of the most popular computer vision tasks is image label prediction, which receive input of image and output the corresponding relative description of that image itself. There is a lot of valuable information hidden in the visual content and it often manual effort to extract this information. The CV solution automates the slow labelling work, and could be integrated with image tagging, as well as searching services. The result could also be an intermediate product yet to be input in another model in the whole network.

Similarly, this computer vision project collaborated with the University of Hong Kong Library (HKUL) involves developing a tool which uses AI techniques to automatically understand visual scenes, hence offering a scalable approach to capture visual relationship with no additional human labor. This tool will be integrated with the HKU library image database, which collects the historical Hong Kong images, featuring themes like people, landscapes, architectures from the 1840s to the 1990s in digital archive for free public consultation, to refine the functionality and user experience of searching. It is an updating database as new images will be included to document the evolving Hong Kong development, and thus human labor is needed to label the features like location, facilities, keyword of the visual content for tagging and searching purposes, which requires considerably high opportunity cost in the work. Therefore, this project aims to (1) construct a mapping from an input image to a set of semantic relationships (or a scene graph) between the detected objects; and (2) processing and learning from existing knowledge in the development of a reliable scene graph prediction model, so that it can be applied to automatic labelling for the incoming images for a more accurate and friendly searching service; the model's analysis on the visual and textual correlation may also be helpful for research purposes. It is deemed as a challenging but explorable task because unlike academic scenario, one industrial machine learning project may not have the perfectly fitted data for effect training and it may be divided into various sub-tasks, depending on the nature of work. The whole process could be discretional as the choice of processing protocols, model/configuration applied, objective defined and data collected may generate non-linear effect on the final result, in other words, trial and error is necessary in examining and adjusting the best solution in general. Moreover, communication with the industry partner is important to establish a more robust and reflective channel to exchange information, so is the deliverable in real application. In this case, the HKUL IT team is under the process of assigning controlled vocabularies to the image for the later tasks and thus a bilateral interaction is required to update the task status promptly.

Objective

The ultimate goal of the project is to develop an extensive and flexible tool that automates the labelling process of the increasing collection of images for HKUL’s Hong Kong Image Database so as to optimize the browsing and searching experience in the service. According to the HKUL, the final labels are mostly Library of Congress Subject Headings (LCSH), which is a set of headings obtained from the Subject Headings Authority Files (SWD) commonly used in bibliographic record indexing, which describes the equivalence, hierarchical and associative relationships between and within multidisciplinary subjects [4]; plus some local subject headings. However, since the HKUL IT team is still working on the assignment of description on the existing images, the current focus is to discover the implication of relationships between and within the images available.

Currently, 1495 images and their metadata are provided by HKUL for training the prototype model. The metadata includes the time, location, description text and remarks like the building included, the highlight it belongs and the title of the photo. Since the final product with translate the input image into a set of descriptive texts similar to the metadata given for training, it would be advisable to construct a classification network to predict these features, especially the location, time, subject and highlights of the images. Such networks could be divided into several classifiers of specific task.

Figure 1: highlight of current images

Figure 1 shows the distribution of image highlight in the given images, which people related images like costumes portraits and prominent subsumes a small proportion of the dataset while landscapes are the main theme. Since it is futile to either detect a building from a portraits; or recognize a famous person from a street view, a people-landscape classifier could be useful to distinguish these two classes for further classification. (Figure 2 compares the portraits and landscape images)

Figure 2: images from prominent figures highlight (left) and Hong Kong tourist association highlights (right)

After primal filtering, it is then important to construct a subject classifier to identify the person, or the building appeared in the image.

Other than that, a location classifier could be informative, as human worker may take ample experience in discerning the building appeared and the time of the picture in deciding where the picture is shot, if not provided with such information in the first place.

Thus, in the first stage of the projects, at least 4 types of classifiers will be developed in classifying people, landscape, subject and location of the images, and their accuracy are of the highest concern.

Methodology

In the first stage of the project, general methods of image/scene classification to analyse the nature of the images for later study. Convolution network (CNN) will be explored as it is a common image classification method. On the other hand, in the second stage, more complicated approaches are used for better generalization of data.

Preprocessing

In the basic configuration, the images are resized and normalized to different scales, for example, the original VGG16 accepts 224*224 3 channels images as input [5].

As more information may improve the prediction result, techniques like segmentation and morphology may be applied for an object detection model of buildings.

The label of the image is extracted from the metadata. For the subject label, the label will be selected from the subjects field first (If the image depicts a person, the people field is used instead); and then keywords field if the previous one is missing. For image with these two fields missing, label will be manually created judging from the title and the visual content of the image. For the location label, the eighteen district field will be used directly and images without such field will be discarded from training.

Model selection

As multiple mini models will be used in distinguishing people/object, face, subject and location in the first stage, so different CNN architectures will be tested and applied to suit their purposes accordingly. Currently, ResNet, VGG, Inception and DenseNet are the candidates of the first stage model, as they are predefined in keras module [6]. For famous people recognition task, the option of using pre-trained face recognition models and their API like VGGFace2 [7] could be explored to reduce training time through transfer learning while for the scene classification task, VGG16-Places365 [8] may be used as the starting point of the training.

Training

In the initial stage, all selected models will be trained for 10 epoch in HKU GPU farm using python tensorflow to evaluate the potentially best model. Adam optimizer with default setting is used as well during training. As the topic becomes more familiar, tools like fast.ai [9] could be applied to tune the performance.

Metrics

The performance will be evaluated using accuracy and confusion matrix. If two results have similar accuracy, precision and the f1-score will be compared to decide the better scheme.

Schedule and Milestones

The proposed schedule is as follows:

Date	Schedule
2019/09/30	Literature review and first deliverable
2019/10/31	Complete first stage of model training, update labelling data
2019/11/30	Refine basic model and start second stages of model training
2019/12/31	Finish second stage of model training
2020/01/17	First presentation
2020/02/02	Finish first implementation and interim report
2020/03/31	Finalize the model used and tuning
2020/04/19	Finish final report and implementation
2020/04/21	Final presentation

Reference

Imagga (2019). The Top 5 Uses of Image Recognition - Imagga Blog. [online] Imagga Blog. Available at: https://imagga.com/blog/the-top-5-uses-of-image-recognition/ [Accessed 29 Sep. 2019].
Tukora, Balázs & Szalay, Tibor. (2008). High performance computing on graphics processing units. Pollack Periodica. 3. 27-34. 10.1556/Pollack.3.2008.2.3.
Zürn, J. (2019). What’s new in TensorFlow 2.0?. [online] Medium. Available at: https://towardsdatascience.com/whats-new-in-tensorflow-2-0-ce75cdd1a4d1 [Accessed 29 Sep. 2019].
Library of Congress (2018), "LCSH Introduction", https://www.loc.gov/aba/publications/FreeLCSH/LCSH40%20Main%20intro.pdf [Accessed 29 Sep, 2019].
Neurohive. (2018). VGG16 - Convolutional Network for Classification and Detection. [online] Available at: https://neurohive.io/en/popular-networks/vgg16/ [Accessed 29 Sep. 2019].
Keras. (2019). Applications - Keras Documentation. [online] Available at: https://keras.io/applications [Accessed 29 Sep. 2019].
Cao, Qiong, et al. (2018) Vggface2: A dataset for recognising faces across pose and age. 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018.
Kalliatakis, G. (2018). GKalliatakis/Keras-VGG16-places365. [online] GitHub. Available at: https://github.com/GKalliatakis/Keras-VGG16-places365 [Accessed 29 Sep. 2019].
Fast.ai. (2019). fast.ai · Making neural nets uncool again. [online] Available at: https://www.fast.ai/ [Accessed 29 Sep. 2019].