RefReasoning: Large Real-World Dataset Referring Expressions Visual Grounding (CVPR 2020)

Computer Vision and Pattern Recognition


Graph-Structured Referring Expression Reasoning in The Wild

Sibei Yang, Guanbin Li, and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, [BibTex]

Code Release and RefReasoning Dataset Download (code and dataset)

Grounding referring expressions aims to locate in an image an object referred to by a natural language expression. The linguistic structure of a referring expression provides a layout of reasoning over the visual contents, and it is often crucial to align and jointly understand the image and the referring expression. In this paper, we propose a scene graph guided modular network (SGMN), which performs reasoning over a semantic graph and a scene graph with neural modules under the guidance of the linguistic structure of the expression. In particular, we model the image as a structured semantic graph, and parse the expression into a language scene graph. The language scene graph not only decodes the linguistic structure of the expression, but also has a consistent representation with the image semantic graph. In addition to exploring structured solutions to grounding referring expressions, we also propose RefReasoning, a large-scale real-world dataset for structured referring expression reasoning. We automatically generate referring expressions over the scene graphs of images using diverse expression templates and functional programs. This dataset is equipped with real-world visual contents as well as semantically rich expressions with different reasoning layouts. Experimental results show that our SGMN not only significantly outperforms existing state-of-the-art algorithms on the new Ref-Reasoning dataset, but also surpasses state-of-the-art structured methods on commonly used benchmark datasets. It can also provide interpretable visual evidences of reasoning.


Propagating Over Phrase Relations for One-Stage Visual Grounding

Sibei Yang, Guanbin Li, and Yizhou Yu
European Conference on Computer Vision (ECCV), 2020

Phrase level visual grounding aims to locate in an image the corresponding visual regions referred to by multiple noun phrases in a given sentence. Its challenge comes not only from large variations in visual contents and unrestricted phrase descriptions but also from unambiguous referrals derived from phrase relational reasoning. In this paper, we propose a linguistic structure guided propagation network for one-stage phrase grounding. It explicitly explores the linguistic structure of the sentence and performs relational propagation among noun phrases under the guidance of the linguistic relations between them. Specifically, we first construct a linguistic graph parsed from the sentence and then capture multimodal feature maps for all the phrasal nodes independently. The node features are then propagated over the edges with a tailordesigned relational propagation module and ultimately integrated for final prediction. Experiments on Flickr30K Entities dataset show that our model outperforms state-of-the-art methods and demonstrate the effectiveness of propagating among phrases with linguistic relations.


• Cross-view Correspondence Reasoning based on Bipartite Graph Convolutional Network for Mammogram Mass Detection

Y Liu, F Zhang, Q Zhang, S Wang, Y Wang, and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Mammogram mass detection is of great clinical significance due to the high proportion of breast cancers. The information from cross views (i.e., mediolateral-oblique and cranio-caudal) is highly related and complementary, and is helpful to make comprehensive decisions. However, unlike radiologists who can recognize masses with reasoning ability in cross-view images, most existing methods lack the ability to reason under the guidance of domain knowledge, thus it limits the performance. In this paper, we introduce the bipartite graph convolutional network to endow existing methods with cross-view reasoning ability of radiologists in mammogram mass detection. The bipartite node sets are constructed by cross-view images respectively to represent relatively consistent regions in breasts, while the bipartite edge learns to model both inherent crossview geometric constraints and appearance similarities between correspondences. Based on the bipartite graph, the information propagates methodically through correspondences and enables spatial visual features equipped with customized cross-view reasoning ability. Experimental results on DDSM dataset demonstrate the proposed algorithm achieves state-of-the-art performance. Besides, visual analysis shows the model has a clear physical meaning, which is helpful to radiologists in clinical interpretation.


Motion Guided Attention for Video Salient Object Detection

Haofeng Li, Guanqi Chen, Guanbin Li, and Yizhou Yu
International Conference on Computer Vision (ICCV), 2019, [BibTex]

Code Release (code and models)

Video salient object detection aims at discovering the most visually distinctive objects in a video. How to effectively take object motion into consideration during video salient object detection is a critical issue. Existing stateof-the-art methods either do not explicitly model and harvest motion cues or ignore spatial contexts within optical flow images. In this paper, we develop a multi-task motion guided video salient object detection network, which learns to accomplish two sub-tasks using two sub-networks, one sub-network for salient object detection in still images and the other for motion saliency detection in optical flow images. We further introduce a series of novel motion guided attention modules, which utilize the motion saliency subnetwork to attend and enhance the sub-network for still images. These two sub-networks learn to adapt to each other by end-to-end training. Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on a wide range of benchmarks. We hope our simple and effective approach will serve as a solid baseline and help ease future research in video salient object detection. Code and models are publicly available.


Dynamic Graph Attention for Referring Expression Comprehension

Sibei Yang, Guanbin Li, and Yizhou Yu
International Conference on Computer Vision (ICCV), 2019, [BibTex]

Referring expression comprehension aims to locate the object instance described by a natural language referring expression in an image. This task is compositional and inherently requires visual reasoning on top of the relationships among the objects in the image. Meanwhile, the visual reasoning process is guided by the linguistic structure of the referring expression. However, existing approaches treat the objects in isolation or only explore the first-order relationships between objects without being aligned with the potential complexity of the expression. Thus it is hard for them to adapt to the grounding of complex referring expressions. In this paper, we explore the problem of referring expression comprehension from the perspective of language-driven visual reasoning, and propose a dynamic graph attention network to perform multi-step reasoning by modeling both the relationships among the objects in the image and the linguistic structure of the expression. In particular, we construct a graph for the image with the nodes and edges corresponding to the objects and their relationships respectively, propose a differential analyzer to predict a language-guided visual reasoning process, and perform stepwise reasoning on top of the graph to update the compound object representation at every node. Experimental results demonstrate that the proposed method can not only significantly surpass all existing state-of-the-art algorithms across three common benchmark datasets, but also generate interpretable visual evidences for stepwisely locating the objects referred to in complex language descriptions.


Cross-Modal Relationship Inference for Grounding Referring Expressions

Sibei Yang, Guanbin Li, and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, [BibTex]

Grounding referring expressions is a fundamental yet challenging task facilitating human-machine communication in the physical world. It locates the target object in an image on the basis of the comprehension of the relationships between referring natural language expressions and the image. A feasible solution for grounding referring expressions not only needs to extract all the necessary information (i.e. objects and the relationships among them) in both the image and referring expressions, but also compute and represent multimodal contexts from the extracted information. Unfortunately, existing work on grounding referring expressions cannot extract multi-order relationships from the referring expressions accurately and the contexts they obtain have discrepancies with the contexts described by referring expressions. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experiments on various common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, outperforms all existing state-of-the-art methods.


Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification from the Bottom Up

Weifeng Ge*, Xiangru Lin*, and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, [BibTex]

Given a training dataset composed of images and corresponding category labels, deep convolutional neural networks show a strong ability in mining discriminative parts for image classification. However, deep convolutional neural networks trained with image level labels only tend to focus on the most discriminative parts while missing other object parts, which could provide complementary information. In this paper, we approach this problem from a different perspective. We build complementary parts models in a weakly supervised manner to retrieve information suppressed by dominant object parts detected by convolutional neural networks. Given image level labels only, we first extract rough object instances by performing weakly supervised object detection and instance segmentation using Mask R-CNN and CRF-based segmentation. Then we estimate and search for the best parts model for each object instance under the principle of preserving as much diversity as possible. In the last stage, we build a bi-directional long short-term memory (LSTM) network to fuze and encode the partial information of these complementary parts into a comprehensive feature for image classification. Experimental results indicate that the proposed method not only achieves significant improvement over our baseline models, but also outperforms stateof-the-art algorithms by a large margin (6.7%, 2.8%, 5.2% respectively) on Stanford Dogs 120, Caltech-UCSD Birds 2011-200 and Caltech 256.


• Multi-Source Weak Supervision for Saliency Detection

Yu Zeng, Huchuan Lu, Lihe Zhang, Yunzhi Zhuge, Mingyang Qian, and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

The high cost of pixel-level annotations makes it appealing to train saliency detection models with weak supervision. However, a single weak supervision source usually does not contain enough information to train a well-performing model. To this end, we propose a unified framework to train saliency detection models with diverse weak supervision sources. In this paper, we use category labels, captions, and unlabelled data for training, yet other supervision sources can also be plugged into this flexible framework. We design a classification network (CNet) and a caption generation network (PNet), which learn to predict object categories and generate captions, respectively, meanwhile highlight the most important regions for corresponding tasks. An attention transfer loss is designed to transmit supervision signal between networks, such that the network designed to be trained with one supervision source can benefit from another. An attention coherence loss is defined on unlabelled data to encourage the networks to detect generally salient regions instead of task-specific regions. We use CNet and PNet to generate pixel-level pseudo labels to train a saliency prediction network (SNet). During the testing phases, we only need SNet to predict saliency maps. Experiments demonstrate the performance of our method compares favourably against unsupervised and weakly supervised methods and even some supervised methods.


Facial Landmark Machines: A Backbone-Branches Architecture With Progressive Representation Learning

Lingbo Liu, Guanbin Li, Yuan Xie, Yizhou Yu, Qing Wang, and Liang Lin
IEEE Transactions on Multimedia (TMM), Vol 21, No 9, 2019

Facial landmark localization plays a critical role in face recognition and analysis. In this paper, we propose a novel cascaded backbone-branches fully convolutional neural network (BB-FCN) for rapidly and accurately localizing facial landmarks in unconstrained and cluttered settings. Our proposed BB-FCN generates facial landmark response maps directly from raw images without any preprocessing. BB-FCN follows a coarse-to-fine cascaded pipeline, which consists of a backbone network for roughly detecting the locations of all facial landmarks and one branch network for each type of detected landmark for further refining their locations. Furthermore, to facilitate the facial landmark localization under unconstrained settings, we propose a large-scale benchmark named SYSU16K, which contains 16000 faces with large variations in pose, expression, illumination and resolution. Extensive experimental evaluations demonstrate that our proposed BB-FCN can significantly outperform the state-of-the-art under both constrained (i.e., within detected facial regions only) and unconstrained settings. We further confirm that high-quality facial landmarks localized with our proposed network can also improve the precision and recall of face detection.



Piecewise Flat Embedding for Image Segmentation

Chaowei Fang, Zicheng Liao, and Yizhou Yu
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol 41, No 6, 2019, [BibTex]

We introduce a new multi-dimensional nonlinear embedding -- Piecewise Flat Embedding (PFE) -- for image segmentation. Based on the theory of sparse signal recovery, piecewise flat embedding with diverse channels attempts to recover a piecewise constant image representation with sparse region boundaries and sparse cluster value scattering. The resultant piecewise flat embedding exhibits interesting properties such as suppressing slowly varying signals, and offers an image representation with higher region identifiability which is desirable for image segmentation or high-level semantic analysis tasks. We formulate our embedding as a variant of the Laplacian Eigenmap embedding with an L1,p (p between 0 and 1) regularization term to promote sparse solutions. First, we devise a two-stage numerical algorithm based on Bregman iterations to compute L1,1-regularized piecewise flat embeddings. We further generalize this algorithm through iterative reweighting to solve the general L1,p-regularized problem. To demonstrate its efficacy, we integrate PFE into two existing image segmentation frameworks, segmentation based on clustering and hierarchical segmentation based on contour detection. Experiments on four major benchmark datasets, BSDS500, MSRC, Stanford Background Dataset, and PASCAL Context, show that segmentation algorithms incorporating our embedding achieve significantly improved results.



Context-Aware Semantic Inpainting

Haofeng Li, Guanbin Li, Liang Lin, Hongchuan Yu, and Yizhou Yu
IEEE Transactions on Cybernetics (TCyb), 2019

Recently image inpainting has witnessed rapid progress due to generative adversarial networks (GAN) that are able to synthesize realistic contents. However, most existing GAN-based methods for semantic inpainting apply an auto-encoder architecture with a fully connected layer, which cannot accurately maintain spatial information. In addition, the discriminator in existing GANs struggle to understand high-level semantics within the image context and yield semantically consistent content. Existing evaluation criteria are biased towards blurry results and cannot well characterize edge preservation and visual authenticity in the inpainting results. In this paper, we propose an improved generative adversarial network to overcome the aforementioned limitations. Our proposed GAN-based framework consists of a fully convolutional design for the generator which helps to better preserve spatial structures and a joint loss function with a revised perceptual loss to capture high-level semantics in the context. Furthermore, we also introduce two novel measures to better assess the quality of image inpainting results. Experimental results demonstrate that our method outperforms the state of the art under a wide range of criteria.



Multi-Evidence Filtering and Fusion for Multi-Label Classification, Object Detection and Semantic Segmentation Based on Weakly Supervised Learning

Weifeng Ge, Sibei Yang, and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, [BibTex] (PDF, Suppl)

Supervised object detection and semantic segmentation require object or even pixel level annotations. When there exist image level labels only, it is challenging for weakly supervised algorithms to achieve accurate predictions. The accuracy achieved by top weakly supervised algorithms is still significantly lower than their fully supervised counterparts. In this paper, we propose a novel weakly supervised curriculum learning pipeline for multi-label object recognition, detection and semantic segmentation. In this pipeline, we first obtain intermediate object localization and pixel labeling results for the training images, and then use such results to train task-specific deep networks in a fully supervised manner. The entire process consists of four stages, including object localization in the training images, filtering and fusing object instances, pixel labeling for the training images, and task-specific network training. To obtain clean object instances in the training images, we propose a novel algorithm for filtering, fusing and classifying object instances collected from multiple solution mechanisms. In this algorithm, we incorporate both metric learning and density-based clustering to filter detected object instances. Experiments show that our weakly supervised pipeline achieves state-of-the-art results in multi-label image classification as well as weakly supervised object detection and very competitive results in weakly supervised semantic segmentation on MS-COCO, PASCAL VOC 2007 and PASCAL VOC 2012.


Borrowing Treasures from the Wealthy: Deep Transfer Learning through Selective Joint Fine-tuning

Weifeng Ge and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, (Spotlight Paper) [BibTex] (PDF)

Code and Models are available at https://github.com/ZYYSzj/Selective-Joint-Fine-tuning

Deep neural networks require a large amount of labeled training data during supervised learning. However, collecting and labeling so much data might be infeasible in many cases. In this paper, we introduce a deep transfer learning scheme, called selective joint fine-tuning, for improving the performance of deep learning tasks with insufficient training data. In this scheme, a target learning task with insufficient training data is carried out simultaneously with another source learning task with abundant training data. However, the source learning task does not use all existing training data. Our core idea is to identify and use a subset of training images from the original source learning task whose low-level characteristics are similar to those from the target learning task, and jointly fine-tune shared convolutional layers for both tasks. Specifically, we compute descriptors from linear or nonlinear filter bank responses on training images from both tasks, and use such descriptors to search for a desired subset of training samples for the source learning task.

Experiments demonstrate that our deep transfer learning scheme achieves state-of-the-art performance on multiple visual classification tasks with insufficient training data for deep learning. Such tasks include Caltech 256, MIT Indoor 67, and fine-grained classification problems (Oxford Flowers 102 and Stanford Dogs 120). In comparison to fine-tuning without a source domain, the proposed method can improve the classification accuracy by 2%-10% using a single model.


Visual Saliency Based on Multiscale Deep Features

Guanbin Li and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, [BibTex], (PDF, Suppl)

Code Release (testing code and trained model), Dataset Download (Baidu Cloud, Google Drive), Project Webpage

Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be trained with multiscale features extracted using a popular deep learning framework, convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for extracting features at three different scales. We then propose a refinement method to enhance the spatial coherence of our saliency results. Finally, aggregating multiple saliency maps computed for different levels of image segmentation can further boost the performance, yielding saliency maps better than those generated from a single segmentation. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotation. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-Measure by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively on these two datasets.



Deep Contrast Learning for Salient Object Detection

Guanbin Li and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, [BibTex], (PDF )

Code Release (testing code and trained model), Dataset Download (saliency maps of 6 benchmark datasets), Project Webpage

Salient object detection has recently witnessed substantial progress due to powerful features extracted using deep convolutional neural networks (CNNs). However, existing CNN-based methods operate at the patch level instead of the pixel level. Resulting saliency maps are typically blurry, especially near the boundary of salient objects. Furthermore, image patches are treated as independent samples even when they are overlapping, giving rise to significant redundancy in computation and storage. In this paper, we propose an end-to-end deep contrast network to overcome the aforementioned limitations. Our deep network consists of two complementary components, a pixel-level fully convolutional stream and a segment-wise spatial pooling stream. The first stream directly produces a saliency map with pixel-level accuracy from an input image. The second stream extracts segment-wise features very efficiently, and better models saliency discontinuities along object boundaries. Finally, a fully connected CRF model can be optionally incorporated to improve spatial coherence and contour localization in the fused result from these two streams. Experimental results demonstrate that our deep model significantly improves the state of the art.



Instance-Level Salient Object Segmentation

Guanbin Li, Yuan Xie, Liang Lin, and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, (Spotlight Paper) [BibTex] (PDF, Suppl)

Dataset Download (new dataset for salient instance segmentation)

Image saliency detection has recently witnessed rapid progress due to deep convolutional neural networks. However, none of the existing methods is able to identify object instances in the detected salient regions. In this paper, we present a salient instance segmentation method that produces a saliency mask with distinct object instance labels for an input image. Our method consists of three steps, estimating saliency map, detecting salient object contours and identifying salient object instances. For the first two steps, we propose a multiscale saliency refinement network, which generates high-quality salient region masks and salient object contours. Once integrated with multiscale combinatorial grouping and a MAP-based subset optimization framework, our method can generate very promising salient object instance segmentation results. To promote further research and evaluation of salient instance segmentation, we also construct a new database of 1000 images and their pixelwise salient instance annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks for salient region detection as well as on our new dataset for salient instance segmentation.



Stroke Controllable Fast Style Transfer with Adaptive Receptive Fields

Yongcheng Jing, Yang Liu, Yezhou Yang, Zunlei Feng, Yizhou Yu, Dacheng Tao, and Mingli Song
European Conference on Computer Vision (ECCV), 2018, PDF.

Fast Style Transfer methods have been recently proposed to transfer a photograph to an artistic style in real time. This task involves controlling the stroke size in the stylized results, which remains an open challenge. In this paper, we present a stroke controllable style transfer network that can achieve continuous and spatial stroke size control. By analyzing the factors that influence the stroke size, we propose to explicitly account for the receptive field and the style image scales. We propose a StrokePyramid module to endow the network with adaptive receptive fields, and two training strategies to achieve faster convergence and augment new stroke sizes upon a trained model respectively. By combining the proposed runtime control strategies, our network can achieve continuous changes in stroke size and produce distinct stroke sizes in different spatial regions within the same output image.



Contrast-Oriented Deep Neural Networks for Salient Object Detection

Guanbin Li and Yizhou Yu
IEEE Transactions on Neural Networks and Learning Systems (TNNLS), Vol 29, No 12, 2018, [BibTex]

Deep convolutional neural networks have become a key element in the recent breakthrough of salient object detection. However, existing CNN-based methods are based on either patch-wis (region-wise) training and inference or fully convolutional networks. Methods in the former category are generally time-consuming due to severe storage and computational redundancies among overlapping patches. To overcome this deficiency, methods in the second category attempt to directly map a raw input image to a predicted dense saliency map in a single network forward pass. Though being very efficient, it is arduous for these methods to detect salient objects of different scales or salient regions with weak semantic information. In this paper, we develop hybrid contrast-oriented deep neural networks to overcome the aforementioned limitations. Each of our deep networks is composed of two complementary components, including a fully convolutional stream for dense prediction and a segment-level spatial pooling stream for sparse saliency inference. We further propose an attentional module that learns weight maps for fusing the two saliency predictions from these two streams. A tailored alternate scheme is designed to train these deep networks by fine-tuning pre-trained baseline models. Finally, a customized fully connected CRF model incorporating a salient contour feature embedding can be optionally applied as a post-processing step to improve spatial coherence and contour positioning in the fused result from these two streams. Extensive experiments on six benchmark datasets demonstrate that our proposed model can significantly outperform the state of the art in terms of all popular evaluation metrics.



High-Resolution Shape Completion Using Deep Neural Networks for Global Structure and Local Geometry Inference

Xiaoguang Han*, Zhen Li*, Haibin Huang, Evangelos Kalogerakis, and Yizhou Yu
International Conference on Computer Vision (ICCV) 2017, (PDF, Suppl) [BibTex]

We propose a data-driven method for recovering missing parts of 3D shapes. Our method is based on a new deep learning architecture consisting of two sub-networks: a global structure inference network and a local geometry refinement network. The global structure inference network incorporates a long short-term memorized context fusion module (LSTM-CF) that infers the global structure of the shape based on multi-view depth information provided as part of the input. It also includes a 3D fully convolutional (3DFCN) module that further enriches the global structure representation according to volumetric information in the input. Under the guidance of the global structure network, the local geometry refinement network takes as input local 3D patches around missing regions, and progressively produces a high-resolution, complete surface through a volumetric encoder-decoder architecture. Our method jointly trains the global structure inference and local geometry refinement networks in an end-to-end manner. We perform qualitative and quantitative evaluations on six object categories, demonstrating that our method outperforms existing state-of-the-art work on shape completion.



LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling

Zhen Li, Yukang Gan, Xiaodan Liang, Yizhou Yu, Hui Cheng, and Liang Lin
European Conference on Computer Vision (ECCV), 2016, PDF.

Code Release (testing code and trained model)

Semantic labeling of RGB-D scenes is crucial to many intelligent applications including perceptual robotics. It generates pixelwise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. This paper addresses this problem by i) developing a novel Long Short-Term Memorized Context Fusion (LSTM-CF) Model that captures and fuses contextual information from multiple channels of photometric and depth data, and ii) incorporating this model into deep convolutional neural networks (CNNs) for end-to-end training. Specically, contexts in photometric and depth channels are, respectively, captured by stacking several convolutional layers and a long short-term memory layer; the memory layer encodes both short-range and long-range spatial dependencies in an image along the vertical direction. Another long short-term memorized fusion layer is set up to integrate the contexts along the vertical direction from different channels, and perform bi-directional propagation of the fused vertical contexts along the horizontal direction to obtain true 2D global contexts. At last, the fused contextual representation is concatenated with the convolutional features extracted from the photometric channels in order to improve the accuracy of fine-scale semantic labeling. Our proposed model has set a new state of the art, i.e., 48.1% and 49.4% average class accuracy over 37 categories (2.2% and 5.4% improvement) on the large-scale SUNRGBD dataset and the NYUDv2 dataset, respectively.



Visual Saliency Detection Based on Multiscale Deep CNN Features

Guanbin Li and Yizhou Yu
IEEE Transactions on Image Processing (TIP), Vol 25, No 11, 2016, PDF

Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be learned from multiscale features extracted using deep convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for feature extraction at three different scales. The penultimate layer of our neural network has been confirmed to be a discriminative high-level feature vector for saliency detection, which we call deep contrast feature. To generate a more robust feature, we integrate handcrafted low-level features with our deep contrast feature. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-measure by 6.12% and 10.0% respectively on the DUT-OMRON dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 9% and 35.3% respectively on these two datasets.



HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition

Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, and Yizhou Yu
IEEE International Conference on Computer Vision (ICCV), 2015, [BibTex], (PDF, Suppl)

In image classification, visual separability between different object categories is highly uneven, and some categories are more difficult to distinguish than others. Such difficult categories demand more dedicated classifiers. However, existing deep convolutional neural networks (CNN) are trained as flat N-way classifiers, and few efforts have been made to leverage the hierarchical structure of categories. In this paper, we introduce hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a two-level category hierarchy. An HD-CNN separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers. During HDCNN training, component-wise pretraining is followed by global fine-tuning with a multinomial logistic loss regularized by a coarse category consistency term. In addition, conditional executions of fine category classifiers and layer parameter compression make HD-CNNs scalable for largescale visual recognition. We achieve state-of-the-art results on both CIFAR100 and large-scale ImageNet 1000-class benchmark datasets. In our experiments, we build up three different two-level HD-CNNs, and they lower the top-1 error of the standard CNNs by 2.65%, 3.1%, and 1.1%.



Piecewise Flat Embedding for Image Segmentation

Yizhou Yu, Chaowei Fang, and Zicheng Liao
IEEE International Conference on Computer Vision (ICCV), 2015, [BibTex], (PDF, Suppl, Slides, VideoLecture )

Image segmentation is a critical step in many computer vision tasks, including high-level visual recognition and scene understanding as well as low-level photo and video processing. In this paper, we propose a new nonlinear embedding, called piecewise flat embedding, for image segmentation. Based on the theory of sparse signal recovery, piecewise flat embedding attempts to identify segment boundaries while significantly suppressing variations within segments. We adopt an L1-regularized energy term in the formulation to promote sparse solutions. We further devise an effective two-stage numerical algorithm based on Bregman iterations to solve the proposed embedding. Piecewise flat embedding can be easily integrated into existing image segmentation frameworks, including segmentation based on spectral clustering and hierarchical segmentation based on contour detection. Experiments on BSDS500 indicate that segmentation algorithms incorporating this embedding can achieve significantly improved results in both frameworks.

Related Paper: S. Bi, X. Han, and Y. Yu. An L1 Image Transform for Edge-Preserving Smoothing and Scene-Level Intrinsic Decomposition. SIGGRAPH 2015.



Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification

Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu
IEEE International Conference on Computer Vision (ICCV), 2015, (PDF)

Recent work on scene classification still makes use of generic CNN features in a rudimentary manner. In this paper, we present a novel pipeline built upon deep CNN features to harvest discriminative visual objects and parts for scene classification. We first use a region proposal technique to generate a set of high-quality patches potentially containing objects, and apply a pre-trained CNN to extract generic deep features from these patches. Then we perform both unsupervised and weakly supervised learning to screen these patches and discover discriminative ones representing category-specific objects and parts. We further apply discriminative clustering enhanced with local CNN finetuning to aggregate similar objects and parts into groups, called meta objects. A scene image representation is constructed by pooling the feature response maps of all the learned meta objects at multiple spatial scales. We have confirmed that the scene image representation obtained using this new pipeline is capable of delivering state-of-the-art performance on two popular scene benchmark datasets, MIT Indoor 67 and Sun397.



Action-Gons: Action Recognition with A Discriminative Dictionary of Structured Elements with Varying Granularity

Yuwang Wang, Baoyuan Wang, Yizhou Yu, Qionghai Dai, and Zhuowen Tu
Asian Conference on Computer Vision (ACCV), 2014, PDF.

This paper presents "Action-Gons", a middle level representation for action recognition in videos. Actions in videos exhibit a reasonable level of regularity seen in human behavior, as well as a large degree of variation. One key property of action, compared with image scene, might be the amount of interaction among body parts, although scenes also observe structured patterns in 2D images. Here, we study high-order statistics of the interaction among regions of interest in actions and propose a mid-level representation for action recognition, inspired by the Julesz school of n-gon statistics. We propose a systematic learning process to build an over-complete dictionary of "Action-Gons". We first extract motion clusters, named as action units, then sequentially learn a pool of action-gons with different granularities modeling different degree of interactions among action units. We validate the discriminative power of our learned action-gons on three challenging video datasets and show evident advantages over the existing methods.



Person Re-Identification Using Multiple Experts with Random Subspaces

Sai Bi, Guanbin Li, and Yizhou Yu
Journal of Image and Graphics, Vol.2, No.2, 2014, PDF.

This paper presents a simple and effective multi-expert approach based on random subspaces for person re-identification across non-overlapping camera views. This approach applies to supervised learning methods that learn a continuous decision function. Our proposed method trains a group of expert functions, each of which is only exposed to a random subset of the input features. Each expert function produces an opinion according to the partial features it has. We also introduce weighted fusion schemes to effectively combine the opinions of multiple expert functions together to form a global view. Thus our method overall still makes use of all features without losing much information they carry. Yet each individual expert function can be trained efficiently without overfitting. We have tested our method on the VIPeR, ETHZ, and CAVIAR4REID datasets, and the results demonstrate that our method is able to significantly improve the performance of existing state-of-the-art techniques.



SCaLE: Supervised and Cascaded Laplacian Eigenmaps for Visual Object Recognition Based on Nearest Neighbors

Ruobing Wu, Yizhou Yu and Wenping Wang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, PDF.

Recognizing the category of a visual object remains a challenging computer vision problem. In this paper we develop a novel deep learning method that facilitates examplebased visual object category recognition. Our deep learning architecture consists of multiple stacked layers and computes an intermediate representation that can be fed to a nearest-neighbor classifier. This intermediate representation is discriminative and structure-preserving. It is also capable of extracting essential characteristics shared by objects in the same category while filtering out nonessential differences among them. Each layer in our model is a nonlinear mapping, whose parameters are learned through two sequential steps that are designed to achieve the aforementioned properties. The first step computes a discrete mapping called supervised Laplacian Eigenmap. The second step computes a continuous mapping from the discrete version through nonlinear regression. We have extensively tested our method and it achieves state-of-the-art recognition rates on a number of benchmark datasets.



Learning Image-Specific Parameters for Interactive Segmentation

Zhanghui Kuang, Dirk Schnieders, Hao Zhou, Kwan-Yee K. Wong, Yizhou Yu and Bo Peng
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, PDF.

In this paper, we present a novel interactive image segmentation technique that automatically learns segmentation parameters tailored for each and every image. Unlike existing work, our method does not require any offline parameter tuning or training stage, and is capable of determining image-specific parameters according to some simple user interactions with the target image. We formulate the segmentation problem as an inference of a conditional random field (CRF) over a segmentation mask and the target image, and parametrize this CRF by different weights (e.g., color,texture and smoothing). The weight parameters are learned via an energy margin maximization, which is solved using a constraint approximation scheme and the cutting plane method. Experimental results show that our method, by learning image-specific parameters automatically, outperforms other state-of-the-art interactive image segmentation techniques.



Reconstruction of 3-D Symmetric Curves from Perspective Images without Discrete Features

Wei Hong, Yi Ma and Yizhou Yu
European Conference on Computer Vision (ECCV), 2004, PDF.

The shapes of many natural and man-made objects have curved contours. The images of such contours usually do not have sufficient distinctive features to apply conventional feature-based reconstruction algorithms. This paper shows that both the shape of curves in 3-D space and the camera poses can be accurately reconstructed from their perspective images with unknown point correspondences given that the curves have certain invariant properties such as symmetry. We show that in such cases the minimum number of views needed for a solution is remarkably small: one for planar curves and two for nonplanar curves (of arbitrary shapes), which is significantly less than what is required by most existing algorithms for general curves. Our solutions rely on minimizing the L2-distance between the shapes of the curves reconstructed via the "epipolar geometry" of symmetric curves. Both simulations and experiments on real images are presented to demonstrate the effectiveness of our approach.



Shadow Graphs and Surface Reconstruction

Yizhou Yu and Johnny Chang
European Conference on Computer Vision (ECCV), 2002, PDF.
An extended version appears in International Journal of Computer Vision (IJCV), Vol. 62, No. 1-2, 2005.

We present a method to solve shape-from-shadow using shadow graphs which give a new graph-based representation for shadow constraints. It can be shown that the shadow graph alone is enough to solve the shape-from-shadow problem from a dense set of images. Shadow graphs provide a simpler and more systematic approach to represent and integrate shadow constraints from multiple images. To recover shape from a sparse set of images, we propose a method for integrated shadow and shading constraints. Previous shape-from-shadow algorithms do not consider shading constraints while shape-from-shading usually assumes there is no shadow. Our method is based on collecting a set of images from a fixed viewpoint as a known light source changes its position. It first builds a shadow graph from shadow constraints from which an upper bound for each pixel can be derived if the height values of a small number of pixels are initialized properly. Finally, a constrained optimization procedure is designed to make the results from shape-from-shading consistent with the upper bounds derived from the shadow constraints. Our technique is demonstrated on both synthetic and real imagery.






Acknowledgment: part of the material on this webpage is based upon work supported by the National Science Foundation.