Best Deep Saliency Detection Models (CVPR 2015-2017)

Machine Vision and Pattern Recognition

Borrowing Treasures from the Wealthy: Deep Transfer Learning through Selective Joint Fine-tuning

Weifeng Ge and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, (Spotlight Paper) [BibTex]

Deep neural networks require a large amount of labeled training data during supervised learning. However, collecting and labeling so much data might be infeasible in many cases. In this paper, we introduce a deep transfer learning scheme, called selective joint fine-tuning, for improving the performance of deep learning tasks with insufficient training data. In this scheme, a target learning task with insufficient training data is carried out simultaneously with another source learning task with abundant training data. However, the source learning task does not use all existing training data. Our core idea is to identify and use a subset of training images from the original source learning task whose low-level characteristics are similar to those from the target learning task, and jointly fine-tune shared convolutional layers for both tasks. Specifically, we compute descriptors from linear or nonlinear filter bank responses on training images from both tasks, and use such descriptors to search for a desired subset of training samples for the source learning task.

Experiments demonstrate that our deep transfer learning scheme achieves state-of-the-art performance on multiple visual classification tasks with insufficient training data for deep learning. Such tasks include Caltech 256, MIT Indoor 67, and fine-grained classification problems (Oxford Flowers 102 and Stanford Dogs 120). In comparison to fine-tuning without a source domain, the proposed method can improve the classification accuracy by 2%-10% using a single model.

Codes and trained models are available at

Visual Saliency Based on Multiscale Deep Features

Guanbin Li and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, [BibTex], (PDF, Suppl)

Code Release (testing code and trained model), Dataset Download (Baidu Cloud, Google Drive), Project Webpage

Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be trained with multiscale features extracted using a popular deep learning framework, convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for extracting features at three different scales. We then propose a refinement method to enhance the spatial coherence of our saliency results. Finally, aggregating multiple saliency maps computed for different levels of image segmentation can further boost the performance, yielding saliency maps better than those generated from a single segmentation. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotation. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-Measure by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively on these two datasets.

Deep Contrast Learning for Salient Object Detection

Guanbin Li and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, [BibTex], (PDF )

Code Release (testing code and trained model), Dataset Download (saliency maps of 6 benchmark datasets), Project Webpage

Salient object detection has recently witnessed substantial progress due to powerful features extracted using deep convolutional neural networks (CNNs). However, existing CNN-based methods operate at the patch level instead of the pixel level. Resulting saliency maps are typically blurry, especially near the boundary of salient objects. Furthermore, image patches are treated as independent samples even when they are overlapping, giving rise to significant redundancy in computation and storage. In this paper, we propose an end-to-end deep contrast network to overcome the aforementioned limitations. Our deep network consists of two complementary components, a pixel-level fully convolutional stream and a segment-wise spatial pooling stream. The first stream directly produces a saliency map with pixel-level accuracy from an input image. The second stream extracts segment-wise features very efficiently, and better models saliency discontinuities along object boundaries. Finally, a fully connected CRF model can be optionally incorporated to improve spatial coherence and contour localization in the fused result from these two streams. Experimental results demonstrate that our deep model significantly improves the state of the art.

• Instance-Level Salient Object Segmentation

Guanbin Li, Yuan Xie, Liang Lin, and Yizhou Yu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, (Spotlight Paper) [BibTex] (PDF)

Image saliency detection has recently witnessed rapid progress due to deep convolutional neural networks. However, none of the existing methods is able to identify object instances in the detected salient regions. In this paper, we present a salient instance segmentation method that produces a saliency mask with distinct object instance labels for an input image. Our method consists of three steps, estimating saliency map, detecting salient object contours and identifying salient object instances. For the first two steps, we propose a multiscale saliency refinement network, which generates high-quality salient region masks and salient object contours. Once integrated with multiscale combinatorial grouping and a MAP-based subset optimization framework, our method can generate very promising salient object instance segmentation results. To promote further research and evaluation of salient instance segmentation, we also construct a new database of 1000 images and their pixelwise salient instance annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks for salient region detection as well as on our new dataset for salient instance segmentation.

HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition

Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, and Yizhou Yu
IEEE International Conference on Computer Vision (ICCV), 2015, [BibTex], (PDF, Suppl)

In image classification, visual separability between different object categories is highly uneven, and some categories are more difficult to distinguish than others. Such difficult categories demand more dedicated classifiers. However, existing deep convolutional neural networks (CNN) are trained as flat N-way classifiers, and few efforts have been made to leverage the hierarchical structure of categories. In this paper, we introduce hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a two-level category hierarchy. An HD-CNN separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers. During HDCNN training, component-wise pretraining is followed by global fine-tuning with a multinomial logistic loss regularized by a coarse category consistency term. In addition, conditional executions of fine category classifiers and layer parameter compression make HD-CNNs scalable for largescale visual recognition. We achieve state-of-the-art results on both CIFAR100 and large-scale ImageNet 1000-class benchmark datasets. In our experiments, we build up three different two-level HD-CNNs, and they lower the top-1 error of the standard CNNs by 2.65%, 3.1%, and 1.1%.

LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling

Zhen Li, Yukang Gan, Xiaodan Liang, Yizhou Yu, Hui Cheng, and Liang Lin
European Conference on Computer Vision (ECCV), 2016, PDF.

Code Release (testing code and trained model)

Semantic labeling of RGB-D scenes is crucial to many intelligent applications including perceptual robotics. It generates pixelwise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. This paper addresses this problem by i) developing a novel Long Short-Term Memorized Context Fusion (LSTM-CF) Model that captures and fuses contextual information from multiple channels of photometric and depth data, and ii) incorporating this model into deep convolutional neural networks (CNNs) for end-to-end training. Specically, contexts in photometric and depth channels are, respectively, captured by stacking several convolutional layers and a long short-term memory layer; the memory layer encodes both short-range and long-range spatial dependencies in an image along the vertical direction. Another long short-term memorized fusion layer is set up to integrate the contexts along the vertical direction from different channels, and perform bi-directional propagation of the fused vertical contexts along the horizontal direction to obtain true 2D global contexts. At last, the fused contextual representation is concatenated with the convolutional features extracted from the photometric channels in order to improve the accuracy of fine-scale semantic labeling. Our proposed model has set a new state of the art, i.e., 48.1% and 49.4% average class accuracy over 37 categories (2.2% and 5.4% improvement) on the large-scale SUNRGBD dataset and the NYUDv2 dataset, respectively.

Visual Saliency Detection Based on Multiscale Deep CNN Features

Guanbin Li and Yizhou Yu
IEEE Transactions on Image Processing (TIP), Vol 25, No 11, 2016, PDF

Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be learned from multiscale features extracted using deep convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for feature extraction at three different scales. The penultimate layer of our neural network has been confirmed to be a discriminative high-level feature vector for saliency detection, which we call deep contrast feature. To generate a more robust feature, we integrate handcrafted low-level features with our deep contrast feature. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-measure by 6.12% and 10.0% respectively on the DUT-OMRON dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 9% and 35.3% respectively on these two datasets.

Piecewise Flat Embedding for Image Segmentation

Yizhou Yu, Chaowei Fang, and Zicheng Liao
IEEE International Conference on Computer Vision (ICCV), 2015, [BibTex], (PDF, Suppl, Slides, VideoLecture )

Image segmentation is a critical step in many computer vision tasks, including high-level visual recognition and scene understanding as well as low-level photo and video processing. In this paper, we propose a new nonlinear embedding, called piecewise flat embedding, for image segmentation. Based on the theory of sparse signal recovery, piecewise flat embedding attempts to identify segment boundaries while significantly suppressing variations within segments. We adopt an L1-regularized energy term in the formulation to promote sparse solutions. We further devise an effective two-stage numerical algorithm based on Bregman iterations to solve the proposed embedding. Piecewise flat embedding can be easily integrated into existing image segmentation frameworks, including segmentation based on spectral clustering and hierarchical segmentation based on contour detection. Experiments on BSDS500 indicate that segmentation algorithms incorporating this embedding can achieve significantly improved results in both frameworks.

Related Paper: S. Bi, X. Han, and Y. Yu. An L1 Image Transform for Edge-Preserving Smoothing and Scene-Level Intrinsic Decomposition. SIGGRAPH 2015.

Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification

Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu
IEEE International Conference on Computer Vision (ICCV), 2015, (PDF)

Recent work on scene classification still makes use of generic CNN features in a rudimentary manner. In this paper, we present a novel pipeline built upon deep CNN features to harvest discriminative visual objects and parts for scene classification. We first use a region proposal technique to generate a set of high-quality patches potentially containing objects, and apply a pre-trained CNN to extract generic deep features from these patches. Then we perform both unsupervised and weakly supervised learning to screen these patches and discover discriminative ones representing category-specific objects and parts. We further apply discriminative clustering enhanced with local CNN finetuning to aggregate similar objects and parts into groups, called meta objects. A scene image representation is constructed by pooling the feature response maps of all the learned meta objects at multiple spatial scales. We have confirmed that the scene image representation obtained using this new pipeline is capable of delivering state-of-the-art performance on two popular scene benchmark datasets, MIT Indoor 67 and Sun397.

Action-Gons: Action Recognition with A Discriminative Dictionary of Structured Elements with Varying Granularity

Yuwang Wang, Baoyuan Wang, Yizhou Yu, Qionghai Dai, and Zhuowen Tu
Asian Conference on Computer Vision (ACCV), 2014, PDF.

This paper presents "Action-Gons", a middle level representation for action recognition in videos. Actions in videos exhibit a reasonable level of regularity seen in human behavior, as well as a large degree of variation. One key property of action, compared with image scene, might be the amount of interaction among body parts, although scenes also observe structured patterns in 2D images. Here, we study high-order statistics of the interaction among regions of interest in actions and propose a mid-level representation for action recognition, inspired by the Julesz school of n-gon statistics. We propose a systematic learning process to build an over-complete dictionary of "Action-Gons". We first extract motion clusters, named as action units, then sequentially learn a pool of action-gons with different granularities modeling different degree of interactions among action units. We validate the discriminative power of our learned action-gons on three challenging video datasets and show evident advantages over the existing methods.

Person Re-Identification Using Multiple Experts with Random Subspaces

Sai Bi, Guanbin Li, and Yizhou Yu
Journal of Image and Graphics, Vol.2, No.2, 2014, PDF.

This paper presents a simple and effective multi-expert approach based on random subspaces for person re-identification across non-overlapping camera views. This approach applies to supervised learning methods that learn a continuous decision function. Our proposed method trains a group of expert functions, each of which is only exposed to a random subset of the input features. Each expert function produces an opinion according to the partial features it has. We also introduce weighted fusion schemes to effectively combine the opinions of multiple expert functions together to form a global view. Thus our method overall still makes use of all features without losing much information they carry. Yet each individual expert function can be trained efficiently without overfitting. We have tested our method on the VIPeR, ETHZ, and CAVIAR4REID datasets, and the results demonstrate that our method is able to significantly improve the performance of existing state-of-the-art techniques.

SCaLE: Supervised and Cascaded Laplacian Eigenmaps for Visual Object Recognition Based on Nearest Neighbors

Ruobing Wu, Yizhou Yu and Wenping Wang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, PDF.

Recognizing the category of a visual object remains a challenging computer vision problem. In this paper we develop a novel deep learning method that facilitates examplebased visual object category recognition. Our deep learning architecture consists of multiple stacked layers and computes an intermediate representation that can be fed to a nearest-neighbor classifier. This intermediate representation is discriminative and structure-preserving. It is also capable of extracting essential characteristics shared by objects in the same category while filtering out nonessential differences among them. Each layer in our model is a nonlinear mapping, whose parameters are learned through two sequential steps that are designed to achieve the aforementioned properties. The first step computes a discrete mapping called supervised Laplacian Eigenmap. The second step computes a continuous mapping from the discrete version through nonlinear regression. We have extensively tested our method and it achieves state-of-the-art recognition rates on a number of benchmark datasets.

Learning Image-Specific Parameters for Interactive Segmentation

Zhanghui Kuang, Dirk Schnieders, Hao Zhou, Kwan-Yee K. Wong, Yizhou Yu and Bo Peng
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, PDF.

In this paper, we present a novel interactive image segmentation technique that automatically learns segmentation parameters tailored for each and every image. Unlike existing work, our method does not require any offline parameter tuning or training stage, and is capable of determining image-specific parameters according to some simple user interactions with the target image. We formulate the segmentation problem as an inference of a conditional random field (CRF) over a segmentation mask and the target image, and parametrize this CRF by different weights (e.g., color,texture and smoothing). The weight parameters are learned via an energy margin maximization, which is solved using a constraint approximation scheme and the cutting plane method. Experimental results show that our method, by learning image-specific parameters automatically, outperforms other state-of-the-art interactive image segmentation techniques.

Subspace Segmentation with A Minimal Squared Frobenius Norm Representation

Siming Wei and Yizhou Yu
International Conference on Pattern Recognition (ICPR), 2012, PDF.

We introduce a novel subspace segmentation method called Minimal Squared Frobenius Norm Representation (MSFNR). MSFNR performs data clustering by solving a convex optimization problem. We theoretically prove that in the noiseless case, MSFNR is equivalent to the classical Factorization approach and always classifies data correctly. In the noisy case, we show that on both synthetic and real-word datasets, MSFNR is much faster than most state-of-the-art methods while achieving comparable segmentation accuracy.

Reconstruction of 3-D Symmetric Curves from Perspective Images without Discrete Features

Wei Hong, Yi Ma and Yizhou Yu
European Conference on Computer Vision (ECCV), 2004, PDF.

The shapes of many natural and man-made objects have curved contours. The images of such contours usually do not have sufficient distinctive features to apply conventional feature-based reconstruction algorithms. This paper shows that both the shape of curves in 3-D space and the camera poses can be accurately reconstructed from their perspective images with unknown point correspondences given that the curves have certain invariant properties such as symmetry. We show that in such cases the minimum number of views needed for a solution is remarkably small: one for planar curves and two for nonplanar curves (of arbitrary shapes), which is significantly less than what is required by most existing algorithms for general curves. Our solutions rely on minimizing the L2-distance between the shapes of the curves reconstructed via the "epipolar geometry" of symmetric curves. Both simulations and experiments on real images are presented to demonstrate the effectiveness of our approach.

Two-Level Image Segmentation Based on Region and Edge Integration

Qing Wu and Yizhou Yu
Seventh International Conference on Digital Image Computing: Techniques and Applications, 2003, PDF.

This paper introduces a two-level approach for image segmentation based on region and edge integration. Edges are first detected in the original image using a combination of operators for intensity gradient and texture discontinuities. To preserve the spatial coherence of the edges and their surrounding image regions, the detected edges are vectorized into connected line segments which serve as the basis for a constrained Delaunay triangulation. Segmentation is first performed on the triangulation using graph cuts. Our method favors segmentations that pass through more vectorized line segments. Finally, the obtained segmentation on the triangulation is projected onto the original image and region boundaries are refined to achieve pixel accuracy. Experimental results show that the two-level approach can achieve accurate edge localization, better spatial coherence and improved efficiency.

Shadow Graphs and Surface Reconstruction

Yizhou Yu and Johnny Chang
European Conference on Computer Vision (ECCV), 2002, PDF.
An extended version appears in International Journal of Computer Vision (IJCV), Vol. 62, No. 1-2, 2005.

We present a method to solve shape-from-shadow using shadow graphs which give a new graph-based representation for shadow constraints. It can be shown that the shadow graph alone is enough to solve the shape-from-shadow problem from a dense set of images. Shadow graphs provide a simpler and more systematic approach to represent and integrate shadow constraints from multiple images. To recover shape from a sparse set of images, we propose a method for integrated shadow and shading constraints. Previous shape-from-shadow algorithms do not consider shading constraints while shape-from-shading usually assumes there is no shadow. Our method is based on collecting a set of images from a fixed viewpoint as a known light source changes its position. It first builds a shadow graph from shadow constraints from which an upper bound for each pixel can be derived if the height values of a small number of pixels are initialized properly. Finally, a constrained optimization procedure is designed to make the results from shape-from-shading consistent with the upper bounds derived from the shadow constraints. Our technique is demonstrated on both synthetic and real imagery.

Acknowledgment: part of the material on this webpage is based upon work supported by the National Science Foundation.