Video surveillance of large facilities such as airport usually requires a set of cameras for
ensuring a complete coverage of the scene. Tracking human in such scenes requires the
integration of video cues acquired from a set of cameras as the targets usually walk through
a large area. A major task of multiple-camera surveillance system is to reliably track a
person moving across cameras in the environment.
Most of the automatic surveillance approaches [1-6] require overlapping field of views
(FOVs) for tracking targets across multiple cameras. Jain and Wakimoto [1] used calibrated
cameras and an environmental model to obtain 3D location of a person. Cai and Aggarwal
[2], used multiple calibrated cameras for surveillance. Geometric and intensity features
were used to track the objects in multiple views. Collins et al. [3] developed a system
consisting of multiple calibrated cameras and a site model. They used region correlation
and location on the 3D site model for tracking. Bayesian Networks were used by
Dockstader and Tekalp [4] for tracking and occlusion reasoning across cameras with
overlapping views. Lee et al. [5] proposed an approach for tracking in cameras with
overlapping FOV’s that does not require calibration. The camera calibration information
was recovered by matching motion trajectories obtained from different views and plane
homographies were computed from the most frequent matches. Khan et al. [6] used field of
view (FOV) line constraints for tracking in cameras with overlapping views.
However, cameras with overlapping FOVs under large area coverage is usually not a
practically valid assumption. In addition, scene models or calibrated cameras are not
always available in many real environments, since the cameras usually need re-calibration
after PTZ actions, and assuring up-to-date calibration of a large network of cameras
requires significant maintenance effort.
As a result, we are motivated to propose a framework to locate and track people using
uncalibrated cameras which can have overlapping and/or non-overlapping FOVs, as
illustrated in Figure 1. Client-server architecture [7] is used to implement our approaches.
At the single camera level, a client program is attached to each camera to track [8-14] all
the targets in its FOV. Face recognition results will help to handle severely occluded targets,
which most of the existing tracking systems fail to track reliably. To enhance performance of
face recognition, a face reconstruction method [15-17] is utilized to transform human face
from arbitrary pose back to a frontal one, which most of the face recognition algorithms
would have assumed. And then, the server utilizes face recognition result as well as human
appearance and space-time cue to build stronger correspondence for all of the tracked
targets. Space-time constraints can be derived within a training period by self-calibration
and scene modeling procedure.