Tuan Hue, THI
Graduate Student , School of Computer Science and Engineering
University of New South Wales, Sydney, Australia
In this project we aim to develop a complete visual-based video search using its motion content. Traditional text-based video search engines rely solely on annotated text description of a video to carry out similarity search. Due to the fast growth of video domain, manual annotation are normally labor expensive and can also be errorprone. Our motivation is to develop a system that uses motion content inside each video to carry out matching task, then rank video similarity according to the returned matching values. We are particularly interested in two aspects of the matching algorithm, firstly, it should work for realistic cases with cluttered backgrounds, secondly, it should be generic enough to work on any kind of action. Taking these into account, we turn our focus to Invariant Local Features and Nonparametric Implicit Shape Model.
The general framework can be illustrated by the following figure
Basically, a video shot is decomposed into sparse set of invariant features, and with suitable techniques for detection/descriptor, we can eventually represent the video as a collection of feature vectors obtained from these local patches. These invariant local features will make the representation of video data more robust to illumination change and occlusion. Using Implicit Shape Model with Hough Transform as implementation, we construct a Hough Space from the query video. All video candidates will be projected on to this Model Hough Space and the Model Fitting Region is derived by searching for the projected region with highest density. Since this approach relies on no particular parameter of the action feature data, and works solely using motion shape of the action.
1. Local Feature Detection and Extraction
In our system, we use two different approaches for Local Feature Extraction and Selection, one with Sparse Bayesian Kernel Filter of Space-Time Interest Points and the other with Local Motion-Shape Feature. The output of these two techniques are the most representative local features of the action data, and will be fed into the Implicit Shape Model for matching value.
1.1. Space-Time Interest Point Approach
1.1.2. Sparse Bayesian Learning
Among all detected interest points from the video shots, there
are usually motion noise from the scattered background that
do not contribute to the action motion. In fact, those points
normally make the modeling computation much harder and
in some cases might completely distract the core parts of the
action. In order to filter out these irrelevant elements, we develop
an extended version of the Sparse Bayesian Kernel Machine
from Object Recognition work of Carbonetto et al. .
For each interest point xi(i;c) as notated in previous section,
there will be associated a class label y(-1;1). The idea
is to build a hierarchical Baysesian classifier model with parameters
1.2. Motion-Shape Features for Video
Human action in our perspective is represented as a sparse set of local Motion-Shape features. Those are the selective patches detected at different scale from the Motion History Image (MHI), each contains information about the motion field and the shape of the actor, hence we loosely use the term Motion-Shape to describe. By analyzing these local attributes and their global configuration, we can conceptually articulate the behavior of the formulated action. The following Figure describes our feature extraction algorithm. From the query video (a), we compute a collection of MHIs following . For each MHI, there will be detected the dominant motion region and its center point (white rectangle and circle in Figure (b)). This motion blob and its center are used as references for Motion- Shape searching. In our design, we call those center points Local Action Centers, as opposed to Global Action Center, the mean position of all Local Action Centers in each video.
Sliding the window search at different scales in (c) will help to produce a collection of Motion-Shapes,
drawn as the thin colored circles. Each Motion-Shape m is
2. Implicit Motion-Shape Model
In order to produce a generic design for modeling the human action, we extend the Generalized Hough Transform  technique to detect the 3D action structure formed by different distinctive Motion-Shapes. Using Global Action Center as the reference point in our 3D structure, we build a Hough Space to quantitatively represent the relative position of all Motion-Shapes in the action model, illustrated as the Lookup Table in Figure (d). In that coordinate system, each Motion-Shape is indexed by a key pair I = (c; omega) consisting of its cluster id c and gradient orientation id omega. The 3-tuple entries (x; y; t) in the Lookup Table are filled by those Motion-Shapes attributes extracted from the model video based on index key value.
Providing this Hough Space, the action matching task now
becomes projecting the Motion-Shapes collected from video
candidate to this space, matching value will be calculated as
how well those projected points fit in the model. The above Figure
illustrates the main steps of the matching task, starting with
the MHI construction (Figure (a)) to calculate new Motion-Shapes in (b). Using the filled Model Hough Space (top left
of (b)), a particular Motion-Shape mo (circled in white) will
use its index key (c;omega) to find its corresponding entries in
the Lookup Table. In this example, at its located index, there
are 5 records about the relative position of the model reference
In our design, we run mean-shift Parzen window density estimation to find the Model Fitting Region, which is the projected region that has the highest density of voted Action Centers (Figure (c)). In our model, we adopt two kinds of reference points, Global Action Center and Averaging Local Action Centers. While the former runs 3D volume density searches using 3-tuple (x; y; t) location of a Global Action Center, the latter runs 2D area searches using 2-tuple (x; y) location of all Local Action Centers on multiple MHIs and the average of all best matching will be used. We will discuss the effects of these two referencing systems in next section. The search criteria for mean-shift is the ratio r of vote counts on searching region. Matching score can then be interpreted in different ways using this ratio. In our video search system (described in the next section), we rank video relevancy based on this ratio in ascending order, best match has the highest r.
3. Experimental Results
We developed a complete video search system based on the
proposed action matching algorithm, Figure 4 shows a snapshot
of its user interface. On the left panel, the Query Video
is loaded and processed to generate the codebook dictionary
of Motion-Shapes (in this particular query, there are 289 clusters
obtained); also a Segmentation thumbnail is generated to
illustrate the relative position of the nominated Action Center.
The right panel shows the video from the database that have
The following Figure lists the most representative action elements obtained from 16 actions of both datasets.
A detailed snapshot of the actual classification process is also shown in the following Figure
In order to attain a thorough evaluation of our technique, we use this Video Search system to run the tests on the two datasets KTH  (2400 video shots of 6 actions) and Weizmann  (92 video shots of 10 actions). The common benchmark train/test amount for these two datasets is 2/3 Split on KTH, 16 persons for training and 9 persons for testing, and leave-one-out onWeizmann, 8 persons for training and 1 person for testing. Since we are carrying out searching task, we only need one training sample per query. Therefore, in order to produce fair comparison with reported works, we do a random selection (with equal samples of each action) of the search queries, and run the search on the same testing amount.
In our evaluation process, we are also interested in understanding
the accuracy-time tradeoff relationship and view
Tests on Weizmann dataset generally return better results
than KTH, which is quite reasonable since the backgrounds
in Weizmann are static, while the cameras in KTH are not
stable. It is also shown that using Global Action Center as
a reference for 3D point cloud search does yield better result
than averaging individual 2D Local Action Centers. The performance
decline in two NonMirrored techniques does imply
that our technique is sensitive to view change, which is expected
since we rely on the motion field shape to analyze action
behavior, this drawback can be overcome with Mirrored
method, sacrificing an overhead portion of processing time.
The black straight line in the above Figure is called Cut-off Line
which connects the two ends of True Positive Rate [0; 1] and
True Negative Rate [1; 0], intersections of this line with ROC
Using the average accuracy, we also conduct a comparison survey between our methods and reported state-of-the-art action recognition approaches, as shown in Table 9. Interestingly, while other works seem to work well for either one of the datasets, our system performs equally well, having the second best classification score on both KTH and Weizmann.
 S. Agarwal and D. Roth. Learning a sparse representation for
object detection. In ECCV ’02: Proceedings of the 7th European
Conference on Computer Vision-Part IV, pages 113–130,
|Homepage||Resume||Latex CV||Projects||Photo Albums||Research Ideas||Personals|
|Tuan Hue, THI (email@example.com) / Updated December 2009 || View My Stats|