Tuan Hue, THI
Graduate Student , School of Computer Science and Engineering
University of New South Wales, Sydney, Australia

Level 4, 223 Anzac Parade, Kensington, NSW, 2033, Australia
Phone: +61 2 8306 0447
Mobile : 0404 659 624
Email: huetuan1984@gmail.com
Web: http://huetuan.net

Implicit Motion-Shape Model and Approaches to Visual-based Video Searching
Tuan Hue THI, Jian Zhang (NICTA-Sydney), Li Cheng (TTI-Chicago), Li Wang (SEU-China), and Shin'ichi Satoh (NII-Tokyo)


In this project we aim to develop a complete visual-based video search using its motion content. Traditional text-based video search engines rely solely on annotated text description of a video to carry out similarity search. Due to the fast growth of video domain, manual annotation are normally labor expensive and can also be errorprone. Our motivation is to develop a system that uses motion content inside each video to carry out matching task, then rank video similarity according to the returned matching values. We are particularly interested in two aspects of the matching algorithm, firstly, it should work for realistic cases with cluttered backgrounds, secondly, it should be generic enough to work on any kind of action. Taking these into account, we turn our focus to Invariant Local Features and Nonparametric Implicit Shape Model.

The general framework can be illustrated by the following figure


Basically, a video shot is decomposed into sparse set of invariant features, and with suitable techniques for detection/descriptor, we can eventually represent the video as a collection of feature vectors obtained from these local patches. These invariant local features will make the representation of video data more robust to illumination change and occlusion. Using Implicit Shape Model with Hough Transform as implementation, we construct a Hough Space from the query video. All video candidates will be projected on to this Model Hough Space and the Model Fitting Region is derived by searching for the projected region with highest density. Since this approach relies on no particular parameter of the action feature data, and works solely using motion shape of the action.

1. Local Feature Detection and Extraction

In our system, we use two different approaches for Local Feature Extraction and Selection, one with Sparse Bayesian Kernel Filter of Space-Time Interest Points and the other with Local Motion-Shape Feature. The output of these two techniques are the most representative local features of the action data, and will be fed into the Implicit Shape Model for matching value.

1.1. Space-Time Interest Point Approach
1.1.1. Local Features as Video Representation

A video shot in this perspective is a complex set of local features under various configurations. Tackling action recognition this way as we discussed earlier helps to lighten the dependency on view and scale variance of action visual appearance. We adopt the existing Space Time Interest Point (STIP) detection technique from Laptev et al. [13] to detect points with high motion change. In addition, by observing that the certain regions around these detected points are also contributive to the action context, we refine STIP detection results with a post Inpainting procedure, which idea is similar to Image Inpainting described in [4] by Criminisi et al. The inpainting process starts on the boundary of connected STIP point regions, base on the median scale and frequency of theseSTIP points to generate hypothesis about whether other points in the neighborhood should be included. The following figure illustrates the effect of our improved technique, inpainted Space Time Interest Point (iSTIP), over the traditional STIP. The surrounding areas of detected pixels are then described using a concatenation of Histogram of Oriented Gradients (HOG) and Histogram of Oriented Flow (HOF) [13]. In order to better organize the interest points in terms of their appearance, we use the agglomerative clustering scheme from Agarwal and Roth [1] to group similar interest regions based on pairwise Normalized Greyscale Correlation (NGC) values, which helps to yield the highest similarity compactness of pixel appearance regardless of the number of clusters being produced. With this analysis, a video shot is now represented as a sparse set of all interest points xi(i; c) having cluster identity i and 3D coordinate c.


1.1.2. Sparse Bayesian Learning

Among all detected interest points from the video shots, there are usually motion noise from the scattered background that do not contribute to the action motion. In fact, those points normally make the modeling computation much harder and in some cases might completely distract the core parts of the action. In order to filter out these irrelevant elements, we develop an extended version of the Sparse Bayesian Kernel Machine from Object Recognition work of Carbonetto et al. [3]. For each interest point xi(i;c) as notated in previous section, there will be associated a class label y(-1;1). The idea is to build a hierarchical Baysesian classifier model with parameters
learned from the limited amount of available training data. Following [3], we adopt a sparse kernel machine for classification purpose. Following Figure shows our successful adoption of this Sparse Bayesian Machine for the task of feature labeling on Weizmann [6] dataset, all red colored circles represent points noisy background
while green circles are those positively labeled as parts of the action configuration, and denoted as action elements in our system.


1.2. Motion-Shape Features for Video

Human action in our perspective is represented as a sparse set of local Motion-Shape features. Those are the selective patches detected at different scale from the Motion History Image (MHI), each contains information about the motion field and the shape of the actor, hence we loosely use the term Motion-Shape to describe. By analyzing these local attributes and their global configuration, we can conceptually articulate the behavior of the formulated action. The following Figure describes our feature extraction algorithm. From the query video (a), we compute a collection of MHIs following [1]. For each MHI, there will be detected the dominant motion region and its center point (white rectangle and circle in Figure (b)). This motion blob and its center are used as references for Motion- Shape searching. In our design, we call those center points Local Action Centers, as opposed to Global Action Center, the mean position of all Local Action Centers in each video.


Sliding the window search at different scales in (c) will help to produce a collection of Motion-Shapes, drawn as the thin colored circles. Each Motion-Shape m is
(a) Query Video: Action Model will be built based on this video (b) MHI: nominate action search region with Local Action Center (c) Motion-Shape: extracted as
m(x; y; t; c; omega); circle color represents cluster identification (d) Hough Space: represented as Lookup Table, 3-tuple m(x; y; t) is filled using index pair m(c; omega)
Fig. (d) Constructing Hough Space for Action Model defined by its relative position (x; y; t) to the Action Centers. Since each Motion-Shape by its nature is a motion field, we easily compute the Histogram of Oriented Gradients (HOG), and also the orientation ! of the most dominant gradient in each patch. Using HOG values obtained from each patch, we cluster the patches into different representative group, each defined by a cluster identification number c. Figure (c) shows the cluster id of Motion-Shape circles using different colors. At this stage, a video shot is genuinely decomposed into a set of Motion-Shapes m(x; y; t; c; omega), which will be encoded into our model, described in the next section.

2. Implicit Motion-Shape Model

In order to produce a generic design for modeling the human action, we extend the Generalized Hough Transform [2] technique to detect the 3D action structure formed by different distinctive Motion-Shapes. Using Global Action Center as the reference point in our 3D structure, we build a Hough Space to quantitatively represent the relative position of all Motion-Shapes in the action model, illustrated as the Lookup Table in Figure (d). In that coordinate system, each Motion-Shape is indexed by a key pair I = (c; omega) consisting of its cluster id c and gradient orientation id omega. The 3-tuple entries (x; y; t) in the Lookup Table are filled by those Motion-Shapes attributes extracted from the model video based on index key value.


Providing this Hough Space, the action matching task now becomes projecting the Motion-Shapes collected from video candidate to this space, matching value will be calculated as how well those projected points fit in the model. The above Figure illustrates the main steps of the matching task, starting with the MHI construction (Figure (a)) to calculate new Motion-Shapes in (b). Using the filled Model Hough Space (top left of (b)), a particular Motion-Shape mo (circled in white) will use its index key (c;omega) to find its corresponding entries in the Lookup Table. In this example, at its located index, there are 5 records about the relative position of the model reference
point (Model Action Center) and will all be considered as possible locations for the new Action Center of this video. The search for this new Action Center will help to determine how well this video action matches the model action. Formally, with each hypothesis about Action Center position we can factor the marginalization probability using Motion-Shape evidence and its location l = (x; y; t) based on the pair index I = (c; omega), this factorization is adopted from 2D Object Recognition work of Leibe et al. [14]. The quantitative matching score of an unknown motion compared to the action model is then defined as the summation of all Motion-Shape entities and Action Center locations.

In our design, we run mean-shift Parzen window density estimation to find the Model Fitting Region, which is the projected region that has the highest density of voted Action Centers (Figure (c)). In our model, we adopt two kinds of reference points, Global Action Center and Averaging Local Action Centers. While the former runs 3D volume density searches using 3-tuple (x; y; t) location of a Global Action Center, the latter runs 2D area searches using 2-tuple (x; y) location of all Local Action Centers on multiple MHIs and the average of all best matching will be used. We will discuss the effects of these two referencing systems in next section. The search criteria for mean-shift is the ratio r of vote counts on searching region. Matching score can then be interpreted in different ways using this ratio. In our video search system (described in the next section), we rank video relevancy based on this ratio in ascending order, best match has the highest r.

3. Experimental Results

We developed a complete video search system based on the proposed action matching algorithm, Figure 4 shows a snapshot of its user interface. On the left panel, the Query Video is loaded and processed to generate the codebook dictionary of Motion-Shapes (in this particular query, there are 289 clusters obtained); also a Segmentation thumbnail is generated to illustrate the relative position of the nominated Action Center. The right panel shows the video from the database that have
high matches with the query video. There are three different views for each match (Original, Motion-Shapes, or Segmentation). The results are ordered according to the matching score r and the relative matching correspondence is implied by the yellow square Model Fitting Regions.


The following Figure lists the most representative action elements obtained from 16 actions of both datasets.


A detailed snapshot of the actual classification process is also shown in the following Figure


In order to attain a thorough evaluation of our technique, we use this Video Search system to run the tests on the two datasets KTH [16] (2400 video shots of 6 actions) and Weizmann [6] (92 video shots of 10 actions). The common benchmark train/test amount for these two datasets is 2/3 Split on KTH, 16 persons for training and 9 persons for testing, and leave-one-out onWeizmann, 8 persons for training and 1 person for testing. Since we are carrying out searching task, we only need one training sample per query. Therefore, in order to produce fair comparison with reported works, we do a random selection (with equal samples of each action) of the search queries, and run the search on the same testing amount.

In our evaluation process, we are also interested in understanding the accuracy-time tradeoff relationship and view
change sensitivity. We conduct three independent test runs based on three principal methods, namely, Global.Mirrored, Local.Mirrored, and Local.NonMirrored. The first term indicates whether the Global or average of Local Action Centers is used as reference point, the second term specifies if the search is mirrored in left-right direction, that is, each video candidate will be flipped horizontally to generate a mirror of itself, the test is then done on both instances, and the maximum matching score is used. Empirical results show that on average, the Global search requires double the processing time of Local search, while the Mirrored algorithm is 1.5 times slower than NonMirrored. Figure 3 summarizes the performance of three techniques on KTH and Weizmann using the Receiver Operating Characteristic (ROC) curve.


Tests on Weizmann dataset generally return better results than KTH, which is quite reasonable since the backgrounds in Weizmann are static, while the cameras in KTH are not stable. It is also shown that using Global Action Center as a reference for 3D point cloud search does yield better result than averaging individual 2D Local Action Centers. The performance decline in two NonMirrored techniques does imply that our technique is sensitive to view change, which is expected since we rely on the motion field shape to analyze action behavior, this drawback can be overcome with Mirrored method, sacrificing an overhead portion of processing time. The black straight line in the above Figure is called Cut-off Line which connects the two ends of True Positive Rate [0; 1] and True Negative Rate [1; 0], intersections of this line with ROC
curves are called Cut-off points (represented as black circles), indicating the position where Sensitivity is equal to Specificity, and we use those points to analyze the average performance of our system specific for each action and as compared to other reported works on KTH and Weizmann. The performance of Global.Mirrored are elaborated as per action at Cutoff points using the Confusion Matrix with normalized sum on each row and column, as shown in the following Figure.


Using the average accuracy, we also conduct a comparison survey between our methods and reported state-of-the-art action recognition approaches, as shown in Table 9. Interestingly, while other works seem to work well for either one of the datasets, our system performs equally well, having the second best classification score on both KTH and Weizmann.



[1] S. Agarwal and D. Roth. Learning a sparse representation for object detection. In ECCV ’02: Proceedings of the 7th European Conference on Computer Vision-Part IV, pages 113–130,
London, UK, 2002. Springer-Verlag.
[2] G. Bradski and J. Davis. Motion segmentation and pose recognition with motion history gradients. MVA, 13:174–184, 2002.
[3] P. Carbonetto, G. Dorko, C. Schmid, H. Kuck, and N. de Freitas. Learning to recognize objects with little supervision. Intl. Journal of Computer Vision, 77(1-3):219–237, May 2008.
[4] A. Criminisi, P. Prez, and K. Toyama. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on Image Processing, 13:1200–1212, 2004.
[5] A. Fathi and G. Mori. Action recognition by learning midlevel motion features. pages 1–8, 2008.
[6] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. Transactions on Pattern Analysis and Machine Intelligence, 29(12):2247–2253, 2007.
[7] M. Grundmann, F. Meier, and I. Essa. 3d shape context and distance transform for action recognition. In ICPR, 2008.
[8] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for action recognition. pages 1–8, 2007.
[9] H. Jiang, M. Drew, and Z. Li. Successive convex matching for action detection. In Proc. Conf. Computer Vision and Pattern Recognition, pages II: 1646–1653, 2006.
[10] Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual event detection using volumetric features. In ICCV, 2005.
[11] Y. Ke, R. Sukthankar, and M. Hebert. Event detection in crowded videos. In Proc. Intl. Conf. Computer Vision, pages 1–8, 2007.
[12] H. Kuck, P. Carbonetto, and N. de Freitas. A constrained semisupervised learning approach to data association. In Proc. European Conf. Computer Vision, pages Vol III: 1–12, 2004.
[13] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In Proc. Conf. Computer Vision and Pattern Recognition, 2008.
[14] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmentation with an implicit shape model. In ECCV workshop on statistical learning in computer vision, pages 17–32, 2004.
[15] J. Niebles, H. Wang, and F. Li. Unsupervised learning of human action categories using spatial-temporal words. Intl. Journal of Computer Vision, 79(3), September 2008.
[16] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In Proc. Intl. Conf. Pattern Recognition, pages III: 32–36, 2004.
[17] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, 2003.
[18] Y.Wang and G. Mori. Human action recognition by semilatent topic models. PAMI, 31(10), 2009.

Homepage Resume Latex CV Projects Photo Albums Research Ideas Personals
Tuan Hue, THI (huetuan1984@gmail.com) / Updated December 2009 || View My Stats