This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Learning Human Actions via Information Maximization Jingen Liu Computer Vision Lab University of Central Florida email@example.com Mubarak Shah Computer Vision Lab University of Central Florida firstname.lastname@example.org Abstract In this paper, we present a novel approach for au- tomatically learning a compact and yet discriminative appearance-based human action model. A video sequence is represented by a bag of spatiotemporal features called video-words by quantizing the extracted 3D interest points (cuboids) from the videos. Our proposed approach is able to automatically discover the optimal number of video- word clusters by utilizing Maximization of Mutual Infor- mation(MMI). Unlike the k-means algorithm, which is typ- ically used to cluster spatiotemporal cuboids into video words based on their appearance similarity, MMI cluster- ing further groups the video-words , which are highly cor- related to some group of actions. To capture the structural information of the learnt optimal video-word clusters, we explore the correlation of the compact video-word clusters. We use the modified correlgoram, which is not only trans- lation and rotation invariant, but also somewhat scale in- variant. We extensively test our proposed approach on two publicly available challenging datasets: the KTH dataset and IXMAS multiview dataset. To the best of our knowl- edge, we are the first to try the bag of video-words related approach on the multiview dataset. We have obtained very impressive results on both datasets. 1. Introduction Automatically recognizing human actions is critical for several applications such as video indexing, video summa- rization, and so on. However, it remains a challenging prob- lem due to camera motion, occlusion, illumination changes and the individual variations of object appearance and pos- tures. Over the past decade, this problem has received consid- erable attention. We can model the human actions using either holistic information or part-based information. One way to compare two actions is to compute the correlation of their spatiotemporal (ST) volumes. Shechtman et. al.  proposed a method which measures the degree of consis- tency by computing the correlation using the local intensity variance. Similarly, Efros et. al.  extracted an optical flow field as a descriptor from the stabilized object ST volume, and computed the cross correlation between the model and the input optical flow descriptors. In another holistic ap- proach, an action is considered as a 3D volume and features are extracted from this volume. For instance, Yilmaz et. al.  used differential geometry features extracted from the surfaces of their action volumes and achieved good per- formance. Yet this method requires robust tracking to gen- erate the 3D volumes. Parameswaran et. al.  proposed an approach to exploit the 2D invariance in 3D to 2D projec- tion, and model actions using view-invariant canonical body poses and trajectories in 2D invariance space. They assumeposes and trajectories in 2D invariance space....
View Full Document
- Spring '08