U.S. flag

An official website of the United States government, Department of Justice.

NCJRS Virtual Library

The Virtual Library houses over 235,000 criminal justice resources, including all known OJP works.
Click here to search the NCJRS Virtual Library

Online Localization and Prediction of Actions and Interactions

NCJ Number
Ieee Transactions on Pattern Analysis and Machine Intelligence Volume: 41 Issue: 2 Dated: 2019 Pages: 459-472
Khurram Soomro; Haroon Idrees; Mubarak Shah
Date Published
14 pages
This study proposes a person-centric and online approach to the challenging problem of localization and prediction of actions and interactions in videos.

Typically, localization or recognition is performed in an offline manner in which all the frames in the video are processed together. This prevents timely localization and prediction of actions and interactions - an important consideration for many tasks, including surveillance and human-machine interaction. In the proposed approach, the study estimated human poses at each frame and trained discriminative appearance models, using the superpixels inside the pose bounding boxes. Since the pose estimation per frame was inherently noisy, the conditional probability of pose hypotheses at current time-step (frame) was computed using pose estimations in the current frame and their consistency with poses in the previous frames. Next, both the superpixel and pose-based foreground likelihoods were used to infer the location of actors at each time through a Conditional Random Field enforcing spatio-temporal smoothness in color, optical flow, motion boundaries and edges among superpixels. The issue of visual drift is handled by updating the appearance models and refining poses using motion smoothness on joint locations in an online manner. For online prediction of action/interaction confidences, the study proposes an approach based on Structural SVM that operates on short video segments, and is trained with the objective that confidence of an action or interaction increases as time passes in a positive training clip. Lastly, the study quantified the performance of both detection and prediction together, and it analyzed how the prediction accuracy varied as a time function of observed action/interaction at different levels of detection performance. The experiments on several datasets suggest that despite using only a few frames to localize actions/interactions at each time instant, the proposed technique was able to obtain competitive results with state-of-the-art offline methods. (publisher abstract modified)