Online Temporal Action Localization (On-TAL) is a critical task that aims to instantaneously identify action instances in untrimmed streaming videos as soon as an action concludes--a major leap from frame-based Online Action Detection (OAD). Yet, the challenge of detecting overlapping actions is often overlooked even though it is a common scenario in streaming videos. Current methods that can address concurrent actions depend heavily on class information, limiting their flexibility.
This paper introduces "ActionSwitch", the first class-agnostic On-TAL framework capable of detecting overlapping actions. By obviating the reliance on class information, ActionSwitch provides wider applicability to various situations, including overlapping actions of the same class or scenarios where class information is unavailable. This approach is complemented by the proposed "Conservativeness loss", which directly embeds a conservative decision-making principle into the loss function for On-TAL. Our ActionSwitch achieves state-of-the-art performance in complex datasets, including Epic-Kitchens 100 targeting the challenging egocentric view and FineAction consisting of fine-grained actions.
Overview of the ActionSwitch Framework: State label is derived from the sum of the ids of activated switches. For example, the state is labeled as `3' between t2 and t3 when switches `1' and `2' are simultaneously active, whereas it registers as `2' from t3 to t4 when only switch `2' is active. State changes signify action instance boundaries, and our "Conservativeness loss" minimizes state fluctuations to improve detection accuracy.
(a) State diagram of ActionSwitch framework, and (b) implementation of state-emitting OAD model.
A possible solution to catch multiple concurrent actions would be using multiple OAD models for mutually exclusive action instance detection. For instance, in a setup with two OAD models and a single action, one model should signal "no action" when the other detects the action. However, this interdependence of the models' decisions makes the implementation complicated.
Instead, we abstract the concept into a single machine with multiple switches. To be specific, let's think about the two-switch case. The finite state machine corresponding to this machine has four states: i) no switch activated, ii) switch 1 activated, iii) switch 2 activated, and iv) both switches activated. These states are illustrated in the State Table in (a).
Action boundaries are typically much less frequent than non-boundary frames, which is a widely observed phenomenon in most videos. Our approach directly encodes this prior into the loss function, by simply utilizing previous steps' predictions as pseudo-labels and posing standard cross-entropy loss.
@inproceedings{actionswitch,
author = {Kang, Hyolim and Hyun, Jungsuk and An, Joungbin and Yu, Youngjae and Kim, Seon Joo},
title = {ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos},
booktitle={European Conference on Computer Vision (ECCV)},
year={2024}
}