SPATIO-TEMPORAL TRANSFORMERS FOR ACTION UNIT
CLASSIFICATION WITH EVENT CAMERAS
THE UNIVERSITY OF FLORENCE, THE UNIVERSITY OF SIENA,
THE UNIVERSITY OF PARMA
Luca Cultrera, Federico Becattini, Lorenzo Berlincioni, Claudio Ferrari, Alberto Del Bimbo
ABSTRACT
As one of the most important applications in computer vision, face analysis has been studied from different angles in order to infer emotion, poses, shapes, and landmarks. Traditionally the research has employed classical RGB cameras to collect and publish the relevant annotated data. For more fine-grained tasks however standard sensors might not be up to the task, due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. In this paper we propose a novel spatio-temporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered one of the main causes of an existing gap between the maturity of RGB and neuromorphic vision models. In fact, gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames.