Temporal cues enhanced multimodal learning for action recognition in RGB-D videos

Link:

https://doi.org/10.1016/j.neucom.2024.127882

Autor/in:

Erscheinungsjahr:

2024

Medientyp:

Text

Schlagworte:

Co-learning
Human action recognition
Multimodal learning
Temporal modeling

Beschreibung:

Action recognition is an important and active research direction in computer vision, where temporal modeling is critical for action representation. Generally, unimodal methods that use only RGB or skeleton modality for human action recognition have their limitations, e.g., information redundancy/environment noise of RGB video modality, and spatial interaction deficiency of skeleton modality. In this paper, we present a novel multimodal learning approach based on RGB and skeleton modalities for action recognition in RGB-D videos. Specifically, we (1) transfer skeleton knowledge to RGB video for effective video compression, which produces the informative action image from raw RGB video, (2) introduce the temporal cues enhancement module to adequately learn the spatiotemporal representation for action classification, and (3) propose a multi-level multimodal co-learning framework for human action recognition in RGB-D videos. Experimental results on NTU RGB+D, PKU-MMD, and N-UCLA datasets demonstrate the effectiveness of the proposed multimodal learning method.
Action recognition is an important and active research direction in computer vision, where temporal modeling is critical for action representation. Generally, unimodal methods that use only RGB or skeleton modality for human action recognition have their limitations, e.g., information redundancy/environment noise of RGB video modality, and spatial interaction deficiency of skeleton modality. In this paper, we present a novel multimodal learning approach based on RGB and skeleton modalities for action recognition in RGB-D videos. Specifically, we (1) transfer skeleton knowledge to RGB video for effective video compression, which produces the informative action image from raw RGB video, (2) introduce the temporal cues enhancement module to adequately learn the spatiotemporal representation for action classification, and (3) propose a multi-level multimodal co-learning framework for human action recognition in RGB-D videos. Experimental results on NTU RGB+D, PKU-MMD, and N-UCLA datasets demonstrate the effectiveness of the proposed multimodal learning method.

Lizenz:

info:eu-repo/semantics/closedAccess

Quellsystem:

Forschungsinformationssystem der UHH

Interne Metadaten

Quelldatensatz: oai:www.edit.fis.uni-hamburg.de:publications/46b8fc02-b272-4634-9d94-2d2afd0ee456