Temporal cues enhanced multimodal learning for action recognition in RGB-D videos

Link:
Autor/in:
Erscheinungsjahr:
2024
Medientyp:
Text
Schlagworte:
  • Co-learning
  • Human action recognition
  • Multimodal learning
  • Temporal modeling
Beschreibung:
  • Action recognition is an important and active research direction in computer vision, where temporal modeling is critical for action representation. Generally, unimodal methods that use only RGB or skeleton modality for human action recognition have their limitations, e.g., information redundancy/environment noise of RGB video modality, and spatial interaction deficiency of skeleton modality. In this paper, we present a novel multimodal learning approach based on RGB and skeleton modalities for action recognition in RGB-D videos. Specifically, we (1) transfer skeleton knowledge to RGB video for effective video compression, which produces the informative action image from raw RGB video, (2) introduce the temporal cues enhancement module to adequately learn the spatiotemporal representation for action classification, and (3) propose a multi-level multimodal co-learning framework for human action recognition in RGB-D videos. Experimental results on NTU RGB+D, PKU-MMD, and N-UCLA datasets demonstrate the effectiveness of the proposed multimodal learning method.
  • Action recognition is an important and active research direction in computer vision, where temporal modeling is critical for action representation. Generally, unimodal methods that use only RGB or skeleton modality for human action recognition have their limitations, e.g., information redundancy/environment noise of RGB video modality, and spatial interaction deficiency of skeleton modality. In this paper, we present a novel multimodal learning approach based on RGB and skeleton modalities for action recognition in RGB-D videos. Specifically, we (1) transfer skeleton knowledge to RGB video for effective video compression, which produces the informative action image from raw RGB video, (2) introduce the temporal cues enhancement module to adequately learn the spatiotemporal representation for action classification, and (3) propose a multi-level multimodal co-learning framework for human action recognition in RGB-D videos. Experimental results on NTU RGB+D, PKU-MMD, and N-UCLA datasets demonstrate the effectiveness of the proposed multimodal learning method.
Lizenz:
  • info:eu-repo/semantics/closedAccess
Quellsystem:
Forschungsinformationssystem der UHH

Interne Metadaten
Quelldatensatz
oai:www.edit.fis.uni-hamburg.de:publications/46b8fc02-b272-4634-9d94-2d2afd0ee456