In this paper, we present an efficient and adaptive method for detecting and tracking multiple persons while providing real-time capability and high robustness to outlier noise. Given an RGB-D image data sequence, our algorithm combines two independent approaches for person detection. First, a cluster-based segmentation and classification on RGB-D point clouds and second a face detection on RGB images, where each method itself is post-processed by spatio-temporal filtering for tracking and sensitivity purposes. Our analysis and experimental results prove that the combined approach performs significantly better than the individual solutions and greatly reduces the number of false positives in situations where one detector fails.