Gesture recognition is an important task in Human-Robot Interaction (HRI) and the research effort towards robust and high-performance recognition algorithms is increasing. In this work, we present a neural network approach for learning an arbitrary number of labeled training gestures to be recognized in real time. The representation of gestures is hand-independent and gestures with both hands are also considered. We use depth information to extract salient motion features and encode gestures as sequences of motion patterns. Preprocessed sequences are then clustered by a hierarchical learning architecture based on self-organizing maps. We present experimental results on two different data sets: command-like gestures for HRI scenarios and communicative gestures that include cultural peculiarities, often excluded in gesture recognition research. For better recognition rates, noisy observations introduced by tracking errors are detected and removed from the training sets. Obtained results motivate further investigation of efficient neural network methodologies for gesture-based communication.