This work presents a novel self-supervised learning framework for deep visual odometry on stereo cameras. Recent work on deep visual odometry is often based on monocular vision. A common approach is to use two separate neural networks, which use raw images for depth and ego-motion prediction. This paper proposes an alternative approach that argues against separate prediction of depth and ego-motion and emphasizes the advantages of optical flow and stereo cameras. Its central component is a deep neural network for optical flow predictions, from which both depth and ego-motion can be derived. The neural network training is regulated by a 3D-geometric constraint, which enforces a realistic structure of the scene over consecutive frames and models static and moving objects. It ensures that the neural network has to predict the optical flow as it would occur in the real world. The presented framework is tested on the KITTI dataset. It achieves very good results, outperforming most algorithms for deep visual odometry, and exceeds state-of-the-art results for depth detection.