The performance of an efficient and accurate action recognition system heavily depends on distinctive representations for a different class of action sequences. To address this issue, we propose an ensemble network in this paper. We design two multilayer Long Short Term Memory networks to capture spatial and temporal dynamics of the entire sequence, referred to as Spatial-distance Net (SdNet) and Temporal-distance Net (TdNet) respectively. More specifically, SdNet captures the spatial dynamics of joints within a frame and TdNet explores the temporal dynamics of joints between frames along the sequence. Finally, two nets are fused as one Ensemble network, referred to as Spatio -Temporal distance Net (STdNet) to explore both spatial and temporal dynamics. The efficacy of the proposed method is evaluated on two widely used datasets, UTD MHAD and NTU RGB+D, and the proposed STdNet achieved 91.16% and 80.03% accuracies respectively.