 Open Access
 Authors : Saketh Kilaru, Anushka Shah, Swoichha Adhikari
 Paper ID : IJERTV12IS120040
 Volume & Issue : Volume 12, Issue 12 (December 2023)
 Published (First Online): 19122023
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Enhancing Classification in UTDMHAD Dataset: Utilizing Recurrent Neural Networks in EnsembleBased Approach for Human Action Recognition
Saketh Kilaru
Birla Institute of Technology and Science
Anushka Shah
Birla Institute of Technology and Science
Swoichha Adhikari
Birla Institute of Technology and Science
AbstractThis study demonstrates the ensemble approach to perform Human Action Recognition on the UT Dallas Multimodal Human Action Data, where number of actions by humans is 27. Our ensemble approach gained an accuracy of 0.821 on the validation data, a remarkable uplevelling as compared to the accuracy of baseline paper which is 0.672. The paper also shows the trainval performances of other models we experimented using only the Inertial and Skeleton dataset. The link to Github repository which holds to code can be found here, and the link to get the dataset can be found here.

INTRO.
Human Action Recognition(HAR), a research domain that piqued interests from diverse range of computer science disciplines from the period of the 1980s, as a result of its application in numerous branches of investigstion like sociology, humancomputer interaction, and medicine. Data which is in the format of videos, inertial sensors, like accelerometers and gyroscopes, depth maps and point clouds are often used to classify the different human actions [1][2].
Traditional approaches are generally broken down as the following three steps

Feature Extraction using Signal Processing or Computer Vision techniques to capture relevant spatiotemporal features [3][4][5]

Designing a pipeline to combine the extracted features

Training a classifier, usually a Support Vector Machine(SVM) or Random Forest, using following charc
teristics
The mentioned approach often requires deep domain knowledge in extracting the useful spatiotemporal features. Another approach is to use modern neural network architectures to do the feature extraction and model building automatically. Soon after 2014, there were two breakthrough papers, Single Stream and Two Stream, which ignited the research using modern approach. The main difference between them was how the model architecture combines the spatio temporal information.
The first paper [6] uses a Single Stream Network(SSM) which explores distinct pathways to club temporal info and
consecutive frames by utilizing 2D pretrained convolutions. This had led to popular methods like Longterm Recurrent Convolutional Networks[7], 3D Convolutional Networks [8], and Conv3D with Attention [9].
The second paper utilizes a Two Stream Network [10], as depicted in Figure 1, one stream captures the pretrained spatial context, while the second one captures motion information. Other variants are soon developed as like the TwoStream Network Fusion [11], Temporal Segment Networks (TSN) [12], Action VideoLevel Aggregation
(ActionVLAD) [13], Hidden TwoStream Convolutional Networks [14], TwoStream Inflated 3D ConvNet (Two StreamI3D) [15], and Temporal 3D ConvNets (T3D) [16]
Fig. 1. 2 stream architecture of Simmoyan and Zisserman [10]
This paper presents an ensemble proposition, using both convolutional as well as recurrent, single stream neural networks to recognize 27 different human actions using the UTDMHAD dataset [17].


METHODS AND MODELING

Dataset
Derived through the integration of diverse technologies, a wearable inertial sensor like Microsoft Kinect sensor equipped with an accelerometer and gyroscope, and a video camera, the UTDMHAD dataset encompasses 27 unique actions executed by eight individuals (four females and four males). All of them performed each action four times for data collection purposes. Following the elimination of three faulty sequences, and there are 861 sequences in totality. There are four different types of datasets, the Depth dataset, Inertial dataset, Skeleton dataset,
and RGB dataset. They are plotted and shown respectively in Fig. 2 – 5.
Fig. 2. Depth map dataset of a one person doing a tennis swing
Fig. 3. Inertial dataset of a one person doing a tennis swing
Fig. 4. Skeleton dataset of a person doing tennis swing
Fig. 5. Video dataset of a subject performing a tennis swing
In this paper, we would only be using the inertial and skeleton dataset for our models. We split the trainvalidation data in the same fashion as the original paper [17],For training purposes, the dataset will utilize subjects 1, 3, 5, and 7, while
validation will involve subjects 2, 4, 6, and 8, so that we could perform a baseline comparison.

Data Preprocessing

Inertial Data: The inertial data was resampled to the mean period of 180 units. This was found to allow the model to
achieve convergence much quicker. Amplitude normalization was also tried but subsequently removed as it does not show
improvement in the models training. A histogram plot of the period distribution is shown in Fig. 6
Fig. 6. Histogram plot of the period distribution of Inertial Data
Histogram plots for the inertial data are shown in Fig. 7 10 to show the distribution of amplitude for both maximum and minimum of the 3axial accelerometer and gyroscope.
Fig. 7. Histogram plot of the period distribution of the maximum amplitude of the 3axial accelerometer data
Fig. 8. Histogram plot of the period distribution of the minimum amplitude of the 3axial accelerometer data
Fig. 9. Histogram plot of the period distribution of the maximum amplitude of the 3axial gyroscope data
Fig. 10. Histogram plot of the period distribution of the minimum amplitude of the 3axial gyroscope data

Skeleton Data: To aid with data fusion later, we resampled the skeleton data having 180 frames.


Models
During our experiments, we employed the AdamOptimizer with the following values: = 1e4, 1 = 0.9, and 2 =
0.999. To initialize the models weights, we utilized the Glorot initialization method. To improve generalization, the batch size was set to 3, as suggested in [18]. In this paper, we shall experiment with the following models:

Simple LSTM

BiDirectional LSTM

Conv LSTM 4) UNet LSTM
5) Ensemble of Conv LSTM and UNet LSTM
The full architecture of the models can be found in the Appendix.

Simple LSTM: We start off by having a Simple LSTM model, which comprises of an LSTM unit of 512 hidden units, followed by a Dense layer, to classify all 27 actions. Our remarkable strategy enables rapid handling of regional, decentralized, continuous numerical, and distorted structure representations, surpassing conventional realtime iterative learning, timedriven error optimization, sequential cascade interdependence, Elman networks, and neural sequence segmentation techniques in terms of both speed and efficiency. Each recurrent unit in the LSTM was set to have a dropout rate of 0.2.

BiDirectional: The next model flips the copy of LSTM unit and concatenates it with original LSTM unit. With our groundbreaking technique, we realize a variant of the process of generating deep learning, allowing the layer of output to gather insights pertaining to both past as well as future possibility of states at the same time [19]. Afterward, a fully connected layer with a softmax activation is utilized for classifying the 27 activities. This novel process enhances overall performance and capabilities of the model

Conv LSTM: By employing a unique combination of 1D Convolutional and 1D Maxpooling layers, the third model efficiently extracts higher dimensional features. Following data processing, the information is input splitting into two LSTM units to note crucial corporeal details. Afterward, the LSTM units output undergoes flattening, while an added Dropout layer with a dropout rate of 0.5 boosts generalization capabilities. Finally, to achieve a robust and accuracy, we integrate a fully activated layer using softmax to enhance the models classification capabilities allowing the classification of all 27 actions.

UNet LSTM: Incorporating a renowned architecture commonly utilized in image semantic segmentation roles, last model is founded upon the UNet framework.The UNet, a kind of convolutional neural network with encoderdecoder architecture, possesses symmetrical features in both its contraction and expansion paths. During the contraction phase, the input experiences multiple convolutions and maxpooling operations, resulting in amplified feature maps and diminished image resolution, favoring what over where. During the expansion path, lowresolution, highdimensional features are upsampled using convolutional kernels, resulting in reduced feature maps. Remarkably, UNet exhibits a distinctive integration of highdimensional characteristics from the contraction phase into the compact feature representations within the expansion layers, creating a robust and cohesive model.

Ensemble of Conv LSTM and UNet LSTM: The Conv LSTM and UNet LSTM were found to be performing well on the validation set, so we took an average of their softmax activation to create an ensemble as shown below.
(1)
where (z)j is the softmax output for an action j, given by:
(2)

COMPUTATIONAL SIMULATION
The code is written in Python, using Keras with Tensorflow backend, NumPy, SciPy and Matplotlib libraries. The link to codes github repository is: https://github.com/ notha99y/Multimodal human actions. The models are trained on Google Colaboratory notebooks.

RESULTS

Inertial Data

Simple LSTM: The accuracy on the validation of our simple LSTM was 0.238. The trainval accuracy and loss plots of the Simple LSTM are shown in Fig. 11, 12 respectively. It shows that the model is quickly overfitting after epoch 2. The confusion matrix is show in Fig. ??.
Fig. 11. Train (blue)val (green) accuracy plot of the Simple LSTM model
Fig. 12. Train (blue)val (green) loss plot of the Simple LSTM model
Fig. 13. The assessment of the Simple LSTM models performance is carried out using a confusion matrix.
(2) BiDirectional LSTM: Similarly, the trainval accuracy, loss plots are shown in in Fig.??, ?? respectively, with the confusion matrix in Fig.
?? with a validation accuracy of 0.465.
Fig. 14. Train (blue)val (green) accuracy plot of the Bi Directional LSTM
Fig. 15.Train (blue)val (green) loss plot of the BiDirectional LSTM
Fig. 16. The LSTM model with bidirectional connections confusion matrix is used for performance evaluation.

Conv LSTM: The Conv LSTM was the first major breakthrough among our models, achieving a validation accuracy of 0.700. The trainval accuracy, loss plots are shown in Fig. 17, 18 with the confusion matrix in Fig. 19.
Fig. 17.Train (blue)val (green) accuracy plot of the Conv LSTM
Fig. 18. Train (blue)val (green) loss plot of the Conv LSTM
Fig. 19. The BiDirectional LSTM models confusion matrix is used for performance evaluation.

UNet LSTM: Lastly, our UNet LSTM got the highest accuracy of 0.712. The relevant plots are shown below.
Fig. 20.Train (blue)val (green) accuracy plot of the UNet LSTM model
Fig. 21.Train (blue)val (green) loss plot of the UNet LSTM model
Fig. 22. Confusion matrix of the UNet LSTM

Ensemble of Conv LSTM and UNet LSTM: The confusion matrix of the ensemble is shown in Fig. 23 with a total accuracy of 0.765.

Fig. 23. The performance evaluation of the Ensemble of Conv LSTM and UNet LSTM model is represented using the confusion matrix.

Inertial + Skeleton Data

Conv LSTM: We combined the Skeleton data along the features axis with the Inertial data and trained the Conv LSTM to achieve an accuracy of 0.784. The trainval accuracy, loss plots together with the confusion matrix is shown below.
Fig. 24. Train (blue)val (green) accuracy plot of the Conv LSTM model on both Inerital and Skeleton Data
Fig. 25. Train (blue)val (green) loss plot of the Conv LSTM model on both Inerital and Skeleton Data
Fig. 26. Confusion matrix of the Conv LSTM model on both Inerital and Skeleton Data

UNet LSTM: Similiarly, the same was done with the UNet LSTM model and it achieve an accuracy of 0.742. The relevant plots are shown below.
Fig. 27. Train (blue)val (green) accuracy plot of the UNet LSTM model on both Inertial and Skeleton Data
Fig. 28. Train (blue)val (green) loss plot of the UNet LSTM model on both Inertial and Skeleton Data
Fig. 29. The UNet LSTM models confusion matrix is computed for both Inertial and Skeleton Data.

Ensemble of Conv LSTM and UNet LSTM: Lastly, we combined all our stocks together and put together an ensemble which achieved an accuracy of 0.821. The confusionn matrix is shown in figure 30 Fig30.

Fig. 30. The confusion matrix is generated for the Ensemble model, which combines Conv LSTM and UNet LSTM, using both Inertial and Skeleton Data.
C. Summary
A summary of results can be found below in Table. I.
TABLE I
AN OVERVIEW OF THE VALIDATION ACCURACY ACHIEVED BY VARIOUS MODELS.
SLSTM 
BLSTM 
CLSTM 
ULSTM 
Ensemble 

Iner 
0.238 
0.465 
0.700 
0.712 
0.765 
Iner + skel 
– 
– 
0.784 
0.742 
0.821 
V. CONCLUSION DISCUSSION VI. APPENDIX
This research showcases our ensemble technique in Human Action Recognition (HAR), resulting in a notable accuracy of 0.821 on the validation set, outperforming the baseline papers results of 00.672, by utilizing Inertial and Skeleton Data. This could be due to the convolutional networks being able to capture more generic, higher dimensional features compared to the Collaborative Representation Classifier (CRC) method used in [4] MHAD. However, we find that there are still some things we can work on and in the following sub sections, we shall explore other methods as future work to improve our validation accuracy.

Issue of model overfitting
Throughout our endeavors, a persistent challenge was the early epoch overfitting of our model. We effectively mitigated this issue by incorporating dropout layers, which encouraged exploration in alternative solution spaces by ensembling the UNet LSTM with the Conv LSTM [20]. Overfitting typically arises when the model attempts to learn highfrequency Fig. 31. Full model of Simple LSTM features that offer little utility. To enhance the learning capacity,We introduced Gaussian Noise with a mean value of zero and data elements spanning across the entire spectrum. Additionally, we observed substantial variability in time sequences among different subjects, even for similar activities. To address this, we implemented augmentation of data through timescaling, as well as translation, bolstering volume of data to be trained and enabling better generalization. It is pertinent to consider reducing the models complexity to further mitigate overfitting risk while maintaining optimal performance.
Fig. 32. Full model of BiDirectional LSTM

Integraion of Depth and RGB Data for Data Fusion FCombining Depth with RGB data in our fusion approach
would lead to an increased number of input variables for training the models, resulting in enhanced validation accuracy.

Collective Intelligence
Right now, our ensemble simply takes the average of our models softmax activations. We could further enhance the validation accuracy and reduce overfitting, by exploring different Methods of Collaborative Learning[21] like Boosting [22], Voting and Bagging mechanisms [23].
Fig. 33 Full mode of Conv LSTM Fig. 34 Full model of Unet LSTM
REFERENCES
[1] T. Choudhury, G. Borriello, S. Consolvo, D. Haehnel, B. Harrison, B. Hemingway, J. Hightower, P. . Klasnja, K. Koscher, A. LaMarca, J. A. Landay, L. LeGrand, J. Lester, A. Rahimi, A. Rea, and D. Wyatt, The mobile sensing platform: An embedded activity recognition system, IEEE Pervasive Computing, vol. 7, no. 2, pp. 3241, April 2008. [2] S.M. Lee, S. M. Yoon, and H. Cho, Human activity recognition from accelerometer data using convolutional neural network, in 2017 IEEE International Conference on Big Data and Smart Computing (BigComp), Feb 2017, pp. 131134. [3] H. Wang, A. Klaser, C. Schmid, and C. Liu, Action recognition byÂ¨ dense trajectories, in CVPR 2011, June 2011, pp. 31693176. [4] Laptev and Lindeberg, Spacetime interest points, in Proceedings Ninth IEEE International Conference on Computer Vision, Oct 2003, pp. 432 439 vol.1. [5] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, Behavior recognition via sparse spatiotemporal features, in 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Oct 2005, pp. 6572. [6] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei Fei, Largescale video classification with convolutional neural networks, in CVPR. [7] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, Longterm recurrent convolutional networks for visual recognition and description, CoRR, vol. abs/1411.4389, 2014. [Online]. Available: http://arxiv.org/abs/1411. 4389 [8] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, C3D: generic features for video analysis, CoRR, vol. abs/1412.0767, 2014. [Online]. Available: http://arxiv.org/abs/1412.0767 [9] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, Describing Videos by Exploiting Temporal Structure, ArXiv eprints, Feb. 2015. [10] K. Simonyan and A. Zisserman, Twostream convolutional networks for action recognition in videos, CoRR, vol. abs/1406.2199, 2014. [Online].Available: http://arxiv.org/abs/1406.2199
[11] C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional twostream network fusion for video action recognition, CoRR, vol. abs/1604.06573, 2016. [Online]. Available: http://arxiv.org/abs/1604. 06573 [12] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, Temporal segment networks: Towards good practices for deep action recognition, CoRR, vol. abs/1608.00859, 2016. [Online]. Available: http://arxiv.org/abs/1608.00859 [13] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. C. Russell, Actionvlad: Learning spatiotemporal aggregation for action classification, CoRR, vol. abs/1704.02895, 2017. [Online]. Available: http://arxiv.org/abs/1704.02895 [14] Y. Zhu, Z. Lan, S. D. Newsam, and A. G. Hauptmann, Hidden two stream convolutional networks for action recognition, CoRR, vol. abs/1704.00389, 2017. [Online]. Available: http://arxiv.org/abs/1704. 00389 [15] J. Carreira and A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, CoRR, vol. abs/1705.07750, 2017. [Online]. Available: http://arxiv.org/abs/1705.07750 [16] A. Diba, M. Fayyaz, V. Sharma, A. H. Karami, M. M. Arzani, R. Yousefzadeh, and L. V. Gool, Temporal 3d convnets: New architecture and transfer learning for video classification, CoRR, vol. abs/1711.08200, 2017. [Online]. Available: http://arxiv.org/abs/1711. 08200 [17] C. Chen, Utdmhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor, 09 2015. [18] E. Hoffer, I. Hubara, and D. Soudry, Train longer, generalize better: closing the generalization gap in large batch training of neural networks, in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, 2017, pp. 17291739. [19] M. Schuster, K. K. Paliwal, and A. General, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, 1997. [20] R. Maclin and D. W. Opitz, Popular ensemble methods: An empirical study, CoRR, vol. abs/1106.0257, 2011. [Online]. Available: http://arxiv.org/abs/1106.0257 [21] C.Zhang and Y.MA, Ensemble Machine Learning ,Methods and Applications. Landon: Springer, 2012, pp.254270. [22] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, Boosting the margin: A new explanation for the effectiveness of voting methods, 1997. [23] R. E. Schapire, The strength of weak learnability. Kluwer Academic Publishers, 1990, p. 197227.