Human Activity Recogniton using Machine Learning

Download Full-Text PDF Cite this Publication

Text Only Version

Human Activity Recogniton using Machine Learning

Swati Patil, Harshal Lonkar, Ashitosh Supekar, Pranav Khandebharad, Mayur Limbore

Jaywantrao Sawant College of Engineering Hadapsar Pune

Abstract Human Activity Recognition is a standard Computer Vision problem and has been well studied. It is one of the active research areas in computer vision for various contexts like security surveillance, healthcare and Human Computer Interaction. The fundamental goal is to research a video to spot the actions happening within the video. In this paper, we have proposed a Human Activity Recognition system using Machine Learning. The proposed human activity recognition system recognizes human activities including walking, running, sitting etc. While walking and running are often recorded as daily fitness activities, falling also will be detected as anomalous situations and alerting messages are often sent as

required . It can be done by capturing images and videos of the activity, identifying it, and classifying it into a specific category if needed.

Keywords Human Activity Recognition; Human Computer Interaction; Machine learning; images; videos;


    Human activity recognition is regarded as one of the most important issues in smart home environments as it has a wide range of applications including healthcare, daily fitness recording, and anomalous situations alerting. Some simple human activities (walking, running, sitting, etc.) are essential for understanding complex human behaviors. Also, the detection of activities like walking and sitting can provide useful information a few person's health condition.

    The use of human activity recognition in field of healthcare and elder citizen monitoring has also been proved useful. This ensures the wellbeing of children as well as elderly people.

    The camera module senses the activity and captures it which is later sent to the main framework of server and analysis model. The data captured by camera is analyzed by the machine learning model and saved on cloud. If the data signifies any suspicious or anomalous activity an alert is sent to the android application on mobile phone due to which we can prevent or minimize the harmful impact that is caused by the ongoing activity.

    With the implementation of this system in day to day lives people can monitor and also be aware of activities happening at real time and surveillance over areas like security, healthcare, child monitoring can be managed in an effective way.

    Fig 1: System Architecture


    1. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

      (Kensho Hara, Yutaka Satoh, Hirokatsu Kataoka)

      In this paper the agenda of study is to ascertain if the present video datasets have enough data for the training of very deep Convolutional Neural Networks(CNNs). In CNN the use of different datasets has proven to be vital in recognizing of activity. In the past years ActivityNet which a quite larger dataset has been used on a large scale for detecting tasks and actions, but the number of these actions is still limited. Nowadays 3D CNNs are more widely in use as they are more effective than 2D CNNs for the recognition of activity.

      The HMDB-51 and UCF-101 datasets are presently very successful in this field of activity detection. There are many different approaches through which can persuade human activity recognition with the help of different datasets. But some major datasets have already set a benchmark in this field of activity recognition and still there is much more further scope in this field.

      Training on different datasets can be done to make the working of activity detection more effective and thus large scale datasets with large number of algorithms can be made ensuring more depth to the CNNs.

      Pros :

      • 3D architecture with simple trained kinetics outperforms complex 2D architecture.

        Cons :

        • For Kinetics dataset ResNet-18 training is not overfitting.

    2. ConvNet Architecture Search for Spatiotemporal Feature Learning.

      (Du Tran , Jamie Ray, Shih-Fu Chang , Manohar Paluri, Zheng Shou)

      This paper presents a ConvNet architecture for spatiotemporal feature learning. Learning on representation of image with the help of ConvNets can be proclaimed by training on many datasets like ImageNet for a better visual understanding and other activities like identification of object, semantic segmentation and recognizing motion patterns.

      The dataset Sports-1M has been trained and used with much effectiveness rather than the conventionally used datasets and it has been found that the applied model is 2 times faster, 2 times smaller and also more compact than other activity recognition models with deep learning.

      Pros :

      • Sports-1M dataset has been trained and used as feature extractor for the rest datasets

        Cons :

      • Problem in temporal modeling for long range.

    3. Sequential Deep Learning for Human Action Recognition. (Moez Baccouche, Franck Mamalet, Christophe Garcia, Atilla Baskurt, Christian Wolf)

      It has been proposed in this study that without any previous knowledge the fully automatic human activity recognition model can classify activities into different areas of existence. In last few years the interest in this field has increased with different approaches which also includes so called deep- models. These models have the capacity to develop high level of interpretation and understanding over even a raw input of data.

      The Recurrent Neural Network(RNN) has been trained accordingly to classify each video sequence considering the evolving activity taking place at each time step. One more interesting aspect of these models is that they do not waste time on working on the entire sequence of frames but smartly create a short sub-sequence which determines the nature of activity within much less time constraint and even utilizes much less data.

      Pros :

      • High level understanding and interpretation over raw data

      • Time efficient. Cons :

      • Conv3D and LSTM cannot be trained in single step model.

    4. Action Recognition using Visual Attention.

      (Shikhar Sharma, Ruslan Salakhutdinov, Ryan Kiros)

      In this proposed model the use of Recurrent Neural Networks(RNNs) along with Long Short-Term Memory(LSTM) which are both deep in nature spatially as well as temporally. This model is very effective in focusing on selective frames in the given sequence of activity to identify the nature of activity and classify it accordingly just within a glimpse.

      The proposed model is a soft attention model and is based on recurrent model for the recognition of activity that takes place. The method of max pooling which has been used previously in many earlier proposed models has been replaced with the method of dynamic pooling and has even proven to be much more effective. This model has also been trained to detect and identify the important elements in video frames which further makes the process of categorization of activity much flawless and easier.

      The evaluation of model has taken place on datasets like UCF-11, HDMB-51 and Holywood2 and thus we study and analyze the way this model focuses its attention not only on activity but also on the scene in which the action is being performed.

      Pros :

      • The model automatically focuses on the important parts of the video.

    5. Event Description From Video Stream For Anomalous Human Activity and Behaviour Detction.

      (Mohamad Hanif Md Saad, Wan Noor Aziezan Baharuddin, Aini Hussain, Nooritawati Md Tahir, Liang Xian Loong)

      This model proposes the way in which the activity or event can be described so that it gets easier for the system to process it in a measurable and organized manner. In this model monitoring over anomalous activity is included to deal with the aspects like security and maintaining safety in the public locality. Previously many researchers have come with their ideas of Event Description Language(EDL) which has been effectively used to store such anomalous activities and retrieve them during the phase of activity recognition.

      The InVISS Event Description Language has been used in this model to automatically recognize the activity in the video and describe accordingly as anomalous if suspected.

      Generally this model is more suitable for CCTV surveillance in which such activities can be auto-recognized at real time and alert notifications can be sent through the server to the application which will very convenient in tracking the anomalous activities taking place and prevention of damage to a larger extent can be avoided.

      This model can also be useful in identifying the person involved in malicious activity and can be then be reached if not caught in the first place.

      Pros :

      • IEDL automatically identifies anomalous activity through CCTV.

      • User friendly and easy to understand. Cons :

      • The language structure needs to be improvised and optimized.


    1. General System

      This system constitutes of following components:

      1. Hardware:

        In hardware we will be using camera to capture images and video at real time. The frames of captured

        video and images will act as input and will be sent to server and the machine learning model for analysis

      2. Cloud:

        All the data will be remotely accessible over internet on internet servers instead of any harddrive. This data can be accessed on the mobile application by means of internet and hence a notification will be sent to that app if the nature of occurring activity is perceived to be anomalous.

    2. Android Application

    The android application provided will be used to provide notification as well as to guide the user to the location of anomalous activity. The identification and image of the actor in the ongoing activity can also be available on the application. The application can also be used for real time monitoring and surveillance of the activity taking place.


    The Human Activity Recognition model is implemented to capture the activity which takes place and categorizes it as anomalous or not on the basis of the behavior observed and notifies the user about the activity. The model is implemented by training on many datasets in Machine Learning and the recognition of activity is ensured through this training.

    Human activity recognition is still a very broad area for new research opportunities with different approaches .

    Many issues related to the background, presence of proper light and alignment of the capturing device are to be resolved in upcoming future. There is also a huge scope for improvement in accuracy.


  1. Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh, (2018 April). Can Spatotiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? Retrieved from:

  2. Du Tran , Jamie Ray , Zheng Shou , Shih-Fu Chang , Manohar Paluri Sabk, (2017 August). ConvNet Architecture Search for Spatiotemporal Feature Learning Retrieved from:

  3. Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt Asa, (2011). Sequential Deep Learning for Human Action Recognition Retrieved from: 8_4.pdf

  4. Shikhar Sharma, Ryan Kiros & Ruslan Salakhutdinov, (2016 February). Action Recognition using Visual Attention Retrieved from: recognition-using-visual-attention-nips2015.pdf

  5. Mohamad Hanif Md Saad, Aini Hussain , Liang Xian Loong , Wan Noor Aziezan Baharuddin, Nooritawati Md Tahir, (2011). Event Description From Video Stream For Anomalous Human Activity and Behaviour Detection Retrieved from:

  6. Ishan Agarwal, Alok Kumar Singh Kushwaha, Rajeev Srivastava,(2015). Weighted Fast Dynamic Time Warping Based Multiview Human Activity Recognition Using a RGB-D Sensor Retrieved from:

  7. Murat Cihan Sorkun, Ahmet Emre Daniman, Özlem Durmaz ncel(2018). Human Activity Recognition With Mobile Phone Sensors: Impact Of Sensors And Window Size Retrieved from:

  8. Neslihan Käse, Mohammadreza Babaee, Gerhard Rigoll, (2017). Multi-view human activity recognition using motion frequency Retrieved from: df

  9. Hui Huang, Xian Li, Ye Sun, (2016 August). A triboelectric motion sensor in wearable body sensor network for human activity recognition Retrieved from: c_Motion_Sensor_in_Wearable_Body_Sensor_Network_for_Human


  10. Shuiwang Ji, Wei Xu, Ming Yang, Kai Yu, (2012). 3D Convolutional Neural Networks for Human Action Recognition Retrieved from:

  11. Karen Simonyan, Andrew Zisserman, (2014). Two-Stream Convolutional Networks for Action Recognition in Videos Retrieved from:

  12. Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici, (2015 March). Beyond Short Snippets: Deep Networks for Video Classification Retrieved from:

Leave a Reply

Your email address will not be published. Required fields are marked *