The Literature Survey on Virtual Piano

This paper presents an efficient data-driven approach to track fingertip and detect finger tapping for virtual piano using an RGB-D camera. We collect 7200 depth images covering the most common finger articulation for playing piano, and train a random regression forest using depth context features of randomly sampled pixels in training images. In the online tracking stage, we firstly segment the hand from the plane in contact by fusing the information from both color and depth images. Then we use the trained random forest to estimate the 3D position of fingertips and wrist in each frame, and predict finger tapping based on the estimated fingertip motion. Finally, we build a kinematic chain and recover the articulation parameters for each finger. In contrast to the existing hand tracking algorithms that often require hands are in the air and cannot interact with physical objects, our method is designed for hand interaction with planar objects, which is desired for the virtual piano application. Using our prototype system, users can put their hands on a desk, move them sideways and then tap fingers on the desk, like playing a real piano. Preliminary results show that our method can recognize most of the beginner’s piano-playing gestures in real-time for soothing

INTRODUCTION Recent years have witnessed rapid progress of hand pose tracking and hand motion analysis using consumer depth sensors. State-ofthe-art techniques [Tagliasacchi et al. 2015] ] are able to accurately track hand motion and handle intricate geometric configurations with complex contact patterns among fingers in real-time.
However, most of them require that hands are in the airand cannot interact with physical objects. Such a requirement diminishes their utility for virtual instrument applications due to two reasons: First, users can quickly get tired when hands are not supported by some physical object. Second, mid-air interactions do not provide user any feedback, hence users may feel difficult to position their fingers and map them to the keys or strings of virtual instrument. This paper aims at developing a virtual piano application, which allows users to put their hands on a desk, move them sideways and then tap fingers on the desk, like playing a real piano. There are two major technical challenges in this application. First, the system must track the positions of fingertips and detect their status, i.e., whether a finger is tapping or not. Due to frequent interaction between fingers and desk, the existing hand tracking algorithms often fail. Second, piano-playing gestures are usually fast and complex, involving highly flexible hand articulation and causing severe hand selfocclusion.
To tackle these challenges, we propose a virtual-piano tailored method to track fingertip and detect finger tapping using an RGB-D camera in real-time. We first collect a training dataset with 7200 RGB-D images, covering the most common finger articulation for playing piano. After manually labeling the positions of seven hand joints including five fingertips, thumb MCP joint and wrist center, we train a random regression forest to predict them using depth context features of spatial-voting pixels randomly sampled over the training images. During online testing, we first predict the positions of the hand joints from raw RGB-D images with the trained random forest. Then we use the trajectories of these joints to detect and locate finger tapping using support vector machine (SVM) classification. The virtual piano is registered onto the desk surface using pre-detected normal vector and centroid of the desk surface.
Based on the locations of fingertips and the finger tapping status, the system can hereby determine which piano key is pressed and play the corresponding sound. Preliminary results show that our method can recognize the basic pianoplaying gestures in real-time for soothing rhythms. Figure 1 illustrates our virtual piano application with a DepthSenser 325 sensor on top of the desk and in front of the user. We render the hand and the piano based on the coordinates of the desk surface and the detected hand pose from the RGB-D images.

II. RELATED WORK
Hand pose tracking and evaluation is a fundamental hassle in laptop portraits and vision, and is central for many humancomputer interfaces. Early gesture reputation programs resort to the usage of facts gloves or uniquely colored gloves/markers on palms or hands [Aristidou and Lasenby 2010]. In latest years there has been a developing interest in non-invasive setup using a unmarried commodity RGB-D sensor, consisting of Microsoft Kinect, Intel Real Sense, or reason-designed hardware, e.g., the Leap Motion Controller. Such unmarried-dig cam acquisition does not obstruct consumer movements, hereby is specifically effective to VR applications. This phase in short evaluations related work accessible pose monitoring, finger motion popularity and digital musical instrument.

A. Hand Pose Tracking
Algorithms for imaginative and prescient-primarily based hand pose monitoring can be widely labeled as generative version-fitting methods and discriminative strategies. Each magnificence of algorithms have their very own merits and drawbacks. The version-becoming strategies [Melax et al. 2013] [Tagliasacchi et al. 2015] reconstruct hand poses with the aid of becoming a 3-d articulated hand model to intensity photographs. These methods work properly in controlled environments, but, they usually require calibration and their outcomes are sensitive to initialization. The discriminative techniques require an annotated dataset to examine a regressor offline, after which use it to are expecting the hand pose on-line. Such techniques are strong to initialization, however their accuracy closely depends on the size of the training dataset. Therefore, the dataset must be fairly massive to cover the viable hand and finger articulations for a selected software.The latest strategies (e.G., [Tang et al. 2013;Sun et al. 2015;Xu and Cheng 2013]) require that the hand is inside the air and no longer interacting with other items. The purpose is that, hand motion itself is of excessive ranges-offreedom and as a consequence calls for lots of training information to symbolize such flexibility. Thus, if the hand is interacting with unknown gadgets, there could be greater unpredictable complexity, e.G., hand self-occlusion and occlusion between item and hand, and the big look versions of each the hand and interacting objects.
These problems avoid researches in this vicinity. There are some preceding paintings that may help hand interacting with gadgets, however they either assume that the geometrical information of the item is known so that item and hand can supplement every other to enhance pose estimation In [Rogez et al. 2014] the hand is authorized to have interaction with different items, such as bottles, desk surfaces, etc., and the hand pose is expected in a discriminative way via education a multi-class cascade classifier on a dataset that covers many interacting examples between hand and objects. However, in their hand pose estimation framework, the hand posture is best assumed to belong to a small set of predescribed templates. This is far from our want to play the piano in the proposed software, wherein we want to music the accurate articulated fingertip positions and wrist positions, so that the device can come across whether or not a finger tap is finished with the aid of the performer or now not. Cheng 2013], it is used to regress for hand joint angles directly. With a pre-skilled woodland, a set of vote casting pixels forged their votes for every joint angle, which might be fused into numerous candidate hand poses. An greater model-matching degree is wanted to discover the foremost pose. In [Tang et al. 2013] a transductive regression wooded area is proposed to alleviate the discrepancy among synthesis and actual-global facts to enhance prediction accuracy. In ] a multi-modal prediction fusion set of rules is proposed to utilize hand motion constraints to remedy the ambiguous pose predictions from random regression woodland, in order that infeasible handpostures can be averted. In ], a hierarchical regression scheme is constructed upon the regression forest for hand pose estimation, wherein the basis joints of hand skeleton are predicted first and other joints are expected subsequently based totally on the root joints, which proves to improve prediction accuracy in large part. While those techniques paintings simplest for in-air arms, we advocate to make use of the regression wooded area for hand in interplay with planar gadgets.

B. Finger Action Recognition
To extract discriminative capabilities and find effective getting to know fashions are the 2 key troubles in every pattern reputation hassle. Actions are spatio-temporal patterns, which requires complete features accumulating data from time domain as well as area domain to define the problem. It's usually recognized that absolute 3-d joints

C. Virtual Musical Instruments
There are many research efforts to develop virtual and augmented musical instruments in the past decades. Virtual reality and/or augmented reality techniques are utilized to increase instrument accessibility, improve user's psychopleasure and provide performance guidance [ [Gillian and Paradiso 2012] is a gesturally controlled virtual musical instrument which utilizes 3D depth sensor to recognize hand gesture with machine learning algorithms and triggers the note to be played by using a "tap" style gesture with the tip of the index finger of the right hand. However, the user experience of Digito is too much different from real playing piano with different fingers. Some applications are developed using Leap Motion Controller to construct virtual piano using 3D positioning of fingers to detect the tapping [Heavers], but in these applications user's hands are not allowed to interact with any object, which is unnatural and uncomfortable for piano player. Han and Gold [2014] conducted a detailed examination on Leap Motion as the tracking device and algorithm for playing piano, which shows that although Leap Motion provide accurate tracking for free hand postures, when there is no interaction of hand with any object, it's difficult for player to determine the position and height of the virtual keyboard without prior practices. Our approach allows users to put their bare hand on a planar object and tap on it, like playing on a real piano

III. OVERVIEW
We develop a virtual piano application enabling fingertip tracking and tapping gesture detection, which can let users play a virtual piano keyboard on any plane as a force feedback. We especially design the application for starterlevel piano players, who start playing with slow and simple practice songs. In such cases, the fingertip motions can be accurately tracked, and tapping can be identified relatively robustly based on hand joint trajectories only. Our application is developed with DepthSenser 325 as the RGB-D sensor, which consists of three components: * Fingertip tracking takes RGB-D images as inputs, extracts the hand from the reference plane, and computes the positions of hand joints; * Tapping Detection converts five fingertip locations into global coordinate system, computes the height relative to the reference plane and the relative positions of each Pair-wise fingertip, and finally generates tapping event based on the spatial-temporal features retrieved from motion trajectory data.
* Rendering and Feedback takes the tapping event as input, triggers virtual piano key event and sound system, and finally provides a visual and sound feedback to the user.

IV. HAND SEGMENTATION
To make certain high exceptional of hand pose estimation, the hand area desires to be segmented correctly from the background in the depth photo. To locate the palms, we carry out in step with-pixel skin shade detection [Hammer and Beyerer 2013;Li and Kitani 2013]. However, the detected pores and skin masks is not always reliable and history pixels can be misclassified into the hand vicinity. To improve the outcomes, we advocate to first in shape a plane to the desk floor using the RANSAC algorithm [Fischler and Bolles 1981] in the depth photo after which differentiate the points that do not fit the plane as the hand region. However, as the hand occupies a massive part of the foreground intensity image, it introduces many outliers for 3-d aircraft fitting. This big range of outliers can have an effect on the RANSAC algorithm as it will need tons greater iterations to find the great set of factors that in shape the plane. To address this trouble, we find the hand vicinity inside the depth image with the pores and skin coloration detection outcomes, then use RANSAC to fit a plane with the final factors.
Based on the detected plane, the hand can be better segmented in intensity photos. In addition, we use the normal vector targeted on the desk as the beginning of the coordinate to assemble the digital piano for interplay.

V. HAND POSE TRACKING
Once the hand is segmented, we can then use the random regression forest [Girshick et al. 2011] to predict the 3D positions of the seven joints of the hand. The regression forest is an ensemble of several random regression trees, each of which consists of a number of split nodes and leaf nodes. Each split node contains one split function learnt from the training data to branch to the child node based on the feature values of the descriptor of an input pixel i. Each leaf node contains the distributions over the 3D relative offsets to the joint positions, which are collected from the training samples.

VI. EXPERIMENTS & DISCUSSIONS
Experiment Setup. We implemented the fingertip tracking and tapping detection algorithm in C++/OpenCV and rendered the virtual piano using OpenGL. We adopted a DepthSenser 325 sensor on top of the desk and in front of the user. The system was tested on a PC with an Intel i7 3.3GHz CPU and 16GB RAM. It is worth noting that the time cost to process one frame is only 20ms, which is efficient enough for real-time tracking. After training for 20 minutes, a user with little musical playground can play a simple adagio melody with our virtual piano. See the accompanying video. Training. To validate the effectiveness of the proposed hand pose tracking algorithm, we collect a dataset of real-world hand images consisting of around 7.2k depth images of two subjects performing various finger tapping postures to play the virtual piano. The resolution of these images is 320 X 240. The subjects can either put their hand over or on the desk. The hand poses collected cover the most frequent gestures for playing piano in the view of depth camera, and the poses are music score independent. In each of the image we manually annotate the 3D positions of the seven joints of the hand. In this experiment we set the number of trees in the forest to be 3. During training, we randomly sample 150 pixels from each training image and generated 6000 candidate split functions to learn the tree structure. The tree stops growing if its depth exceeds 20 or the node sample is less than 50. During testing, a number of 500 voting pixels are randomly sampled from the segmented hand region to predict the hand joint positions.
To collect the training data set for tapping, we manually label several sequences of RGB images, including over 100 taps for each finger. We label the tapping down moment frame and its previous 3 frames as TD frames, and label the tapping up moment frame and its following 3 frames as TU frames. The other frames are labeled as non-tapping frames. These annotated data are then used to train the SVM classifier for tap detection.
Performance. We perform 4-fold cross validation on this dataset to evaluate the performance of the proposed method. The prediction performance of a joint is evaluated in terms of the percentage of its predictions that are within a distance of DT centimeters from the ground truth in the test images. This metric is averaged for all the seven joints to obtain the overall evaluation. To better understand the performance of the method, we present the results for different DT so that the distribution of the predictions over different intervals of DT can be observed, as shown in Figure 7. The average error between the ground truth hand joint positions and the predicted positions is 1:3cm. Figure 8 shows the hand pose prediction results on some sample frames in the dataset. We can see that the proposed method can accurately recover the positions of the hand joints, when fingers are in the air and on the reference plane. In contrast, the commercial products, such as Leap Motion, Intel RealSense and SoftKinetic, are not able to detect the hand joints for those cases. To test tapping detection algorithm, we ask 2 users to perform 100 taps totally on each finger with around 1 second time difference in between. We consider a tapping down and tapping up event classified successfully if the finger which performed the action is correctly identified within 0.3 second. The result of tapping down detection is shown in Table 1.
Comparison. We compare our method with two stateof-the-art techniques, a model-based algorithm [Tagliasacchi et al. 2015] and Leap Motion Controllerthe leading commercial product for hand tracking. These methods are able to accurately track (multiple) hands when they are in the air, however, they fail when hands are interacting with physical objects. In contrast, our algorithm is specifically designed for hand interaction with planar objects, hereby has better performance and accuracy in the virtual piano application.
Limitations. Although our method can track most of the beginner's piano-playing gestures for soothing rhythms in realtime, our virtual piano has several limitations compared with playing real piano. (1) The proposed tracking algorithm is not quite robust to hand-shape variations, e.g., the prediction accuracy drops when the shape and/or size of player's hand are significantly different from the ones in the training dataset.
(2) Thumb under is a common gesture, where the thumb is brought under the hand in order to pass the 3rd or 4th finger for playing the scale. Due to severe occlusion, the depth sensor is not able to capture the thumb and our tracking algorithm cannot detect it either. (3) Our current implementation is not efficient and accurate enough to detect the tapping event in a fast tempo. (4) Our method supports two-hand tracking. However, due to the limited viewing volume of the DepthSenser325 sensor, users can only play with a single hand for about 2 octaves.

VII. CONCLUSION & FUTURE WORK
This paper provided a virtual piano application that permits customers to play with naked palms on or near a planar surface. Taking the RGB-D pics as input, our approach makes use of an offline trained random regression forest to music the fingertips and locate the finger tapping. Compared with the present hand tracking algorithms, our method is designed for hand interplay with planar gadgets. Preliminary consequences display that our method can apprehend most of the amateur's piano-gambling gestures for soothing rhythms in real-time.
The machine may be similarly incorporated with head-established display, which includes Oculus Rift, to provide with consumer an immersive visible and audial environment which may also in addition support remote gaining knowledge of and gamification in musical tool gaining knowledge of. In a broader experience, our work affords a pipeline to solve hand integration with planar objects and a general solution to such form of application, which draws the community's attention to the hassle of cutting-edge mid-air hand tracking strategies.
In the future, we will extend the gesture database for intermediate and advanced gamers and enhance the accuracy of our monitoring set of rules for allegro rhythms. To locate selfoccluded gestures, a few graphical machine studying version might be implemented to expect occluded finger role and tapping moment with the help of domain information. We may also increase a hand normalization set of rules so that players whose palms are notably specific from the ones of the education dataset can use our gadget. Moreover, we will behavior a formal person have a look at to assess the efficacy of the proposed machine.