HUMAN ACTION POSELETS ESTIMATION VIA COLOR G-SURF IN STILL IMAGES

Human activity is a persistent subject of interest in the last decade. On the one hand, video sequences provide a huge volume of motion information in order to recognize the human active actions. On the other hand, the spatial information about static human poses is valuable for human action recognition. Poselets were introduced as latent variables representing a configuration for mutual locations of body parts and allowing different views of description. In current research, some modifications of Speeded-Up Robust Features (SURF) invariant to affine geometrical transforms and illumination changes were tested. First, a grid of rectangles is imposed on object of interest in a still image. Second, sparse descriptor based on Gauge-SURF (G-SURF) invariant to color/lighting changes is constructed for each rectangle separately. A common Spatial POselet Descriptor (SPOD) aggregates the SPODs of rectangles with following random forest classification in order to receive fast classification results. The proposed approach was tested on samples from PASCAL Visual Object Classes (VOC) Dataset and Challenge 2010 providing accuracy 61-68% for all possible 3D poses locations and 82-86% for front poses locations regarding to nine action categories. * Corresponding author


INTRODUCTION
Recognition of human active actions is employed in many tasks of computer vision, e.g., video surveillance, human detection, activity analysis, scene analysis, image annotation, image and video retrieval, augmented reality, human-computer interaction, among others.Conventional methods of human action recognition use motion information in videos (Laptev andLindeberg T, 2003, Zhen et al., 2013), which can be applied for object capturing and then for trajectory motion analysis as the temporal component in human action classification.During action classification, the spatial component plays a significant role that makes reasonable to develop methods for human action recognition in still images.
Methods used to estimate the human poses can be classified in four categories: model-based, example-based, pictorial structure-based, and poselet-based approaches.The modelbased methods predefine a parametric body model and find the pose matching based on labelled extracted features.A graphical model of human-object interactions was developed by Gupta et al. (Gupta et al., 2009) including reach motions, manipulation motions, and object reactions.Such models are often built on silhouette-based representation of body parts or edge information.The main disadvantage deals with difficulties in design of parametric body model and pose sub-models.
The example-based methods do not use a global modelling structure and store a set of images with corresponding pose descriptions.In this framework, two problems appear as relevant descriptors and fast search.Three shape descriptors -Fourier descriptors, shape contexts, and Hu moments were compared by Poppe and Poel (Poppe and Poel, 2006) for representation of human silhouettes.They experimented with deformed silhouettes robustness to body sizes, viewpoint, and noise.Shakhnarovich et al (Shakhnarovich et al., 2003) proposed interesting hashing-based search technique for pose estimation relevant to pose examples in a large database.A fast pruning method based on shape contexts in order to speed up the search for similar body poses was presented by Wang et al. (Wang et al., 2006).High computational cost of searches in the high-dimensional spaces inside large datasets is the major drawback of example-based methods.
The pictorial structure-based methods represent poses as the cues using prior information of a human body structure (Felzenszwalb and Huttenlocher, 2005).In this approach, the histogram-based methods prevail.It may be circular histograms of spatial and orientation binning (Ikizler et al., 2008) or the most popular Histogram of Oriented Gradients (HOG) (Dalal, and Triggs, 2005) with multiple modifications.The last research was the pioneer investigation in pose descriptor construction based on non-negative matrix factorization.The action classes were represented by the HOGs of pose primitives with following simple histogram comparison for action recognition.This technique works well in typical cases but it fails in occlusions or significant changes of camera viewpoints.To overcome these problems, Delaitre et al. (Delaitre et al., 2010) proposed a Bag-of-Features (BoF) approach for human action recognition in still images in combination with Support Vector Machine (SVM) classification.They combined the statistical and part-based representations integrating a person-centric description in cluttered background.
The poselet-based methods provide rich information about locations of body parts.They are built on 3D pose images.The poselets were introduced by Bourdev and Malik (Bourdev and Malik, 2009) for person detection in natural framework.The detailed literature review of methods from the fourth category is presented in following section.Better pose descriptor means a wide invariance to various geometric transforms, lighting, viewpoints, and image warping in general case.This difficult task is in the field of current and following investigations.Our contribution deals with the proposed color Gauge-SURF with selection of special imposed grids in human body image for human action recognition.The original Gauge-SURF was extended by invariance to color and lighting changes.
In the following, Section 2 gives a brief review of high-level and low-level evaluations of poselets.The G-SURF background is presented in Section 3. The proposed methodology of poselets estimation is explained in Section 4. The experimental results are presented and analyzed in Section 5.The paper is concluded in Section 6.

RELATED WORK
The poselet as a subject of interest is 2D still non-segmented image of configured body parts in 3D space (head, shoulders, arms, torso, legs) capturing a part of neighbouring background.Thus, the poselet of single pose is a collection of 2D still images received from various shooting viewpoints, in different scales and lighting conditions.Sometimes, the presence of articulation makes the pose estimation harder.Although the pose estimation is used for human action recognition, one may consider the pose estimation as a separate task of computer vision.Figure 1 depicts some examples of human actions.

Figure 1. Human pose examples
Since only the spatial information is available in still images, one can represent information as high-level cues and low-level cues/features.Description of human body, body parts, actionrelated objects, human object interaction, and scene context are included in high-level cues.Typical low-level features are a Dense sampling of Scale Invariant Feature Transform (DSIFT), HOG, Shape Context (SC), GIST, and some other features.Short surveys of high-level and low-level evaluation methods are situated in Sections 2.1 and 2.2, respectively, while existing 3D-based poselet methods are discussed in Section 2.3.

High-level Evaluation
A human body image is an important cue in human action recognition, which can be detected automatically or manually labelled.Usually a bounding box is used to indicate a location of a person.Some approaches extract features in areas within or surrounding the human bounding boxes.Delaitre et al. (Delaitre et al., 2010) defined a person setting in each image in one and a half time more that the sizes of human bounding box.Then these regions are resized up to 300 pixels (in larger size) and analyzed using low-level features.Some methods extract contour information of human body and body parts from still images.Wang et al. (Wang et al., 2006) exploited the overall coarse shape of human body as a collection of edge points obtained via Canny edge detector.Then the received features were classified and categorized into different actions.Also semantic features can be used to describe the actions in images with the human body (Yao et al., 2011).The attributes were related to verbs in a human language and remained visual words from annotation system.In many scenarios, a person relates with other objects, e.g.phone, ball, animal, etc.These objects serve as a source of reliable information about category of human action.Some approaches analyse individual objects separately while other methods consider them as a scene context.As a result, some methods were developed as a Human Object Interaction (HOI), for example, weakly supervised method proposed by Prest et al. (Prest et al., 2012).

Low-level Evaluation
The high-level evaluation is usually based on various low-level scores.The DSIFT features are extracted from many image patches with following clustering to obtain a limited number of "keywords", which are grouped in a codebook.Many methods use the DSIFT features due to their possibility for direct classification of human actions.One can mention the researches of Delaitre et al. (Delaitre et al., 2010), Yao et al. (Yao et al., 2011), etc.
The HOG descriptor is very popular for pedestrian detection (Dalal and Triggs, 2005).The HOG descriptor counts the occurrences of discrete gradient orientations within a local image patch similar to the edge orientation histogram, the SIFT, and the SC.The SC is useful to detect and segment the human contour; however, this technique is crucial for high-level cue representation of human body silhouettes.
The spatial envelop or GIST was proposed by Oliva and Torralba (Oliva and Torralba, 2001).A set of spatial properties in a scene can be computed by the GIST method, which provides the abstract category representations of a scene based on integrated background information.This approach has been used by Gupta et al. (Gupta et al., 2009), Prest et al. (Prest et al., 2012), among others.One can mention the development of other approaches based on SURF (Bay et al., 2008), Circular Histogram of Oriented Rectangles (CHORs) (Ikizler et al., 2008), Adaboost classifiers (Gupta et al., 2009).

Towards to 3D Representation
All variety of poselet-based methods may be classified according to various criteria.A viewpoint dependence/ independence in 2D/2D spaces, respectively, can be considered the main criterion.Bourdev and Malik (Bourdev and Malik, 2009) were the first, who formulated the task of human pose estimation and recognition as 3D object representation.They constructed the body part detectors trained from annotated data of joint locations of people and based on patches similarities.As a result, the poselet activation vector consisting of poselets inside the bounding box was introduced.The SVM classifier was used to recognize these patches.The distribution of these joints and personal bounding boxes can be obtained to each poselet.
Russakovsky et al. (Russakovsky et al., 2012) developed an Object-Centric spatial Pooling (OCP) approach for detection an object of interest.The local OCP information is used to pool the foreground and background features.Khan et al. (Khan et al., 2013) applied a comprehensive evaluation of color descriptors with combination of shape features.Some methods deal with detection of view-independent objects using 3D object models.Glasner et al. (Glasner et al., 2011) used a viewpoint estimation method for rigid 3D objects from 2D images by voting method for efficient accumulation of evidence.This method was tested on rigid car data.Fidler et al. (Fidler et al., 2012) developed a method for localizing objects in 3D space by enclosing them within tightly oriented 3D bounding boxes.This model represents an object class as a deformable 3D cuboid by anchors of body parts in 3D box.Hejrati and Ramanan (Hejrati and Ramanan, 2012) developed a two-stage model, when, first, a large number of effective views and shapes are modelled using a small number of local viewbased templates and, second, these estimates are refined by an explicit 3D model of the shape and viewpoint.
Another issue is a computational speed required for viewindependent object detection because the most methods generate the classifiers at multiple locations and scales.Sometimes, the shared features are extracted to reduce a number of classifiers and the runtime complexity correspondingly.Razavi et al. (Razavi et al., 2010) used an extension of Houghbased object detection and built a shared codebook by jointly considering several viewpoints.Tosato et al. (Tosato et al., 2010) modelled a human image as a hierarchy of fixed overlapping parts.Each part was trained using a boosted classifier learned using Logicboost algorithm.Velaldi et al. (Velaldi et al., 2009) introduced a three-stage SVM classifier combining linear, quasi-linear, and non-linear kernels for object detection.

G-SURF BACKGROUND
The original SURF defines the determinant of the approximate Hessian matrix in the points, where the determinant has maximum values (Bay et al., 2008).The Hessian matrix H(p; ) is defined by Equation 1: and similarly for L xy (p, ) and L yy (p, ) along diagonal and OY directions, respectively.
Alcantarilla et al. (Alcantarilla et al., 2013) developed a novel family of multi-scale local feature descriptors called as Gauge-SURF (G-SURF).In this case, every pixel in the image is described by 2D local structure.The multi-scale gauge derivatives are invariant to rotations and shifts.Additionally, they describe the non-linear diffusion processes.The use of G-SURF makes blurring locally adaptive to the region so that noise becomes blurred, whereas details or edges remain unaffected.Such local structures are described by Equation 3:  Using gauge coordinates, one can obtain a set of derivatives invariant to any order and scale.The second-order gauge derivatives L vv (p, ) and L ww (p, ) are in special interest.They can be obtained as a product of gradients in w  and v  directions and the 2  2 second order derivatives or Hessian matrix provided by Equation 4: The G-SURF descriptor is based on the original SURF descriptor.Mention some modifications such as Modified Upright SURF (MU-SURF) descriptor based on Haar wavelets responses and two Gaussian weighting steps, the Center Surround Extremas (CenSurE) for approximation the bi-level Laplacian of Gaussian using boxes and octagons, the Speeded Up Surround Extrema (SUSurE), which is a fast modification of the MU-SURF, and the CenSurE descriptors for mobile devices.These modifications well describe edges, angles, boundaries of unknown objects in an image.However, they poorly concern color information, which is useful in analysis of still images.

ESTIMATION OF POSELETS
The poselets estimation is concerned to identification task, when a number of classes is restricted by a finite set of human poses describing in feature space by some descriptors.The complexity deals with another issuethe great variants of 3D images of a single pose mapping in 2D still images.Nowadays, this problem has not been solved, and many authors develop heuristic algorithms, more or less successful.
The structure of bounding boxes, in common case, grid, for poselet detection is discussed in Section 4.1.The proposed color G-SURF family is represented in Section 4.2.Section 4.3 provides a classification procedure based on random forest in order to identify a testing poselet.

Imposed Grid for Poselet Detection
For poselet recognition, it is required to capture a salient part of one's pose from a given viewpoint and impose a set of corresponding rectangle boxes at given orientation, position, and scale.Many authors use the predetermined aspect ratios sometimes with normalization of distance between hips and shoulders.Bourdev and Malik (Bourdev and Malik, 2009) used the poselets with following of aspect ratios 96  64, 64  64, 64  96, and 128  64 pixels.Their algorithm was trained using 300 poselets of each type of pose.Then a model predicting a bounding box for each poselet was fitted.Additionally, the overlapping bounding boxes were considered with overlapping area more 20%.These scores were added to the basic set.The 1200 dimensional vector to estimate the human pose was constructed, besides the poses of head and torso were considered separately.This approach was developed by Ko et al. (Ko et al., 2015), when five aspect ratios such as 96  64, 64  64, 64  96, 64  128, and 128  64 pixels were used in order to consider variations of human poses.These authors modified the selecting algorithm of the action poselets using the modified Hausdorff distance with Epanechnikiv kernel.The descriptor is based on Oriented Center-Symmetric Local Binary Patterns (OCS-LBPs) due to their low computational complexity.Such approach proposes the overlapping of aspect ratios with preliminary rough body parts segmentation.Background is involved in this grid.Often segmentation is implemented manually.
It is mindless to use the predefined aspect ratios as well as the random aspect ratios.It will be reasonable to find Regions Of Interest (ROIs) or pose primitives accurately.If a video sequence is available, then this issue is the solved problem, when the ROIs are detected from the temporal domain in previous frames (Favorskaya, 2012;Favorskaya et al., 2015).Also the hybrid methods based on optical flow and learning of salient regions are possible (e.g., Eweiwi et al. 2015).
If a single still image is available, then one of pixel-based segmentation methods can extract large areas with identical texture under assumption that such large areas are the body parts.These areas representing the bounding boxes are the basis of the SPOD constructions, which later are aggregated in a common SPOD for classification.This task is similar to automatic image annotation.Our recommendations deal with the use of J-SEG algorithm or similar ones.If this assumption is not fulfilled, then additional human segmentation methods are required in order to receive the cropped image of human body.
Let us suppose that cropped still image of human body is obtained, and six aspect ratios are used including head, torso, left arm, right arm, left leg, and right leg.Each of the arm and leg boxes is composed from two sub-boxes including hands and feet.A number of imposed aspect ratios can be reduced due to human action and human position.Moreover, a number of imposed aspect ratios serve as a weak classifier for human action categorization.For example, the active action "phoning" may include an image of head and any arm while the active action "running" may involve images of torso and both legs.

The Proposed Color G-SURF Family
Under various conditions of color/lighting changes in human pose images, it is important to develop a descriptor invariant to these changes.Our contribution deals with the development of original G-SURF, which is invariant to geometrical (affine) distortions.Family of the proposed color G-SURF descriptors includes the following components.
The rg G-SURF (rgG-SURF) includes the chromaticity components r and g invariant to scale and light changes.The rg histogram is based on the normalized RGB color model, where the components r and g describe the color information by Equation 5 (b is redundant as r + g + b = 1): where R, G, B = Red, Green, Blue components in RGB color space, respectively r, g, b = normalized components Because of normalization, the components r and g are scaleinvariant and invariant to lighting changes, shadows and shading (Gevers et al., 2006).This descriptor has 45 dimensions.
The Opponent G-SURF (OppG-SURF) analyzes three channels in the opponent color space using G-SURF descriptor (Equation 6): The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XL-5/W6, 2015 Photogrammetric techniques for video surveillance, biometrics and biomedicine, 25-27 May 2015, Moscow, Russia where O 1 , O 2 = two channels in the opponent color space providing color invariance O 3 = the channel in the opponent color space providing intensity invariance The Hue G-SURF (HueG-SURF) is constructed by concatenation of a hue histogram in Hue Saturation Value (HSV) color space with a histogram of G-SURF descriptors.Such descriptor is scale-invariant and shift-invariant with respect to light intensity due to a Hue histogram.In a Hue histogram, the hue is weighted by the saturation of a pixel and reflected the instabilities in hue.This histogram has 36 dimensions.
The transformed RGB color G-SURF (RGBG-SURF) descriptor is computed for each normalized RGB channel by Equation 7:

standard deviations of the distribution in RGB channels computed in a chosen region of image, respectively
The RGBG-SURF is invariant to scale and shift with respect to light intensity (Van de Sande et al., 2009), while a classic RGB histogram is not invariant to changes in lighting conditions The histogram has 45 dimensions.Also one can mention additional color descriptors with some invariance to color/lighting conditions such as the color moment histogram, the hue-saturation descriptor, the color names, the discriminative color descriptor, among others.However, the experiments show that the rg G-SURF, the OppG-SURF, the HueG-SURF, and the RGBG-SURF provide better results in poselets estimation.

Classification Procedure
Various classifiers have been applied for object classification such as SVM, Boosting-algorithms, and random forest.The SVM classifier is the well proved technique for general classification.However, SVM is not suitable, when the features have high dimensionality and a set of analysed images is huge.The Boosting-algorithms such as AdaBoost or GentlBoost are the popular machine learning methods but their performance depends critically on the choice of weak classifiers.
A random forest is one of the most popular tree-based classification approaches, which is effective in a large variety of high-dimensional tasks such as object detection and object tracking (Ko et al., 2013).The random forest is an ensemble classifier of several randomized decision trees, which have a capacity to analyze big data at high training and runtime speeds.Each tree is grown using some type of randomization.The structure of each tree is binary, and all trees are created in a topdown manner (Breiman, 2001).During the training stage, the random forest uses a random subset from training data initially.

EXPERIMENTS AND DISCUSSIONS
The  3, where the scaled fragments of images are represented, this additional procedure helps to decide this problem partly.The analysis of received results shows the common tendency: more sizes of bounding boxes provide large number of feature points that leads to better recognition results.Thus, phoning, playing photo, riding horse, and using computer are characterized with better precision results.
Tables 2 and 3 contain false rejection rates and false acceptance rates.Also one can see that errors have high values.This may be explained by restriction of current statement of problem.In future, additional procedures, e.g., a graph of body parts, skin detection, skeleton representation, Kinect data analysis, and analysis of surrounding objects, among others, will permit the promising results against current results.The current research shows that the analysis of bounding boxes limits estimators significantly.False rejection rates are higher in comparison to false acceptance rates.It can be explained by usual difficulties in object recognition using still images relative to video-based object recognition in the spatio-temporal domain.

CONCLUSION
In this study, the poselets estimation is proposed for recognition of human active actions.Our approach deals with accurate segmentation of body parts in a still image by a possibility of the temporal data extraction from a video sequence or pixelbased segmentation method in order to receive the images of body parts.The experiments with G-SURF lead to the descriptor invariant to color/lighting conditions.The SPOD is based on color G-SURF family.Better results were received using OppG-SURF and RGBG-SURF.For classification of common SPOD, a random forest classification was used as a fast and effective procedure during work with big data representing various poselets of various human active actions.PASCAL VOC Dataset and Challenge 2010 provided the test material for the training and the testing stages.The precision for nine action categories such as phoning, walking, running, taking photo, playing instrument, riding bike, riding horse, reading, and using computer achieves on the average of 61-68% for all 3D poses locations and 82-86% for front poses locations regarding to action categories.
Yang et al. (Yang et al., 2010) developed a coarse examplebased poselets representation, when each body part may have more than 20 poselets concerning to different body poses.They constructed a set of four corresponding body parts L = {l 0 , l 1 , ..., l k-1 } denoting the upper-body, legs, left-arm, and right-arm.Often a graph is a good model to represent the relations between different body parts.Raja et al. (Raja et al., 2011) constructed a graphical model containing six nodes: the action label and five body parts correspond to head H, right-hand RH, left-hand LH, right-foot RF, and left-foot LF.The links between nodes encode action-dependent constraints on the relative positions of body parts.
x, y) T = a point in an image I  = a scale factor L xx (p, ) = a convolution of an image I(p) in a point p with a Gaussian second order derivative along direction OX A convolution L xx (p, ) is determined by Equation 2: direction vector  = a kernel's standard deviation or scale parameter L(p, ) = a convolution of image I(p) with 2D Gaussian kernel g(p, σ) In common case, at node n the training data D n are split interactively into left and right subsets using a threshold and the split function according to Equation 8: D r = left and right subsets, respectively th = a thresholdF(v i ) = a split function v i = ith feature vectorThe threshold T is selected randomly in the range T  (minF(v i ), max F(v i )).The use of an ensemble of trees trained with small random subsets increases a speed of training and reduces amount of overfitting.Because a random forest dismisses the spatial information of local regions (patches) within a detection window, it produces some false positives, especially when a background has a similar appearance.In this study, the common SPOD in a view of the concatenated histograms (OX and OY) of color G-SURF descriptors in each bounding box is used as a feature for random forest classification.In training stage, each tree T is constructed based on a set of examples Ex i of a poselet Ex i = (e i , c i ), where e i is a poselet example and c i is a class label of a poselet example.The positive poselet examples are marked by a class label c i = 1, and the negative poselet examples gain a class label c i = 0. Samples of the other body parts including a background are concerned to the negative poselet examples.During tree construction, each leaf node L stores the class information C L of the examples.When only positive poselet examples are reached at node N, then C L = 1.Value of C L is proportional to a number of positive poselet examples.A split function is assigned to each non-leaf node in such manner that the uncertainties in class labels ought to be reduced towards leaves.In testing stage, the current element descriptor of a poselet is checked using the created binary trees by sequential comparison with value of split function in each node.

Figure 2 .Figure 4 .
Figure 2. The processed person layouts with detected G-SURF

Table 1 .
Precision results for nine action categories (%) Table 1 includes the precision evaluation for nine action categories.

Table 2 .
False rejection rates for nine action categories (%)

Table 3 .
False acceptance rates for nine action categories (%)