Egocentric visual scene description based on human

Multimed Tools Appl https://doi.org/10.1007/s11042-018-6286-9

Egocentric visual scene description based on human-object interaction and deep spatial relations among objects Gulraiz Khan 1 & Muhammad Usman Ghani 1,2 & Aiman Siddiqi 1 & Zahoor-ur-Rehman 3 & Sanghyun Seo 4 & Sung Wook Baik 5 & Irfan Mehmood 5

Received: 22 March 2018 / Revised: 27 May 2018 / Accepted: 15 June 2018 # Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract Visual Scene interpretation is one of the major areas of research in the recent past. Recognition of human object interaction is a fundamental step towards understanding visual scenes. Videos can be described via a variety of human-object interaction scenarios such as when both human and object are static (static-static), one is static while other is dynamic (static-dynamic) and both are dynamic (dynamic-dynamic). This paper presents a unified framework for the explanation of these interactions between humans and a variety of objects using deep learning as a pivot methodology. Human-object interaction is extracted through native machine learning techniques, while spatial relations are captured by training a model through convolution neural network. We also address the recognition of human posture in detail to provide egocentric visual description. After extracting visual features, sequential minimal optimization is employed for training our model. Extracted inter-action, spatial relations and posture information are fed into natural language generation module along with interacting object label to generate scene understanding. Evaluation of the proposed framework is done for two state of the art datasets i.e., MSCOCO and MSR3D Daily activity dataset; where achieved results are 78 and 91.16% accurate, respectively. Keywords Scene description . Classification . Surveillance . Human-object interaction . Spatial relations . Deep neural network

* Irfan Mehmood [email protected]; [email protected]

1

Al-Khwarizmi Institute of Computer Science UET, Lahore, Pakistan

2

Department of Computer Science and Engineering, University of Engineering and Technology Lahore, Lahore, Pakistan

3

Department of Computer Science, COMSATS University Islamabad, Attock Campus, Pakistan

4

Department of Media Software, Sungkyul University, Anyang-si, South Korea

5

Department of Software, Sejong University, Seoul, South Korea

Multimed Tools Appl

1 Introduction Comprehensive visual scene interpretation becomes challenging if it is dependent solely on the individual components present in the image. Spatial and temporal relations that are present within visual contents of the image play a fundamental part in understanding the complete video sequence. In order to mimic human vision, the relationship between individual components within the video sequence should be considered and modelled in any visual scene interpretation paradigm [1]. The individual recognition of various image components such as objects, humans and location need to be put together to present an illustration of entire scene by an automated system. An object is closely related to its environment and other spatiotemporal factors present in the visual scene. The recent focus of describing video contents into natural language descriptions involves the comprehensive description of the detected objects in the scene, humans present along with the actions being performed by them, and also their posture. The understanding of the scene also involves details of the interactions between different living and non-living things identified in an image or a video. A classification of the necessities of scene description is shown in Fig. 1. The figure also shows the classes of actions, postures and object interactions targeted in this paper. Spatial relations between objects provide a summary of the objects present with respect to their spatial coordinates, whereas the temporal relations explain the presence and change in the coordinates of the objects in time domain. The spatial relations between objects can be categorized into three further categories: static-static interaction, static-dynamic interaction and dynamic-dynamic interactions. These three set of interaction lead to plethora of applications and play vital role in the complete understanding of any video sequences. Object-to-object spatial relations and human-object interaction have their importance and applications in the domain of surveillance, child monitoring systems, scene description, robotics, image retrieval, traffic control, monitoring systems and discipline control in institutions. Knowledge of object interactions can be obtained using spatial features such as depth, horizontal and vertical distances between objects and also the measure of their co-occurrence with a relation predefined in language [15, 18]. Most researchers have used the Bayesian

Fig. 1 Basic components of scene description

Multimed Tools Appl

approach and a 3D video dataset to analyse the relations of static with moving objects [3, 20]. In cases where a human is one of the objects, pose estimation has played a key role in understanding the action being performed and consequently, the object in contact with the human. This research work is an effort towards incorporating static-static and static-dynamic interactions between contents of video sequences. Static-static interaction deals with still images as input. The methodology takes a segmented image of interacting objects and passes the processed image through convolution neural network (CNN). In our CNN network we employed five convolution layers followed by three fully connected layers (FC). Last FC layer predict 11 spatial relations and if no relation exist then none is returned. In the static-dynamic interaction module, spatial and temporal features across a group of frames are applied. Spatial features include a histogram of oriented gradients (HOG) features, mutual distances, object label, centroid movement, size of bounding box and region of maximum movement. A sequential minimal optimization (SMO) classifier was applied for classification of visual scenes as results achieved showed higher accuracies. In the literature, researchers have worked on extracting static-static and static-dynamic relations individually [3, 18, 19]. There is very little amount of effort in joining both staticstatic and static-dynamic relations. The proposed system joins deep learning based static-static relation and machine learning based static-dynamic relation with the help of natural language generation (NLG) techniques. NLG techniques makes the system more robust in terms of automatic sentence generation from individual words by placing them in correct order with self-generated missing words. Section II provides a literature survey of previous methodologies used for detecting interaction and spatial relation. Detailed methodology is described in section III along with complete framework. Section IV of our paper majorly covers evaluation of our system followed by conclusion section in last.

2 Literature survey Our work is motivated by the interaction between different categories of objects based on static and dynamic nature of objects. Considering the human as most dynamic object, static-dynamic relation directly maps to human-object interaction. Abinav Gupta et al. [1] proposed Bayesian approach, which depends upon various perceptual tasks to understand human-object interaction. They utilized spatial and functional constraints for comprehensive interpretation. They also recognize action in static image without using motion information. Xiaodong Yang and Yingli Tian [3] employed 3D joints recovered from 3D depth data of RGBD camera for human-object interaction identification. They calculated differences of skeleton joints along with static posture, motion property and human dynamicity. Naive Bayes nearest neighbours is used for classifying actions on informative frames determined by accumulated motion energy. The system is evaluated on three datasets comprise of MSR [20], the Cornell Human Activity [10], and UCF Kinect dataset [16]. Mihai Zanfir et al. [21] captured motion pose descriptor containing information about motion along with temporal information of human body joints speed and acceleration. Knearest neighbours is applied on moving pose descriptor to classify frames. Their framework is real-time, saleable and tested on MSR-Action3D and MSR Daily Activity3D [2]. Lu Xia et al. [19], calculated histogram of joint locations as reference information of posture for human action detection. Joint locations are extracted by utilizing depth map

Multimed Tools Appl

followed by linear discriminant analysis (LDA) applied on the calculated histogram and finally clustered into k posture visual words to represent different actions. These visual words are modelled by hidden Markov models for classification. The methodology is tested on their selfgenerated and MSR dataset. Aydemir, Alper, et al. [1] worked around the basic spatial relations including^ on^ and^ in^. They incorporated the two basic topological relations in a moving robot to part artificial vision to it. They evaluated the effect of gravity by making use of distance between the two involved objects, the depth of penetration (if any), the horizontal distance between the gravity centre & the point of contact of the two objects and the angle of inclination. Probabilistic models can also be employed to account for the co-occurrence of a pair of objects to always occur with one another with a particular relation. This idea was used by [18]. They clustered the objects based on the similarity in their spatial locality and the relation of the cluster with the objects in contact, e.g. a few stationary items Bon^ the table. The localization of objects has been expressed as the probability density function (PDF). Already known relations between some pairs of objects have been fed into the system. The conditional probability of the occurrence of one object with respect to another then outputs this preentered relation between them. Karpathy, Andrej, and Li Fei-Fei [8] directly used bi-directional recurrent neural networks (BRNN) for spatial relation recognition and scene description. They used images of MS COCO, Flickr8k and Flickr30k datasets directly and passed them into the BRNN along-with the annotations generated by Amazon Mechanical Turk (AMT). Each image was accompanied with five sentences to describe them by AMT. Jain, Priyanka, et al. [7] has employed information acquisition model to use the feature set consisting of size, color, position, and count of objects present in the image and the background of the objects. By ranking the images and comparing them with the trained images via information acquisition model, they devised means to generate the textual description of the images. Deep learning based architectures are currently very useful in every field. A twotier authenticating system was proposed by Sajjad, Muhammad, et al. [14] which guarantees a check on the secure login of the user and separates out spoofed users from the normal ones. They combined fingerprint, face recognition and palm vein print as the input to the CNN-based training. The resultant model then acts as the core part of the system. Hamza et al. [4] used the key-frames extraction technique to summarize a video. They used a 2D logistic map to create the cryptographic keys. Their proposed algorithm could be used for different perspective of security. Muhammad, Khan, et al. [12] devised an efficient and robust method of image encryption. The input was taken from the video summarization techniques in the form of key frames. The key-frames were then encrypted for privacy and security by the probabilistic encryption technique. By performing speed tests, they justified the high speed of their fast encryption algorithm. Rafik Hamza et al. [5] proposed a video summarization system for outdoor patients going through wireless capsule endoscopy (WCE) process. The light-weight video summary generation technique was employed to extract important frames. These important integral images were utilized for feature computation, making their system more suitable for real-time summarization.

[18] Proposed method

B-RNN Information acquisition model

Classifier

Hidden markov model (HMM) Naïve Baye’s Nearest Neighbor (NBNN) – Simulated data HOG features, mutual distances, object CNN, SMO label, centroid movement, size of bounding box, region of maximum movement

Images & textual description Size, color, position, background, number of objects LDA, Histogram of 3D joint (HOJ3D) Eigen Joints

[8] [7]

[19] [20]

Features

Paper

Table 1 Comparison of different methodologies with proposed system

MSR Action 3D Cornell university dataset, UCF Kinect dataset, MSR Action 3D Gaussian Mixture model (GMM) MS COCO MSR Daily Activity HMDB51 dataset Self-generated dataset

MSCOCO,Flickr 8 k,Flickr 30 k –

Dataset

Domain

84.72% 91.16%, 88%

90.92% 83.3%

Human-object interaction Spatial relations, human-object interactions, scene description

Human- object interaction Human-object interaction

16, 15, 19 (B-4) Scene description – Scene description

Accuracy

Multimed Tools Appl

Multimed Tools Appl

Table 1 depicts detailed comparison of literature with the proposed system. There is no such system that comprises of. combination of static-static, static-dynamic and posture to describe a scene setting. The proposed system initially focuses on different aspects of group of frames and then concatenates all aspect using NLG techniques to describe a scene. The proposed system used amalgam of machine learning and deep learning techniques to describe a scene comprehensively.

3 Methodology The implemented system deals with a combination of static-static object relations from single frame and the static-dynamic relation between objects including the human-object interaction from video sequences to generate a comprehensive description of the scene at hand. The final result was generated by a NLG based textual description. The overall framework diagram is shown in Fig. 2.

3.1 Static-dynamic interaction Static-dynamic object interaction contains one object as stationary with no independent movement and other object with independent motion. In dynamic objects most significant and interesting object is the human, considering human as dynamic object our static-dynamic interaction mainly focuses on human-object interaction. Human-object interaction is mostly of two types: independent and dependent. In former, only human body moves with no movement in interacting object like human sitting on sofa, while in second type of interaction, interacting object also moves e.g. pointing remote, calling on phone and eating something. The

Fig. 2 Framework diagram of complete system showing sequence of steps from input video to textual description

Multimed Tools Appl

proposed human-object interaction work is divided into two different parts including posture computation of human and interaction with object. All two constitutes have their own weightage towards finding human-object interaction.

Posture calculation of human Finding the posture of the human body is simple in our case as we considered six distinct classes that are standing up, sitting down, laying down, standing, seated, and laying. First three represents change in position from one posture to other, as standing up can be determined if human changing posture from seated to standing. Similarly, in sitting down and laying down, human change posture from standing to seated and seated to laying, respectively. In seated, standing and lying there is no change in posture of human body. To determine posture over a group of 25 frames, posture status for each individual frame is computed using the following Eq. 1. 8 2 > 0:5 > : 1 otherwise where HWR and SD can be defined by Eqs. 2 and 3 respectively. HWR ¼

H W

ð2Þ

N

∑ 2 f ðxÞ SD ¼ 2 i¼0 N

ð3Þ

where HWR is height to width ration of human rectangle detected by darknet deep learning framework, you only look once (YOLO) [13], H is the height and W is width of rectangle. In Eq. 3, SD is silhouette density with N as the total number of pixels in a frame, x is pixel value and f(x) can be defined by Eq. 4, it checks whether pixel is bright or not as 200 is the value of bright pixel. 1 x > 200 f ðxÞ ¼ ð4Þ 0 otherwise

Multimed Tools Appl

Posture status of 25 frames are combined to find overall combined posture using above algorithm (Algorithm 1).

3.2 Feature extraction 3.2.1 Static-static spatial relations between objects The object boundaries returned by YOLO [13] are passed through image segmentation section to extract detected objects region as foreground and all remaining image as background. The resulting image is segmented containing only pixel of two objects with rest of the pixels as black pixels. A pixel is nominated as a foreground on the bases of Eq. 5. Fig. 3 depicts the segmentation results for two objects in each sample image. g ðx; yÞ ¼

f ðx; yÞ 0

x; y ∈ Obj1 or x; y ∈ Obj2 otherwise

ð5Þ

where, g(x,y) represents the segmented image and f(x, y) represents original image with x and y are horizontal and vertical dimensions respectively. The pre-processed image is used for training our deep convolution network model. The proposed system for detecting spatial relation takes a segmented image as an input and passes it through the convolution neural network which contains five

Fig. 3 Input image samples with segmented image

Multimed Tools Appl

convolution layers, each with rectified linear units (ReLU) as an activation function. It is used to accelerate the convergence of our network. It is a simple thresholding function that can be defined as f(x) = max(0, x). It simply converts negative input to zero with positive input unchanged. Furthermore, feature extraction from convolution layer is followed by pooling and normalization to fine-tune the extracted features from convolution layer. Max pooling is employed for choosing maximum value in 3 × 3 window followed by local response normalization that eventually normalize all values in specific range. Features from last convolution layer are passed through three fully connected layers that simply connect features to generate final class of spatial relation. We have used three fully connected layer to generate result for 12 classes.

3.2.2 Network architecture Overall architecture for spatial relation extraction from segmented image starts with an input image of [227 × 227] with 3 channels. Afterwards, input image is passed through five convolution layers and three fully connected layers. Complete architecture layers are shown in Fig. 4 and detail is defined as follows, 1. In the first convolution layer 96 kernels of size [3 × 11 × 11] are applied on input image at strides 4, followed by ReLU and a 3 × 3 max pool layer with two pixel strides. After convolution, output blob of [96 × 55 × 55] is further reduced by pooling to [96 × 27 × 27]. 2. Output from top layer acts as an input to second layer. The second layer takes input of shape [96 × 27 × 27] and apply 256 filters of shape [96 × 5 × 5] at stride 1 to produce blob of shape [256 × 24 × 24] followed by max pool layer of kernel [3 × 3] with stride of two and produce [256 × 13 × 13] image. 3. In the third convolution layer 384 filters of kernel size [256 × 3 × 3] are applied on input of size [256 × 13 × 13] with padding and stride both as 1 followed by ReLU. This layer produce output volume of [384 × 13 × 13]. 4. In the fourth convolution layer we apply 384 filters on image from above layer. Kernels of size [384 × 3 × 3] with stride and padding set to 1 are employed followed by ReLU. This layer produce output volume of size [384 × 13 × 13]. 5. The fifth layer apply 256 kernels of size [384 × 3 × 3] on the input blob of shape [384 × 13 × 13]. 6. Convolution layers are followed by fully connected (FC) layers. First fully connected layer or sixth layer receives input from fifth convolution layer and contains 4096 neurons followed by ReLU. Drop-out layer is added to avoid overfitting. 7. The seventh layer takes input from above FC layer. This layer produce output of 4096 neurons. At the end ReLU and drop-out layer is applied to make training robust.

Fig. 4 Architecture diagram for finding spatial relation

Multimed Tools Appl

8. The eighth layer of our architecture act as a score calculation layer of spatial relation. This layer generates 12 number of outputs. This is further used for loss calculation in training process.

3.3 Human-object interaction features Human-object interaction extraction incorporates multiple features. These features are calculated using detected human location and region, human silhouette and object rectangle. There are 12 basic features and HOG features, used for finding interaction. Basic features include object label detected by YOLO [13]. The label of interacting object is one of the most depicting features for human-object interaction, like in the case of cell phone, a human can point, call or hold. Secondly, the average position of the object centroid in all 25 frames gives information about object spatial location, which can be calculated as Eq. 6, where OR is object rectangle with w and h as width and height. After finding centroid of object we need to calculate distance moved by centroid of object in x and y directions separately as in Eq. 7, where (Cx1; Cy1) and (Cx25; Cy25) is centre point of object in first frame and last frame respectively. C ðx; yÞ ¼ ORx þ

ORw ORh ; ORy þ 2 2

Distðx; yÞ ¼ C x1 −C x25 ; C y1 −C y25

ð6Þ ð7Þ

After finding the centroid relationship, we need to find object and human intersection area and region. These two parameters are important for detecting interaction with smaller objects Fig. 5 Human body regions division

Multimed Tools Appl

like remotes, cell phones. To find the region we have considered the whole human body into multiple regions and check which region is most overlapping with the interacting objects in each frame as shown in Fig. 5. The rationale behind this division is logical distribution of human body parts into lower torso, upper torso, left body region and right body region. Interacting object width, height and ratio is also useful for finding interaction between human and object. After finding intersection of object with human body, we also obtain the most moving part of human body which is in most of the cases the hands region. The gaussian mixture model (GMM) is employed for detection of moving part of human body, considering same scene setting in short period of 25 frames there could be some pixel values that are the part of dynamic foreground, like in case of eating, calling, pointing, typing, cutting most dynamic part would be hand. Considering time period for 25 frames be T, we have FT = f1…f25, for new group of frames we approximate foreground using GMM with M components (eq. 8). M Pð f j F; BG þ FGÞ ¼ ∑ πn N f : μ; σ2 I

ð8Þ

n¼1

where μ and σ represents mean and variance respectively for an image I. The most moving part is calculated using argmax function for extracting foreground. The intersection area of calculated foreground and object region is used as a prime feature in human-object interaction.

3.4 Classification of human-object interaction Features obtained from our system are large in dimensions and require efficient algorithm that works effectively for data. SMO is one of the best options available to work with large dimensions’ data. SMO operate on the basis of quadratic problem used in SVM but it divides data into small slices and optimizes two Lagrangians at the same time analytically. SMO decrease the complexity of quadratic problem by splitting major problem into sub-problems. SVM quadratic problem can be considered as initial step

Fig. 6 GUI for Scene Description

Multimed Tools Appl

for SMO and represented by Eq. 9: (

1 n m ∑ λi − ∑ ∑ λ i λ j yi y j xi x j 2 i¼1 j¼1 i¼1 n

maxλ

) ð9Þ

where, 0 < λi < C and ∑ni¼1 yi λi =0, λ is the Lagrange multiplier, x is input data and y is class. SMO considers two lagranges by assuming all others as constant. Eq. 9 boils down to Eq. 10, all other multipliers can be computed in same way. λ1 y1 þ λ2 y2 ¼ −∑ni¼1 λi yi ¼ c

ð10Þ

3.5 Scene description The proposed work is dedicated to the depiction of environmental scene that incorporate human-object interaction in video and spatial relations between objects. This is an attempt to specify what is going on in a video rather than object and texture based scene prediction. We combine information extracted from interaction module, objects detection, posture identification, and spatial relation to generate a comprehensive textual description of scene. We used Eq. 11 for describing scene from visual frames (Fig. 6). S ¼ H þ } is} þ p þ } and } þ I þ O

ð11Þ

where S is scene description, H is used or human, p depicts posture, I represents interaction and O is used for detected object in frames. Spatial relation is useful when there is no interaction between human and object. In such scenario, scene description involves spatial relation as in Eq. 12. S ¼ H þ is þ p þ S:R þ O

ð12Þ

where S.R represents spatial relation between human and object. Group of frames with no human detected are described with objects spatial relation where interaction is not involved (Eq. 13). S ¼ O1 þ } is} þ S:R þ O2

ð12Þ

where, O1 and O2 are first and second object, respectively.

4 Experimentation We have used two video datasets for the evaluation of human object interaction and one dataset for evaluating spatial relations. MSR3D daily activity dataset and HMDB datasets are used for human-object interaction and MSCOCO for finding spatial relations in frames. We used 10 folds’ validation technique for evaluating both of our techniques.

Multimed Tools Appl

Fig. 7 GUI for annotation tool

4.1 Annotation To train our model for detection of static-static spatial relations between objects we used MS COCO dataset [11]. This dataset contains images with no detail about spatial relation between objects present in the image. We have used supervised training to train our deep learning

Fig. 8 Sample annotated image with annotation file

Multimed Tools Appl

Fig. 9 Sample images from MSCOCO 2014 dataset

model about relations present in the image. To enable supervised learning, we need to have annotated relation against each object available in the image. Annotating images manually is a tedious task and require a lot of human effort to check and mark relation. Furthermore, as humans are prone to errors, he might not cover all objects or may write wrong relation. To avoid human errors and reducing human effort, we developed an automatic annotation tool that shows two marked objects in an image at a time. In our annotation tool, a human does not need to cater number of objects, their labels and location rather he is only concerned with providing label for static relation between objects. A text file with same name as image with .txt extension is generated that covers location, label and relation of each pair of objects. Fig. 7 provides a GUI for our annotation tool, having three sections, image area with marked pair of objects, drop down area for specifying relations and a button to store relation and to move to the next relation. The number of relations present in an image can be defined by Eq. 13 with N as number of relations and C as number of detected objects from YOLO [13]. As in the sample image in Fig. 8, three different objects are identified so the number of possible relations would be six. There are some cases in which no relation exists like tie is on human but there is no relation in

(a) Sample images for eating

(b) Sample images for laying Fig. 10 MSR sample dataset instances

Multimed Tools Appl

Fig. 11 Sample Video frames from HMDB51 dataset (riding bike)

opposite way. All possible relations for sample image is shown in Table 3. N ¼ C ðC−1Þ

ð13Þ

The number of relations extracted from above equation are divided into possible eleven relations if there exist some relation otherwise none is marked by human annotator.

4.2 Dataset 4.2.1 MS COCO 2014 objects dataset For static-static spatial relations of objects, we made use of a subset of MS COCO 2014 dataset [11]. The subset comprises of 500 images, chosen at random which results in approximately 1100 pairwise spatial relations. The published dataset consists of 82,783 images in the train set,

(a) Sample frames for lifting bag

(b) Sample frames for pulling chair Fig. 12 Sample Video frames from self-generated dataset

Multimed Tools Appl Table 2 Comparison of different methodologies for MSR daily activity dataset Method

Year

Classifier

Number of classes

Accuracy (%)

Sequential max-margin event detectors [9]

2014

12

73.2

The Moving Pose [15] Actionlet Ensemble (3D pose only) [17]

2013 2012

16 16

73.8 85.5

Our proposed Approach

2018

Sequential Max-Margin Event Detectors KNN Actionlet Ensemble SMO

6

86.6

40,504 validation images and 40,775 test images. All images cover 80 objects categories in their natural context. Images have been gathered in a way such that they consist of objects in the actual terms of their usage and real life scenarios with varying camera orientation and scalability. The annotated labels of the objects have been decided based on the names used by laymen in their common daily routines. This dataset has a large number of instances per class which vary from one category to another. Some sample instances of this dataset have been shown in Fig. 9. This dataset provides detected objects in an image, and using our customized annotation tool for spatial relations between pairs of objects, we utilized this data to classify the object topological relations present in a static image. The proposed system majorly concerns with real-life scene description based on static-static, static-dynamic and dynamic-dynamic relation. We used MS COCO as it is one of the best descriptor for real scene.

4.2.2 MSR daily activity dataset MSR Daily Activity 3D dataset is commonly used for detecting human-object interaction in our system. The dataset was developed by Jiang Wang at Microsoft Research Redmond lab in 2012 [19]. There are 16 different activities in this dataset: call cell phone, reading book, write on a paper, eat, drink, use laptop, cheer up, use vacuum cleaner, sit still, toss paper, play game,

Table 3 Confusion matrix for MS COCO topological relations

on above Across from behind in Front of inside Left of Right of under below none

on above Across from behind in

Front of inside Left of Right of under below none

71 4 0 0 6 0 0 0 0 0 1 0

0 0 0 0 0 68 0 1 1 0 0 0

1 101 0 0 2 0 0 0 0 0 0 0

0 0 49 0 0 0 0 0 0 0 0 3

0 0 0 77 0 0 0 0 5 0 0 0

7 9 0 0 80 0 0 0 0 0 0 0

0 1 0 0 0 0 24 0 0 0 0 0

0 0 0 0 0 0 0 178 0 0 0 4

0 0 2 2 0 5 0 0 175 0 0 5

0 0 0 0 0 0 0 0 0 34 4 0

0 0 0 0 2 0 0 0 0 6 55 0

4 0 0 2 0 1 0 5 3 1 3 146

Multimed Tools Appl Table 4 Confusion matrix for six classes of MSR daily activity dataset

Eating Drinking Phone call Reading Typing Holding

Eating

Drinking

Phone call

Reading

Typing

Holding

8 3 0 0 0 0

2 7 0 0 0 0

0 0 90 0 0 0

0 0 0 9 0 1

0 0 1 0 10 0

0 0 1 0 9

lay down on sofa, walk, play guitar, stand up, sit down. Most of the subjects are captured in multi-poses: sitting and standing. We have chosen this dataset as it contains most common human-object interaction videos. This dataset is more challenging as it comprises daily activities in living room. There are 320 videos captured from 10 subjects with 640–480 resolution. We selected some of the interaction classes in MSR dataset set: eat, drink, reading, call cell phone, use laptop (typing) and holding (book reading videos split into holding and reading), we selected six classes because of our object detector limitation. Fig. 10a and b display the sample frames from MSR daily activity dataset.

4.2.3 HMDB51 dataset HMDB51 dataset is utilized for evaluating our sub-module of dynamic-dynamic interaction, containing 51 action videos downloaded from internet. Each action class consists of nearly 101 video clips making total 6766 videos [9]. Each video clip is validated by two human evaluators to make sure consistency in videos. It also includes annotations for videos to represent camera viewpoint, video quality, camera motion or static and number of actors involved in the video. Out of 51 action classes we used this dataset for evaluating dynamic-dynamic interaction for two classes: riding bike, and riding horse. Sample images from HMDB51 are shown in Fig. 11.

4.2.4 Self-generated dataset A self-generated dataset is used for five different object interactions that are not explicitly found in MSR and HMDB51: pushing chair, pulling chair, lifting bag, dropping bag and cutting with knife. 10 videos for each class are generated making 50 videos in out selfgenerated dataset. Ten different subjects performed interaction in lab environment. Fig. 12a and b show the sample frames of our self-generated dataset.

Table 5 Comparison of different methodologies for HMDB51 dataset Method Multi-View Super Vector [18]

Year

Classifier

2014 Probabilistic canonical Correlation analyser Fisher Vectors [6] 2013 SVM Trajectory + Local features + VLAD [21] 2013 SVM Proposed Approach 2018 SMO

Number of classes Accuracy (%) 51

55.9

51 51 2

54.8 52.1 100

Multimed Tools Appl

Table 6 Confusion matrix for two classes of HMDB51 dataset Riding bike Riding horse

Riding bike

Riding horse

10 0

0 10

5 Results and discussion For evaluation purposes, we employed three datasets for human-object interaction: MSR3D dataset, HMDB51 dataset, and self-generated dataset. MSCOCO is used for training and testing of spatial relation module. We have selected three datasets for human-object interaction as they are mostly similar to daily life scenarios with cluttered background. We have tested our algorithm of human-object interaction cross-datasets to make our algorithm more robust. The proposed system performed more accurately as compared to others in literature as shown Table 1. Table 2 shows accuracy of different methods for MSR activity dataset. This table is sorted in ascending accuracy. We have chosen 6 interactions from the MSR Daily activity dataset. The first and second column depict methodology and year respectively. Similarly, the last three columns show the classification algorithm used, number of classes and accuracy respectively. The table clearly depicts that our proposed system works more accurately than others. Table 3 presents the confusion matrix for 11 topological relations along with a default class i.e. none. The annotations had a total of 1147 spatial relations for 500 images, chosen at random from the MS COCO dataset. The permutations of objects have been considered while annotating, to assure a two-way correspondence. The 11 relations include on, above, across from, in, in front of, inside, left of, right of, under, below and a default class: none. Relations such as under and below show relatively larger overlap due to their significant similarity. We have taken an assumption of contact between the two objects to be negligible for Bunder^ and a greater vertical distance between the two objects accounting for the Bbelow^ relation. On the same grounds, in and on show an overlap in the confusion matrix (Table 3). This similarity occurs due to variation in the camera orientation of the images present in the MSCOCO dataset. In the case of top view of a container containing an object, an indecision may occur because the difference in the two classes (in and on) largely depends on the vertical overlap of the object pair. The classes such as left, right, front of, behind and inside are considerably accurate and show minimal amount of disorientation. Table 4 depicts class distribution of all six selected classes from MSR dataset. As accuracy is majorly dependent on detected object, so drinking videos mostly resembles with eating.

Table 7 Confusion matrix of self-generated dataset

Lifting Dropping Pushing Puling Cutting

Lifting

Dropping

Pushing

Puling

Cutting

8 3 0 0 0

2 7 0 0 0

0 0 9 0 0

0 0 1 10 0

0 0 0 0 10

Multimed Tools Appl Table 8 Confusion matrix of human-object interaction

Eating Drinking Phone Call Lifting Dropping Reading Pushing Pulling Typing Cutting Holding Riding Bike Riding horse

Eat

Drink

Phone Call

Lift

Drop

Read

Push

Pull

Type

Cut

Hold

Riding Bike

Riding horse

7 1 1

2 8 1

1 1 8

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

9 0 0 0 0 0 0 0 0

1 10 0 0 0 0 0 0 0

0 0 7 0 0 0 0 4 0

0 0 0 9 2 0 0 0 0

0 0 0 1 8 0 0 0 0

0 0 0 0 0 10 0 0 0

0 0 0 0 0 0 8 0 0

0 0 3 0 0 0 2 6 0

0 0 0 0 0 0 0 0 9

0 0 0 0 0 0 0 0 1

0

0

0

0

0

0

0

0

0

0

1

8

1

Typing gives 100% accuracy as it involves laptop that is present in only typing videos. On the other hand, videos of phone call, reading and holding some object slightly overlap with each other. Table 5 shows comparison of different methodologies results for HMDB51 dataset with first column as method name and the second column shows year of proposed method. The third and fourth columns shows classifier, number of target classes respectively and finally last column depicts accuracy of each methodology. It can be clearly seen that we have achieved high accuracy because of selecting only two classes. Table 6 shows confusion matrix for two classes of HMDB51 dataset. This confusion matrix manifests the accuracy of dynamic-dynamic interaction. We have selected only two classes because our major focus is on combining all three relationships to generate a comprehensive description. Lastly, Table 7 illustrates the results of our interaction algorithm for self-generated dataset which gives accuracy of 88% for complex dataset with cluttered background. To find the overall results, we combined all selected videos from MSR, HMDB and self-generated and tested our complete methodology for all 13 classes. Table 8 shows complete confusion matrix for all interaction classes. As shown by the table, accuracy drops when we combine all video categories. To evaluate static-static relations between objects we have used a subset of 500 images from MSCOCO dataset [11] and achieved 91% accuracy, while in case of human-object interaction, accuracy about 78% is achieved as depicted by Table 8. In this confusion matrix rows represent predicted classes and columns depict actual classes. We have 13 rows and columns to cover each individual class of our combined dataset. Eating and drinking results in overlap with each other as they are closely related to hand and mouth interaction with intersection. Lifting and dropping share same object (bag) that makes them more similar. Pushing and pulling are quite similar with only difference in direction of movement for reference object.

Multimed Tools Appl

Holding and reading overlap because the only difference is opened book in case of reading. All other interactions are mostly unique.

6 Conclusion and future work This paper deals with the idea of complete visual analysis in terms of spatial relations and human object interactions via still images and videos. The system takes images for static relations and videos for dynamic interactions, extracts various features such as mutual distances, size and movement, applies classification algorithms such as SMO and predicts the interaction between human and corresponding object. Furthermore, deep learning techniques are employed to extract spatial relations between objects to generate final result of scene description in the form of sentences. The system has been evaluated over standard datasets such as MS-COCO database, MSR action dataset and HMDB51 dataset to achieve an accuracy of 78% on interaction and 91.16% on static spatial relations. A self-generated dataset has also been employed to test the system in a customized environment (Table 7). To check the results of topological relations over MS COCO dataset, we have annotated the dataset using a self-made annotation tool and used the partially automated annotations of pairwise relation between objects to train and test our system. In the future, the system can be expanded to summarize the generated description via a paragraph summarization module. Moreover, the classes for human-object interaction and spatial relations can also be increased to provide even more comprehensive description of the system. Acknowledgements This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIP) (No. 2016R1A2B4011712) & by IGNITE, National Technology Fund, Pakistan for the project entitle BAutomatic Surveillance System for Video Sequences^.

Publisher_s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References 1. Aydemir A et al (2011) Search in the real world: Active visual object search based on spatial relations. Robotics and Automation (ICRA) 2. Ellis C, Masood S, Tappen M, Laviola J, Sukthankar R (2013) Exploring the trade-off between accuracy and observational latency in action recognition. Int J Comput Vis 101(3):420436 3. Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789 4. Hamza R et al. (2017) Hash based encryption for keyframes of diagnostic hysteroscopy. IEEE Access 5. Hamza R et al (2017) Secure video summarization framework for personalized wireless capsule endoscopy. Pervasive and Mobile Computing 41:436–450 6. Huang D et al. (2014) Sequential max-margin event detectors.^ European conference on computer vision. Springer, Cham 7. Jain P et al. (2015) Knowledge acquisition for language description from scene understanding." Computer, Communication and Control (IC4), 2015 International Conference on. IEEE 8. Karpathy A, Li F-F (2015) Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf Comput Vis Patt Recog 9. H Kuehne, H Jhuang, E Garrote, T Poggio, T Serre, HMDB (2011) A Large Video Database for Human Motion Recognition. ICCV

Multimed Tools Appl 10. W Li, Z Zhang, Z Liu (2010) Action recognition based on a bag of 3D points, in: IEEE CVPR Workshop on Human Communicative Behavior, Analysis 11. Lin T-Y et al (2014) Microsoft coco: Common objects in context. European conference on computer vision. Springer, Cham 12. Muhammad K et al. (2018) Secure Surveillance Framework for IoT systems using Probabilistic Image Encryption. IEEE Trans Indust Info 13. Redmon J et al (2016) You only look once: Unified, real-time object detection. Proc IEEE Conf Comput Vis Patt Recog 14. Sajjad M, et al. (2018) CNN-based anti-spoofing two-tier multi-factor authentication system. Pattern Recognition Letters 15. Sj K, Aydemir A, Jensfelt P (2012) Topological spatial relations for active visual search. Robot Auton Syst 60(9):1093–1107 16. J Sung, C Ponce, B Selman, A Saxena (2012) Unstructured human activity detection from RGBD images, in: Proc. International Conference on Robotics and Automation 842849 17. Wang J et al. (2012) Mining actionlet ensemble for action recognition with depth cameras.^ Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE 18. Welke K et al. (2011) Grounded spatial symbols for task planning based on experience. Humanoid Robots (Humanoids), 2013 13th IEEE-RAS International Conference on. IEEE, 2013. IEEE International Conference on. IEEE 19. Xia L, C-C Chen, JK Aggarwal (2012) View invariant human action recognition using histograms of 3d joints.^ Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE 20. Yang X, Tian YL (2014) Effective 3d action recognition using eigenjoints. J Vis Commun Image Represent 25.1:2–11 21. Zanfir M, M Leordeanu, and C Sminchisescu (2013) The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. Proceedings of the IEEE international conference on computer vision

Dr. Muhammad Usman Ghani Khan is an associate professor in Department of Computer Science & Engineering, University of Engineering and Technology, Lahore, Pakistan. His PhD (Sheffield University, UK) was concerned with statistical modelling for machine vision signals, specifically language descriptions of video streams. He is heading National Center for Artificial Intelligence in Al-Khwarizmi Institute of Computer Science, UET Lahore.

Multimed Tools Appl

Zahoor-ur-Rehman has experience both in academia and research. He has received his educational and academic training at university of Peshawar, Foundation University Islamabad and UET Lahore, Pakistan. He joined COMSATS institute of information technology as assistant professor in the early 2015. Along with teaching responsibilities, he is an active researcher and reviewers of various conferences and reputed journals.

Irfan Mehmood has been involved in IT industry and academia in Pakistan and South Korea for over a decade. Currently, he is serving as an assistant Professor in Software department, Sejong University. His sustained contribution at various research and industry-collaborative projects gives him an extra edge to meet the current challenges faced in the field of multimedia analytics. Specifically, he has made significant contribution in the areas of visual surveillance, information mining and data encryption.