IoT-based 3D convolution for video salient object

Neural Computing and Applications https://doi.org/10.1007/s00521-018-03971-3

(0123456789().,-volV)(0123456789(). ,- volV)

INTELLIGENT BIOMEDICAL DATA ANALYSIS AND PROCESSING

IoT-based 3D convolution for video salient object detection Shizhou Dong1,2 • Zhifan Gao3 • Sandeep Pirbhulal1 • Gui-Bin Bian4 Shuo Li3

•

Heye Zhang5 • Wanqing Wu1

•

Received: 14 September 2018 / Accepted: 20 December 2018 Ó Springer-Verlag London Ltd., part of Springer Nature 2019

Abstract The video salient object detection (SOD) is the first step for the devices in the Internet of Things (IoT) to understand the environment around them. The video SOD needs the objects’ motion information in contiguous video frames as well as spatial contrast information from a single video frame. A large number of IoT devices’ computing power is not sufficient to support the existing SOD methods’ expensive computational complexity in emotion estimation, because they might have low hardware configurations (e.g., surveillance camera, and smartphone). In order to model the objects’ motion information efficiently for SOD, we propose an end-to-end video SOD algorithm with an efficient representation of the objects’ motion information. This algorithm contains two major parts: a 3D convolution-based X-shape structure that directly represents the motion information in successive video frames efficiently, and 2D densely connected convolutional neural networks (DenseNet) with pyramid structure to extract the rich spatial contrast information in a single video frame. Our method not only can maintain a small number of parameters as the 2D convolutional neural network but also represents spatiotemporal information uniformly that enables it can be trained end-to-end. We evaluate our proposed method on four benchmark datasets. The results show that our method achieves state-of-the-art performance compared with the other five methods. Keywords Internet of Things Salient object detection Video processing Deep learning Mathematics Subject Classiﬁcation 68T45 68T10 68T05

1 Introduction The video salient object detection (SOD) often is the first step for the devices in the Internet of Things (IoT) to understand the environment around them. Most IoT

& Gui-Bin Bian [email protected]

devices rely on analyzing the videos recorded by their cameras to obtain the information of the surrounding environment [2, 27, 32, 33]. An efficient method to analyze the videos is the utilization of the video SOD algorithm at the first step. The video SOD algorithm not only can imitate human attention mechanism to find the salient objects 1

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Shizhou Dong [email protected]

2

Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences Shenzhen, Shenzhen, China

Zhifan Gao [email protected]

3

Western University, London, Canada

4

State Key Laboratory of Management and Control for Complex Systems Institute of Automation, Chinese Academy of Sciences, Beijing, China

5

Sun Yat-Sen University, Guangzhou, Guangdong, China

Sandeep Pirbhulal [email protected] Heye Zhang [email protected] Wanqing Wu [email protected] Shuo Li [email protected]

123

Neural Computing and Applications

(the noticeable objects), but also can effectively reduce the IoT devices’ computational cost in video analysis by filtering out the irrelevant objects. However, existing video SOD algorithms usually suffers from expensive computational complexity in motion information estimation. A large number of the IoT devices’ computing power is not sufficient to support the expensive computational complexity because they have low hardware configurations (e.g., surveillance camera and smart phone). Therefore, it is difficult to directly deploy existing algorithms to the IoT devices. A video SOD algorithm with an efficient representation of the objects’ motion information is necessary. An efficient representation of the object’s motion information in videos is critical to video SOD algorithms. The motion information identifies the physical change of the object’s position with respect to its background [35]. It has vast applications in the field of intelligent video surveillance, traffic monitoring, event detection, people tracking and behavior analysis. Figure 1a, b shows two general representations of motion information. They are the traditional optical flow and the 3D convolutional neural network separately. Although traditional optical flow always precisely represents the motion information, its computational complexity is usually expensive [30, 42]. The 3D convolutional neural network can directly learn the task-specific representation of the objects’ motion information from huge video data [10], but it has too many parameters to avoid the overfitting problem. In contrast, the 2D convolutional neural network has made great progress in the detection of static image SOD [22]. It surpasses numerous traditional algorithms on many public benchmark datasets. However, it is not enough to be applied to

the video SOD because it does not have the structure to represent the motion information. Obviously, it has bright prospect to develop a new style of neural network architecture that contains both 2D and 3D convolution in order to directly represent spatiotemporal information. Unfortunately, a slight works have been done in this prospect. This paper proposes a 3D convolution-based X-shape structure that can efficiently represent the motion information of objects in videos. The 3D convolution-based X-shape structure can be directly embedded in a 2D convolutional neural network illustrated as Fig. 1c. It accepts the feature maps of the successive frames as input. Then, we send these feature maps to go through a 3D convolution to discover motion information. Next, the output feature maps with the motion information are concatenated with the input feature maps. Finally, the concatenated feature maps are distributed to each video frame. This kind of ‘collection’ and ‘distribution’ structure is similar to letter ‘X.’ So, we called it as ‘3D convolution-based X-shape structure.’ This structure has two benefits. On the one hand, it can directly represent the motion information of objects in videos efficiently. On the other hand, it preserves more spatial contrast information of each single frame. Even if the motion information of the video is not accurate, it will not destroy the spatial contrast information of the single frame. We embed this structure into 2D densely connected convolutional neural networks (DenseNet [5]) with pyramid structure to construct a complete video SOD algorithm. We compare our algorithm with five state-of-the-art video SOD algorithms on four benchmark datasets. The results show that our proposed algorithm is state of the art.

Fig. 1 Three different representations of motion information in successive video frames. a Optical flow. b The 3D convolution neural network. c The 3D convolution-based X-shape structure, which is our proposed method in this paper. Best to be viewed in color (color figure online)

123


Our main contributes are summarized as follows: 1. We propose a 3D convolution-based X-shape structure in order to directly represent motion information efficiently. This structure can preserve more spatial contrast information of each input frame. 2. We utilize a 2D DenseNet with pyramid structure to effectively extract single frame spatial contrast information. 3. We introduce an end-to-end video salient detection neural network which has efficient computation. It directly represents spatiotemporal information efficiently. In addition, it can maintain comparable numbers of parameters as the 2D convolutional neural network. It is possible to be deployed on the IoT devices. 4. We evaluate our proposed model on four benchmark datasets. Our proposed model achieves state-of-the-art performance compared with the other five methods. The rest of the paper is organized as follows: We give an overview of related work in Sect. 2. We detailedly explain our method in Sect. 3. The implementation details and experiment analysis are introduced in Sect. 4. Finally, we conclude our works in Sect. 5.

2 Related work A large number of methods [8, 19] have been proposed for the static image SOD in the past decade. Margolin et al. [26] proposed a model using the inner statistics of the patches to find unique patterns in an image in order to quickly compute distinctness. Similarly, Zhang et al. [44] used bottom-up saliency deriving from natural image statistics as the self-information for the pointwise mutual information. Judd et al. [12] proposed a top-down model based on all levels of features in eye tracking data they collected. Borji [1] proposed a method combining both low-level features and top-down cognitive visual features. Recently, with the development of deep learning, more and more convolutional neural networks (CNN) models [16, 45] have been proposed. Wang et al. [36] proposed a model containing two deep convolutional neural networks: the one to determine the saliency value of each pixel and the other for predicting saliency score of each object region based on the global features. Li et al. [21] combined semantic segmentation and salient object detection to construct a multi-task neural network. It demonstrated that the semantic information of image helps salient object detection. Hu et al. [9] found that deep network is difficult to discriminate pixels belonging to similar receptive fields around the object boundaries. And they proposed a deep network to learn Level Set function to product more

accurate boundaries and uniform saliency map. Li et al. [18] presented a multi-scale salient object detection neural network in order to detect multiple-scale salient object. They were also the first to do salient object proposals to distinguish different salient objects. Wang et al. [38] proposed an augment feedforward neural networks with a novel pyramid pooling module and a multi-stage refinement mechanism for saliency detection. Luo et al. [24] fused all levels features. And they designed a contrast layers to enhance the features contrast between salient region and non-salient region. Hou et al. [7] introduced short connections to the skip-layer structures in their networks and took advantage of features of all levels. Compared with the static image SOD, the video SOD is more challenging due to the introduction of motion information. In earlier years, many models [23, 31] just extended the existing static image SOD methods by adding the motion information. For example, Guo et al. [6] proposed that adding an extra motion dimension to their static SOD method could represent spatiotemporal saliency. More recently, a variety of unsupervised methods [15, 25] have been proposed. Wang et al. [39] incorporated saliency as prior via the computation of geodesic measurement. Li et al. [20] proposed an approach using saliency-guided stacked autoencoders, which can be unsupervisedly constructed by the saliency cues extracted before. Deep learning techniques have also been introduced to the field of the video SOD nowadays. Le and Sugimoto [14] presented a method for detecting salient objects in a video where temporal information in addition to spatial information was fully taken into account. They used two sub-networks to extract spatial deep feature. The one was fed video block to extract spatial local features. And another was fed video frame regions to extract spatial global features. Moreover, they refined the saliency maps using they proposed SpatioTemporal Conditional Random Field (STCRF). STCRF was an extension of CRF toward the temporal domain as well as the formulation of the relationship between neighboring regions both in a frame and over frames. STCRF results in more accurate boundaries of salient objects and less noise owing to its temporally consistent saliency maps over frames. However, the STCRF was a time-consuming component. Wang et al. [41] proposed a deep model containing two modules to capture both the spatial and temporal saliency information using fully convolutional networks. The dynamic saliency model, unlike traditional models using the optical flow, incorporated saliency results from the static saliency model to generate saliency maps directly. It not only outperformed most existed traditional methods a lot but also significantly improved the computational speed. However, the result of current frame depends on that of previous frame in their method. More recently, Li et al. [17]

123


introduced the ‘FlowNet’ to model motion information for video SOD, but its architecture is a little complex. These methods mentioned above show the power and enormous potential of deep convolutional neural networks in video SOD. In this work, we also employ deep convolutional neural networks for end-to-end training and saliency detection. Different from all the methods above, we propose an 3D convolution-based X-shape structure in order to detect motion information and preserve more spatial saliency clues of each input frame. Benefit from the advantages of deep learning, it can make full use of very big training data. It enables our both temporary and spatial features, directly learning from data, to have good generalization ability. And it is fast enough for real-time detection. It is possible to be deployed on some kinds of devices in the Internet of Things.

3 Methodology Figure 2 shows the framework of our proposed model in detail. It consists of two major kinds of components: four 2D DenseNet with pyramid structure and a 3D convolution-based X-shape structure. The four 2D DenseNet with pyramid structure are considered as the static image subnetworks sharing parameters. It used to extract the spatial contrast information in singe video frame. The 3D

convolution-based X-shape structure is adapted to represent the temporary information among input video frames. When successive video frames are sent into the network, we first consider them as independent images and let them go through the corresponding static image sub-networks that has same parameters separately. (This step also can considered as let all input video frames go through a same static image sub-networks separately to extract the spatial contrast information). Then, we gather the features of all frames together. Next, we fed these features to a same 3D convolution-based X-shape structure and distribute the temporary information to each frame. So, after the 3D convolution-based X-shape structure, the features of each frame not just include the spatial contrast information in single video frame but also the effective temporary information among input video frames. Finally, each frame features go through last convolution layers and sigmoid activation function to generate final saliency maps. The detail of 2D DenseNet with pyramid structure, the 3D convolution-based X-shape structure and benefits of our method will be explained in the following three sections, respectively.

3.1 2D DenseNet with pyramid structure DenseNet can effectively deal with the vanishing-gradient problem and substantially reduce the number of parameters [5]. Also the DenseNet we used is pyramidal. In other

Fig. 2 Detailed illustration of the framework of our proposed deep neural network, which includes 2D DenseNet with pyramid structure and the 3D convolution-based X-shape structure. Best to be viewed in color (color figure online)

123


Fig. 3 Detailed illustration of the architecture of our utilized the 2D DenseNet with pyramid structure and Dense block. a The 2D DenseNet with pyramid structure. b Dense block in DenseNet. Notice

words, we not just utilize the features from final DenseBlock like traditional DenseNet architecture, but also utilize features from all DenseBlocks. These features of all DenseBlocks is extremely useful in the extraction of saliency clues from high to low levels. The 2D DenseNet with pyramid structure is illustrated in Fig. 3a, where the they are used as static image sub-network to extract the spatial contrast information. Different from traditional ‘FCN’ with multi-scale features, we almost used the feature maps of each convolution layer in our network. These feature maps represent rich spatial information that benefits our model to inference more accurate saliency maps for the input video frames. A DenseBlock includes l ‘2D conv’ layers with dense connectivity illustrated as Fig. 3b. The forward mode of the kth layers in the DenseBlock can be formulated as xk ¼ Hðy1 ; . . .yk1 Þ

ð1Þ

yk ¼ Convðxk Þ

ð2Þ

that a ‘2D conv’ means a sequential combination operations of ‘BatchNorm ? ReLu ? 2D convolution.’ Best to be viewed in color (color figure online)

where xk and yk in Eqs. 1 and 2 are the input and output of the k th layers, respectively. The function of H ðÞ is the concatenation of the inputs of the preceding k 1 layers. The function of ConvðÞ means a sequential combination operations of ‘BatchNorm ? ReLu ? 2D convolution’ at inputs. And it produces g channels feature maps as output. In our network, we set l ¼ 12; g ¼ 12. Between each two DenseBlocks is a 1 1 convolution layer which is regarded as an information ‘bottleneck layer’ to reduce a half of feature map channels. Meanwhile, We just place 2 2 pooling layer after first two DenseBlocks for keeping more spatial information in the feature maps of DenseBlocks. Notice that the DenseNet architecture that we used consists of five different DenseBlocks that can be called five levels. The pyramid structure also has five levels. The output of each DenseBlock is an input of the pyramid structure. After that, we let each level of the pyramid structure concatenate the higher levels’ feature maps for deeper and better feature extraction. The outputs from different DenseBlock layers are not always of the

123


same size. So during the process, we start from the highlevel small-sized outputs and upsample the prior concatenation when necessary. The upsample algorithm we utilized is bilinear interpolation. After the concatenation, lowlevel feature maps contain the extracted features from higher levels. So far, the pyramid structures are constructed.

3.2 3D convolution-based X-shape structure Then, it comes to the 3D convolution-based X-shape structure we mentioned above. The inputs of 3D convolution-based X-shape structure are the concatenated feature maps at the each level of pyramid structure. They are still of different sizes of resolution. The pyramid structure we used contains five different levels for better representing spatial contrast feature in single video frame as can be seen in Fig. 2. Then, the feature maps of the last four levels are upsampled to the resolution of the first level for subsequent use. Up to now, all the procedures are static image and are the same for each input video frame. And spatial contrast information of different levels are already been discovered by pyramid structure. Now, pyramid features of different frames are ready to get together for the extraction of motion information. We place a 3D convolution layer here in order to make use of deeper ‘pure’ features from the original input frames. Figure 2 illustrates the process in detail. All pyramid features of input frames are concatenated and the feature maps of mid-level are sent into the 3D convolution layer for the extraction of temporal features. We only choose mid-level feature maps for motion information extraction. Because mid-level feature maps are not too coarse or too refined, the useful information can be catched as much as possible while also trying to prevent over-fitting problem. Moreover, we are also concerned that too much motion information may weaken the detection ability of our model because too much motion information will ‘polluting’ the spatial contrast information of each video frame, especially when the motion information is unusual or hard to deal with. Then, the mid-level motion information is combined with all other high-level and lowlevel features again. After the ‘combine’ stage, the combined feature maps are split and distributed to each video frame. The drawing of ‘X’ is done. Finally, the feature maps of each video frame go through the last convolution layers and sigmoid functions to generate final saliency maps.

3.3 Benefits of our method A very good point of the proposed model is using the features of different levels throughout the networks to construct a pyramid structure. This helps our model to

123

learn the spatial contrast information of the video frame more comprehensively and increase its ability to detect different size salient objects. Besides, our concatenation steps are always starting from high-level features, which makes sure that each level’s feature maps at pyramid structure contain the saliency information from levels higher than it. This benefits the discovery of low-level features not so coarse. The use of the 3D convolution-based X-shape structure for motion information extraction is also an advantage. Compared to traditional methods using handcrafted features, like the optical flow, our 3D convolution-based X-shape structure can make use of the outputs generated by the networks ahead to deal with motion information extraction directly and the inference stage is far more computationally efficient. It can effectively preserve per-frame spatial contrast information as while as dealing with the motion information using their connection between each other. In fact, our proposed model not only lets deeper and finer features be used in the temporal information detection stage, but also makes sure that the per-frame spatial contrast information is not ‘polluted’ by motion information too early. A lot of pre-trained deep networks for static image SOD can be easily converted to a video SOD method by using this structure.

4 Experiments and results 4.1 Implementation details We implement our method by Python with TensorFlow library and optimize it by mini-batch gradient descent algorithm with momentum 0.9. Loss function is defined as the sum of the cross-entropy, MAE and generalized Dice [34]. The weight decay is 0.0001. Our train process includes two steps. The first one is pre-training the parameters of 2D convolution by three static image salient object datasets: DUT-TR (10,553 images) [37], DUTOMRON (5168 images) [43] and MSRA10K (10,000 images) [3]. We leverage the DUT-TE dataset as the validation dataset to choose hyperparameters. The second one is fine-tuning all parameters of our mode by GyGo dataset. In each fine-tuning iteration, we input 4 adjacent frames sampling from GyGO dataset to our model. The whole fine-tuning process has 30 epochs and without validation dataset. The learning rate is 0.0002 at the initial time, 0.00002 after 10 epochs, 0.000002 after 15 epochs, and 0.0000002 after 25 epochs. Similarly, at the each iteration of test process, we also input 4 contiguous video frames from test datasets into our model to predict 4 saliency maps. All the above training and testing steps are finished on a Tesla P100 (12G) GPU card.


4.2 Datasets and evaluate metrics We evaluate the performance of the proposed network over four video public benchmark datasets: VOS dataset [20], together with its two subsets VOS-E and VOS-N, FreiburgBerkeley Motion Segmentation (FBMS) dataset [28], Densely Annotated Video Segmentation (DAVIS) dataset [29] and NTT dataset [4, 13]. VOS The VOS dataset consists of 200 videos lasting 64 minutes in total, with 7650 keyframes manually annotated. The videos are classified into two subsets: VOS-E and VOS-N, where videos in VOS-N are much more challenging than the ones in VOS-E. We carry out comprehensive experiments on VOS, VOS-E and VOS-N to fully validate the effectiveness of our model. FBMS The FBMS dataset contains 59 video sequences, 29 of which composing the training set and the rest forming the testset, and a total of 720 frames are annotated. We only report the performance on the testset, for the reason that some methods we compare with have used the training set for training. DAVIS The DAVIS dataset contains 50 video sequences about humans, animals, vehicles, objects and actions, each lasting for about 2–4 s. 3455 frames are densely annotated. NTT The NTT dataset contains 10 video clips capturing outdoor scenes. Each videos has 5–10 s. We utilize recall, precision, Fb and mean square error (MAE) as evaluation metrics. Let G be the ground truth and S for the results; the precision and recall are defined as Precison ¼ Recall ¼

#ðNonzeros in S \ GÞ #ðNonzeros in SÞ

#ðNonzeros in S \ GÞ #ðNonzeros in GÞ

ð3Þ ð4Þ

Here, #ðÞ in Eqs. 3 and 4 means count the number of elements in set. The MAE is computed by MAE ¼

W X H 1 X jSi;j Gi;j j W H i¼1 j¼1

ð5Þ

Here, we use W and H to denote the width and height of the detected result S. To balance the weight of each video in the comparison, we adopt the idea of mean average precision (MAP) and mean average recall (MAR) proposed by [20]. In this way, the MAP, MAR and MAE over all videos are computed by averaging over the average precision, recall and MAE on each video. Also Fb is defined as Fb ¼

ð1 þ b2 Þ MAP MAR b2 MAP þ MAR

ð6Þ

Such calculation method can avoid the problem that the performance of short videos being overwhelmed by long videos, which is common with the usual metrics where the b2 in Eq. 6 is 3.

4.3 Comparisons To validate the effectiveness of the proposed model, we compare it with five state-of-the-art SOD methods including geodesic distance-based saliency (SAG) [39], saliency using local gradient flow (GF) [40], saliency via fully convolutional networks (FCN) [41], saliency via absorbing Markov chain (MC) [11] and saliency-guided stacked autoencoders (SSA) [20] on the four benchmark datasets mentioned above. All the methods are designed for video salient object detection except MC, and FCN is the only deep models among them. The qualitative comparison between ours and other state-of-the-art methods is shown in Fig. 4, where the first column shows the video clips from four datasets from top to bottom: VOS dataset, DAVIS dataset, FBMS testset and NTT dataset, and the second column shows the corresponding ground truth saliency maps. It can be seen from the figure that the results of our model are always the most similar to the ground truth maps over all the models we tested, whatever kind of the input frames are. Therefore, the results also illustrate that our model has the ability of detecting salient objects in different kinds of videos. In fact, our model exhibits competitive performance in some difficult scenes which salient object is included in complicated texture background like videos in VOS-N dataset, as we can see in Table 1. It is worth noting that another deep learning model, FCN also gets good results, though a little weaker than the proposed network. It also demonstrates that using deep learning in video SOD is a promising research direction. Due to the lack of motion information, the static image method MC does not generate promising results. However, the absence of the time-consuming computation of the optical flow saves it much time and makes it the fastest non-deep models here. Non-deep optical flow-based video SOD methods SAG and GF suffer from the ambiguity between salient parts and the background due to lack of deep features and prior knowledge learned from large training sets. And the computation load is very high during the inference time. During the training stage, we are always trying to focus our model more on precision than recall. And the results illustrate that the effort seems to be useful: only a few small non-salient regions are predicted to be salient regions. Our saliency maps are usually pretty ‘slim’ especially compared with the results of traditional methods.

123


Fig. 4 Representative results of the six models we tested over VOS dataset, NTT dataset, FBMS testset and DAVIS dataset

Table 1 Performance comparison of our model, five other state-of-the-art models, and our network without 3D convolution layers (denoted by Our_0) and with two 3D convolution layers in the 3D convolution-based X-shape structure (denoted by Our_2)

VOS-E

VOS-N

VOS

DAVIS

FBMS

NTT

MAP MAR Fβ MAP MAR Fβ MAP MAR Fβ MAP MAR Fβ MAP MAR Fβ MAP MAR Fβ

SAG 0.709 0.814 0.731 0.354 0.742 0.402 0.526 0.777 0.568 0.370 0.815 0.423 0.421 0.736 0.467 0.500 0.612 0.522

GF 0.712 0.798 0.730 0.346 0.738 0.394 0.523 0.767 0.565 0.399 0.741 0.446 0.454 0.674 0.491 0.453 0.593 0.479

FCN 0.804 0.840 0.812 0.572 0.803 0.613 0.685 0.821 0.712 0.528 0.848 0.578 0.592 0.749 0.622 0.673 0.748 0.689

MC 0.815 0.736 0.795 0.499 0.664 0.529 0.652 0.699 0.663 0.347 0.648 0.389 0.390 0.599 0.424 0.680 0.655 0.674

SSA 0.874 0.777 0.850 0.657 0.686 0.664 0.762 0.730 0.755 0.617 0.785 0.649 0.620 0.655 0.628 0.695 0.634 0.680

Our 0.867 0.903 0.875 0.675 0.823 0.704 0.768 0.862 0.788 0.668 0.813 0.696 0.677 0.748 0.693 0.755 0.755 0.755

Our 2 0.846 0.924 0.862 0.641 0.856 0.681 0.740 0.889 0.770 0.609 0.858 0.652 0.638 0.771 0.665 0.728 0.780 0.739

Our 0 0.820 0.923 0.842 0.563 0.802 0.604 0.687 0.861 0.721 0.497 0.838 0.549 0.588 0.772 0.622 0.687 0.754 0.701

The best three scores in each column are marked in red, green and blue. It is worth noting that VOS-E and VOS-N are two subsets of VOS dataset. Best to be viewed in color

123


Quantitatively, Table 1 shows the MAP, MAR and Fb scores of the six methods on four benchmark datasets. These methods include four state-of-art methods we compare and two different variants of our model. The structures of the two variants are just the same as the proposed one except that one has no 3D convolution layer and the other has two 3D convolution layers in the 3D convolution-based X-shape structure. As we can see from the table, our model outperforms all other methods including the two variants of ours. It is worth noting that even all the metrics place the variant with two 3D convolution on the top three and it almost gets all the best recalls except on FBMS testset,

whereas the proposed model fails to get the first three places on two metrics, we still consider the proposed model as the best since it wins others in precision metric over all the datasets except VOS-E. The qualitative representative can be seen in Fig. 5. The reason lies in the fact that it is widely recognized in the research of SOD the rise of precision is a much more important and difficult problem than that of recall. Since recall and precision are somehow inversely proportional. Getting higher precision at the expense of reducing recall a little bit is a commonly used trade-off in SOD research. Moreover, the computation of Fb in almost all the VOS research also puts more emphasis

Fig. 5 Comparisons between the proposed network and five state-ofthe-art methods on four benchmark datasets. Datasets used for validation from top to bottom are the VOS dataset, DAVIS dataset,

FBMS testset and NTT dataset. The subfigures from left to right are average precision–recall cure, F-measure and average MAE

123

Our Two-3D No-3D GT

Image


Fig. 6 The left figure is the comparison results of the computational load from our model and 4 state-of-the-art SOD methods. The right figure is the qualitative comparison of the proposed network and its two variants on VOS dataset, DAVIS dataset and FBMS testset, respectively. The lines from top to bottom represent video clips, ground truth saliency maps, saliency results via no-3D-convolution

network, saliency results via two-3D-convolution layers network and saliency results generated by our proposed architecture. Notice that our model outperforms two variants by a large margin. In particular, our model always gets nice precision scores. Best to be viewed in color (color figure online)

on precision than recall to highlight the importance of precision, so the model with higher precision can get higher Fb score much easier. For example, as can be seen, the proposed model outperforms all the other models on all datasets when it comes to the metric of Fb . Beyond comparing the performance of the proposed model and other 5 state-of-the-art models, another issue in this table is the ablation analysis. The fact that the variant without any 3D convolution performs not as well as networks with 3D convolution architecture validates the effectiveness of our 3D convolution-based X-shape structure in catching motion information. Since this variant is actually no longer video, the comparison also demonstrates the superiority of video models to static image ones in video SOD research. Surprisingly, the proposed model outperforms the model which has two 3D convolution layers in the 3D convolution-based X-shape structure on most of the metrics, since intuitively a more complex architecture can capture more and deeper motion information. This may due to the fact that the complicated model has lots of parameters and may become over-fitting to the training set. Another reason lies in that motion information can also be misleading sometimes, for example, when the object is static and the video is shot by moving the camera. This also explains why in the proposed network we only use one 3D convolution layer in the 3D convolution-based X-shape structure. Figure 6 left graph documents the computation load of our method with 4-state-of-the-art methods. These experiments are all executed on the same computer with Intel Core i5 8250U, 8 GB SAM and Nvidia Geforce TITAN X GPU. Our input for each model is the same: 4 continuous 480p video frames. And the time we marked in the figure is the average time for each frame. It is computed just by dividing the total time by 4. Though very similar, the time we used is a little higher than FCN and MC. Because they

do not need to compute the motion information in adjacent input video frames. Throwing this process away saves their much time. But our method greatly exceeds non-deep video methods using optical flow SAG and GF. And the results of our method are much better than the results of FCN and MC. This suggests our deep network is computational efficient. In fact, it takes only about 0.63 s for our network to predict the saliency map of an 800 * 480 frame. It demonstrates that our model is fast enough to meet the need of real-time salient object detection. Figure 6 right graph is the qualitative comparison of the proposed network and its two variants on VOS dataset, DAVIS dataset and FBMS testset, respectively. The lines from top to bottom represent video clips, ground truth saliency maps, saliency results via no-3D-convolution network, saliency results via two-3D-convolution layers network and saliency results generated by our proposed architecture. Notice that our model outperforms two variants by a large margin. In particular, our model always gets nice precision scores.

123

5 Conclusion In this paper, we have presented a novel deep video saliency detection model. The model consists of an 2D DenseNet with pyramid structure for per-frame spatial contrast information extraction and an 3D convolutionbased X-shape structure for utilizing motion information. The 2D DenseNet with pyramid structure consists of four densely connected networks using feature pyramid structures for feature abstraction at all levels. The discovered pyramid features are then sent to the 3D convolution-based X-shape structure for extracting motion information. The 3D convolution-based X-shape structure enables our model to fastly detect motion information directly and preserve


the spatial contrast information of each input frames as much as possible. Both the 3D convolution-based X-shape structure and the 2D DenseNet with pyramid structure reduce the number of parameters. Together with the fact that features at different levels are all taken into consideration when detecting, our network is strong enough to effectively reduce the problem of overfitting and detect different size of salient objects. Our method is far more computationally efficient compared with traditional methods using the optical flow to express motion information. And our method also has good generality ability because our spatial contrast information and motion information are directly learned from large scale datasets. It is possible to be deployed on the some kinds of devices in the Internet of Things. Besides, we will integrate the long-term motion information into our model in the future work.

Funding This study was funded by Youth Innovation Promotion Association of the Chinese Academy of Sciences (Grant No. 218165), Shenzhen Key Laboratory of Neuropsychiatric Modulation (CN) (Grant No. JCYJ20170307165309009).

Compliance with ethical standards Conflict of interest The authors declared that they have no conflict of interest to this work.

References 1. Borji A (2012) Boosting bottom-up and top-down visual features for saliency estimation. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR), pp 438–445 2. Chen S, Xu H, Liu D, Hu B, Wang H (2014) A vision of IoT: applications, challenges, and opportunities with China perspective. IEEE Internet Things J 1(4):349–359. https://doi.org/10. 1109/JIOT.2014.2337336 3. Cheng MM, Mitra NJ, Huang XL, Torr PHS, Hu SM (2015) Global contrast based salient region detection. IEEE TPAMI 37(3):569–582. https://doi.org/10.1109/TPAMI.2014.2345401 4. Fukuchi K, Miyazato K, Kimura A, Takagi S, Yamato J (2009) Saliency-based video segmentation with graph cuts and sequentially updated priors. In: 2009 IEEE international conference on multimedia and expo (ICME), pp 638–641 5. Gao H, Zhuang L, Laurens M, Kilian W (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2261–2269 6. Guo C, Ma Q, Zhang L (2008) Spatio-temporal saliency detection using phase spectrum of quaternion Fourier transform. In: 2008 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8 7. Hou Q, Cheng MM, Hu X, Borji A, Tu Z, Torr PHS (2018) Deeply supervised salient object detection with short connections. IEEE Trans Pattern Anal Mach Intell 1–1 8. Hsu KJ, Lin YY, Chuang YY (2017) Weakly supervised saliency detection with a category-driven map generator. In: British machine vision conference (BMVC)

9. Hu P, Shuai B, Liu J, Wang G (2017) Deep level sets for salient object detection. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR) 10. Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231 11. Jiang B, Zhang L, Lu H, Yang C, Yang MH (2013) Saliency detection via absorbing Markov chain. In: 2013 IEEE international conference on computer vision (ICCV), pp 1665–1672 12. Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision, pp 2106–2113 13. Kazuma A, Fukuchi K, Kimura A, Takagi S (2010) Fully automatic extraction of salient objects from videos in near real-time. CoRR 1–25 14. Le TN, Sugimoto A (2017) Spatiotemporal utilization of deep features for video saliency detection. In: 2017 IEEE international conference on multimedia and expo workshops (ICMEW). IEEE, pp 465–470 15. Lee YJ, Kim J, Grauman K (2011) Key-segments for video object segmentation. In: 2011 International conference on computer vision (ICCV), pp 1995–2002 16. Li G, Yu Y (2015) Visual saliency based on multiscale deep features. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 5455–5463 17. Li G, Xie Y, Wei T, Wang K, Lin L (2018) Flow guided recurrent neural encoder for video salient object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3243–3252 18. Li GB, Xie Y, Lin L, Yu YZ (2017) Instance-level salient object segmentation. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 247–256 19. Li J, Levine M, An X, He H (2011) Saliency detection based on frequency and spatial domain analyses. In: Proceedings of the British machine vision conference (BMVC). BMVA Press, pp 86.1–86.11 20. Li J, Xia C, Chen X (2018) A benchmark dataset and saliencyguided stacked autoencoders for video-based salient object detection. IEEE Trans Image Process 27(1):349–364 21. Li X, Zhao LM, Wei L, Yang MH, Wu F, Zhuang YT, Ling HB, Wang JD (2016) Deepsaliency: multi-task deep neural network model for salient object detection. IEEE Trans Image Process 25(8):3919–3930 22. Liu N, Han J (2016) Dhsnet: deep hierarchical saliency network for salient object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 678–686 23. Liu T, Zheng N, Wei, Yuan Z (2008) Video attention: learning to detect a salient object sequence. In: 2008 19th International conference on pattern recognition (ICPR), pp 1–4 24. Luo ZM, Mishra A, Achkar A, Eichel J, Li SZ, Jodoin PM (2017) Non-local deep features for salient object detection. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR) 25. Ma T, Latecki LJ (2012) Maximum weight cliques with mutex constraints for video object segmentation. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR), pp 670–677 26. Margolin R, Tal A, Zelnik-Manor L (2013) What makes a patch distinct? In: 2013 IEEE conference on computer vision and pattern recognition, pp 1139–1146 27. Mohammadi M, Al-Fuqaha A, Guizani M, Oh JS (2018) Semisupervised deep reinforcement learning in support of IoT and smart city services. IEEE Internet Things J 5(2):624–635. https://doi.org/10.1109/JIOT.2017.2712560

123

Neural Computing and Applications 28. Ochs P, Malik J, Brox T (2014) Segmentation of moving objects by long term video analysis. IEEE Trans Pattern Anal Mach Intell 36(6):1187–1200 29. Perazzi F, Pont-Tuset J, McWilliams B, Gool LV, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 724–732 30. Rahtu E, Kannala J, Salo M, Heikkil J (2010) Segmenting salient objects from images and videos. In: Proceedings of the 11th European conference on computer vision: part V (ECCV), ECCV’10. Springer, Berlin, pp 366–379 31. Seo PM (2009) Static and space–time visual saliency detection by self-resemblance. J Vis 9(12):15 32. Sezer OB, Dogdu E, Ozbayoglu AM (2018) Context-aware computing, learning, and big data in internet of things: a survey. IEEE Internet Things J 5(1):1–27. https://doi.org/10.1109/JIOT. 2017.2773600 33. Stankovic JA (2014) Research directions for the internet of things. IEEE Internet Things J 1(1):3–9. https://doi.org/10.1109/ JIOT.2014.2312291 34. Sudre CH, Li WQ, Vercauteren T, Ourselin S, Cardoso MJ (2017) Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, pp 240–248 35. Sumati M, Shanu S (2016) Analysis of computer vision based techniques for motion detection. In: Cloud system and big data engineering. IEEE, pp 445–450 36. Lijun W, Huchuan L, Xiang R, Ming-Hsuan Y (2015) Deep networks for saliency detection via local estimation and global search. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3183–3192

123

37. Wang LJ, Lu HH, Wang YF, Feng MY, Wang D, Yin BC, Ruan X (2017) Learning to detect salient objects with image-level supervision. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR) 38. Wang T, Borji A, Zhang LH, Zhang PP, Lu HC (2017) A stagewise refinement model for detecting salient objects in images. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 4019–4028 39. Wang W, Shen J, Porikli F (2015) Saliency-aware geodesic video object segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3395–3402 40. Wang W, Shen J, Shao L (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans Image Process 24(11):4185–4196 41. Wang W, Shen J, Shao L (2018) Video salient object detection via fully convolutional networks. IEEE Trans Image Process 27(1):38–49 42. Xiao X, Xu C, Rui Y (2010) Video based 3D reconstruction using spatio-temporal attention analysis. In: 2010 IEEE international conference on multimedia and expo (ICME), pp 1091–1096 43. Yang C, Zhang LH, Lu HH, Ruan X, Yang M (2013) Saliency detection via graph-based manifold ranking. In: 2013 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3166–3173 44. Zhang L, Tong MH, Marks TK, Shan H, Cottrell GW (2008) Sun: a Bayesian framework for saliency using natural statistics. J Vis 8(7):32 45. Zhao R, Ouyang W, Li H, Wang X (2015) Saliency detection by multi-context deep learning. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1265–1274 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

IoT-based 3D convolution for video salient object

IoT-based 3D convolution for video salient object

Suggest Documents

Salient Object Detection on Large-Scale Video Data - CiteSeerX

Efficient Object Recognition using Convolution

Locating Salient Object Features - CiteSeerX

Salient Local 3D Features for 3D Shape Retrieval

3D ResNets for 3D object classification

A novel 3D video transcoding scheme for adaptive 3D video ...

Salient Object Detection by Composition - Google Sites

Salient object detection: from pixels to segments

Learning to Detect A Salient Object

Salient object segmentation based on active

SALIENT OBJECT DETECTION IN HYPERSPECTRAL IMAGERY Jie

Semantic Instance Meets Salient Object: Study on

Auto-classifying Salient Content In Video

Learning to Detect A Salient Object

Salient Object Detection via Augmented Hypotheses

Salient Object Detection: A Benchmark - arXiv

Self-explanatory Deep Salient Object Detection

HyperFusion-Net: Densely Reflective Fusion for Salient Object Detection

A Multi-size Superpixel Approach for Salient Object Detection based

Contrast-Oriented Deep Neural Networks for Salient Object ... - arXiv

Hierarchical Salient Object Detection for Assisted Grasping - arXiv

Salient region detection and segmentation for general object

MINDEX: An efficient index structure for salient-object ... - Springer Link

Multi-Scale Deep Neural Network for Salient Object Detection