Neurocomputing 295 (2018) 127–141
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Correlational examples for convolutional neural networks to detect small impurities Yue Guo a,b,∗, Yijia He a,b, Haitao Song a, Wenhao He a, Kui Yuan a a b
Institute of Automation, Chinese Academy of Sciences, Beijing, PR China School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, PR China
a r t i c l e
i n f o
Article history: Received 8 May 2017 Revised 22 February 2018 Accepted 6 March 2018 Available online 14 March 2018 Communicated by Xiang Bai Keywords: Impurity detection Multi-frame correlation Convolutional neural network Correlational example Sequential training
a b s t r a c t Convolutional neural networks have been significantly improving common object detection performances for a long time. However, targets across frames are independently detected in an image sequence, and object detection methods in multiple frames are generally divided into two main stages: object detection in every single frame and feature map association across frames. In this paper, a multi-frame detection framework is proposed to directly detect small impurities in opaque glass bottles with liquor. Specifically, a convolutional neural network trained with correlational examples simultaneously detects and correlates proposals, and then links them selectively to obtain robust detection results under challenging illuminations. Besides, memory costs of patch pairs become extremely large compared with those of patches, thus a sequential training procedure is introduced to relax hardware requirements. Extensive experiments on impurity datasets demonstrate superior performances of multi-frame detection frameworks with convolutional neural networks than traditional single-frame models.
1. Introduction Impurity detection and classification in transparent bottles may have already been solved using traditional machine learning methods, and most researchers focus on studying shapes and motions of impurities since backgrounds are relatively visually static. However, to the best of our knowledge, no researches about impurity detection and classification in opaque glass bottles have been published, and performances in our engineering work only based on motions and shapes of impurities have already been proved ineffective because backgrounds in opaque glass bottles are much more dynamic and complex. Different from impurity detection in transparent bottles, impurities can not be observed outside the bottle wall, and a camera is directly put down into the bottle and samples images above the liquid level, so one of the special characteristics of opaque bottles in our task is that decorative patterns are carved on bottle bottoms. Consequently, considering the locations of cameras and characteristics of bottles themselves, impurity detection performances in an opaque glass bottle depend on several influence factors including non-uniform light exposures of the bottle bottom, bottom fragment motions of the carved pattern, impurity-background discriminative ∗ Corresponding author at: Institute of Automation, Chinese Academy of Sciences, Beijing, PR China. E-mail address:
[email protected] (Y. Guo).
https://doi.org/10.1016/j.neucom.2018.03.017 0925-2312/© 2018 Elsevier B.V. All rights reserved.
© 2018 Elsevier B.V. All rights reserved.
feature extractions and impurity detection methods. Specifically, non-uniform exposures result from various bottom thicknesses and colors of each bottle; fragment motions of the bottom carved pattern are the local shifts of partial designs caused by tilts and fluctuations of liquor level under intense lighting conditions; performances of discriminative feature extractions and detections mainly depend on imaging qualities and impurity detection frameworks. Impurity detection problem in opaque glass bottles can also be characterized by sampled images: firstly, both impurities have only a few exclusive features; secondly, under the same strength of illumination, ranges of gray images among different bottles look unstable, because sometimes a camera may be put into a nearly transparent bottle, while at other times it may be in a completely dark bottle. Though lighting conditions have been improved with stronger lights, the thicknesses of bottle bottoms vary. Therefore, impurity detection problem in opaque glass bottles still remains challenging. To address this problem, a multi-frame detection framework with a convolutional neural network is proposed to directly detect and link object proposals in a finite number of frames, and a sequential training method for convolutional neural networks is additionally introduced with restricted memory. Moreover, detection performances in different bottle colors are intuitively analyzed through precision-recall visualization. The main contributions of this paper are summarized as follows:
128
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
• Only one convolutional neural network trained with correlational examples simultaneously correlates and detects impurities in bottles with liquor under inadequate lighting conditions. • A multi-frame framework based on a convolutional neural network is applied to link predicted impurities and eliminate background fluctuations. • Only a small dataset is far from adequate for a convolutional neural network, but after patch balancing and combination, the correlational dataset becomes significantly large. To train and evaluate the convolutional neural network with restricted memory costs, a simple sequential training procedure is introduced to efficiently address this issue. The rest of this paper is organized as follows. In Section 2, prior works in detection and matching are briefly introduced. In Section 3, a multi-frame detection framework based on a convolutional neural network is proposed, and training networks sequentially with correlational examples is detailed. Analyses of experiments in a single frame and those in multiple frames are independently provided, and then we quantitatively compare multi-frame experiments with single-frame tests in Section 4. Finally, our conclusions and future works are presented in Section 5. 2. Related work Impurity detection in transparent bottles is the closest application to our works (Section 2.1), and our researches are mainly inspired by many ideas from object detection (Section 2.2), since FPGAs temporarily provide image patches and relevant information, methods based on image patches such as image patch matching (Section 2.3) are further considered but applied differently in our task. 2.1. Impurity detection in transparent bottles Impurity detection in opaque glass bottles may be the first work in our field since related public works have not been found. Therefore, only the impurity detection in transparent bottles is introduced, which mainly includes following cascade steps: Initially, moving impurities and backgrounds are different in a continuous image sequence. Second-order difference and accumulative energy were applied to separate moving impurities from backgrounds [1]. Moreover, fused image difference was used to detect motion regions, then fuzzy c-means clustering combined with fuzzy support vector machines divided impurities and backgrounds [2]. Similarly, fuzzy c-means clustering combined with least squares filtering suppressed backgrounds. Next, prior knowledge such as the area, gray values and the location of each proposal region were considered to detect and track impurities [3]. Secondly, moving impurities should be different from bubbles. The maximum length of a moving blob divided by its minimum length was found as the best feature of shape classification, then support vector machines separated bubbles and impurities [1]. Trajectories of moving objects can also be constructed to differentiate bubbles from impurities in ampoule injections [4]. Features including the gray, shape and position of a blob were trained with fuzzy least squares support vector machines to classify impurities and bubbles [3]. Additionally, feature vectors including several shape features are generated to classify different impurities [4]. 2.2. Object detection Many object detection frameworks are applied in a single frame, and the majority of relevant researchers focus on largescale image detection benchmarks. Mostafa Mehdipour Ghazi et al.
[5] used deep neural networks to identify plant species through transfer learning and discovered that simpler models such as AlexNet are easier to be fine-tuned than others like GoogLeNet and VGGNet. However, sometimes datasets are quite different from some specific image detection tasks in special occasions, and finetuning may not even make a network converge. For example, real-time object detection framework, You Only Look Once (YOLO v2) [6], has state-of-the-art performances in common object detection benchmarks, but when applied in our task, it does not even converge if finetuning or training it at the beginning. Famous benchmarks facilitate the developments and applications of network architectures. Christian Szegedy et al. added 1 × 1 convolutions into the Inception module to reduce dimensions and achieved the state-of-the-art performance with large-scale datasets [7]. Meanwhile, even with a small number of training data, convolutional neural networks may still perform well. Shiqi Yu et al. applied convolutional neural networks for classification in hyperspectral images, and their work demonstrates that a well-designed network can also have good generalization with few training samples. Specifically, without max pooling layers and fully connected layers, proper convolution kernels and larger dropout rates are used in their network architecture. However, objects occupy large pixel spaces in some public benchmarks such as Imagenet [8], so a trained network naturally prefers to detect big objects, but COCO [9] contains more tiny and occluded objects. Small object detections might be challenging if directly using a single convolutional neural network, and it may be due to the following reasons: reception fields in high-level layers are quite large, therefore, firstly, they will not encode sufficient informative features if objects are too small; secondly, deeper layers may become less representative for tiny objects when they get more information outside the regions of interest. On the one hand, to alleviate such problems, object detection tasks can be divided into two steps: locations of relatively notable target contexts and small object detections in these contexts. Junhua Sun et al. [10] proposed an automatic fault recognition system with convolutional neural networks. Specifically, a network in the first stage detects region proposals of side frame keys and shaft bolts, while a network in the second stage identifies faults from those region proposals. Regions of Interest (RoI) pooling operations are applied to extract regions from the last convolutional feature maps. On the other hand, high-level feature maps can be concatenated with both original images and low-level feature maps to preserve small but informative features. To detect tiny faces, Chenchen Zhu et al. [11] fused high-level feature maps and low-level ones to generate region candidates, and concatenate candidate features after RoI pooling operations in different layers. False positives can be further rejected with the help of human body to detect unconvinced faces. Xiaodan Sui et al. [12] tried convolutional neural networks to segment choroidal boundaries. Specifically, a coarsescale network is used to learn global features, and a mid-scale network concatenates output feature maps from a model in coarse scale and downsampled original images to get mid-level edge-cost maps, finally, a fine-scale network concatenates original image and upsampled edge-cost maps to refine high-resolution boundaries. Sampled images may inevitably be polluted by noises in their surroundings, which can lead to meaningless detections, and inspection models can refuse to detect noisy samples beforehand with convolutional neural networks. Honghan Chen et al. [13] proposed a cascaded spatial-temporal deep framework to inspect gastrointestinal tract diseases, a convolutional neural network firstly removes noisy contents seen from a capsule endoscopy, and then another one classifies organs in clear contents. Features of convolutional neural networks can also assist tracking tasks. Tao He et al. [14] experimented convolutional neural
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
[impurity] Labels: [1] Labels (One-hot): [0, 1]
[impurity] Labels: [1] Labels (One-hot): [0, 1]
[background] Labels: [0] Labels (One-hot): [1, 0]
[background] Labels: [0] Labels (One-hot): [1, 0]
129
[impurity, impurity]
Labels: [1] => Labels (One-hot): [0, 1]
[impurity, background]
Labels: [0] => Labels (One-hot): [1, 0]
[background, impurity]
Labels: [0] => Labels (One-hot): [1, 0]
[background, background]
Labels: [0] => Labels (One-hot): [1, 0]
Fig. 1. Examples in an image sequence: assuming that an image sequence contains three frames, one patch of impurity and several patches of backgrounds are extracted in each frame (see the top left, an image patch is represented as a small square). On the left (”image”), only the impurities are augmented using rotations, and backgrounds remain the same. Then in the middle part (“image + adversarial”), adversarial examples are generated and combined with augmented data to train models. On the right part, impurities and backgrounds on the left part are paired and labeled to make correlational examples, and specific details are in Fig. 2. Obviously, correlational examples are much larger than the first two types of data, so background patches can be randomly sampled before data augmentation to alleviate this problem. Labeling for correlational examples are different from those used for image classification. Only the [impurity, impurity] is labeled as 1 (“impurity pair”), but [background, background], [background, impurity], and [impurity, background] are labeled as 0 (“background pair”). And if we treat ”impurity pair” and “background pair” as two different classes, the one-hot representations of [background, background], [background, impurity], [impurity, background], and [impurity, impurity] should be [1,0], [1,0], [1,0], and [0,1].
networks to learn cell features, and a particle filter model outputs confidences of all candidates, then the candidate with the highest confidence is treated as the cell position in the subsequent frame, and multi-task learning enables network parameters shared to obtain joint features among related tasks.
2.3. Image patch comparison Siamese [15] is a network with two branches to extract image features, and contrastive loss is used to train the network. Moreover, a similar architecture was successfully applied for disparity map estimation in stereo visions [16,17] and person reidentification [18–20]. However, branch architectures and their corresponding weights are shared, which means these two branches may possibly be unified into one with tangled correlations. Feature extractions and correlations should be separated since they are intuitively different concepts for humans. Wei Li et al. [21] proposed a filter pairing neural network to re-identify a person from disjoint camera views, and the network mainly consists of three steps: extracting features of input image patches in different camera views, matching responses of patch features and recognizing identities. Specifically, a convolutional network firstly crops two pedestrian images and outputs corresponding feature maps, then patch matching layers match local patches, finally, the network outputs two probabilities indicating whether two pedestrian images belong to the same person.
Experiments were extended on feature correlations in different levels. Mauricio Perez et al. [22] combined optical flows and MPEG motion vectors with static image information to achieve video pornography detections, and further found that fusions among extracted convolutional features and those among decisions in separate branches outperform the fine-tuned networks and models with static and dynamic information mixed before feature extraction. Zifeng Wu et al. [20] experimented three different convolutional neural network architectures (matching local features at the input layer, matching mid-level features at the top layer and matching global features at the top layer) to identify human’s cross-view gaits, and the ensemble of gait energy images and temporal information achieves the best recognition performance, but these network architectures have similar performances in their datasets. Sergey Zagoruyko et al. [23] proposed a convolutional neural network architecture to compare image patches across images, and experiments show that a two-channel convolutional neural network with two streams outperforms other methods on local image patches benchmarks. Specifically, the input of a two-channel convolutional neural network is an image with two channels, each of which contains an image patch to be compared, and the network outputs similarity between two image patches. As for a two-stream network, two streams take an original image patch pair and its central downsampled patch pair as inputs separately, and a single stream extracts features for one patch pair. Each network is trained using a hinge loss function with squared
130
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
Fig. 2. Correlational examples generated with augmented data: in Fig. 1, we take five background patches in the third frame and 15 augmented impurity patches in all the frames as an example. Each background patch is paired with every augmented impurity patch, and the network input has two channels, where a patch in the pair corresponds to a channel.
L2-norm regularization and evaluated by setting a similarity distance threshold in the feature space. Actually, those conclusions [20,22] are not conflicted with those proposed in image patch comparison [23], since in image patch comparison, their input contents lie on the same input feature manifold. However, static input features and dynamic ones may lie on different manifolds, therefore, even though image inputs contain diverse types of information, given the same targets, similar high-level semantic features will be extracted with convolutional neural networks. In our task, a convolutional neural network classifies all the image patch proposals in a frame and simultaneously correlates them in multiple frames. Moreover, since a two-channel convolutional neural network performs better than that in one-channel with two feature extraction branches [23]. Therefore, instead of using Siamese, a convolutional neural network with a two-channel input is preferred in our application. A two-channel convolutional neural network only learns to compare patches, but a convolutional neural network trained with correlational examples can classify proposals, and the main purpose of our model is to classify impurities and fluctuations. Therefore, instead of hinge-based loss using binary labels for patch matching [23], or triplet loss among high-level feature maps [24], and then adding another redundant output branch for classification of patch pairs, a cross-entropy loss is simply used, which suggests that this network can be intuitively viewed as a classification model with correlational constraints from samples. 3. Multi-frame correlation and detection Multi-frame correlation includes generating correlational examples (Section 3.1), sequentially training a convolutional neural network with correlational examples, and applying the trained model with proposal pairs (Section 3.2). Multi-frame Detection contains linking patch pairs and transferring the probability shared between two patch pairs in different frames into a binary decision that every patch pair includes impurities or not (Section 3.3). 3.1. Correlational examples Pairs of image patches and areas are prepared for convolutional neural network training and evaluation (see Fig. 3). Specifically, generation of correlational examples can be divided into two parts: firstly, image patches and corresponding bounding box areas
are augmented to increase and balance binary samples, which is the same as single-frame data augmentation (see the left part of Fig. 1); secondly, a patch pair can be automatically labeled positive when both image patches contain impurities, and a pair is labeled as negative when at least one background patch is included. Besides, only one impurity is put into a bottle in data sampling, and merely one complete or major part of an impurity is labeled in the same sequence in human labeling. Moreover, pairing patches are operated in the same sequence. Therefore, correlations among dissimilar impurities are prevented to a large extent. In other words, although different kinds of impurities are included in only one impurity class, most correlations only happen in impurities with the same identities in different frames. Specific operations for generating correlational examples in an image sequence are illustrated on the right part of Fig. 1 (different parts of generating examples are separated by two green vertical solid line segments). 3.2. Sequential training Correlational examples are generated and automatically labeled, then they are used for supervised learning of a convolutional neural network. Specifically, back-propagation calculates the gradients of model weights, and Adam [25] provides adaptive learning rates to update network weights. After training, the model can be applied in real-time image patch sequences. When models are used to run in real time, only the image patches and their areas are provided through FPGAs, and they are paired to output proposal pairs, which are later used as the inputs of a convolutional neural network, as shown in Fig. 3. After the network outputs two probabilities, thresholds Pcorr+ and Pcorr− are set to remove pairs that may have at most an impurity. Correlational examples become extremely large after patch pairing. Minority most informative samples may benefit the decision boundary, but majority less informative samples learn robust features with sufficient trainng data [26]. Therefore, to train a dataset with specific bottle color, instead of active learning or online hard negative mining, sequential training is preferable for us with restricted computer memory. Specifically, at first, training datasets D are randomly shuffled and divided into ntrain subsets {D1 , D2 , . . . , Dntrain }, and ntest subsets {D1 , D2 , . . . , Dntest } are similarly obtained from test datasets D ; then training sets are sequentially combined as
{{D1 , D2 }, {D1 , D3 }, . . . , {D1 , Dntrain }, {D2 , D3 }, {D2 , D4 }, . . . , {D2 , Dntrain }, ...,
(1)
{Dntrain −1 , Dntrain }}. where in {Di , Dj }, Di represents ith subset for training, and Dj is jth subset for validation. Test subsets are also handled in a similar way. Generally, sequential training can be viewed as cross validation on the combination of partial data for training and a subsequent part for validation. 3.3. Linking impurity sequences across frames Image patch pairs satisfying the threshold values of a multiframe convolutional neural network are connected into several sequences according to their maximum probabilities, and the sequence containing the most patches are preserved as an impurity sequence, while patches in other sequences are predicted as backgrounds. For example, assuming that after setting thresholds for outputs of the convolutional neural network, there exist five, one, and three image patches in the first frame, second frame, and third frame. Image patches in neighboring frames are paired. Therefore, five proposal pairs are in the first two frames, while three pairs
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
131
Image Patch (80*80)
Blob Area Image Patch (20*20) Normalized Blob Area
Sample 1st Image
Proposal Central Clip (1 / 2) + Scale (1 / 2)
Proposal Pairs
Sample 2nd Image
Conv1 3x3x32 / 1
Normalized Blob Area
Conv2 3x3x32 / 1 Max Pooling 2x2x16 / 2
Max Pooling 2x2x32 / 2
Max Pooling 2x2x16 / 2
Image Patch Pair Pairs for Network Input
Fully Connected 240 Binary Outputs
Conv3 Conv4 3x3x16 / 1 3x3x16 / 1
Conv6 Conv5 3x3x16 / 1 3x3x16 / 1
Fig. 3. Applying a convolutional neural network trained with correlational examples in neighboring frames: 40 × 40 image patches (blue and yellow rectangles in the original image) and blob areas are directly provided by FPGAs with three-frame differentiation, morphological operations, and some area thresholding rules. Then the sizes of image patches are rescaled into 20 × 20, while blob areas are normalized to the normal distribution and limited to [0,1]. Next, all the patches between frames at current and subsequent time steps are matched and sent to a convolutional neural network. When areas are considered as the model inputs, normalized areas are concatenated with flattened convolutional feature maps of two image patches, and the network outputs binary predictions. Besides, when preparing training data, a 80 × 80 image patches (green and red rectangles in original images) are centrally cropped for image patch augmentation.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
are in last two frames. In the first two frames, a patch pair with the maximum probability of being an impurity pair is preserved, and only one proposal pair is left; then in the last two frames, one with the similar maximum probability is maintained; finally, since the patch in the second frame is connected with the remaining patch in the first frame and one in the third frame, these three patches are connected into a sequence. However, in the first frame and the third frame, there are 15 proposal pairs, and if the chosen patches described above in the first frame and the third frame are coincidentally paired with the maximum probability, then only one predicted impurity sequence exists, otherwise, another patch may be linked to the current sequence, or a new short sequence connects other two nodes in the first frame and the third frame. Impurity linking operations in an image sequence are shown in Fig. 4. Since there may exist many broken patch sequences (disconnected impurities, background fluctuations, or the mixtures) in a single image sequence, final decisions are further made according to the largest number of impurities connected in an image sequence. Although impurity trajectories should be further considered to refine detection results, no experiments are currently taken due
to the following reasons: the most important one is that lacks of impurity locations in many frames may lead to an unstable final decision, and enough impurity locations obtained throughout a sequence suggests feasible lighting conditions under which simple detectors may be efficient. Next, except for computation costs, comprehensive analysis of impurity motions takes more memory costs, because one patch node in a frame may link many nodes in other frames at the same time, for example, if there exists two different proposal nodes (p1, 1 and p1, 2 , where pi, j is jth node in ith frame) in the first frame, while every other frame contains only one proposal node ( p2,1 , p3,1 , . . . , p11,1 ), at this time, two sequences ({ p1,1 , p2,1 , . . . , p11,1 } and { p1,2 , p2,1 , . . . , p11,1 }) should be preserved instead of one, which may take unnecessary memory costs. 4. Experiments To compare performances of convolutional neural networks using correlational examples with results of single-frame models, experiments in a single frame including support vector machines
132
Y. Guo et al. / Neurocomputing 295 (2018) 127–141 Table 1 Details of the impurity dataset.
Train red Train red Train gray Train gray Train black Train black Test red Test red Test gray Test gray Test black Test black Fig. 4. Impurity sequence linking: relative to an image patch in the previous frame, a patch pair including this patch and the other one in a subsequent frame sharing the maximum probability when a model trained with correlational examples predicts the patch pair as impurities is preserved at first, likewise relative to an image patch in the next frame, a pair including another patch in one of the previous frames with the maximum probability is retained. In some occasions, if proposals include two ends of a fiber in the first frame, and a proposal connects to an end of the fiber, but another proposal in a frame at many time steps later might link to the other end of the fiber, at this time an impurity sequence will be deformed into a circle.
trained with histograms of oriented gradients [27,28], convolutional neural networks trained with augmented data [29] and convolutional neural networks trained with adversarial examples [30] are taken and can be regarded as prior works. Moreover, all the convolutional neural networks based on image patches have the same architecture except their inputs. Datasets are introduced at first (Section 4.1), then since network inputs in a single frame and those in multiple frames are different, separate test data are used to evaluate single-frame models (Section 4.2) and networks trained with correlational examples (Section 4.3), finally, all the experiment comparisons are taken in unified evaluations (Section 4.4). As for convolutional neural networks trained with augmented data, adversarial examples combined with augmented data, and correlational examples are simply annotated as “image”, “image + adversarial”, and “correlation” (see Fig. 1). “area” means the model has the area property of a potential impurity proposal as an additional network input. Specifically, networks trained with augmented data, models with area inputs, networks using adversarial examples combined with augmented data, region-based object detection methods, and their implementation details are respectively introduced in A.1, A.2, A.3, A.4 and Appendix B. Additionally, experiments are evaluated in different bottle colors. On the one hand, provided bottles are already packed according to their colors, so impurities in bottles with the same color will be simultaneously detected; on the other hand, exposure parameters of a camera vary widely when facing bottles with different colors, and they have large influences on sampled images. Moreover, even with the same color such as gray, lighting conditions always change. Therefore, in order to stabilize detection performances, temporarily a single network is chosen for each bottle color. Some examples of detection results with each method in different bottle colors are shown from Fig. B.9, B.10, B.11, B.12, B.13, B.14, B.15, B.16,B.17,B.18,B.19 ,B.20. 4.1. Dataset descriptions Images and bounding boxes in red, gray and black bottles are included in our datasets. Specifically, we put only one impurity into each bottle to ease image labeling for correlation experiments. Then, 12 frames in each bottle from the FPGA are sampled as an image sequence, and bounding boxes of potential impurity proposals are also recorded after simple preprocessing with FPGAs
Class
# sequences
# patches
# augmented patches
pos neg pos neg pos neg pos neg pos neg pos neg
874 874 918 918 450 450 274 274 300 300 150 150
3623 35,168 1533 49,828 1565 36,407 1319 11,592 522 8174 525 12,148
72,460 70,336 45,990 49,828 37,560 36,407 11,871 11,592 8713 8174 12,600 12,148
# means the number of items.
using three-frame differentiation, morphological operations, and some area thresholding rules. Generally, image sequences used for training and test are separately sampled, because regardless of how efficient the proposal selection method works, negative proposals (background fluctuations) are usually more than positive ones (impurities) in each frame, and when training sets and test sets are both from the same datasets, data balancing merely balances sample numbers of impurities and backgrounds, but the diversity of sampled impurities still remains much lower than that of backgrounds, which means data distributions in impurities for training and those for test become so similar that overfitting may not be easily observed when networks predict impurities in test sets. Details of the impurity datasets are listed in Table 1. Impurities prepared for the impurity datasets include plugs, plastics, ribbons, fibers, and cotton, but in this task, we merge them as an impurity class. Similar to outputs of the region proposal network in [10], which non-maximum suppression [27] removes heavily overlapped regions before fault classification, non-maximum suppression is applied to remove redundant proposals before detection due to the following reasons: firstly, handling excessive image pairs increases additional computation costs; secondly, when an actuator rotates captured bottle at certain speed range, impurities move more emphatically than any other interference objects; thirdly, to ensure transmission speeds, sampling intervals of uncompressed image patches are much larger than those in common video analysis. Therefore, motions of impurities should be more drastic than those of background fluctuations, and most of the time after appropriate image preprocessing operations, few tiny fluctuation blobs exist nearby true impurity blobs (see from Fig. B.15 to Fig. B.20). At this time, differing from common non-maximum suppression method operated after detection, it prefers proposals with larger areas of blobs before detection when proposal sizes remain constant. Besides, in real-time applications, non-maximum suppression has already been executed in each FPGA board, but when creating datasets, patches are sampled without non-maximum suppression. Therefore, the diversity of negatives samples in impurity datasets is preserved. Datasets prepared for training and test follow the same rules: both datasets are augmented to balance positive and negative examples. To train a classification model, in single-frame experiments, training datasets are randomly split into two parts (one part for training and the other for validation), while in multi-frame experiments, the training part is divided into several parts and follows sequential training rules. 4.2. Evaluations of single-frame experiments Single-frame impurity classification after simple proposal localization is firstly evaluated on test datasets. Several methods
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
Fig. 5. Receiver Operating Characteristic curves in single-frame classification, where solid, dotted and dashed lines represent performances in red, gray and black bottles, respectively.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
including histograms of oriented gradients, typical convolutional neural networks, convolutional neural networks with area inputs, convolutional neural networks trained with adversarial examples (combined with augmented data) are evaluated on impurity datasets in different bottle colors, as shown in Fig. 5. Compared with traditional feature extraction methods such as histograms of oriented gradients, features generated with convolutional neural networks are significantly better. Also, using manually designed features, superior performances in red to results in other colors suggest that bottle colors have great impacts on impurity detection, and performances in gray and those in black are equally poor. Relatively, convolutional neural networks perform similarly in different bottle colors, but results in red and gray are slightly better than those in black, and they are also close to each other. Besides, models with area inputs outperform those trained with adversarial examples and augmented data. 4.3. Evaluations of multi-frame experiments Multi-frame impurity detections are similar to single-frame counterparts except that network inputs have two channels, and evaluations on training sets and test sets are illustrated in Fig. 6. As a result, an area pair added as another input of the network trained with correlational examples slightly improves the detection performance in every bottle color. 4.4. Unified evaluations A complete image that FPGAs output proposals including truly visible impurities is used as the input of each framework. Such considerations are based on the following reasons: firstly, during the image sampling, we put an impurity into every bottle, but sometimes under poor illuminations, invisible impurities can not be labeled by humans; secondly, proposals in the two-stage framework are restricted to simple computer vision techniques our FPGAs currently apply, and it can almost find all the visible impurities with additional background proposals in red bottles, but it is impossible to crop all the impurity proposals in gray bottles and black bottles, however, single-frame experiments and multi-frame experiments can always be fairly compared in our
133
Fig. 6. Receiver Operating Characteristic curves in multi-frame correlation, where solid, dotted and dashed lines represent detection performances in red, gray and black bottles, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
two-stage framework if provided proposals in the first stage are the same; thirdly, end-to-end detectors are trained to detect visible impurities that humans can label, so they are sufficient to find the labeled impurity proposals that FPGAs output, but when end-to-end detectors produce ambiguous or invisible results that FPGAs do not provide, humans also can not always judge their correctness with enough confidences. Therefore, to reduce the evaluation uncertainties to the largest extent, unified evaluations are compared under the same dataset that contains visible impurities that FPGAs can find, instead of the whole dataset. Single-frame and multi-frame experiments are evaluated on different training sets and test sets. Potential impurity proposals are already recorded through FPGAs and are labeled by humans instead of predicting bounding boxes in a whole image, then when a provided proposal is detected as an impurity, its probability is set as 1, otherwise, it becomes 0. However, object proposals are directly predicted in a whole image with mainstream detectors such as Faster R-CNN [31] and SSD [32], and they only represent impurities. In other words, impurities and backgrounds are treated as two different classes (”impurity” and ”background”) in singleframe experiments and multi-frame tests, but there is only one class (”impurity”) in predictions of mainstream detectors. To compare common object detectors with our models, using the same dataset that FPGAs output proposals which include true impurities, all the networks based on image patches are evaluated with the same rules applied to the mainstream detectors. Specifically, common object detectors output no negative predictions, therefore, when they detect an impurity in a frame, the true bounding box btrue corresponding to the labeled impurity is always recorded as 1, and an output bounding box bpred is predicted as 1 or 0 (IoU(bpred , btrue ) > 0.2, the output label is 1, otherwise, it becomes 0); when they detect no labeled impurities in one frame, a fake bounding box located at a true impurity is predicted as 0, while the true bounding box is recorded as 1. Single-frame detection results of precisions and recalls in red bottles (red dots in Fig. 7) are acceptable, but most of precisions and recalls when detecting impurities in gray bottles (green dots in Fig. 7) are worse than those in red bottles, and recalls are terrible for detection in black bottles (blue dots in Fig. 7).
134
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
Fig. 7. A scatter plot with precisions and recalls shows unified evaluations. Specifically, red, green and blue dots represent results in red, gray and black bottles, and abbreviations are presented for each method: in single-frame occasions, ”I”s are convolutional neural networks trained with augmented data without areas, ”I + AREA”s are networks trained with augmented data with area inputs, and ”I + ADV”s indicate networks trained with adversarial examples and augmented data without areas; while in multi-frame occasions, ”C”s are networks trained with correlational examples without areas, and ”C + AREA”s represent networks trained with correlational examples with area inputs. Datasets used in training and test are respectively represented with black and blue text colors.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Table 2 Precisions and recalls on training datasets.
Faster r-cnn ssd Image Image + Adversarial Correlation Correlation + Area Image +Area ssd Faster r-cnn Correlation + Area Image + Adversarial Image Image + Area Correlation Image + Adversarial Image Correlation ssd Faster r-cnn Correlation + Area Image + Area
Table 3 Precisions and recalls on test datasets.
Bottle color
Train/test
Precision
Recall
f1 score
Red Red Red Red Red Red Red Gray Gray Gray Gray Gray Gray Gray Black Black Black Black Black Black Black
Train Train Train Train Train Train Train Train Train Train Train Train Train Train Train Train Train Train Train Train Train
92.82 93.36 96.40 96.58 98.07 96.58 93.79 62.43 73.98 80.18 86.38 82.16 76.01 91.98 94.31 91.52 94.66 83.26 87.69 91.05 85.38
90.68 91.02 88.50 90.69 93.06 95.05 98.61 81.67 78.26 73.22 78.22 82.90 93.19 88.88 60.67 62.40 71.36 85.58 83.27 83.31 91.03
91.73 92.17 92.28 93.54 95.50 95.81 96.14 70.77 76.06 76.54 82.10 82.53 83.72 90.40 73.84 74.21 81.38 84.40 85.42 87.01 88.11
Convolutional neural networks with area inputs (“image + area”) perform the best in recall, and convolutional neural networks trained with correlational examples (“correlation”) have the highest precisions in training sets and the test set of red bottles, while precisions of models trained with adversarial examples combined with augmented data (“image + adversarial”) are highest in precision in test sets of gray bottles and black bottles. Except for the f1 score in the training set of gray bottles, all the models superior in f1 scores have additional area inputs, as shown in Tables 2 and 3. Performance evaluations may be complicated if directly comparing numbers in different bottles. Therefore, for each method in training or test, the ratio of image sequences in each bottle color
ssd Image + Adversarial Image Faster r-cnn Correlation Image + Area Correlation + area ssd Faster r-cnn Correlation + Area Image + Adversarial Image Correlation Image + Area ssd Faster r-cnn Image + Adversarial Correlation Image Correlation + Area Image + Area
Bottle color
Train/Test
Precision
Recall
f1 score
Red Red Red Red Red Red Red Gray Gray Gray Gray Gray Gray Gray Black Black Black Black Black Black Black
Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test
84.44 94.74 93.73 89.62 96.10 91.37 94.84 61.53 84.23 83.99 93.92 91.05 89.19 83.76 61.33 74.32 92.92 85.71 90.21 82.76 86.06
88.15 85.41 87.14 93.88 89.14 97.29 93.94 77.67 73.56 85.45 77.87 82.17 85.10 90.78 63.72 66.46 51.47 62.50 62.92 76.19 85.24
86.26 89.83 90.31 91.70 92.49 94.24 94.39 68.66 78.53 84.68 85.15 86.38 87.10 87.13 62.50 70.17 66.25 72.29 74.14 79.34 85.65
to those in all the colors is used to calculate the weighted mean of each measure (precision, recall, or f1 score), as shown in Table 4. Area inputs have large impacts on convolutional neural networks trained with augmented data in all the bottles and convolutional neural networks trained with correlational examples in red bottles and black bottles, but areas seem uncommon in most of other public datasets and are correlated with object motions (blob areas come from three-frame differentiation). If the model inputs are limited to a single frame, then convolutional neural networks trained with correlational examples (“correlation”) perform better than traditional networks trained with augmented data (“image”) and common object detectors (“aster r-cnn” and “ssd”) in f1 score, as shown in Table 4.
Y. Guo et al. / Neurocomputing 295 (2018) 127–141 Table 4 Average precisions and average recalls on impurity datasets.
ssd Faster r-cnn Image Image + Adversarial Correlation + Area Image + Area Correlation ssd fFaster r-cnn Image + Adversarial Image Correlation Correlation + Area Image + Area
Train/Test
Precision
Recall
f1 score
Train Train Train Train Train Train Train Test Test Test Test Test Test test
78.67 84.08 89.59 91.95 88.75 84.82 94.89 70.16 84.22 94.02 91.89 91.08 87.82 87.12
86.10 84.11 80.97 79.56 83.76 94.87 86.99 78.75 79.78 75.25 80.06 81.95 86.74 92.10
81.85 84.05 84.66 84.90 86.15 89.45 90.58 74.04 81.78 83.01 85.33 86.07 87.25 89.51
A multi-frame detection framework runs fast in our specific application, because the rebuilt model capacity fits the data scale just fine, and non-maximum suppression before detection significantly decreases input data to the framework. Specifically, overall detection speeds of a convolutional neural network trained with correlational examples under different illuminations range from 2 to 13 bottles per second (using a single GTX 1080), and they have exceeded the rolling speed of rotating a bottle (around 2 bottles per second).
135
(40 × 40) are received through sockets from FPGAs. Such an ironic decision has following reasons: firstly, self-made image capture boards are unable to transmit images in real time (around 5 fps for a 480 × 480 gray image), because output images processed by FPGAs are currently uncompressed; secondly, FPGAs can also do some easy works such as proposal selection to lighten a little burden for the server, specifically, in every FPGA, each frame is preprocessed with differentiation among three frames and basic morphological operations for extracting and localizing region proposals; thirdly, proposal selections can also be preprocessed on a server, but temporarily no specialized cameras that can be directly put into a small bottle are available on market. Constant proposal size is preferred in our application, since proposals are initially provided after morphological operations, some large impurities may be cut into several small pieces, even centers of proposals may not roughly coincide with those of true impurities; impurity size also varies, but comparing with background fluctuations, features of impurities remain much simpler than those of backgrounds. In other words, preprocessing methods may inevitably destroy the overall features of impurities, and resizing fluctuation examples may ruin the original feature distributions of negative examples. Therefore, to preserve true data distributions of backgrounds, image patch scaling is forbidden in proposal extraction. Since the input dimension of a convolutional neural network is fixed, we empirically choose 40 × 40 image patches. A1. Convolutional neural networks for impurity detection
5. Conclusions and future works In this paper, we propose a multi-frame detection framework to find impurities in opaque glass bottles with liquor using a convolutional neural network trained with correlational examples. It classifies and correlates image patch proposals between frames in current and subsequent time steps, and then selectively links them together to get a predicted sequence of impurity patches. Extensive experiments on the impurity datasets demonstrate that convolutional neural networks trained with correlational examples perform better than traditional models independently applied in a single frame. Recorded proposals from FPGAs are labeled and evaluated in our two-stage framework, which eliminates many images with impurities that are hard to detect under challenging illuminations, so it actually reduces the detection difficulties. However, such task remains challenging when some impurities are invisible or ambiguous. Compared to end-to-end object detection methods, the proposal generation stage in our framework using FPGAs will be replaced with more flexible modules. Meanwhile, lightning conditions should be more adaptive to bottle thicknesses, we will further improve lightning conditions until all the common impurities in their bottles can be confidently labeled by humans and integrate our methods into multi-frame baseline frameworks for better comparisons. Acknowledgment This work is supported by National Natural Science Foundation (NNSF) of China under Grant61421004. Besides, we gratefully acknowledge the GPU supply of NVIDIA Corporation, and the underlying platform support of Pengcheng Package Machinery Corporation. Appendix A. Single-frame detection It will be the best option if objects are directly detected with convolutional neural networks in a whole image. However, instead of transmitting a complete image, chosen patches in small size
Overfeat [33] is a multi-scale object detection framework with sliding window search, and computation costs are high when scanning each frame completely. However, Overfeat treats detection as classification in a spatial manner, and if we select detected areas with proposal priors, small impurities become relatively larger in these areas compared with the same impurities in a whole image. Therefore, in our application, proposal locations will be readily determined by prior information instead of directly predicting bounding box coordinates in an image. As with other convolutional neural networks, one applied for single-frame impurity detection consists of convolutional layers and fully connected layers, which is the same as models trained with correlational examples except for its one-channel input. After small proposals are localized, detection degenerates into classification, and it basically has two benefits: firstly, in most cases, central locations of image patches happen to be close to those of potential impurities, so data augmentation can be simplified without translation; secondly, even under the constant illumination, local lighting conditions vary according to the thickness distributions of bottle bottoms and painting qualities, so sufficient samples in various illuminations have naturally been provided; thirdly, assuming that influence factors mentioned above can be controlled, the network input dimensions might be slightly reduced, which quickens the forward propagation, and translation invariance may not be necessarily learned to obtain high-level semantic information. One-hot outputs of a convolutional neural network contain two nodes that are respectively responsible for predicting impurities and background fluctuations. Therefore, two relevant thresholds Pimg+ and Pimg− are set to remove potential background fluctuations, and we simply use cross-entropy loss functions in binary case:
yˆi = f (xi , θ ), L(x, y, θ ) = −
n
yi logyˆi + (1 − yi )log(1 − yˆi ).
(A.1)
i=1
where L is the loss function of a network, x is the model input, y is the true model output, θ contains model weights, n is the
136
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
Fig. A.8. Adversarial examples with pertubations in different magnitude: values used in this figure include 0.025, 0.05, 0.1, 0.125, since effects of pertubations can be viewed easily with large magnitudes. Specifically, subfigures from left to right represent examples of fluctuations and impurities in red bottles, gray bottles, and black bottles, respectively.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
number of training samples, f represents the forward computation, and its output yˆ is the model prediction.
A2. Convolutional neural networks with object size Input images must be resized in all the convolutional neural networks for object detections. However, sometimes size is an essential ability to humans, for example, we can easily distinguish planes and toy planes. But vision-only networks may not perform well in this aspect, and even a toy plane can become significantly large in an image if it locates close to a camera. Fortunately, because of our mechanical design, the distance between a camera and its detected bottle is roughly constant. Therefore, unlike other large-scale benchmarks, the influence of object size on convolutional neural networks can be evaluated on our impurity datasets. Here, we further analyze influences to single-frame detection models when object sizes are considered. Area of a bounding box (a cyan rectangle in the first image in Fig. 3) is approximately defined as a proposal size in the impurity detection. After the highest-level convolutional feature maps are vectorized with the flattening operation, area input, a onedimensional vector (two-dimensional if used in a convolutional network trained with correlational examples), is normalized and connected with the flattened vector.
A3. Convolutional neural networks with adversarial examples Deep convolutional neural networks are vulnerable to adversarial examples due to the model linearity in high-dimensional feature spaces [34], therefore, a convolutional neural network trained with adversarial examples combined with augmented data is developed to improve detection performances of ambiguous impurities in a single frame. Adversarial examples are generated using fast gradient sign method [30]. Given a model parameterized with θ , a slightly perturbed image patch x˜ as a model input and its corresponding label y as a model output, a perturbed adversarial loss function L˜ should be minimized:
L˜ (x, y, θ ) =
m 1 L(x˜, y, θ ) m k=1
m 1 = L(x + sign(∇x L(x, y, θ )), y, θ ). m
(A.2)
k=1
where is the max-norm box, and m is the total number of pertubations. In this paper, pertubations constrained with different maximum norm are used, as shown in Fig. A.8, because the magnitudes of changes in high-level features are unknown, and adversarial examples with hierarchical max-norm boxes can expand more continuous manifolds.
Fig. B.9. Detection results of Faster R-CNN in red bottles: red rectangles represent locations of detected impurities and the corresponding blue text boxes above provide probabilities of impurity classification. Faster R-CNN performs impressive in red bottles except for some detection errors on the bottle bottom in the fourth image sequence.
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
A4. Faster R-CNN and SSD As for Faster R-CNN [31], on the last convolutional layer, region proposals are generated by anchor boxes with different scales and aspect ratios using small sliding windows, while object classifications and bounding box regressions are predicted with fully-connected layers, given low-dimensional inputs mapped from convolutional feature maps in the window. Besides, a region proposal network and a detector network are alternately trained. Different from Faster R-CNN, bounding box offsets of SSD [32] are relative to the default boxes located in each feature map, and object classifications are predicted for each default box, which is similar to anchor boxes in Faster R-CNN but applied on multi-scale feature maps.
137
Appendix B. Implementation details B1. Faster R-CNN and SSD Feature maps of convolutional neural network are extracted from ZF-net [35], and four scales (16 × 16, 32 × 32, 64 × 64, and 128 × 128) are applied to anchors, while other settings remain the same in [31]. To make Faster R-CNN performs the state-of-the-art in our datasets, we adjust threshold values in bottles of red, gray, and black to 0.8, 0.6, and 0.2, respectively. SSD removes proposal generation and feature map sampling using sliding windows, but its manually designed default boxes in feature maps of different resolutions lead to longer training time (more than a day in SSD with two GTX 1080 cards compared with
Fig. B.10. Detection results of Faster R-CNN in gray bottles: impurities on the bottle bottom can be detected if they are visible in the dark bottle, and it is natural that detectors can not detect impurities in the fourth image sequence.
Fig. B.11. Detection results of Faster R-CNN in black bottles: sometimes it may not detect indistinct impurities with high confidences if motions are not considered in the fourth image sequence.
Fig. B.12. Detection results of SSD in red bottles: orange rectangles represent locations of detected impurities and the corresponding orange text boxes above provide probabilities of impurity classification. SSD performs well in red bottles except for some missed detections in the third image sequence.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
138
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
Fig. B.13. Detection results of SSD in gray bottles: small nonopaque blocks in the dark bottle bottom can not be detected in the fourth image sequence.
Fig. B.14. Detection results of SSD in black bottles: small nonopaque blocks moving around the bottle wall can not be detected in the fourth image sequence.
Fig. B.15. Detection results of a single-frame convolutional neural network with an area input in red bottles: red and blue rectangles are proposals from FPGAs after nonmaximum suppression and those with human labels, respectively, while green numbers above rectangles suggest detected impurities and represent binary probabilities of single-frame networks.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. B.16. Detection results of a single-frame convolutional neural network with an area input in gray bottles: impurities can still be detected with single-frame convolutional networks in the fourth image sequence.
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
139
Fig. B.17. Detection results of a single-frame convolutional neural network with an area input in black bottles: sometimes impurities can also be easily detected with models in a single frame in the fourth image sequence.
Fig. B.18. Detection results of a convolutional neural network trained with correlational examples in red bottles: rectangles are proposals similar to annotations in Fig. B.15, but green numbers above rectangles record lengths of impurity sequences.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. B.19. Detection results of a convolutional neural network trained with correlational examples in gray bottles: impurities can still be detected even under extremely inadequate lighting conditions (the fourth image sequence).
a quarter of a day in Faster R-CNN using the same single card). Besides, to ensure the network converge, the learning rate at the beginning is changed into 0.0 0 01. In the available code, the scale of the lowest layer (smin ) and that of the highest layer (smax ) are respectively 0.2 and 0.9, but smin seems too large for small objects and is modified to 0.03. As a result, locations of small objects are accurate most of the time. At this time, their corresponding probabilities are quite small, and trained SSD can not detect many small objects with the default probability threshold (0.6) in [32]. Therefore, to make SSD have the state-of-the-art performances in our datasets, threshold values in bottles of red, gray, and black after careful modifications are respectively 0.3, 0.2, and 0.15.
B2. Single-frame and multi-frame models All the methods in a single frame and those in multiple frames are implemented based on the Tensorflow framework [36]. Both the single-frame convolutional neural network and the multiframe model share the same architecture except their inputs, which respectively take an image patch with its area and an image patch pair with its area counterparts as the inputs, but weights of all models are trained independently from scratch. Areas have to be normalized before putting into networks, and typically the range of normalized values belongs to [0,1]. Statistically, means and standard deviations in red, gray and black bottles are (58,179), (52,177) and (33, 129 ), respectively.
140
Y. Guo et al. / Neurocomputing 295 (2018) 127–141
Fig. B.20. Detection results of a convolutional neural network trained with correlational examples in black bottles: impurities in the same frame might be linked together due to similarities inside the defined impurity class in the second image sequence.
Random split ratios between training sets and validation sets are 0.7: 0.3, and test sets remain complete to prepare a singleframe dataset, while we set ntrain = 8 and ntest = 1 to prepare each multi-frame dataset. Besides, bounding box areas are simply copied in data augmentation. Thresholds Pimg+ and Pimg− for network outputs are empirically configured as 0.9 and 0.15 in single-frame detection, while in multi-frame tasks, similar thresholds Pcorr+ and Pcorr− are respectively fixed as 0.85 and 0.15. Networks using correlational examples take much longer training time than single-frame networks because of different dataset sizes. Therefore, numbers of epochs in convolutional neural networks trained with correlational examples and single-frame convolutional neural networks are set as 8 and 80. Additionally, the batch size used for training is constantly set as 10 0 0.
References [1] Z. Bowen, W. Yaonan, G. Ji, Z. Hui, A machine vision intelligent inspector for injection, in: Proceedings of Pacific-Asia Workshop on Computational Intelligence and Industrial Application, vol. 2, 2008, pp. 511–516, doi:10.1109/PACIIA. 2008.276. [2] H.J. Liu, A novel vision based inspector with light, Appl. Mech. Mater. 268–270 (2012) 1916–1921, doi:10.4028/www.scientific.net/AMM.268-270.1916. [3] B. Huang, S. Ma, Y. Lv, C. Liu, H. Zhang, The study of detecting method for impurity in transparent liquid, Optik - Int. J. Light Electron. Opt. 125 (1) (2014) 499–503, doi:10.1016/j.ijleo.2013.07.017. [4] H. Zhang, B. Zhou, F. Weng, C. Ru, C. Zhou, M. Tan, Y. Sun, A system of automated detection of ampoule injection impurities, IEEE Trans. Autom. Sci. Eng. 99 (2015) 1–10. [5] M.M. Ghazi, B. Yanikoglu, E. Aptoula, Plant identification using deep neural networks via optimization of transfer learning parameters, Neurocomputing 235 (2017) 228–235, doi:10.1016/j.neucom.2017.01.018. [6] J. Redmon, A. Farhadi, YOLO90 0 0: better, faster, stronger, arXiv preprint arXiv:1612.08242 (2016). [7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9, doi:10.1109/CVPR.2015.7298594. [8] J. Deng, W. Dong, R. Socher, L.-j. Li, K. Li, L. Fei-fei, ImageNet : a large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255, doi:10.1109/CVPR. 2009.5206848. [9] T. Lin, C.L. Zitnick, P. Doll, Microsoft COCO : common objects in context, in: Proceedings of the European Conference on Computer Vision, 2014, pp. 740–755. [10] J. Sun, Z. Xiao, Y. Xie, Automatic multi-fault recognition in TFDS based on convolutional neural network, Neurocomputing 222 (2017) 127–136, doi:10.1016/j. neucom.2016.10.018. [11] C. Zhu, Y. Zheng, K. Luu, M. Savvides, CMS-RCNN: contextual multiScale region-based CNN for unconstrained face detection, arXiv preprint arXiv:1606.05413 (2016). [12] X. Sui, S. Zhang, B. Wei, H. Bi, J. Wu, X. Pan, Y. Yin, Y. Zheng, Choroid segmentation from optical coherence tomography with graph-edge weights learned from deep convolutional neural networks, Neurocomputing 237 (2017) 332– 341, doi:10.1016/j.neucom.2017.01.023.
[13] H. Chen, X. Wu, G. Tao, Q. Peng, Automatic content understanding with cascaded spatial-temporal deep framework for capsule endoscopy videos, Neurocomputing 229 (2017) 77–87, doi:10.1016/j.neucom.2016.06.077. [14] T. He, H. Mao, J. Guo, Z. Yi, Cell tracking using deep neural networks with multi-task learning, Image Vis. Comput. 60 (2016) 142–153, doi:10.1016/j. imavis.2016.11.010. [15] J. Bromley, J.W. Bentz, L. Bottou, I. Guyon, Y. Lecun, C. Moore, E. Säckinger, R. Shah, Signature verification using a siamese time delay neural network, Int. J. Pattern Recognit. Artif. Intell. 07 (04) (1993) 669–688, doi:10.1142/ S02180 014930 0 0339. [16] J. Zbontar, Y. LeCun, Computing the stereo matching cost with a convolutional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1592–1599, doi:10.1109/CVPR.2015.7298767. [17] J. Žbontar, Y. LeCun, Stereo matching by training a convolutional neural network to compare image patches, J. Mach. Learn. Res. 17 (2) (2016) 1–32, doi:10.1103/PhysRevE.93.033307. [18] L. Wu, C. Shen, A. van den Hengel, PersonNet: person re-identification with deep convolutional neural networks, arXiv preprint arXiv:1601.07255 (2016). [19] X. Zeng, W. Ouyang, X. Wang, Multi-stage contextual deep learning for pedestrian detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 121–128, doi:10.1109/ICCV.2013.22. [20] Z. Wu, Y. Huang, L. Wang, X. Wang, T. Tan, A comprehensive study on crossview gait based human identification with deep CNNS, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2) (2017) 209–226, doi:10.1109/TPAMI.2016.2545669. [21] W. Li, R. Zhao, T. Xiao, X. Wang, DeepReID: deep filter pairing neural network for person re-identification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014, pp. 152–159, doi:10.1109/CVPR.2014.27. [22] M.L. Perez, Video pornography detection through deep learning techniques and motion information, Neurocomputing 230 (2017) 279–293, doi:10.1016/j. neucom.2016.12.017. [23] S. Zagoruyko, N. Komodakis, Learning to compare image patches via convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4353–4361, doi:10.1109/CVPR.2015. 7299064. [24] N.Z. De Cheng, Y. Gong, S. Zhou, J. Wang, Person re-identification by an multichannel parts-based CNN with improved triplet loss function, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1335–1344, doi:10.1109/CVPR.2016.149. [25] D.P. Kingma, J.L. Ba, Adam: a method for stochastic optimization, in: Proceedings of the International Conference on Learning Representations, 2015, pp. 1–15. [26] K. Wang, D. Zhang, Y. Li, R. Zhang, L. Lin, Cost-effective active learning for deep image classification, IEEE Trans. Circuits and Systems for Video Technology (2016), doi:10.1109/TCSVT.2016.2589879. [27] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893, doi:10.1109/CVPR.2005.177. [28] O. Déniz, G. Bueno, J. Salido, F. De La Torre, Face recognition using histograms of oriented gradients, Pattern Recognit. Lett. 32 (12) (2011) 1598–1603, doi:10. 1016/j.patrec.2011.01.004. [29] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: Proceedings of the International Conference on Learning Representations, 2015, pp. 1–14, doi:10.1016/j.infsof.20 08.09.0 05. [30] I.J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, arXiv preprint arXiv:1412.6572 (2014). [31] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, Adv. Neural Inf. Process. Syst. (2015) 91–99, doi:10.1016/j.nima.2015.05.028. [32] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-y. Fu, A.C. Berg, SSD : single shot multiBox detector, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 21–37, doi:10.1016/j.nima.2015.05.028.
Y. Guo et al. / Neurocomputing 295 (2018) 127–141 [33] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, OverFeat: integrated recognition, localization and detection using convolutional networks, arXiv preprint arXiv:1312.6229 (2013). [34] C. Szegedy, W. Zaremba, I. Sutskever, Intriguing properties of neural networks, arXiv preprint arXiv: (2013), doi:10.1021/ct2009208. [35] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Proceedings of the European Conference on Computer Vision, 2014, pp. 818–833, doi:10.1007/978- 3- 319- 10590- 1_53. [36] GoogleResearch, TensorFlow: large-scale machine learning on heterogeneous systems, arXiv preprint arXiv:1603.04467 (2016).
141 Haitao Song received the B.S. degree in engineering from Xiamen University, Xiamen, China, in 2009, and the Ph.D. degree in engineering from Institute of Automation Chinese Academy of Sciences, Beijing, China, in 2014. He is currently a Assistant Research Fellow with the Institute of Automation Chinese Academy of Sciences, Beijing, China. His research interests include machine vision and embedded system.
Yue Guo received his B.S. degree in automation from Lanzhou University of Technology, Gansu, China, in 2013, and he is currently working toward the Ph.D. degree in Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include computer vision, image processing and machine learning.
Wenhao He is an associate professor at the Institute of Automation, Chinese Academy of Sciences. His research interests involve reconfigurable computing, image processing, and high-performance computing.
Yijia He received B.E. degree in automatic control science and engineering from Hunan University, Changsha, China, in 2013 and is currently persuing the Ph.D. degree in control science and engineering from Institute of Automation Chinese Academy of Sciences. His research interests include computer vision and robotics.
Kui Yuan received his Ph.D. degree in Kyushu University, Japan, in 1988. Since then, he has been working in Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests are in intelligent control, intelligent robots and machine vision.