Context-aware Deep Convolutional Neural Networks

Weimer, D.; Benggolo, A.; Freitag, M.: Context-aware Deep Convolutional Neural Networks for Industrial Inspection. In: Leitner, J. ; Cherian, A.; Shirazi, S. (eds.): Australasian Joint Conference on Artificial Intelligence. Workshop on Deep Learning and its Applications in Vision and Robotics, Queensland University of Technology, Queensland, Australia, 2015, 4 pages

Context-aware Deep Convolutional Neural Networks for Industrial Inspection Daniel Weimer, Ariandy Yoga Benggolo, and Michael Freitag BIBA – Bremer Institut f¨ur Produktion und Logistik at the University of Bremen Hochschulring 20, D-28359 Bremen, Germany Email: {wei, beo, fre}@biba.uni-bremen.de

Abstract—This contribution presents the application of deep Convolutional Neural Networks (CNN) for surface defect detection in manufacturing processes. On a classification task with 12 defect categories we analyze the impact of context-aware patch extraction on the detection result and compare different strategies to combine original patch and its context. Results demonstrate the importance of context towards a reliable detection on strong textured surfaces with respect to a public available industrial dataset.

I. I NTRODUCTION Optical Quality Control (OQC) and machine vision are vital processes in manufacturing industry to satisfy costumer requirements, which in return will establish sales continuity of the company. To have satisfactory products, one essential step in OQC is to ensure that the product is visually free of imperfections or defects, among other requirement. In many manufacturing industries, human inspection is still a critical element in the process. Although visual inspection is a trivial task for human to solve, in fast-paced modern industry such task can be highly repetitive and mind-numbing, which potentially lead to human error due to fatigue. However reliable this may be, OQC performance uncertainty due to human error could be very expensive for the company. Moreover, with increasing production volume, performance of human-based OQC doesn’t scale well as it is constrained with low inspection frequency, not to mention the cost. For given reasons, automated defect detection naturally emerges as solution to this problem. The aim of the defect detection process is to segment a possible defective area from the background and classify it in predefined defect categories. In a controlled environment, characterized by stable lighting conditions, simple thresholding techniques are often sufficient to segment defects from the background. However, such methods are no longer applicable when dealing with surfaces with strong and/or complex textures. In this much more challenging applications, more elaborate methods are required to realize stable and reliable defect detection results. As described by Xie [2] and Neogi et al. [3], existing methods can roughly be divided into four main categories: statistical, structural, filter based, and model based. In industrial defect detection system, intensive studies have been performed in order to hand-craft the optimum feature representation of the data in the given problem. Pernkopf and O’Leary [4] for example, performed comparative study

on feature selection from exhaustive list of different feature encodings. Despite often times such methods yield satisfactory results, they don’t always apply to new or different problem set because each problem has its own characteristics that only respond to certain kind of feature extractor. Thus, it is a common practice in industrial environment that new feature has to be manually engineered when new set of problem arises. Additionally, in contrast to other object detection problems, texture defects can occur in arbitrary size, shape and orientation. Therefore, standard feature descriptors for defect description often lead to insufficient classification results. Given the considerations, this contribution investigates an alternative approach, namely Convolutional Neural Networks (CNN), in order to overcome the problem of redefining a specific feature representation for every new problem set. In the past few years, CNN has proven itself to be a robust algorithm that allows features to be learned from data, as opposed to be manually designed, which is naturally a lengthy procedure as it requires deep knowledge on the problem domain. Particularly in visual defect detection domain, Masci [1] implements MaxPooling CNN to classify defects in steel surface inspection. Motivated by Tompson et al. [5], we focus on deeper CNN architectures and investigate the impact if context-aware patch representation during network training and deployment. II. A RCHITECTURE CNN is a type of feed-forward neural network inspired by the working principle of visual cortex of the brain. The architecture is composed by multiple feature extraction stages. Each of which consists mainly of three basic operations: convolution, non-linear activation function, and feature pooling. Fig. 1 illustrates the basic architecture of a deep CNN. With an image as the input, feature map C1 would be the result of a filter convoluted throughout the whole image, followed by non-linear activation function, e.g. sigmoid or Rectified Linear Unit. Feature pooling – typically involving max or average operation, is then performed to obtain compressed feature representation S1. These layers can be stacked together which results in a hierarchical and evolutionary development of raw pixel data towards powerful feature representations. For classification task, the extracted features can be processed further by any classifier available, e.g. SVM or standard fully connected (FC)


Fig. 1. Typical Convolutional Neural Networks (CNN) model architecture. Shown is an example of CNN for classification task with 2 feature extraction stages, each consists of convolution, non-linear activation function, and subsampling operation, producing feature maps (C1, S1, C2, S2). The extracted features are finally classified with fully connected MLP.

TABLE I D ESIGN OF THE DEEP CNN

ARCHITECTURE . T HE CONVOLUTION LAYER PARAMETER IS DENOTED AS CONV h FILTER SIZE i-h NUM . OF KERNELS i. R E LU ACTIVATION FUNCTION ALWAYS FOLLOWS CONVOLUTION LAYER , BUT NOT SHOWN FOR BREVITY. A DDITIONALLY, pad DENOTES ZERO PADDING FOR THE BORDER REGION .

Stage-1 conv3-32 conv3-32 pad max-pooling

Stage-2 conv3-64 pad conv3-64 conv3-64 max-pooling

Stage-3 conv3-512 conv3-512

MLP FC-1024 FC-1024 FC12 softmax

Multi-Layer Perceptron (MLP), then normalized with softmax to approximate the posterior class probabilities. Tab. I describes the CNN architecture for visual defect detection evaluated in this work. The network is divided into three stages of feature extraction, and a final classifier using fully connected MLP. The network design heuristics in this work are strongly inspired by Simonyan and Zisserman [6] (OxfordNet) due to its simplicity and effectiveness, i.e. use of small convolution filter size of 3 × 3 px with stride of 1 px. Spatial pooling is performed with max operation on non-overlapping 2 × 2 px neighborhood (stride of 2 × 2 px), following some of the convolutional layers. In total, the CNN architecture consists of seven convolutional layers and a final MLP with two hidden layers. III. C ASE S TUDY A. Data Preparation and Training Setup In this work, we evaluate the deep CNN architecture with publicly available dataset from DAGM 2007 Competition of ”Weakly Supervised Learning for Industrial Optical Inspection” [8], which will be referred afterwards as DAGM dataset. Despite that the dataset is artificially generated, the characteristics is similar to the real world problem of texture defects formulated earlier. Fig. 2 shows the 6 classes of texture models pf the dataset, each with different defect models. Each class contains 1000 of 512 × 512 px images showing the background texture, and 150 images containing one labeled defect.

Fig. 2. Examples from the dataset used in this contribution. An ellipse coasrsly labels the defect region.

As shown in Fig. 2, the actual defect only covers very small area compared to the total area of the image. Furthermore, the number of images for the two classes (defect-free and defect) is clearly unbalanced with very small amount of defective image example. For this reason, it is necessary to introduce Data Augmentation step to have evenly distributed training data. To reach fairly equal ratio, the augmentation process is only performed on defect images using linear transformations, i.e. rotations and mirroring. The complete set of images are then shuffled randomly and split into training (70%), testing (15%), and the remaining for evaluation (15%) of the CNN. From an image, the network operates on 32 × 32 px patches in order to build a heatmap of defect detection. Therefore the network is trained on image patches, with additional context view which will be discussed further on the next section. Network training is performed with backpropagation algorithm, minimizing Negative Log-Likelihood cost function using mini-batch Stochastic Gradient Descent (SGD). To avoid overfitting – also to improve classification performance, numbers of regularization techniques were implemented, such as L2 regularization, Dropout [7], and early stopping [9].


a) Input image (512x512)

b) Detection (31x31)

Fig. 3. Different strategies for random patch extraction with its context view. TABLE II ACCURACY

OF DETECTION WITH DIFFERENT CONTEXT STRATEGIES

Context type Strategy a Strategy b Strategy c

Dimension 32 × 32 32 × 32 32 × 64

Accuracy 88.49% 92.31% 96.44%

B. Context-aware Patch Arrangement The main contribution of this work is the investigation of the impact of context-aware patch extraction as the input for CNN to detect visual defects. Fig.3 shows the arrangement of three different strategies to extract patch information. The patch extraction takes place on random positions within the 512px × 512px image. To avoid inter-class training effects during backpropagation, the CNN learns with randomly shuffled patch and context examples. • •

•

Strategy a: In the first version, a standard patch of size 32 × 32 is extracted and used for training. Strategy b: Second, an additional patch of size 64 × 64 is extracted, having spatially the same center point as the patch. For integration into the training set, each 64 × 64 patch is down scaled to 32×32 and added as an additional example to the training set. Strategy c: The third version combines original and down scaled patch to a 32 × 64 patch representation. The motivation of the third version is the additional prior knowledge by combining context information and original patch and represent it as one training instance.

These three different strategies are implemented for training as well as for the evaluation of unknown data. C. Detection Results The experiments were performed based on the model from Tab. I. The nature of SGD-based learning is that the performance up to certain level, is scalable. Therefore, we can cut the training time to minimum by using only half of the total available training patches, i.e. 300,000 and still able to assess the performance of different input configurations. Tab. II summarizes the results of this experiment.

c) scaled detection to 512x512

d) non-maxima suppression

Fig. 4. Detection example and resulting heat map, with additional nonmaxima suppression.

Basically, the results confirmed that the addition of context view influences the model training to learn better representation. To look deeper into the results, we obtain the class-level accuracy as shown in Fig. 5. From class-level accuracy we see that despite context view has relatively small influence on non-defect classes, in contrast the accuracy improvement in defect-classes are quite significant. It is also noted that network generally performs better when input patch is paired with the corresponding context view (strategy c). Class 4 shows the most significant improvement when context view is introduced. This class has particular characteristic of defects where the texture is visually similar to the background texture (lines), not to mention that actual defect pixel area are also very small compared the background. Addition of context view gives up to 35% better accuracy compared to patch-only input. However, in classes where defect area occupy large area of the patch/context, e.g. Class 6, improvement is not as significant. This shows that when a network is optimized to find certain objects, addition of context view helps in a way that it provides an extended view of the object of interest related to its neighboring pixels, thus creating a more discriminative model. When there are no objects to look for, e.g. non-defect classes, results vary. Fig. 4 shows an example from the dataset and the resulting detection heat map for the network. Fig. 4b shows the resulting heat map and Fig 4c the same heat map up scaled to the original image size. To reduce false alarms, we additionally implemented a morphological operation. The final non-maxima suppressed detection result is shown in Figure 4d.


Fig. 5. Defect detection results for three different context strategies using the proposed CNN architecture. Especially strategy c, representing a fusion of patch and context, contribute to the detection of defects compared to a single patch strategy (strategy a).

IV. C ONCLUSION In this work an approach in visual defect detection using deep CNN is presented. The performance of the proposed approach is measured on DAGM dataset, an artificially generated image set with visual defects occuring on heavily textured background. As opposed to hand-crafting features, with CNN we engineer architecture by investigating different hyperparameters involved in the process. In this way, systems for OQC or different applications can be developed with minimum prior knowledge on the problem domain. The CNN architecture allows knowledge (in form of features) to be extracted from different characteristic of multiple data classes with minimum human supervision, thus making it highly adaptable to cater different kinds of problems. This enables rapid development of automated visual defect detection system in case if new problem set arises. Simultaneously, this eliminates the need of generating hand-crafted features and lengthy procedures of trial and error in order to determine the best performing feature for a specific task. In future work, we will investigate different hyperparameter settings and analyize their impact on the detection result. Furthermore, the findings will be applied in a micromanufacturing scenario focusing safety-relevant micro parts in a mass production process. ACKNOWLEDGMENT The authors gratefully acknowledge the financial support by DFG (German Research Foundation) for Subproject B5 Sichere Prozesse (Reliable Processes) within the SFB 747

(Collaborative Research Center) Micro Cold Forming - Processes, Optimization, Characterization R EFERENCES [1] J. Masci, ”Advances in Deep Learning for Vision, with Applications to Industrial Inspection”, PhD Thesis, Universita della Svizzera italiana, 2014. [2] X. Xie, ”A Review of Recent Advances in Surface Defect Detection Using Texture Analysis Techniques”, in: Electronic Letters on Computer Vision and Image Analysis, 7.3, 2008. [3] N. Neogi, D. K. Mohanta and P. K. Dutta, ”Review of vision-based steel surface inspection systems”, in: EURASIP Journal on Image and Video Processing, 50, 2014. [4] F. Pernkopf and P. O’Leary, ”Visual Inspection of Machined Metallic High-Precision Surfaces”, in: EURASIP Journal on Applied Signal Processing, 2002.1, pp.667-668, 2002. [5] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, ”Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation”, in: Advances in Neural Information Processing Systems 27, Ed. by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger. Curran Associates, Inc., 2014. [6] K. Simonyan and A. Zisserman, ”Very Deep Convolutional Networks for Large-Scale Image Recognition”, in: CoRR abs/1409.1556, 2014. [7] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, ”Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, in: Journal of Machine Learning Research 15, pp. 1929– 1958, 2014. [8] 29th Annual Symposium of the German Association for Pattern Recognition, Weakly Supervised Learning for Industrial Optical Inspection, 2007. [9] L. Prechelt, ”Early stopping – but when?”, in: Neural Networks: Tricks of the Trade, Springer, pp. 55–69, 1998.

Context-aware Deep Convolutional Neural Networks

Context-aware Deep Convolutional Neural Networks

Suggest Documents

Deep Convolutional Neural Networks for

Improving deep convolutional neural networks with

Evaluating Deep Convolutional Neural Networks for Material ...

Deep Convolutional Neural Networks for Pairwise Causality

Fine-tuning deep convolutional neural networks for

Stenosis Detection with Deep Convolutional Neural Networks

Deep Convolutional Neural Networks for Airport

Deep Convolutional Neural Networks for Smile ...

Incremental Training of Deep Convolutional Neural Networks

Deep Convolutional Neural Networks for Smile

Deep convolutional neural networks for Bearings

Robust Deep Convolutional Neural Networks for

Deep convolutional neural networks for estimating ... - Infoscience

Deep Convolutional Neural Networks and Noisy

Deep Convolutional Neural Networks for Computer

ImageNet Classification with Deep Convolutional Neural Networks

Deep Convolutional Neural Networks for Hyperspectral Image ...

Deep convolutional neural networks for pedestrian detection

ImageNet Classification with Deep Convolutional Neural Networks

lecture 17: neural networks, deep networks, convolutional ... - GitHub

Convolutional Neural Networks

Compressing Convolutional Neural Networks

Convolutional Neural Networks

Deep Columnar Convolutional Neural Network