Feature Extraction Using Deep Learning for Food

11 downloads 0 Views 946KB Size Report
Keywords: Deep learning 4 Transfer learning 4 Image recognition 4 Food ..... A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional.
Feature Extraction Using Deep Learning for Food Type Recognition Muhammad Farooq and Edward Sazonov(&) Department of Electrical and Computer Engineering, University of Alabama, Tuscaloosa AL 35487, USA {mfarooq,sazonov}@eng.ua.edu

Abstract. With the widespread use of smartphones, people are taking more and more images of their foods. These images can be used for automatic recognition of foods present and potentially providing an indication of eating habits. Traditional methods rely on computing a number of user derived features from image and then use a classification method to classify food images into different food categories. Pertained deep neural network architectures can be used for automatically extracting features from images for different classification tasks. This work proposes the use of convolutional neural networks (CNN) for feature extraction from food images. A linear support vector machine classifier was trained using 3-fold cross-validation scheme on a publically available Pittsburgh fast-food image dataset. Features from 3 different fully connected layers of CNN were used for classification. Two classification tasks were defined. The first task was to classify images into 61 categories and the second task was to classify images into 7 categories. Best results were obtained using 4096 features with an accuracy of 70.13% and 94.01% for 61 class and 7 class tasks respectively. This shows improvement over previously reported results on the same dataset. Keywords: Deep learning recognition  Classification



Transfer learning



Image recognition



Food

1 Introduction In the last few years, recognition of food items from images has become a popular research topic due to the availability of a large number of images on the internet and because of the interest of people in social networks. One of the challenging tasks in image-based food recognition is to determine which food items are present in the pictures. This paper focuses on the task of food item recognition assuming that it is known that given images contain food and the algorithm is used to determine the food type. Food type recognition is a hard problem because the shape of different food items is not well defined and a single image of food can have a variety of ingredients with varying textures. Color, shape, and texture of a given food type is defined by the ingredients and the way food is prepared [1]. Even for a given food type, high intra-class variations in both shape and texture can be observed for example chicken burgers [1], which can be prepared in a variety of different methods and the final texture, color, and shape might be different for the same food after preparation. © Springer International Publishing AG 2017 I. Rojas and F. Ortuño (Eds.): IWBBIO 2017, Part I, LNBI 10208, pp. 464–472, 2017. DOI: 10.1007/978-3-319-56148-6_41

Feature Extraction Using Deep Learning for Food Type Recognition

465

Researchers have proposed a number of algorithms for recognition of food items in images by employing different feature extraction and classification algorithms. Features computed from images and the choice of classifier plays an important role in food type recognition systems. Yang et al. proposed support vector machine (SVM) based approach with pair-wise statistics of local features such as distance and orientation to differentiate between eight basic food materials [2] on the Pittsburgh fast-food image dataset [1]. On the same dataset, they further classified a given food into one of the 61 food categories with a classification rate of 28.2% [1]. The authors in that work used four different features groups namely color histogram features, bag of scale-invariant feature transform (SIFT) descriptor features, semantic texton forest (STF) features and joint features of orientation and midpoint (OM) for classification of food types. Another work on the same dataset proposed the use of local textural patterns and their global structure using SIFT detector and Local Binary Pattern (LBP) to classify images. Joutou et al. proposed a visual recognition system to classify images of Japanese food into one of the 51 categories [3]. They proposed feature fusion approach where an SIFT-based bag of features, Gabor, and color histogram features were used with multiple kernel learning [4]. Authors in [5] used three image descriptors Bag of Textons, SIFT, and PRICoLBP to classify food images. Another work proposed a combination of bags-of-features along k-means clustering and final classification using linear support vector machine to classify food items into 11 classes with an accuracy of 78% [6]. Random Forest has also been proposed for determining distinctive visual components in food images and to use it for classification of food type [7]. The authors proposed a method to mine parts of images simultaneously for all image classes using random forest models. Other researchers have proposed systems which are able to recognize and segment different food items from images taken by people in real world scenarios using smartphone cameras [8]. In this work, authors proposed a segmentation procedure and used local features computed from segments and each segment was individually classified. The final decision was obtained combining the decision for individual segments. One of the most critical tasks for any machine learning problem is to extract useful and descriptive features. Feature engineering can be domain-specific and often requires domain knowledge. This can be seen by different feature representations used in literature as reported in previous section. In recent years, Deep Learning algorithms have been successfully applied to a number of image recognition problems [9]. Deep learning architectures have seen an increase in their use in literature because of the availability of large image datasets and the availability of high computing hardware and GPUs. An added advantage of using Deep Learning algorithms is their ability to automatically extract useful representative features during the training phase [10]. A special class of deep learning algorithms called convolutional neural network (CNN) has shown excellent performance on recognition task such as Large Scale Visual Recognition Challenge and is considered as state of the art [11]. Training CNN requires large datasets and are computationally expensive. Therefore, an alternate way is to use a pre-trained CNN model for feature extraction called transfer learning [12] and then use another simpler classifier such as SVM to perform final classification. The goal of this paper was to explore the use of a pre-trained CNN model for feature extraction for classification of food images into different food categories.

466

M. Farooq and E. Sazonov

A secondary goal was to explore the classification ability of features extracted from different fully-connected layers of CNN. In this work, SVM classifier was used for classifying food intake using features extracted from the pre-trained CNN models, to perform multi-class classification. This paper further compares the results of the proposed approach with previously reported results on the same image datasets.

2 Methods 2.1

Data

The algorithm designed in this work was tested on the Pittsburgh Fast-food Image Dataset (PFID) [1]. The dataset comprised of images of 61 different fast foods captured in the laboratory. According to the authors, each food item was bought from a fast food chain on 3 different days and on each day, 6 images from different angles with different lightening conditions were taken. The background was kept constant in each image, and the focus was on the food item. The dataset consisted of a total of 1098 images of 61 categories. Details of the dataset are given in [1]. As suggested in [1], in this work, data was divided into 3-folds for each food type, and 3-fold cross validation was performed where 12 images from two days were used for training and the remaining 6 images were used for testing. Figure 1 shows an example of two different food items (burger and salad). First, three rows present images of a chicken burger taken on 3 different days and the last 3 rows show images of salad taken on 3 different days. From this image, variations in shape, texture and color are visible for pictures of the same foods taken after different lighting conditions and different image angles. Further, authors in [1] proposed to divide foods into seven different categories since different food types might have similar ingredients and similar physical appearance, and the training and validation images were captured on separate days with different view angles. These categories were “(1) sandwiches including subs, wraps; (2) salads, typically consisting of greens topped with some meat; (3) meat preparations such as fried chicken; (4) bagels; (5) donuts; (6) bread/pastries; and (7) miscellaneous category that included variety of other food items such as soup and Mexican-inspired fast food” [1]. This approach resulted in two separate datasets, one with 61 food categories and the second with 7 categories of food items. Two separate classifiers were trained for both problems. For both problems, similar feature computation and classification approaches were used. Traditional classification methods employ user computed features from images and then use classification methods (linear or non-linear classifiers) to classify food images. This work proposes deep learning for feature computation and then use a linear classifier to classify food images. Figure 2 shows the flow of these two methods.

2.2

Feature Extraction: Convolutional Neural Network

Convolutional neural networks (CNN) are state of the art for many image recognition problems. CNN are essentially multi-layer neural networks with multiple convolution and pooling layers. The convolution layer consists of small rectangular patches (filters) smaller than the original image and whose weights are learned during the training

Feature Extraction Using Deep Learning for Food Type Recognition

467

Fig. 1. An example of image categories presents in the PFID food database.

Fig. 2. Feature and classification model flow for user defined features and CNN model extracted features. The first row present conventional machine learning approach where user selected features are used for training of a classifier. The second row shows a deep learning approach where features are extracted automatically by a CNN and then a linear classifier is used for classification.

468

M. Farooq and E. Sazonov

phases. These filters or kernel are used to extract low-level details from input images. CNN layer filters can be used to extract basic information such as edges and blobs etc. The second type of layer used by CNN is called pooling layers which are used to reduce the spatial size of images at each pooling layer by using some form of activation function over a rectangular window such as maximum or average over a rectangular region. This reduces the number of parameters needs to be computed and hence results in reduced computations at subsequent layers. In addition, a CNN architecture can have multiple fully-connected layers which are similar to regular neural networks, where the layer has full connection to all activations in the previous layer. Fully-connected layers are represented by FC. In this work rather than training a CNN from scratch, a pre-trained convolution neural network was used. Pre-trained networks can be used to feature extractions from a wide range of images. In this work a network pre-trained on ImageNet dataset called AlexNet was used [11]. AlexNet consists of a total of 23 layers, where the input size is 227-by-227-by-3 (RGB images). Images in the PFID are of the size 600-by-800-by-3 and therefore they were re-sampled to 227-by-227-by-3 so that they can be used as an input to the network. Figure 3 shows the filters used in the first convolution layer of AlexNet. AlexNet has 3 fully connected layers represented as FC6, FC7, and FC8. Fully-connected layers learn higher level image features and are better suited for image recognition tasks [13]. In AlexNet, FC6, FC7, and FC8 consist of 4096, 4096 and 1000 features respectively. In this work, features computed from these three layers were separately used for classification task.

Fig. 3. Example filters used by the first convolution layer in AlexNet [11]. Each of the 96-filters shown is of the size 11  11  3. These filters are used to extract basic information such as edges, blobs, etc.

2.3

Classification: Support Vector Machine

For the classification task using features computed from deep neural networks, there are two possibilities. One possibility is to use an end-to-end deep network which performs both feature computation as well as classification whereas the second is to use CNN for feature computation and then use a linear classifier for classification. Using end-to-end

Feature Extraction Using Deep Learning for Food Type Recognition

469

deep architectures require to tune the parameters of the pre-trained network using the new image dataset and can suffer from overfitting [12]. This procedure also requires the availability of large image dataset. The second procedure using a linear classifier with features extracted by the CNN architecture can help in avoiding overfitting issues since the final classifier is not involved in feature computation [12]. In this work, to perform multiclass classification; linear SVM models were used with features computed by the AlexNet. Training and validation were performed using 3-fold cross validation where for each food type, images taken on two days were used for training and the images taken on the third day were used for validation. This process was repeated three times. Classification accuracies (F-scores) were reported for each food type using confusion matrix. Features from all three fully-connected layers of AlexNet were used for training three separate linear SVM models. These features were used for both 61 class and 7 class multiclass classification problem.

3 Results Using features extracted from three fully-connected layers of the AlexNet to train linear SVM models resulted in different accuracies for classification of images into 61 categories. Average classification accuracies were 70.13%, 66.39% and 57.2% for features extracted from FC6, FC7 and FC8 layers of the AlexNet. For 7 classes, the accuracies obtained for features extracted from FC6, FC7, and FC8 layers were 94.01%, 93.06%, and 89.73%, respectively. Tables 1, 2 and 3 show the confusion matrices for seven class classification for features extracted from FC6, FC7 and FC8 fully connected layers of the AlexNet. Confusion matrices for 61 classes/categories are harder to visualize and therefore, are not presented.

Table 1. Confusion matrix; classification into seven food categories based on features extracted from FC6 layer of the AlexNet.

470

M. Farooq and E. Sazonov

Table 2. Confusion matrix; classification into seven food categories based on features extracted from FC7 layer of the AlexNet.

Table 3. Confusion matrix; classification into seven food categories based on features extracted from FC8 layer of the AlexNet.

4 Discussion and Conclusions This work presented an approach based on convolution neural network and linear SVM models to differentiate between categories of fast foods from the Pittsburgh dataset. Instead of computing user defined features, AlexNet was used for automatically extracting features from food images. Results suggest that the feature extracted from the FC6 fully-connected layer along with Linear SVM classifier provided the best classification results in both 61 class classification as well as on 7-class classification problem. The approach presented in this work has improvements over the previously reported results on the same dataset with similar testing conditions. For example, for 61-class problem, previous best results were reported using a combination of Pairwise Rotation Invariant Co-occurrence Local Binary Pattern (PRI-CoLBPg) features with SVM classifier, resulting in classification accuracy of 43.1% [14], whereas the proposed approach in with work resulted in the best accuracy of 70.13%, which shows an improvement of about 27%. On average, the proposed approach consistently performs better than previous approaches, even if features from other two layers are used

Feature Extraction Using Deep Learning for Food Type Recognition

471

(accuracies of 66.39% and 57.2%). A possible reason is the ability of CNN to extract local and global features which are more relevant to the classification task. PFID is a challenging dataset where for each food category, images were taken on 3 different days. On each day images were taken from 6 different viewpoints. Since there are intra-class variations in classes, therefore food types were split into seven major categories i.e. sandwiches, salads/sides, chicken, bread/pastries, donuts, bagels, and tacos. Previous best results for 7 category classification were obtained with a combination of PRI-CoLBPg features and SVM classifier and resulted in a classification accuracy of 87.3% [14], whereas in this work features extracted from FC6 fully-connected layer with linear SVM obtained classification accuracy of 94.01% with an overall improvement of about 7%. The performance of other classifiers trained with features from FC7 and FC8 layers are also better than previous results. In this work the image dataset was based on fast food images taken in the laboratory. This work is also important because of the wide use of smartphones for taking images of the foods. The approach presented here can be used to automatically recognize food images and categories similar foods. One limitation of the approach presented here is that images contain only single food items. Future work will focus on images containing multiple food items. Another relevant problem is the use of learning algorithms to differentiate between images of food versus non-food and then use algorithms for recognizing food types. This will be considered in future work. In the last decade or so, several wearable sensor systems have been proposed for automatic detection of food intake by monitoring of chewing and swallowing such as [15–18]. One such example is a wearable system presented in [16], where a combination of piezoelectric strain sensor and accelerometer were used for detection and recognition of chewing related to eating and, the physical activities performed by the participants. One future direction is to use these systems to automatic detect eating episodes and then automatically trigger a camera to capture images of the food being consumed. As a final step, the approach proposed here can be used to recognize food type and relevant caloric information. There is also a possibility to use food volume estimation method for estimating the volume of the food consumed using 3D models such as the method proposed in [19]. Acknowledgement. Research reported in this publication was supported by the National Institute of Diabetes and Digestive and Kidney Diseases (grants number: R01DK100796). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References 1. Chen, M., Dhingra, K., Wu, W., Yang, L., Sukthankar, R., Yang, J.: PFID: pittsburgh fast-food image dataset. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 289–292 (2009) 2. Yang, S., Chen, M., Pomerleau, D., Sukthankar, R.: Food recognition using statistics of pairwise local features. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2249–2256 (2010)

472

M. Farooq and E. Sazonov

3. Joutou, T., Yanai, K.: A food image recognition system with multiple kernel learning. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 285–288 (2009) 4. Sonnenburg, S., Rätsch, G., Schäfer, C., Schölkopf, B.: Large scale multiple kernel learning. J. Mach. Learn. Res. 7, 1531–1565 (2006) 5. Farinella, G.M., Allegra, D., Stanco, F.: A benchmark dataset to study the representation of food images. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 584–599. Springer, Cham (2015). doi:10.1007/978-3-319-16199-0_41 6. Anthimopoulos, M.M., Gianola, L., Scarnato, L., Diem, P., Mougiakakou, S.G.: A food recognition system for diabetic patients based on an optimized bag-of-features model. IEEE J. Biomed. Health Inform. 18, 1261–1271 (2014) 7. Food-101 – Mining Discriminative Components with Random Forests. https://www.vision. ee.ethz.ch/datasets_extra/food-101/ 8. Zhu, F., Bosch Ruiz, M., Khanna, N., Boushey, C., Delp, E.: Multiple hypotheses image segmentation and classification with application to dietary assessment. IEEE J. Biomed. Health Inform. 19(1), 377–388 (2015) 9. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998) 10. Le, Q.V.: Building high-level features using large scale unsupervised learning. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8595–8598 (2013) 11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 12. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345– 1359 (2010) 13. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition (2013) ArXiv13101531 14. Qi, X., Xiao, R., Li, C.G., Qiao, Y., Guo, J., Tang, X.: Pairwise rotation invariant co-occurrence local binary pattern. IEEE Trans. Pattern Anal. Mach. Intell. 36, 2199–2213 (2014) 15. Fontana, J.M., Farooq, M., Sazonov, E.: Automatic ingestion monitor: a novel wearable device for monitoring of ingestive behavior. IEEE Trans. Biomed. Eng. 61, 1772–1779 (2014) 16. Farooq, M., Sazonov, E.: A novel wearable device for food intake and physical activity recognition. Sensors 16, 1067 (2016) 17. Farooq, M., Fontana, J.M., Sazonov, E.: A novel approach for food intake detection using electroglottography. Physiol. Meas. 35, 739 (2014) 18. Farooq, M., Sazonov, E.: Segmentation and characterization of chewing bouts by monitoring temporalis muscle using smart glasses with piezoelectric sensor. IEEE J. Biomed. Health Inform., 1–1 (2016) 19. Chae, J., Woo, I., Kim, S., Maciejewski, R., Zhu, F., Delp, E.J., Boushey, C.J., Ebert, D.S.: Volume estimation using food specific shape templates in mobile image-based dietary assessment. Proc. SPIE. 7873, 78730K (2011)

Suggest Documents