3D perception from binocular vision for a low cost ...

Robotics and Autonomous Systems 68 (2015) 129–139

Contents lists available at ScienceDirect

Robotics and Autonomous Systems journal homepage: www.elsevier.com/locate/robot

3D perception from binocular vision for a low cost humanoid robot NAO Samia Nefti-Meziani a , Umar Manzoor b,∗ , Steve Davis a , Suresh Kumar Pupala a a

School of Computing, Science and Engineering, The University of Salford, Salford, Greater Manchester, United Kingdom

b

Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia

highlights • Implementation of a stereo vision system integrated in humanoid robot is proposed. • Low cost robotics vision system for 3D perception avoids expensive hardware cost. • Cameras are highly utilized as they are easy to handle, cheap and very compatible.

article

info

Article history: Received 27 June 2014 Received in revised form 26 November 2014 Accepted 5 December 2014 Available online 3 February 2015 Keywords: Stereo vision system Low cost 3D perception Camera calibration 3D reconstruction NAO robot

abstract Depth estimation is a classical problem in computer vision and after decades of research many methods have been developed for 3D perception like magnetic tracking, mechanical tracking, acoustic tracking, inertial tracking, optical tracking using markers and beacons. The vision system allows the 3D perception of the scene and the process involves: (1) camera calibration, (2) image correction, (3) feature extraction and stereo correspondence, (4) disparity estimation and reconstruction, and finally, (5) surface triangulation and texture mapping. The work presented in this paper is the implementation of a stereo vision system integrated in humanoid robot. The low cost of the vision system is one of the aims to avoid expensive investment in hardware when used in robotics for 3D perception. In our proposed solution, cameras are highly utilized as in our opinion they are easy to handle, cheap and very compatible when compared to the hardware used in other techniques. The software for the automated recognition of features and detection of the correspondence points has been programmed using the image processing library OpenCV (Open Source Computer Vision) and OpenGL (Open Graphic Library) is used to display the 3D models obtained from the reconstruction. Experimental results of the reconstruction and models of different scenes are shown. The results obtained from the program are evaluated comparing the size of the objects reconstructed with that calculated by the program. © 2015 Elsevier B.V. All rights reserved.

1. Introduction The concept of 3D reconstruction is an on-going and paramount research topic in the area of computer vision [1,2]. Its high demand is understandable as it is very useful in many applications starting from robotics to virtual reality and media. Perception is essential for understanding and interacting with the environment, if robots are enabled with the 3D perception they can enter into service sector at a very high scale. Many services which people usually do not like to do can happily be done with the help of robots [3,4].

∗

Corresponding author. E-mail addresses: [email protected] (S. Nefti-Meziani), [email protected] (U. Manzoor). http://dx.doi.org/10.1016/j.robot.2014.12.016 0921-8890/© 2015 Elsevier B.V. All rights reserved.

In robotic-vision applications, scene reconstruction information can be used to find a target object or to avoid obstacles (i.e. detect and grasp objects [5–7]). Besides, the 3D distance of a target object from the robot can be used in velocity computation for robot navigation, it is also useful to track an object occluded and moving along the optical axis. Therefore, 3D Euclidean distance can be more feasible than a relative distance in robot vision and navigation system [8]. The demand for social interaction of robots is increasing day by day, but the robots’ physical world perception systems at this point are not perfect, either they are very costly (laser ∼50.000$ [9]) or they are of poor quality and limited like sonar and infrared systems [1,10–12]. It is always realized that a vision system similar to human would solve almost all the problems such as distance estimation and pattern recognition. Thus, the 3D reconstruction

130

S. Nefti-Meziani et al. / Robotics and Autonomous Systems 68 (2015) 129–139

from stereo vision gains high importance. And hence there are research theories related to this topic, however, less effort has been made into their practical implementation [8,13–18]. Originally the 3D reconstruction is done with expensive hardware (like laser range finders), but having a cheaper alternative makes it more accessible to the average user. More and more people find useful applications of it, encouraging further research in new dimensions. The demand for computer vision similar to human vision system drives the researcher’s concentration to 3D Model reconstruction from stereo vision [16,19]. Microsoft has developed an efficient movement, voice, and gesture recognition device named as Kinect [20] for application development. Kinect contains different type of sensors (i.e. RGB Camera, Microphone, Depth sensors) which provides 3D motion capture, facial recognition and voice recognition capabilities. The price of this device is approximately $250 and can be considered as low cost 3D perception capture device. Kinect has been successfully used in various domains such as gaming, human computer interaction, healthcare and education. However, Kinect is vendor specific and works only for Windows environment. Shahram Izadi et al. in [21] proposed the 3D construction of in-door scene using Kinect camera by using depth data provided by Kinect device. The authors have tested the proposed system on large number of test cases; results show the effectiveness and efficiency of the proposed system. Jimmy Tran et al. in [22] proposed Low-Cost 3D Scene Reconstruction for Response Robots using Kinect device and used the spatial data gathered from this sensor to create the 3D model. Furthermore, the authors have also discussed several methods for 3D model creation. Marco A. et al. in [9] proposed cost-efficient 3D Sensing System for Mobile Robot, the proposed approach uses a regular commercial 2D Laser Range Finder (starting price approximately $220), a step motor (approximately $20) and a camera (approximately $25) all controlled by an embedded circuit for this purpose. The approach has been tested on two scenarios; as a result several limitations were identified by the authors. Guoqiang Fu et al. in [23] proposed a low-cost active 3D triangulation laser scanner for indoor navigation of mobile robots. The proposed system consists of two modules, (1) Hardware: A camera, laser diode and servo motor are integrated in the mobile robot to get perception from the environment. (2) Software: It includes image processing (i.e. conversion of 2D image features to 3D world coordinates) and data post processing (i.e. navigations algorithms for obstacle avoidance etc.). The proposed system has been evaluated on different test cases; the experimental results were satisfactory and support the implementation of the system. To the best of our knowledge, our proposed system is the cheapest among all the existing systems proposed until now for creating low cost 3D reconstruction. Vision means the detection of light from the world, which is emitted from some source (e.g., a light bulb or the Sun), which is then reflected from any object and enters our eyes making that object visible [24,25]. It works similarly with cameras. The geometry of the ray’s travel from the object, through the lens in camera, and to image is of particular importance to practical computer vision (especially to estimate 3D properties of that object). A simple but useful model of how this happens is the pinhole camera model. A pinhole is an imaginary wall with a tiny hole in the center that blocks all rays except those passing through the tiny aperture in the center to form image [14,24,25]. The camera characteristics depend and vary based on manufacturers according to investment on product development. So, in the market there are many kinds of cameras with different dimensions and lenses which introduce errors in the theoretical model [25,26]. The task of estimating distortion is to find a distortion model that allows easy un-distortion as well as satisfactory accuracy [16]. Therefore, camera calibration is an important and essential step in 3D reconstruction [27]. Furthermore, the geometry model of a pinhole camera is not valid without distortions modeling.

Fig. 1. NAO head dimensions (left) and Microsoft LifeCam VX6000—Webcam model used (right).

The work explained in this paper aims at practical implementation of stereo vision for 3D perception on a robot and analyzing the feasibilities in real-time interaction and navigation applications. It is obvious that the developed 3D reconstruction application must be implemented on any robot to analyze the applications feasibility in real world. So NAO is chosen for this work, NAO is a low cost humanoid with highly advanced features very suitable for the research and teaching [28]. The methodology followed for the 3D reconstruction begins with the un-distortion and rectification of images, then the stereo feature correspondence extraction and disparity estimation are done and finally the epipolar geometry is applied to retrieve 3D information. In the 3D reconstruction, disparity is one of the strongest binocular depth cues, considering ideal pinhole cameras the accuracy of the 3D model reconstructed will be directly proportional to the accuracy of disparity estimation [29–32]. Finally, the acquired 3D information is used to develop a 3D polygonal model where the texture is mapped. 2. Hardware description and model To implement 3D reconstruction from stereo images the essential equipment is a Binocular Camera set and different tools for measurements like chessboard for calibration. The NAO is required as implementation platform of the 3D perception from binocular camera. The tools used in this work are shown in Table 1. The characteristics of the principal devices used in Low Cost 3D Perception are explained below. 2.1. Binocular camera set and fixation in NAO head The Binocular Camera set is an arrangement of cameras’ aperture plane separated by a distance (average human eye distance ∼6.5 cm). The NAO has already two inbuilt cameras, but these cameras are not useful for stereo vision as their field view does not overlap. So, considering the NAO head structure Fig. 1, a frontal plate and its fixation elements are designed to hold cameras. The NAO head dimensions are measured first in order to design the frame, the camera and lens system used in this experiment is Microsoft LifeCam VX6000 (Webcam) [Quantity required: 2, costs £26 each]. The NAO head measurements and Microsoft LifeCam VX6000 are shown in Fig. 1. The audio unit of the camera was removed as it was not required in the current project, only the image extraction unit (small chip with lens) is used to keep minimum model dimensions. The front plate (Fig. 2) is an important part of headset which holds the cameras; the width is made to be 130 mm similar to that of NAO head width. Two big holes (15 mm diameter) separated by a distance of 60 mm are protruded for arranging cameras. The thickness of the plate is made to be 10 mm to ensure that it is strong enough to mount the cameras and finally in the four corners four holes are made to fix it with the other part of the frame.


131

Table 1 Tools used in low cost 3D perception. Device

Description

NAO Webcams Rulers Chessboard Rod Box Markers

Autonomous humanoid robot. Height 0.57 m. Weight 4.5 kg. Body Mass Index (BMI) 13.5 kg/m2 [28], Version: H25 Academic Edition. High Definition video (1.3 megapixels, 30 frames/s) and photos (5.0 megapixels interpolated). 2 m, 1 m, 50 cm, 30 cm and 15 cm rulers are used for measuring distances. Plane board of chess squares of dimensions 9 × 7 resulting 8 × 6 inner Aluminum rod with a blue colored sliding strip designed for depth perception calculations and experiments. Cotton box always used for experimentation, to calculate accuracy of 3D Reconstruction. Small items like balls, dice and other items used as markers.

Fig. 2. Perspective view of front plate model, the arrangement of cameras separated by 6 cm mounted on NAO head.

2.2. Workspace

where

 The experiments are carried out in a square region of dimensions 240 cm × 240 cm, which is surrounded by a wall of 35 cm height painted white for easiness in image processing as shown in Fig. 3. 3. Stereo vision model The mathematical model used to extract the 3D coordinates of the correspondence points from a pair of images captured with a stereo vision system usually requires (1) the geometrical model of each camera and (2) the relative position between them (i.e. stereo calibration). 3.1. Camera model Camera model and calibration techniques are well known and widely described in the literature; for detailed description of the Perspective Transformation Matrix (PTM) construction and camera calibration see [14,33–36]. We have used an ideal pin-hole model for the cameras which relates 3D coordinates of the scene points [X , Y , Z ]T with the 2D coordinates of the image points [x, y]T with a homogeneous matrix called Perspective Transformation Matrix (PTM) as shown in Eq. (1) [27]. s·u s·v s





 

X Y  = PTM ·   ; Z 1

(1)

··· .. . ···

m11

 .

PTM =  .. m31

m14



..  . 

m34

and s is a scale factor. The coordinate systems of the model that must be taken in account are shown in Fig. 4. These are: the global coordinate system of the scene (X , Y , Z ) in millimeters, the local coordinate system of the camera (Xc , Yc , Zc ) in millimeters, and the coordinate system of the sensor (u, v ) in pixels. The components of PTM (mij ) are a linear combination of the intrinsic and extrinsic parameters of the camera. The intrinsic parameters depend on the camera and lens characteristics (pixel size, focal length, f , and the (u, v ) coordinates of the optic center of the image, Cu and Cv ) where as the extrinsic parameters are the rotation and translation that relates the global reference system of the scene with the local reference system of the camera. The ideal pin-hole model does not need any lens to project the image, but the real system needs to work with a lens that often introduces distortion in the projection of the 3D points in the image plane. The lens distortion could be classified as radial and tangential [16,25], both kinds of distortion are modeled in Eq. (2).



xcorr ycorr



    x = 1 + k1 · r 2 + k2 · r 4 + k3 · r 6 · y

2 · p1 · x · y + p2 r 2 + 2 · x . p1 r 2 + 2 · y 2 + 2 · p2 · x · y

 +



 2

(2)

132


Fig. 3. Workspace where the experiments have been carried out.

tled with the same value. However, the cameras are not located with parallel optical axis and the sensor planes are not coplanar so a process of image rectification is needed to achieve this condition. The process of rectification is a mathematical process implemented to achieve the assumptions needed for a camera arrangement frontal parallel. In this process, the relative position between the cameras (rotation and translation) is taken in account to transform the coordinates of the image points into those coordinates of a frame that fits the theoretical assumptions. Once the images are rectified and fit with the above described conditions the depth estimation is calculated as in Eq. (3):

Fig. 4. A pinhole camera model with front projection. The coordinate systems of the model, scene (X , Y , Z ), camera (Xc , Yc , Zc ) and sensor or image (u, v) are shown. The optical center of the camera (O) and the focal length (f ) are also shown.

Here, (x, y) is the original location of the distorted point on the image and (xcor , ycor ) is the new location as a result of the correction. k1 , k2 , k3 are the radial distortion coefficients and finally, the coefficients to model the tangential distortion are p1 and p2 . The distortion is usually dominated by the radial components, and especially dominated by the first term [16]. It has also been found that too high order may cause numerical instability. With the calibration of the camera, the distortion parameters and the components of the PTM are obtained, and from the extrinsic parameters of the PTM, it is possible to relate the position of the cameras in the stereo calibration.

f ·T ; (3) xl − xr where Z is the depth in the camera frame, f is the focal length, T is the distance between the principal axis of the cameras after the rectification process, and xl , xr are the x-coordinates of the same 3D point projected in the image frame. After calculating the Z coordinate the values of X and Y can be calculated as shown in Eqs. (4) and (5).

Z =

 X =Z·



f

 Y =Z·

x − Cx y − Cy



f

;

(4)

.

(5)

4. Methodology 3.2. Reconstruction model The reconstruction model used in this paper is based on the assumption that the images are perfectly undistorted, the cameras are aligned and their image planes are exactly coplanar with each other, with parallel optical axis. Other considerations are: (1) the focal length is the same for both cameras (we have used exactly same cameras so the focal length in our case is similar), (2) the prinright cipal point of each camera has the same coordinates (Cxleft = Cx ), and finally, (3) the images are row-aligned (camera arrangement frontal parallel [25]). From the calibration, the distortion parameters are known and the coordinates of the principal point of the cameras could be set-

The several steps used for 3D reconstruction are shown in Fig. 5. The first step is camera calibration which includes extracting intrinsic and extrinsic parameters of stereo camera set. This step is followed by the image processing procedures which begin with acquiring the stereo images from the camera which then undergo un-distortion and rectification processes. After that, the good features are extracted from both images in order to find stereo corresponding features. The disparity is calculated and then depth estimation is performed. Delaunay’s algorithm is used to divide the surface into triangles and then updating the 3D-model and also wrapping the triangle area as model texture. The details of each step involved in the 3D reconstruction are discussed below.


133

Fig. 5. 3D reconstruction algorithm flow chart. Fig. 7. Left: 20 pixel boundary that can be cropped after un-distortion of image. Center and right: Not overlapped regions of the images.

Fig. 6. Chessboard and corners detection during the camera calibration process.

4.1.3. Stereo calibration In the stereo calibration, PTM of each camera is calculated. The PTM is composed by a linear combination of the intrinsic and extrinsic parameters of each camera, these parameters are also calculated. From the extrinsic parameters of the cameras, we calculated the position of one of the cameras by using the frame of the other. This information is needed for the rectification process. Once calibration process is completed, the points are undistorted and epipolar lines of each corner point of an image are calculated in the other image. Each corner point in the image must be content in the epipolar line calculated from the same corner point in the other image. In other words, that means it is an error and this error is calculated as the calibration error (error average = 0.63 pixels).

4.1. Camera calibration

4.2. Image un-distortion and rectification

In the camera calibration the parameters of the geometrical model of each camera are obtained as well as the relative position between them [34]. A self-calibration procedure has been programmed to calibrate the cameras using a chessboard (Fig. 6).

Once the stereo calibration is done, the next step is to obtain the 3D point of the image features by computing the un-distortion and rectification map [39]. The undistorted image is calculated by applying the distortion coefficients (i.e. radial and tangential) obtained in the calibration of each camera. For the images rectification the spatial relation between the cameras’ reference systems (rotation and translation calculated in the stereo calibration) is used. The rectification transform of each image allows the simplification of the stereo correspondence problem by making both camera image planes the same plane and, consequently, making all the epipolar lines parallel. After image un-distortion and rectification, the image was extended to corners but in the central zone between edges, black regions up to 20 pixels in the peripheries of both of the images are resulting. So, to eliminate these black regions, the selected ROI starts at (20, 20) having a width of 600 and height of 440 (Fig. 7, left). The non-overlapped regions are not useful for the reconstruction and cause extra computation in feature extraction and finding correlation [13,40]. In order to avoid this region, the overlapping region is calculated and 75 pixels in the opposite sides of both the images are cropped (Fig. 7, center and right).

4.1.1. Data acquisition The first step is the Stereo Image set collection. Calibration is done using this set, where the chessboard is used as reference object for corner extraction. The stereo image set is the left and right camera images of the calibration object at a particular point of time (or keeping object at the same position). In this way, the images are taken locating the chessboard at different positions. But, there is a chance of errors and waste of time in manual triggering. Therefore, to avoid these types of problems, a program has been written to grab both the images and save them whenever a key is pressed. 4.1.2. Image analysis After the images of the chessboard are captured the corner points are detected/identified and the calibration procedure is applied in order to obtain the PTM components. In the beginning each of the images is loaded and checked for chess corners (8 × 6 = 48), if 48 corners are found, a sub-pixel algorithm is used in each one to improve the location accuracy. The sub-pixel algorithm is based on the gradient of the gray level of the image pixels [25,37,38].

4.3. Feature extraction and finding stereo correspondence The feature extraction used for this application begins with the analysis of one of the images and extracting the corners with the

134


Fig. 8. Frames of the NAO and the camera set (left) and position of the cameras (right).

Fig. 9. Left: Aluminum rod designed for data collection. Right: Points measured at different depth values.

big eigenvalues. That is, the covariation matrix over the neighborhood is calculated (6) and the pixel with the higher eigenvalues are detected and extracted [37,38].

 

dI

  

dx

S (p)

M =  S (p)

2

dI 2 dxdy



  dI 2  dxdy   S (p)   dI 2   S (p)

(6)

dy

where M is the gray level intensity covariance matrix of a pixel for the region or neighborhood (S (p)) in the x and y directions of the image. The information of the first image is used to identify the corresponding points in the other image and then applying an algorithm to detect the flow of the features in images. The algorithm used for flow detection is the sparse version of the Lucas–Kanade optical flow in pyramids [41]. 4.4. 3D extraction model and texture mapping After the image rectification, the 3D reconstruction of the identified feature points are calculated as shown in (3)–(5) with the information provided by the stereo camera calibration, the feature extraction and the stereo correspondence finder algorithm. Once the 3D coordinates of the feature points have been extracted, the points are then related using the Delaunay triangulation algorithm [42] which creates a triangulation of the set of points in 2D space. 2D Delaunay triangulation ensures that no node lies in the

interior of the circumscribed sphere (circle in this case) of a simplex of the triangulation (the Empty Sphere Property [43]). In this way, each point is connected to its natural neighbors. From the triangulation of the image, the 3D model updating and texture mapping are achieved. OpenGL is used to render tridimensional objects and to update the objects view according to the point of view position. 4.5. Integration with the NAO reference system The reconstructed point coordinates with the stereo set are framed in the reference system of the left camera, so that it can be expressed in the NAO frame by applying the correspondent homogeneous transformation matrix as shown in Fig. 8. The rotation and translation of the head frame from the NAO global frame can be known using the NAO programming command ‘‘get position’’ with the part being ‘‘Head’’. And finally, as the location of the cameras is known from the frame of the head (position and orientation of the left camera frame), the homogeneous transformation matrix can be calculated and the reconstructed points can be expressed in the coordinate system of the NAO body or the NAO supporting leg coordinate system. Once the reconstructed points have been expressed in the NAO frame, they can be used directly for manipulations like grasping or navigating through environment avoiding obstacles. 5. Tests and results We have evaluated our proposed solution on large number of test cases; in this section few of the test cases are presented and


135

Fig. 10. Left: relation between the depth and the disparity: Depth is inversely proportional to disparity. Center: Error measurement in depth (Z coordinate) for the depth values measured. Right: relation between the error in depth measurements and the disparity.

Fig. 11. Results of the geometrical test: 3D coordinates of the recognized features on left hand side and Delaunay’s triangulation on the right hand side.

discussed in detail. First, depth measurement tests are discussed followed by geometry test and scene reconstruction. We have also provided some illustration and graph for better understanding of our experiments. 5.1. Depth measurement A depth versus disparity test has been carried out, where the calculated and measured depths are compared and drawn in graph against the disparity. Depth is inversely proportional to disparity [25] and the error in calculations is estimated by using the measured values. An aluminum rod with a blue colored sliding strip designed for data collection is used for depth measurement in this experiment. In total 40 measurements are taken, each time moving the rod by 10 cm and taking two points reading till 230 cm away from the eye set as shown in Fig. 9. The measurement results of the test are shown in Fig. 10. Depth is inversely proportional to disparity (approximately depth proportional to disparity-2). In the close range (less than 50 cm) the measurement error remains under 1 cm but in farther distances reaches values up to 3 cm. In the disparity graph (Fig. 10, right) is shown that high values of disparity (more than 150 pixels) are needed to keep error values under 1 cm. The error value between 1 and 3 cm is fine for NAO navigation, however if the error value increases more than 3 cm then there is greater chance of collision with the obstacle. Finally the average percentage of error (APE) and the average error (AE) are calculated in Eqs. (7) and (8) respectively. APE =

1 N

·

  Dicalculated − Dimeasured  Dimeasured

· 100 = 1.06%

(7)

Fig. 12. The spatial representation of the box points (in red). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

AE =

1 N

·



[Dicalculated − Dimeasured ] = −0.907 cm.

(8)

5.2. Geometry and scene reconstruction In Geometry Tests, the geometry of the object is measured and compared to that of the real world geometry, to find the model building accuracy as shown in Fig. 11.

136


Fig. 13. Automatic 3D reconstruction of the scene. The box and the elements are located and their position from the robot binocular system is measured (left). Finally the 3D model (Delaunay’s triangulation, center) is rendered with the image information (right).

Fig. 14. Another example of the programmed automatic 3D reconstruction of the scene. Measurement of the elements (left), reconstruction of the model (center) and rendering (right).

Fig. 15. Reconstruction of random targets in the scene.

The height (20 cm) and the length (45 cm) of the box have been calculated resulting errors lowers than 0.3 cm. Also, the Z coordinate of points P1 and P2 located in the same vertical edge of the box differs in less than 0.9 cm. And finally, the angle between P1P2 and P1P5 is 90° (see Fig. 12). Different scenes have been reconstructed and the 3D model has been extracted and rendered. The accuracy of the results is similar to that obtained in the previous test. In Figs. 13–15, the results of the 3D reconstruction of the box placed in different positions are shown as well as different rounded shaped elements spread around the scene. In the reconstructions shown below, markers have been used to highlight the corners of the box or to define spatial targets in the scene, in this case, the accuracy of the results is very sensitive to the marker shape and location. Another test carried out is the reconstruction of a chair (Fig. 16). In this case, the accuracy of the measurement depends on the contrast between the target features and the scene. But on the whole develops a good 3D model of almost any object if the features are distinguishable as shown in Figs. 17 and 18.

3D reconstruction of outdoor scene is different as compared to indoor scene because of several reasons such as light, wind, cloudy, raining etc., conditions play an important role. Our outdoor experimentation was performed in sunlight and we assume that the background is available. Figs. 19–21 show that our proposed system accuracy is same as compared to indoor scene. However, this might change if the conditions are not ideal, for example, the accuracy will be affected if it is cloudy or wind is blowing. To handle these conditions, the algorithms need to be modified in the proposed technique. Furthermore, to grasp objects in unknown environment efficient image processing algorithms need to be incorporated. 6. Discussion and conclusion In this section, we will discuss the overall project briefly, listed all limitations and constraints arising while completing this research. After that, we conclude the contribution from this research.


137

Fig. 16. Automatic reconstruction and rendering of a chair.

Fig. 17. Automatic reconstruction of the scene and different views of the model resulting.

Fig. 18. Automatic reconstruction of the box previously measured and different views of the model resulting (down: front view and side view).

6.1. Discussion This paper presents the development of a low cost 3D perception system based in stereo vision. The system design is integrated with the humanoid robot NAO allowing its application in calculating path avoiding objects or to grasp target objects. The details of system specification and design have been discussed in this paper. Briefly, the software that has been used in this research is CMake, Matlab, OpenCV, OpenGL, NAOSDK and Visual Studio. Meanwhile for hardware, we are using binocular set, NAO and workspace. One of the most crucial factor for this research is camera calibration. If good calibration has been done, all results acquired are

highly accurate and precise. However, if the calibration has been done lightly, it might result in distortion or the image might not rectify accurately. The programmed application carries out the vision system calibration (self-calibration), obtaining the geometrical model parameters and the parameters of the distortion. For the self-calibration process of the cameras a chessboard target has been used avoiding the use of high precision and expensive calibration object. Once the stereo images have been captured, the appliance searches automatically for features to be reconstructed and the correspondence points of each feature in the images. The crucial steps of the 3D reconstruction are (1) Camera calibration, (2)

138


Fig. 19. Outdoor example of automatic reconstruction.

Fig. 20. Outdoor example of bottles automatic reconstruction.

Fig. 21. Another outdoor example of automatic reconstruction.

Feature extraction and (3) Matching for disparity computation. The accuracy of the 3D models reconstructed from the stereo images is directly proportional to the accuracy of camera calibration and feature extraction as shown in Fig. 10. The camera head set is calibrated using 20 sets of binocular images; the matrices are calibrated with an average error of 0.63 pixels. The 3D models reconstructed using this application have been analyzed for depth and geometrical accuracies with different geometries. The results proved to be accurate at closer distances and the accuracy of model decreases as it moves further. Finally, the proposed system can be improved to enable NAO to grasp objects and navigate through an unknown area. 6.2. Conclusion In this paper, we have attempted to get a 3D perception from binocular vision and implement it on NAO which is our platform.

Even though there are few other methods developed for 3D perception, this paper has proposed a simple yet reliable method to develop a 3D perception. Our proposed method also has been tested and the results obtained are quite positive. The method is also straight forward and quite easy to understand. Furthermore, the main objective of this paper which is to obtain 3D perception from binocular vision can be considered as successful. References [1] M.A. Garcia, A. Solanas, 3D simultaneous localization and modelling from stereo vision, in: IEEE International Conference on Robotics and Automation, Vol. 1, 26 April–1 May 2004, pp. 847–853. [2] Chung-Hsien Kuo, Hung-Chyun Chou, Shou-Wei Chi, Yu-De Lien, Visionbased obstacle avoidance navigation with autonomous humanoid robots for structured competition problems, Int. J. Humanoid Robot (2013) http://dx.doi.org/10.1142/S0219843613500217. [3] Daniel Maier, Cyrill Stachniss, Maren Bennewitz, Vision-based humanoid navigation using self-supervised obstacle detection, Int. J. Humanoid Robot. 10 (02) (2013). http://dx.doi.org/10.1142/S0219843613500163.

S. Nefti-Meziani et al. / Robotics and Autonomous Systems 68 (2015) 129–139 [4] Ho Seok Ahn, Dong-Wook Lee, Dongwoon Choi, Duk-Yeon Lee, Ho-Gil Lee, Moon-Hong Baeg, Development of an incarnate announcing robot system using emotional interaction with humans, Int. J. Humanoid Robot. 10 (02) (2013). http://dx.doi.org/10.1142/S0219843613500175. [5] Umar Manzoor, Samia Nefti, Yacine Rezgui, Categorization of malicious behaviors using cognitive agents—CMBCA, Data Knowl. Eng. 85 (2013) 40–56. [6] Naveed Ejaz, Umar Manzoor, Samia Nefti, Sung Wook Baik, A collaborative multi-agent framework for abnormal activity detection in crowded area, Int. J. Innov. Comput. Inf. Control 8 (5) (2012) 4219–4234. [7] Umar Manzoor, Samia Nefti, iDetect: content based monitoring for complex network using mobile agents, Appl. Soft Comput. 12 (5) (2012) 1607–1619. [8] S. Park, S. Lee, Fast distance computation with a stereo head–eye system, in: Proceedings of the First IEEE International Workshop on Biologically Motivated Computer Vision, 2000, pp. 434–443. [9] Marco A. Gutierrez, E. Martinena, A. Sanchez, Rosario G. Rodrıguez, P. Nunez, A cost-efficient 3D sensing system for autonomous mobile robots, in: XII Workshop of Physical Agents, Albacete, September 2011, pp. 1–8. [10] S. Patnaik, A. Kumar, A.K. Mandal, Building 3-D visual perception of a mobile robot employing extended Kalman filter, J. Intell. Robot. Syst. 34 (1) (2002) 99–120. [11] Francesco Rea, Samia Nefti-Meziani, Umar Manzoor, Steve Davis, Ontology enhancing process for a situated and curiosity-driven robot, Robot. Auton. Syst. 62 (12) (2014) 1837–1847. [12] Photchara Ratsamee, Yasushi Mae, Kenichi Ohara, Tomohito Takubo, Tatsuo Arai, Human–robot collision avoidance using a modified social force model with body pose and face orientation, Int. J. Humanoid Robot. 10 (01) (2013). http://dx.doi.org/10.1142/S0219843613500084. [13] C. Thompson, Depth perception in stereo computer vision, Technical Report, Stanford Univ., Dept. of Comp. Sci., October, 1975. [14] R.Y. Tsai, A versatile camera calibration technique for high accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses, IEEE J. Robot. Autom. 3 (4) (1987) 323–344. [15] J. Neubert, et al. Automatic training of a neural net for active stereo 3D reconstruction, in: Proceedings IEEE International Conference on Robotics and Automation, Vol. 2, 2001, pp. 2140–2146. [16] L. Ma, C. YangQuan, K.L. Moore, Flexible camera calibration using a new analytical radial undistortion formula with application to mobile robot localization, in: IEEE International Symposium on Intelligent Control, October 2003, pp. 799–804. [17] M. Hansard, R. Horaud, Cyclopean geometry of binocular vision, J. Opt. Soc. Am. A 25 (9) (2008) 2357–2369. [18] M.P. Meza, R. Montúfar-Chaveznava, Partial 3D reconstruction using evolutionary algorithms, Int. J. Appl. Sci. Eng. Technol. (2007) 107–112. [19] S.G. Neto, et al., Experiences on the implementation of a 3D reconstruction pipeline, Int. J. Model. Simul. Pet. Ind. 2 (1) (2008) 7–8. [20] Kinect for Windows http://www.microsoft.com/en-us/kinectforwindows/ [Accessed 15 July 2013]. [21] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, Andrew Fitzgibbon, KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera, in: ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA, October 16–19, 2011. [22] Jimmy Tran, Alex Ufkes, Mark Fiala, Alexander Ferworn, Low-cost 3D scene reconstruction for response robots in real-time, in: Proceedings of the 2011 IEEE International Symposium on Safety, Security and Rescue Robotics, Kyoto, Japan, November 1–5 2011, pp. 161–166. [23] Guoqiang Fu, Arianna Menciassi, Paolo Dario, Development of a low-cost active 3D triangulation laser scanner for indoor navigation of miniature mobile robots, Robot. Auton. Syst. 60 (10) (2012) 1317–1326. [24] D.A. Forsyth, J. Ponce, Computer Vision: A Modern Approach, Prentice Hall Professional Technical Reference, 2002. [25] G. Bradski, A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library, O’Reilly, Cambridge, 2008. [26] P. Hillman, White Paper: Camera calibration and stereo vision, Square Eyes Software, Edinburgh, UK, October 2005. [27] T.A. Clarke, J.G. Fryer, The development of camera calibration methods and models, Photogramm. Rec. 16 (91) (1998) 51–66. [28] D. Gouaillier, et al. The NAO humanoid: a combination of performance and affordability, Computing Research Repository (CoRR), 2008. [29] N. Qian, Binocular disparity review and the perception of depth, Neuron 18 (1997) 359–368. [30] W.E.L. Grimson, Computing shape using a theory of human stereo vision (Ph.D. dissertation), Massachusetts Institute of Technology, Dept. of Mathematics, Massachusetts, 1980. [31] D. Marr, T. Poggio, A computational theory of human stereo vision, Proc. R. Soc. Lond. 204 (1156) (1979) 301–328. [32] W.E.L. Grimson, Why stereo vision is not always about 3D reconstruction, AI Memos, Massachusetts Institute of Technology, AIM-1435, July 1993.

139

[33] E.E. Hemayed, A survey of camera self-calibration, in: Proc. IEEE Conference on Advanced Video and Signal Based Surveillance, July 2003, pp. 351–357. [34] J. Sun, H. Gu, Research of linear camera calibration based on planar pattern, World Acad. Sci. Eng. Technol. (2009) 627–631. [35] J. Santolaria, et al., A one-step intrinsic and extrinsic calibration method for laser line scanner operation in coordinate measuring machines, Meas. Sci. Technol. 20 (4) (2007) 1–12. [36] F.J. Brosed, et al., 3D geometrical inspection of complex geometry parts using a novel laser triangulation sensor and a robot, Sensors 11 (1) (2011) 90–110. [37] A. Alexandrov, Corner detection overview and comparison, Comput. Vis. 558 (2002) 0–13. [38] B.F. Alexander, K.C. Ng, Elimination of systematic error in subpixel accuracy centroid estimation, Opt. Eng. 30 (9) (1991) 1320–1331. [39] Open Source Computer Vision Library, Intel Corporation, USA, 2001. [40] M. Pollefeys, Tutorial notes of visual 3D modeling from images, Tutorial Notes, University of North Carolina, Chapel Hill, USA, 2002. [41] J.-Y. Bouguet, Pyramidal implementation of the Lucas Kanade feature tracker, Intel Corporation, Microprocessor Research Labs, 2000, pp. 1–9. [42] N.A. Golias, R.W. Dutton, Delaunay triangulation and 3D adaptive mesh generation, J. Finite Elem. Anal. Des. 25 (3–4) (1997) 331–341. Special issue: adaptive meshing part 2. [43] B.N. Delaunay, Sur la sphere vide, Bull. Acad. Sci. USSR (6) (1934) 793–800.

Samia Nefti-Meziani received the M.Sc. degree in Electrical Engineering, the D.E.A. degree in Industrial Informatics, and the Ph.D. degree in Robotics and Artificial Intelligence from the University of Paris XII, Paris, France, in 1992, 1994, and 1998, respectively. In November 1999, she joined the Liverpool University, Liverpool, UK, as a Senior Research Fellow engaged with the European Research Project Occupational Therapy Internet School. Afterwards, she was involved in several projects with the European and UK Engineering and Physical Sciences Research Council, where she was concerned mainly with model-based predictive control, modeling, and swarm optimization and decision making. She is currently an Associate Professor of computational intelligence and robotics with the School of Computing Science and Engineering, The University of Salford, Greater Manchester, UK. Her current research interests include fuzzy- and neural-fuzzy clustering, neurofuzzy modeling, and cognitive behavior modeling in the area of robotics. She is a Full Member of the Informatics Research Institute, a Chartered Member of the British Computer Society, and a member of the IEEE Computer Society. She is a member of the international program committees of several conferences and is an active member of the European Network for the Advancement of Artificial Cognition Systems.

Umar Manzoor received the B.S. degree in Computer Science, the M.S. degree in Computer Science from National University of Computer and Emerging Sciences, and the Ph.D. degree in Multi-Agent Systems from the University of Salford, Manchester, UK, in 2003, 2005, and 2011, respectively. In Feb 2006, he joined the National University of Computer and Emerging Sciences, Islamabad, Pakistan, as a Lecturer and promoted after as an Assistant Professor. In Aug 2012, he was promoted as an Associate Professor; currently he is working at King Abdulaziz University, Jeddah, Saudi Arabia. He has published extensively in the area of multi-agent systems, autonomous systems, behaviour monitoring, network management/monitoring which appeared in journals such as Expert Systems with Applications, Applied Soft Computing, Data and Knowledge Engineering and Journal of Network and Computer Applications.

Steve Davis graduated from the University of Salford with a degree in Robotic and Electronic Engineering in 1998, and an M.Sc. in Advanced Robotics in 2000. He then became a Research Fellow gaining his Ph.D. in 2005 before moving to the Italian Institute of Technology in 2008. He returned to Salford in 2012 as a Lecturer in Manufacturing Automation and Robotics.