Moehring, M., Lessig, C. & Bimber, O. (2004). Video See-Through AR on Consumer Cell Phones. In. Proc. Of International Symposium on Augmented and Mixed ...
SAM.700.003.1155
A Client/Server Architecture for Augmented Assembly on Mobile Phones Charles Woodward(1), Mika Hakkarainen(1), Mark Billinghurst(2) (1) VTT Technical Research Centre of Finland, (2)The Hit Lab New Zealand ABSTRACT In this article, we present a client/server augmented reality (AR) system for viewing complex assembly models on mobile phones. Complex model information is located on a PC which takes care of all of the heavy AR tracking and rendering computation. A camera phone is used as a client to show this information, augmented on still images as animated view. The mobile phone interface also supports correct masking of the assembly pieces as they come together, as well as a graphics only viewing mode intended for better understanding of the assembly process. In addition to describing the mobile augmented assembly system we present results from two pilot user studies evaluating elements of the user interface.
INTRODUCTION In order to remain competitive, today’s industry is employing a growing number of product variations with shorter and shorter life cycles, requiring efficient ways to handle the production planning and processing. Assembly line workers face more and more complex tasks and an increasing memory load is posed by the new production needs. Experienced workers may be able to handle several product variants at the same time, but new workers require training and guidance to handle the work tasks. Related activities such as maintenance and repair also require a lot of know-how of complex products. Guidance in assembly tasks is commonly provided by using printed blueprints or other documents. However, these are mainly two dimensional images or text while the real assembly task is three dimensional, making the relationship between parts hard to understand. Computer based technologies have been developed such as electronic guides, multimedia training material and interactive documents. However, these typically require the user to turn away from the assembly task to study a computer screen separate from the workspace. In our research we explore how augmented reality (AR) technology can be used to provide more intuitive guides to assist with assembly, repair and training tasks. With AR technology, synthetic objects can be merged with the user’s view of the real world so that the user can perform the assembly task without needing to look at a separate screen. The virtual objects appear in the correct position so that AR enables the creation of realistic looking 3D animated assembly “manuals”. Other researchers have developed prototype AR systems for supporting assembly tasks. These are typically based on head mounted displays (HMDs) connected to desktop or wearable computers. However, wearable computers are relatively expensive, they have a short battery life, and taking them into a harsh industrial environment is not always possible. Head mounted displays can also be bulky, they involve safety issues, and many workers are reluctant to wear them. To overcome these disadvantages, we are interested in developing a reliable mobile phone based AR assembly system. Today’s mobile phones have fast processors, significant memory, 3D graphics support, and high resolution displays. They contain various accessories that can be used for developing mobile AR applications such as good quality cameras, connectivity by Bluetooth, WLAN and GPRS, and various
SAM.700.003.1156
2 additional sensors e.g. gyrometers. Mobile phones are widely available, they are unobtrusive, robust, and have relatively long battery life. In this article we present the augmented reality assembly system we developed for mobile phones. We first review related work, and then describe our prototype system based on a mobile phone client and PC server architecture. Next we provide results from two user studies identifying strengths and weaknesses of the system. In the final sections, we provide directions for future work and conclusions.
BACKGROUND Several other research groups have explored the use of AR for assisting with real world assembly tasks. One of the earliest of these efforts was the augmenting of wire bundle harness at Boeing (Caudell & Mizell, 1992), using wearable PCs with head mounted displays attached. With this interface, Boeing engineers were able to superimpose virtual images over the real world showing which real wires should be bundled together. Later, researchers have developed AR assisted assembly applications for putting a lock into a car door (Reiners et al., 1998), for guided assembly application (Sharma & Molineros, 1997), and assembly of architectural structures (Webster et al., 1996), among others. In most cases, these AR assembly interfaces were based around desktop computers and the user needed to wear a head mounted display to view the AR content. User studies have been conducted to compare performance with traditional AR assembly interfaces using manuals or other computer based technology. For example, Tang et al. (2003) compared the effectiveness of using (1) a see-through HMD based AR, or (2) a printed manual, or (3) computer aided instruction (CAI) on a monitor, or (4) CAI in a see-through HMD with non-spatially registered information for an assembly task. In this case, the task was to complete a complex toy brick assembly task that had 56 steps. They found that subjects completed assembly in all three computer aided conditions (1,3,4) significantly faster than with a printed manual (2). There was no difference in time between the AR condition and the other two CAI conditions, however, the AR condition did produce significantly lower number of total errors and dependent errors than the other three conditions and users felt that there was significantly less workload involved using the AR technology. Wiedenmaier et al. (2001) also conducted tests comparing performance in an assembly task using either printed material or AR technology. In this case, tasks with different degrees of difficulty were selected from an authentic assembly process. They found that using AR proved to be more helpful for difficult tasks rather than the paper manual, while for easy tasks there was no difference between using AR technology and the paper manual. Similarly Baird (1999) conducted an experiment where subjects were asked to assemble a computer motherboard using four types of instruction: paper manual, computer aided, a video see-through AR display, and an optical see-through AR display. Baird found that both augmented reality conditions produced faster times and were more effective instructional aids for the assembly task than either the paper instruction manual or the computer aided instruction. Subjects also made fewer errors using the AR conditions than with the paper manual or computer aided instruction. However many of the subjects indicated that both types of HMDs were uncomfortable, and over half expressed concerns about poor image contrast with the see-through HMDs. Other assembly experiments showing that AR produces a significant performance benefit include the block assembly work of Pathomaree and Charoenseang (2005). In our research we use mobile phones for providing an augmented assembly view. So our research is also related to previous work in handheld and mobile AR. Most of the first mobile AR assembly applications were based on wearable computers and head mounted displays. For example Klinker et al. (2001) report on using a wearable computer and HMD to provide a mobile AR maintenance system for power plants. Harritos and Macchiarella (2005) developed a wearable computer based AR system for aerospace maintenance training. More recent augmented assembly solutions have been based on portable PC hardware, for example BMW’s augmented car maintenance system (Platonov et al., 2006). Currently
SAM.700.003.1157
3 hand held ultra mobile PCs are a popular choice for portable computers; an example of a hand held PC based system for industrial AR applications is ULTRA (Riess et al., 2006). Researchers have also used PDAs without HMDs to provide a handheld AR assembly experience. Among the first researchers to employ PDA devices for augmented reality were Geiger et al. (2001), in an application providing augmented operation instructions for home appliances. Their work was part of the AR-PDA project (Ebbesmeyer et al., 2002) using a client/server approach where a camera image was captured on a PDA client and transferred to a remote PC server for processing and virtual graphic overlay. Similar work was presented by Pasman and Woodward (2004) for architectural applications. Christian et al. (2007) reported on using a remote PC to provide a PDA based AR e-Training environment for aircraft maintenance. The processing and graphics power of PDAs have progressed to the point that they can be used to provide stand alone handheld AR experiences, as shown by the work of Wagner and Schmalstieg (2003; 2006) and others e.g. (Pintaric et al., 2005). Riess and Stricker (2006) report on an AR PDA application for industrial maintenance, while Grafe et al. (2004) describe a PDA based system for representing complex assembly processes in the automotive industry. However, although there have been examples of handheld AR systems for supporting assembly tasks there have been very few, if any, formal user studies conducted with these systems. Most recently, smart phones have been used for mobile AR. The first of these solutions, AR-Phone (Cutting et al., 2003) used Bluetooth to send phone camera images to a remote sever for processing and graphics overlay, taking several seconds per image. Henrysson ported ARToolKit over to the Symbian phone platform (Henrysson & Ollila, 2004), while Moehring et al. (2004) developed an alternative computer vision tracking library that enabled stand alone AR applications to run entirely on the phone. Since that time there have been other examples of mobile phone based AR interfaces, e.g. (Henrysson et al., 2005; Honkamaa et al., 2007). Henrysson et al. (2005) report on using mobile phones for scene assembly, and Andel et al. (2001) describe a collaborative interface for furniture arranging. However in these cases the AR application is used just to show and manipulate entirely virtual content and not to assist with a real world assembly task. Similarly, there has been no formal evaluation of how effective a mobile phone based AR assembly application is. Compared to previous research, our work is novel for several reasons. It is the first example of an AR assembly interface based on mobile phones. It also provides the first example of a formal evaluation of a mobile phone based AR assembly interface. In addition, we are exploring unique interface elements that have not been demonstrated before in an AR assembly application. We believe the results presented will be helpful for others wanting to develop mobile phone based AR assembly and related applications.
DEMONSTRATION SYSTEM Design Background The goal of our research has been to provide an AR system on a mobile phone that can assist with real world assembly tasks. However, there are a number of challenges that need to be addressed to provide an effective solution. Product model formats vary enormously, and the number of polygons in complex models can be very large (e.g. 500,000 polygons in Figure 2), difficult to render on mobile phones because of their limited graphics capability. The 3D APIs on the mobile phones are still limited and the lack of input options may require cumbersome format conversions from the computer to the mobile phone compatible presentation format. The communication channels for mobile phones are not always sufficient for transferring large
SAM.700.003.1158
4 amounts of CAD data associated to the 3D assembly models. Finally, just the basic task of grabbing real time video for augmentation can be a major challenge on many mobile phone models. To overcome these challenges we decided to select a still image based client/server approach. All of the complex model rendering, handling of model formats and image processing are performed on the PC server. In this way we can make use of a complex augmented reality rendering system that already exists on PC environment, and connect it to the mobile phone client with a relatively small programming effort. This also considerable simplifies the mobile phone development. All the mobile phone client needs to do is take a still image of the real world scene, send it to the remote server and display the composite AR image sent back from the server. The still image approach enables the system to be implementated even on low-end (e.g. Java ME enabled) camera phones. Using still images is also often more ergonomic than viewing real time video. With still images the user can view the augmented content at his or her own pace without having to constantly point the mobile device towards the target workspace. The user can move the phone to the position where s/he wants to view the model from, take an image, and then move to a convenient position to view the results and assemble the real objects. The assembly station is most typically static and so it is usually sufficient to have just the assembly information animated, and not rely on a live video feed for AR.
User Interface Our prototype system “Sagasse” (Symbian Augmented Assembly) is made up of two parts; (1) PC server software and (2) a visualization/controller client on a Nokia smart phone. On the server side we employ a version of augmented assembly software we previously implemented in the AugAsse project (Sääski et al., 2008; Salonen et al., 2009). The tracking system on server side uses a customized version of ARToolkit 4.06. The rendering of the virtual assembly model on the server uses a scene graph and the OpenSceneGraph 2.0 library. The server also displays a graphics view which is not necessary, but it allows an external person to follow what is happening with the system. Figure 1 shows the pilot application from the AugAsse project, a hydraulic component assembly task at the Finnish tractor company Valtra Plc. The user looks at a real engine block that has been surrounded by ARToolKit markers and sees virtual parts added to the real scene, step by step. The AugAsse project normally uses a PC and HMD to display the AR view, but in this case we use the tracking and rendering hardware on a PC server to create images that are delivered for display to the mobile client.
Figure 1. 3D part augmented on physical assembly. Image courtesy of Valtra Plc.
SAM.700.003.1159
5
The client is used to take single images of the real assembly site, send them to the server and show the augmented views to the user on the phone. The phone client software is a native Symbian application developed in C++ and is targeted towards Nokia mobile phones. The client side rendering is based on native Symbian bitmaps (CFbsBitmap) and the VR mode is implemented by using Open GL ES. The client software is not aware of the scene graphs or tracking systems employed. It is basically just a dummy remote viewer and controller of the system. This approach allows developers to integrate more sophisticated or more time consuming tracking systems on the server side without affecting the client application. The GUI is built using basic native Symbian GUI components. A key feature of the client is the ability to send data wirelessly to the PC server. Data transfer between the phone and PC can use either Bluetooth (short range) or pure TCP/IP sockets over a WLAN connection. The TCP/IP version enables the server to be even on the other side of the world from the mobile phone, but in that case a GPRS or 3G connection would be used on the phone to provide data transfer.
Operation In our present implementation, the augmented reality environment is based on a set of markers, i.e. “marker field” (Siltanen et al., 2007), attached to the assembly site. The assembly workflow is based on a number of steps that specify which virtual objects should be shown in order in the AR view to complete the real world assembly task. This workflow is stored on the server in an XML based file, where the assembly site information (e.g. marker configuration), models and work phases are described. The client does not need to have any information about how the assembly instructions are actually defined or how the AR tracking is performed; these are all handled by the server. To use the application the user takes an image of the assembly site with the mobile phone client and sends it to the server. The user selects if s/he wants to see an augmented view of the next work phase (normal operation), or the previous work phase (backstep if something went wrong), or repeat the current one (verify from new view angle). The user controls the work phase steps by taking a new image, which happens by pressing buttons on the phone keypad. Pressing button ‘1’, the user takes a new picture and asks the system to provide an augmented view of the previous work phase. Button ‘2’ also takes a new picture, but this time the system returns a new augmented image of current work phase. This allows the user to take new pictures from different viewpoints in same work phase and have more confidence about selected item/block and assembly position. When the user is ready to move to the next work phase, s/he takes a new image by pressing button ‘3’. In this way the system provides step-by-step information of the ongoing assembly task. The user places a real model part at the position shown by the AR model, and then takes a new photo to show what the next part and placement should be. The server software augments the image with 3D instructions, just as it would do normally with a live video image replaced with the still image. The server sends the augmented information to the client, where it is shown to the user. The augmented information is properly masked by the real objects and shown in true perspective.
SAM.700.003.1160
6 All the commands are available as command menu items and the most often used commands as keypad shortcuts. The phone client allows the user to digitally zoom the image to take a closer look of the new model and its position in the assembly site. The client software also supports landscape viewing mode to make optimal use of the phone’s small display. See Figures 2 and 4.
Figure 2. Pilot assembly on mobile phone.
Masking and Animation As the items are added to the assembly site, the server gains knowledge of how the previous parts were placed. Using this information, we use masking to provide correct occlusion between the real and virtual items. When the user moves from one work phase to another, s/he has placed the real items in place. Since virtual items correspond with the real ones, the server is able to place the new virtual item “behind the real one”, i.e. mask the new item with the z-buffer image of the real item. We implemented an animation mode to provide a greater understanding of which part is to be placed next and where it should go. Each time the server receives a new image, it creates ane animation of the current part being placed in the AR scene by using a series of subimages placed on the main image. A subimage consists of the smallest image area (rectangle) containing the part, coupled with position information relative to the original image. Each subimage also contains a bit map defining which pixels actually show the part and which are transparent. On the mobile phone’s small display, just a few subimages are required to generate impression of smooth animation. See (Pasman & Woodward, 2003) where a similar approach was utilized on PDA device with non-animated images. Each of the subimages is also masked with the real part. The server sends the subimages along with position and bitmap information to the mobile phone, which is then able to display the simulated animation of the virtual part. Besides masking, the subimage sizes (perspective) can vary within an animation sequence, and the method supports not only translations but any other transformations as well. Figure 3 demonstrates masking and animation with assembly of the 3D wooden puzzle used in our user evaluation tests.
SAM.700.003.1161
7
Figure 3. Augmented assembly with masking and animations.
VR Graphics Mode We provided a VR mode on the phone client which shows 3D graphics of the objects placed in the assembly task to give fast feedback to the user. Representing complete 3D assembled virtual models on the mobile phone is generally not possible because of their complexity and the very limited mobile graphics capability. In our current implementation we have tried a simplified approach. In the VR mode the user sees just the bounding box of the whole assembly project, and the current item’s bounding box. To aid with object recognition, the faces of the current item’s bounding are textured according to the orthogonal views of the item. In other words, the VR mode in effect shows the items as a “box off the shelf” on the assembly site. Initially the orientation of the bounding box matches the current image’s position; however the user may freely rotate the view to understand the assembly instructions better. Model rotation is determined by keypad input; the left and right navigation keys on the keypad are used, and when a key is pressed the model rotates 5 degrees about the vertical model axis. The usefulness of the VR mode depends on how complex the assembly case at hand is; we expect it would be more useful with complex models where it is more possible to have part confusion than with simple 3D shapes.
PERFORMANCE The client software was implemented with Symbian Series 60 OS and tested on a Nokia N95 phone. The N95 has a 262Mhz Arm 9 processor, graphics hardware, a 5 megapixel camera and a screen resolution of 320 x 240 pixels. The current implementation of the system supports sending 640x480 (24 bpp) images from the phone to the server and the server sends back subimages of maximum 320x240 (24bpp) resolution. Using a PC with a 2.4 Ghz processor, there was no noticeable delay in the PC server software
SAM.700.003.1162
8 processing the image and getting it ready to send it back to the mobile phone. This was because the virtual models used were generally quite simple. More significantly, there is a performance delay due to the data transmission times between the phone and PC server. To quantify this we compared the Bluetooth and WLAN speeds in our demo case by measuring the time it took to send an image to the server and receive it back (over 24 images). Creating the Bluetooth connection to server takes a few seconds but the bandwidth is smaller than with WLAN. Setting up an ad-hoc WLAN connection between the phone and PC a can take slightly longer, but after that the bandwidth is much higher than with Bluetooth. in our tests, the average time to send and receive the image was 19.1 seconds over Bluetooth and 3.4 seconds over WLAN, nearly six times faster. Median time was 16.7 seconds with Bluetooth and 2.3 seconds with WLAN. It should be noted, however, that no special effort was made to optimize the data transmission times. For example, using 8 bits per pixel (8 bpp) for the images would probably be quite sufficient for our application. Actually, when using ARToolKit the images sent to the server could even be thresholded to b/w images (1 bpp) on the client side. Further optimizations could be carried out in similar manner as explained in (Pasman and Woodward 2004).
USER EVALUATION We conducted two small pilot studies to evaluate the effectiveness of the system, and in particular the various user interface elements of the system. For these studies we chose an assembly task that involved correctly constructing a wooden 3D soma cube puzzle made up of seven interlocking pieces. When assembled correctly the pieces form a perfect cube. Figure 4 shows one of the subjects solving the puzzle. WLAN connection between the phone and the server was used in the experiments.
Figure 4. Subject solving the 3D cube puzzle.
Public Feedback The system was first demonstrated during a HIT Lab NZ open house for attendees of a local computer science conference. Many people successfully used it to assemble the real puzzle, and from these eight people answered a small informal survey. Each of the questions was answered on a Likert scale of 1 to 7 where 1 = Not very easy/helpful/enjoyable and 7 = Very easy/ helpful/enjoyable. Table 1 shows the average results from these eight users.
SAM.700.003.1163
9 Question
Avge Result
1: How easy was the system to use?
4.63
2: How helpful was the AR system to solving the 3D puzzle?
6.50
3: How easy was it to interact with the system?
5.38
4: How enjoyable was it to use the system?
5.13
Table 1. Pilot study survey results. The small sample size and lack of conditions to compare against makes it impossible to perform any statistical analysis on these results. However it is clear that the users thought that the system was beneficial, for example in response to Q2, “How helpful was the AR system to solving the 3D puzzle”, all of the respondents replied with a 6 or 7 (7 = Very Helpful). In general people enjoyed the application and were very interested in it. Some found it difficult to remember all the buttons but they quickly learned the basic sequences. When told about real life case studies of how the technology could be used they were more interested and found the main concept of the system very useful. The puzzle itself was slightly too “entertaining and simple” to show real benefit of the system but they enjoyed the puzzle as a “proof of concept” and were interesting in how the technology could be applied to real life scenarios.
Evaluation of User Interface Elements Following the informal testing at the HIT Lab NZ open house, we conducted a more formal study to evaluate the different interface elements. Using the same task with a different set of subjects we wanted to explore performance in three conditions: AR: Using the AR viewing mode with static pictures only (no animations). AR+Animation: As with the AR mode, but adding animation to the augmented view.. AR+VR: As with the AR mode, but also adding a graphics only VR mode. There were 11 subjects (8 men 3 women, aged 21 to 47) that took part in the user evaluation. Each of the subjects experienced all three conditions and solved a different puzzle in each condition. The puzzles were all cube puzzles, but in each case the cube was rotated at a different 90 degree angle and the order of the blocks was changed to prevent learning effects. The order of the AR, AR+VR, and AR+Animation conditions was also counterbalanced between users to remove any order effects. The time it took to complete each trial was measured and after each condition subjects filled out a survey with the following eight Likert scale questions on a scale of 1 to 7 (1 = Not Very, 7 = Very): Q1: How easy was the system to use? Q2: How helpful was the AR system to solving the 3D puzzle? Q3: How helpful were the animated graphics to solve the 3D puzzle? Q4: How helpful was the VR wire frame view to solving the 3D puzzle? Q5: How helpful was the textured VR view to solving the 3D puzzle? Q6: How easy was it to interact with the system? Q7: How enjoyable was it to use to system?
SAM.700.003.1164
10 Q8: How easy was it to understand where the puzzle blocks needed to go? Some of these questions (Q3, Q4, Q5) were only asked during the appropriate condition. In addition, after all of the trials were finished the subjects were asked to rank the conditions for in order (from best to worst) each of the three conditions in response to the following ranking questions: R1: Ease of use? R2: Ease of understanding of where the blocks needed to go? R3: How enjoyable was the system? R4: In which condition did you perform best? Subjects were also asked which was the most valuable feature of the system and which feature of the system needed most improvement, as well as given the opportunity to provide any comments or feedback they wanted.
Results All of the subjects were able to complete the task and only one person made any errors (needing to backstep) while solving the puzzle. There was a significant difference in the performance time across the three conditions. Table 2 shows the average time it took subjects to solve the puzzle for each of the three conditions. Using a one factor ANOVA we found a significant difference in the average time that it took to solve the puzzles (F(2,27) = 4.48, P < 0.05). The AR+Animation condition was the fastest of the three, over 80 seconds faster than the AR condition alone. Using a Bonferroni test for pairwise comparisons, we found that there was a significant difference in time taken between the AR+Animation condition (M=147, SD = 7.0) and both the AR condition (M=216, SD = 20.0, p