Mechanical assembly assistance using marker-less augmented reality system

Yue Wang (Key Laboratory of Contemporary Design and Integrated Manufacturing Technology, Ministry of Education, Northwestern Polytechnical University, Xi’an, China)
Shusheng Zhang (Key Laboratory of Contemporary Design and Integrated Manufacturing Technology, Ministry of Education, Northwestern Polytechnical University, Xi’an, China)
Sen Yang (Key Laboratory of Contemporary Design and Integrated Manufacturing Technology, Ministry of Education, Northwestern Polytechnical University, Xi’an, China)
Weiping He (Key Laboratory of Contemporary Design and Integrated Manufacturing Technology, Ministry of Education, Northwestern Polytechnical University, Xi’an, China)
Xiaoliang Bai (Key Laboratory of Contemporary Design and Integrated Manufacturing Technology, Ministry of Education, Northwestern Polytechnical University, Xi’an, China)

Assembly Automation

ISSN: 0144-5154

Publication date: 5 February 2018

Abstract

Purpose

This paper aims to propose a real-time augmented reality (AR)-based assembly assistance system using a coarse-to-fine marker-less tracking strategy. The system automatically adapts to tracking requirement when the topological structure of the assembly changes after each assembly step.

Design/methodology/approach

The prototype system’s process can be divided into two stages: the offline preparation stage and online execution stage. In the offline preparation stage, planning results (assembly sequence, parts position, rotation, etc.) and image features [gradient and oriented FAST and rotated BRIEF (ORB)features] are extracted automatically from the assembly planning process. In the online execution stage, too, image features are extracted and matched with those generated offline to compute the camera pose, and planning results stored in XML files are parsed to generate the assembly instructions for manipulators. In the prototype system, the working range of template matching algorithm, LINE-MOD, is first extended by using depth information; then, a fast and robust marker-less tracker that combines the modified LINE-MOD algorithm and ORB tracker is designed to update the camera pose continuously. Furthermore, to track the camera pose stably, a tracking strategy according to the characteristic of assembly is presented herein.

Findings

The tracking accuracy and time of the proposed marker-less tracking approach were evaluated, and the results showed that the tracking method could run at 30 fps and the position and pose tracking accuracy was slightly superior to ARToolKit.

Originality/value

The main contributions of this work are as follows: First, the authors present a coarse-to-fine marker-less tracking method that uses modified state-of-the-art template matching algorithm, LINE-MOD, to find the coarse camera pose. Then, a feature point tracker ORB is activated to calculate the accurate camera pose. The whole tracking pipeline needs, on average, 24.35 ms for each frame, which can satisfy the real-time requirement for AR assembly. On basis of this algorithm, the authors present a generic tracking strategy according to the characteristics of the assembly and develop a generic AR-based assembly assistance platform. Second, the authors present a feature point mismatch-eliminating rule based on the orientation vector. By obtaining stable matching feature points, the proposed system can achieve accurate tracking results. The evaluation of the camera position and pose tracking accuracy result show that the study’s method is slightly superior to ARToolKit markers.

Keywords

Citation

Wang, Y., Zhang, S., Yang, S., He, W. and Bai, X. (2018), "Mechanical assembly assistance using marker-less augmented reality system", Assembly Automation, Vol. 38 No. 1, pp. 77-87. https://doi.org/10.1108/AA-11-2016-152

Download as .RIS

Publisher

:

Emerald Publishing Limited

Copyright © 2018, Emerald Publishing Limited


1. Introduction

Augmented reality (AR) is a type of human–computer interaction technology that enhances the natural visual perception of a human user by means of computer-generated information (i.e. 3D models, annotation and text; Azuma, 1997). To augment the real world with this information seamlessly, the camera pose must be calculated. The techniques that achieve this goal can be classified into three groups: marker-based, sensor-based and marker-less tracking.

Marker-based tracking has been widely and deeply researched (Ng et al., 2013; Wang et al., 2013; Zhu et al., 2013). Westerfield et al. (2015) used ARToolkit markers to track the motherboard models so that the manipulator could assemble the real element in the correct position. Uchiyama and Marchand (2011) adopted random dot markers for tracking. Their approach was robust against occlusion and user interaction. Wang et al. (2005) explored tracking using color-based non-square visual markers for assembly guidance. Most prototype systems of AR used this marker-based technique for its low computational cost and good performance even under bad illumination. However, marker-based tracking approaches also suffer a certain drawback, namely, visual pollution or tracking jitter owing to the absence of adequate feature points. In addition, external markers are not permitted for many industrial applications.

Sensor-based tracking is another popular tracking method (You and Neumann, 2001), but it can be easily interfered with in industrial scenarios. For instance, magnetic sensing tracking is easily disturbed by a magnetic field or metal in the workshop. Ultrasonic tracking is susceptible to ultrasonic noise in the environment, while global positioning system tracking is of low precision and high time-lapse.

Marker-less tracking is seen as a desirable tracking approach for AR assembly, which has undergone impressive improvements in recent years (Crivellaro et al., 2015; Mottaghi et al., 2015; Engel et al., 2014), such as feature point-based methods (Pauwels et al., 2013; Pressigout and Marchand, 2006; Kim et al., 2003), edge-based methods (Tombari et al., 2013; Damen et al., 2012; Payet and Todorovic, 2011), model-based methods (Wang et al., 2015; Hinterstoisser et al., 2012b; Prisacariu and Reid, 2012) and simultaneous localization and mapping (SLAM) methods (Engel et al., 2014; Caruso et al., 2015; Mur-Artal and Tardós, 2014). However, feature point-based methods are prone to tracking jitter or turbulence because mechanical parts for assembly are textured poorly. Edge-based methods are sensitive to cluttered background and not robust to occlusions during the interaction process. Model-based methods have been widely used for marker-less tracking in AR-based assembly, but retrieving data from a large number of reference images to find the key frame leads to vast searching space and heavy computational load, which greatly reduces real-time performance and system availability. SLAM-based methods are limited to estimating a relative camera pose only, which is not suitable for AR-based assembly; moreover, they are prone to tracking failure on dynamic scene. Additionally, owing to the characteristic that the geometry appearance of assemblage changes with assembly processes, most existing marker-less tracking methods are no longer practical. Therefore, an immense body of research has focused on developing a specific ad hoc tracking method.

In this paper, we present a fast and robust marker-less tracking approach for AR mechanical assembly. The system generates instructions by superimposing a virtual model on the real parts, indicating what the next step is and the way it assembles. The online assistance provided can reduce the time spent by technicians searching for information in paper-based documentation, and thus, it enables faster and more reliable operation.

The main contributions of our work are as follows: First, we present a coarse-to-fine marker-less tracking method that uses modified state-of-the-art template matching algorithm, LINE-MOD (Hinterstoisser et al., 2012a), to find the coarse camera pose. Then we use a feature point tracker Oriented FAST and Rotated BRIEF (ORB) (Rublee et al., 2011) to calculate the accurate camera pose. The whole tracking pipeline needs, on average, 24.35 ms for each frame, which can satisfy the real-time requirement for AR assembly. On the basis of this algorithm, we present a generic tracking strategy according to the characteristic of assembly and develop an AR assembly assistance platform. Second, we present a feature point mismatch-eliminating rule based on the orientation vector. By obtaining stable matching feature points, our system can achieve accurate tracking results. The evaluation of the camera position and pose tracking accuracy results show that our method is slightly superior to ARToolKit markers.

2. Related works

Marker-less tracking is an important but still a challenging issue in AR systems, especially when the real elements in the scenario are texture-less. It has a long research history, but in this study, we only focus on representative works that are related to our work.

2.1 Model-based tracking

Model-based tracking is one of the most widely used approaches for marker-less AR systems. Vacchetti et al. (2004) proposed a stable 3D tracking method that combined natural feature matching and the use of key frames. They used the model information to track every aspect of the target object and to keep following it even when it was occluded or only partially visible. However, their method needs manual initialization of the tracker and manual intervention when the tracking fails. Gordon and Lowe (2006) presented a model-based tracking approach that used local image features. Their method did not require camera pre-calibration, previous knowledge of scene geometry, manual initialization of the tracker or placement of special markers. Unfortunately, their approach was unable to handle proper occlusion of inserted virtual content by real objects, which is inappropriate for mechanical parts assembly. More recently, Hinterstoisser et al. (2012a, 2012b) proposed a model-based template-matching method LINE-MOD for tracking in heavily cluttered scenes. The algorithm has the characteristics of high speed and good real-time performance. Unfortunately, because of the lack of scale invariant and rotation invariant properties of the LINE-MOD algorithm, it cannot be applied to the vision system with significant scale or rotation changes. Radkowski (2015) and Garrett et al. (2014) presented a marker-less tracking approach using an RGB-D camera for AR assembly, but their method was limited to tracking the pump-shell that was visible and shape unchanged at all times during assembly process, which may not be applicable to all assembly situations. Wang et al. (2015) used 2D/3D correspondence between a known 3D object model and 2D scene edges in an image for tracking. As their method only depended on the contours of the object, there was ambiguity to the pose when the object had a symmetrical structure. Wang et al. (2016a and 2016b) proposed a hybrid marker-less tracking algorithm that used key features and their method could achieve high tracking performance, but it required the user to initialize the pose and position of the real components or sub-assembly by selecting a set of points on the captured image in each step of the assembly.

2.2 Feature-based tracking

Feature-based tracking is another popular tracking approach and a large number of literatures studied this method. To improve its tracking performance, there have been numerous feature descriptor algorithms used including scale invariant feature transform (SIFT) (Lowe, 2004), speed-up robust feature (SURF) (Bay et al., 2006), binary robust invariant scalable keypoints (BRISK) (Leutenegger et al., 2011) and ORB (Rublee et al., 2011). Lee and Hollerer (2009) proposed a marker-less camera tracking approach using the SIFT features, and the method was proved useful for real-time camera pose estimation without using fiducial markers. However, for propagating the established coordinate system, the workspace must be a flat surface that provided at least a dominant plane, which limited wider application. Alvarez et al. (2011) designed a hybrid tracking method that combined an edge tracker, a point-based tracker and a 3D particle filter to update the camera pose continuously. The AR disassembler could run in real-time and deal with scenes that lacked texture. Chen and Li (2010) developed a keypoint tracking method that used a FAST algorithm to extract keypoint features and descriptors. To improve the robustness, a Kanade–Lucas–Tomasi (KLT) tracker was added to deliver additional information for pose estimation. Nevertheless, their method could not distinguish different objects, an important issue for AR assembly system.

Although the above methods demonstrated a good tracking performance on their specific systems, most of them can easily result in tracking jitter or turbulence owing to the lack of significant texture information on mechanical parts. Moreover, to realize 3D tracking, feature-based methods also need to process a large amount of reference images obtained from multiple views in real time to acquire the accurate camera pose. Additionally, most of the above methods are fast enough to process one frame, but relying only on a feature-based method can hardly meet the real-time requirement of augmented reality owing to the vast searching space and heavy computational load.

3. Overview of the study’s approach

In this paper, we propose a marker-less tracking approach, and based on this methodology, an AR assembly assistance system is presented. The system’s process can be divided into two stages: the offline preparation stage and online execution stage. In the offline preparation stage, we use CAD software for assembly sequence, generation as previously described (Barnes et al., 2004). In this process, the position and rotation of each part to be assembled is defined by the base part, which is the first node of the generated sequence. Moreover, assembly sequence, position and rotation of each part and tool associated with the project are all stored in an XML file. At the same time, multi-views after each assembly-planning step are generated automatically, and image features are extracted and stored in XML files (Section 3.1). In the online execution stage, the XML file of the assembly sequence is first parsed to identify the next part to be assembled and tools to be used, and then the image features of the base part are extracted and matched with those generated offline. In this manner, the camera pose is calculated (Section 3.2) and guiding information that loaded from the assembly management platform based on those parsing results is superimposed on the base part. Finally, the operator can follow this assistance information and install the part to the correct position on the base element. However, after the part is assembled on the base, the geometry of the assemblage changes and the tracking process may fail. Therefore, we propose a tracking strategy according to the characteristic of assembly, when upon pressing the “Next Step” button the XML file that stores the image features changes accordingly (see Figure 1).

3.1 Offline preparation

After each step of assembly planning, multi-view images (called “reference images” later in this paper) are generated automatically using the virtual camera in the CAD software, which is much more convenient and efficient compared with an online learning method (Kalal et al., 2010). Furthermore, for modern manufacturing, 3D models often exist before the real objects are created, so requiring 3D models beforehand is not a disadvantage anymore.

The reference image sampling rules of our method in the CAD software are as follows. The assemblage is placed at the center of the icosahedron and sampled on each vertex of the triangles with a virtual camera (see Figure 2). To obtain all possible topology images of the assemblage, each triangle of an icosahedron is divided into four equilateral triangles and iterated several times. In our experiment, for a good trade-off, we stop at 91 vertices on the upper hemisphere. As a result, two adjacent vertices are less than 20 degrees apart. The virtual camera aims at the center all along the sampling process; it covers the pose range of 0∼360° around the assemblage, and 0∼90° tilt rotation. To overcome the lack of rotation invariance of the LINE-MOD algorithm (Hinterstoisser et al., 2012a), four angles of in-plane rotation between –45°∼45° are formed; therefore, 455 views are obtained.

For each reference image generated by the method described above, we compute its orientations of gradient, as described (Hinterstoisser et al., 2012a), and use standard ORB (Rublee et al., 2011) to extract 2D feature points and compute their 3D positions on the object model by back-projection method (Lepetit et al., 2005). Finally, reference images labeled with corresponding image features and poses are sorted in XML format for later use.

3.2 Camera-pose calculation

In the online execution stage, image features are extracted and matched with those generated offline for camera pose estimation. To realize fast and accurate pose estimation, a coarse-to-fine tracking strategy is proposed. First, we use a modified template matching method LINE-MOD to obtain the reference image that is similar to the current viewpoint and then an accurate pose refinement method using a point tracker is presented.

3.2.1 Camera-pose rough estimation

The idea of this section is that the key frame and its corresponding rough camera pose can be obtained through template matching. In our research, we base our work on the LINE-MOD algorithm, which has good precision recall performance and computes feature vectors very fast. The LINE-MOD algorithm proposes a similarity measure that, for each gradient orientation on the object, searches in a neighborhood of the associated gradient location for the most similar orientation in the test image. It can be summarized in equation (1):

(1) ε(I,T,c)=rpmaxtR(c+r)|cos(ori(O,r)ori(I,t))|
where ori(O, r) is the gradient orientation in radians at location r in a reference image O of an object to detect, and ori(I, t) is the gradient orientation at location t in input image I. tR(c+r)=[c+rτ2,c+r+τ2]×[c+rτ2,c+r+τ2], which defines the neighborhood of size τ centered on location c shifted by r in the input image I. This setup makes the method robust to small shifts and deformations. p denotes the list of locations r to be considered in O; therefore, template T is defined as a pair T = (O, p).

LINE-MOD uses a gradient feature for matching. At each location of a given image, LINE-MOD computes the gradient orientation of each color channel and picks the orientation with the greatest gradient magnitude. Then, the algorithm quantizes the gradient orientations into eight equal spacing to represent each gradient orientation in 8 bits. Moreover, LINE-MOD speeds up this similarity calculation by spreading gradient orientations and pre-computing response maps.

In the execution of a template matching process, a LINE-MOD model first learns a list of templates for each object. Then, the algorithm searches through the input image with a sliding window, which contains a matching object, if the similarity score between the gradient feature matrix of this window and that of a template is above a certain threshold. The preliminary bounding box is then determined by using the center of sliding window and the bounding box of the template. At the end of this search, non-max suppression is performed on all potential bounding boxes with an overlap threshold of 0.5. The remaining bounding boxes are the locations of the predicted objects.

However, the problem with the original LINE-MOD method is that although it is invariant to small distortions of images, it is not inherently scale invariant, and only works at 1∼2 scale. For instance, if the object in the training image is fairly far away from the camera, while in a testing image it is much closer to the camera, then even if the object in the testing image has the same orientation as that in the training image, its gradient features around a certain location will be far more spread out than those in the training image. This causes difficulties for the template-matching problem. Our goal for the following modification is to make LINE-MOD scalable by using depth information obtained from Softkinetic. The similarity measure equation is as follow:

(2) {ε(I,T,cI)=(co+r)pmaxtR(SI(cI,r))|cos(ori(O,So(co,r))ori(I,t))|Sx(cx,y)=cx+D(cx)D(co)y
where ori(O,So(co,r)) is the gradient orientation in radians at location So(co,r) in a reference image O of an object to detect. ori(I, t) is the gradient orientation at location t in input image I. cI defines the shift from the center point of the matching object to the origin of the input image I. co defines the shift from the center point of the object to be matched to the origin of reference image O. r’ denotes the shift from the patch center and D(cx) is the depth value at location c′x, which can be acquired from a Softkinetic depth sensor.

In equation (2), we incorporated depth information obtained from Softkinetic and scaled the computed gradient feature matrix using this depth data. As objects are trained on a half-dome with fixed positions, the distance between the objects and camera in the training stage is known. Therefore, according to the depth information, the scale ratio of the template can be determined (Figure 3). We then store the scaled gradient feature matrices in each template and perform template matching with the input image. During this process, a sliding window is used to slide the input image for the similarity calculation between the target object and template. In the sliding window, we traverse through all locations in the location list P and sum up the similarity at each location. When the similarity value reaches maximum, this reference image is identified as a key frame. Our method enlarged the work space or scale space of LINE-MOD algorithm.

3.2.2 Camera-pose refinement

After template matching, the key frame similar to the current input image is obtained. To refine the camera pose further, a feature-based tracking method is incorporated into our algorithm. The feature descriptors of the input image are matched with that of the key frame. As a result, the corresponding 2D/3D point set is obtained, which provides the necessary initialization data for precise camera position computation.

Owing to the lack of adequate textures, point feature matching methods such as those found in a few studies (Lowe, 2004; Mikolajczyk and Schmid, 2002) often fail. Therefore, corner features on the assemblage are used in this paper. ORB is an excellent descriptor, which is a fusion of a FAST corner detector and a BRIEF descriptor with many modifications to enhance the performance. Readers can check other references (Rublee et al., 2011) for detailed information. The corresponding 2D/3D points are obtained by matching 2D feature points between the input image and the key frame. Correct points matching result is the most essential premise for pose estimation.

However, in practical use, mismatch happens between partial point pairs. To eliminate mismatch from noisy data and find the correct match for camera-pose calculation, hypothesis testing strategy, random sample consensus (RANSAC), is first used for preliminary screening, and then orientation vector of matching-point pair is adopted to further eliminate outliers. The orientation vector outlier rejection method can be seen as follows.

First, according to the characteristics of mechanical components, density-based spatial clustering method is used to cluster the matching point pairs. Then, for each matching-points cluster, the following steps are conducted.

M is a matrix composed of the unit orientation vector of matching-point pairs:

(3) M=[m1m2m3mn]T

The similarity between the direction vectors mi and mj is measured using the inner product:

(4) dij=mi,mj

The similarity between mi(1×2) and all elements in M(2×n)T is defined as follows:

(5) G=miMT/(n1)

All the similarity results in G are sorted in a descending order; if one element in G is smaller than a given threshold value δ (in practice we use 0.985) the matching-point pair is treated as outlier.

In this paper, point pairs obtained from our outlier rejection method is compared with those obtained using the RANSAC method. The results can be seen in Figure 4. There are lots of mismatching points in Figure 4(a). After using the RANSAC algorithm, mismatching points reduce significantly but still exist [Figure 4(b)]. Figure 4(c) shows our method, which uses RANSAC and orientation vector to remove the outliers. It can be seen that one additional mismatching point pair is further removed, which makes the point pairs matching process more accurate. Besides, the time spent on the orientation vector outliers removal process is less than 2 ms.

After the above procedure, the correct matching-point pairs are obtained. Then, the perspective-n-point (PnP) algorithm is used to calculate the accurate camera pose. However, a traditional iterative PnP algorithm such as Lu Hage Mjolsness (LHM) algorithm (Lu et al., 2000) has very high computational complexity, which greatly reduces the efficiency of the algorithm. Besides, the calculation accuracy of the iterative methods degrades significantly when no redundant reference points (n ≤ 5) can be used. To improve the efficiency, accuracy and robustness of our tracking method, a non-iterative PnP solution Robust Perspective-n-Point (RPnP) (Li et al., 2012) algorithm for camera pose calculation is used in our system, which runs with much less computational time cost and can achieves more accurate results than the iterative algorithms when no redundant reference points can be used (n ≤ 5).

4 Experimental results

4.1 Experimental setup

Our experiments were implemented on a standard notebook with an Intel(R) Core(TM) i3-4130 CPU with 3.4GHz and 4GB of RAM. A Softkinetic DS325 depth sensor was used, which has a color sensor (resolution: 720P) and a depth sensor (resolution: 320 × 240). In our experiments, the image size was set to 640 × 480. The system setup can be seen in Figure 5. The software was developed based on the game engine “unity3D”, and virtual elements such as 3D models were generated by the 3D Studio Max modeling software. In this study, the tracking algorithm was written in C++ language and imported to unity3D as a Dynamic Link Library (DLL).

4.2 Assembly assistance experiments

In this section, we developed a precision machine tool assembly assistance system based on our marker-less tracking method. The first-person view of the machine tool assembly-guiding task can be seen in Figure 6. When touching the “Start” button, the camera pose was calculated using our marker-less tracking method, while a 3D model of the part to be assembled was loaded from the remote server and superimposed on the real assembly scene to indicate where and how to assemble the part. The manipulator could, then, follow this guiding information and assemble the part in the correct position. After completing this step, the process is to trigger the “Next” button for the following assembly assistance information. If the manipulator wants to see this information again, pressing the “Prev” button would generate an animation replay. During this assembly process, our method still can automatically tracking when the part of interest changes its appearance [Figure 6(c)].

4.3 Evaluation of the algorithm

4.3.1 Scale-invariance evaluation

To evaluate the impact of depth factor on scale invariant property of our tracking method, a set of comparison tests were conducted and the results presented in Figure 7. Figure 7 (a) to (d) represents tracking using our method, with the adaption of a depth factor, as described in Section 3.2.1. Figure 7(e) and (f) represents tracking without the above adaption. As can be seen from Figure 7(a) to (d), the scale of the object in the frame changes, but a good tracking performance is still achieved. This was mainly because of the fact that the LINE-MOD algorithm was adapted into a scale invariant descriptor with depth information, and its working range was expended, which could help to obtain the most similar key frame quickly with the current input frame through template matching; thereby, the coarse camera pose could be estimated. After that, an ORB descriptor of scale invariant property can help to calculate the accurate camera pose based on the key frame. However, owing to exceeding the working range of the original LINE-MOD algorithm (1∼2 scale), the coarse camera pose estimation process cannot be completed, so the tracking process fails abysmally [Figure 7(e) and (f)].

4.3.2 Tracking accuracy evaluation

To evaluate the tracking accuracy, the most commonly used marker system ARToolKit was adopted for comparison with our method. It was reported by Abawi et al. (2004) in the International Symposium on Mixed and Augmented Reality (ISMAR) that ARToolKit showed high tracking accuracy with 30∼80° in rotation, 20∼90 cm in distance. Therefore, we conducted the comparative tests within those ranges. In the experiments, the camera was mounted on a high-precision mechanical pan and tilted and adjusted in a way that the center of the lens and the center of the marker were at the same height, as was previously done (Abawi et al., 2004). We adopted the “one-factor-at-a-time” approach. Here, the distance of the camera to the marker was changed from 20 up to 90 cm in steps of 10 cm, while the other input factor rotation angle was maintained at 35°. In another set of experiments, the angle of the camera was changed only around the y-axis in the range from 30° up to 80° in steps of 10°, while Z distance was maintained at 30 cm. The experimental results can be seen in Figure 8. As can be seen in Figure 8(a) and (b), the overall errors of translation and rotation of both our method and the ARToolKit increased with the increasing Z distance, but the ARToolKit’s overall errors of was greater than ours. From Figure 8(c) and (d), we can see that the overall errors of translation and rotation of the ARToolKit increase with the increase in the rotation angle. However, the translation and rotation errors in our method remain unchanged when the rotation angle increased. From the above four figures, we conclude that our tracking method has an advantage over the ARToolKit in the accuracy of position and pose tracking.

To prove the effectiveness of our proposed outlier point pair rejection method, a supplementary experiment was also conducted. Method 1 represents our method without orientation vector rejection process. From Figure 8, it is indicated that the orientation vector outlier rejection process plays a positive role in camera position and pose tracking accuracy.

4.3.3 Time analysis

Table I shows the computation time of our proposed method. From Table I, we can see that the template matching process to find the key frame needs 41 m. After the key frame is obtained, the tracking speed can run at 30 fps until the matching points fall below a thread (in practice we use 5 matching points). Even upon considering the template matching process, our method can meet the real-time requirement (≥15 fps) for an AR system.

5. Conclusions and future work

Mechanical assembly using AR technology has been widely reported in the literature, but most are based on artificial markers, which suffer from not only visual pollution but also complex operation defects. In this paper, we proposed a coarse-to-fine marker-less tracking strategy. First, we adapted the LINE-MOD algorithm into a scale-invariant descriptor to find a key frame labeled with its corresponding pose; then, feature-based tracking method using an ORB descriptor was exploited to refine the pose. To eliminate mismatching points in feature-based tracking, an orientation vector outlier rejection method was presented, which could further remove outlier-mismatching points after using the RANSAC algorithm in lesser computational time. An AR assembly assistance system was developed with our proposed marker-less tracking approach. The assembly result showed that our method automatically adapted to tracking requirement when the part of interest “changes” its appearance. In addition, the tracking accuracy and real time property of the algorithm were investigated. The results showed that our tracking method could run at 30 fps and that the camera position and pose tracking accuracy was slightly superior to ARToolKit. Because of the working distance and FOV range of depth sensor Softkinetic, the working scope and user experience of our approach is limited. In our future research work, we will dedicate our efforts to solving the scale invariant problem of LINE-MOD with an image processing approach. Furthermore, 3D feature descriptors will be explored to improve the robustness of our method.

Figures

Automatic AR assembly assistance overview

Figure 1

Automatic AR assembly assistance overview

Off-line sampling sketch map

Figure 2

Off-line sampling sketch map

Template matching process using the improved LINE-MOD algorithm

Figure 3

Template matching process using the improved LINE-MOD algorithm

Point pairs matching result with (a) original match (b) RANSAC and (c) RANSAC+ orientation vector

Figure 4

Point pairs matching result with (a) original match (b) RANSAC and (c) RANSAC+ orientation vector

The system setup

Figure 5

The system setup

Precision machine tool maintenance guiding experiment

Figure 6

Precision machine tool maintenance guiding experiment

Scale-invariance evaluation (a), (b), (c) and (d) modified LINE-MOD (with depth factor) + ORB, (e) and (f) LINE-MOD (without depth factor) + ORB

Figure 7

Scale-invariance evaluation (a), (b), (c) and (d) modified LINE-MOD (with depth factor) + ORB, (e) and (f) LINE-MOD (without depth factor) + ORB

Tracking accuracy of ARToolKit and our approach (Method 1 represents our method without the orientation vector outlier rejection process)

Figure 8

Tracking accuracy of ARToolKit and our approach (Method 1 represents our method without the orientation vector outlier rejection process)

The computation time of our method

Actions Our method (m)
LINE-MOD template matching 41
ORB feature extraction 10
ORB feature matching 7.35
Pose estimation 7
Total time (matching point N < σ) 65.35
Total time (matching point Nσ) 24.35

References

Abawi, D.F., Joachim, B. and Ralf, D. (2004), “Accuracy in optical tracking with fiducial markers: an accuracy function for ARToolKit”, Proceedings of the 3rd IEEE/ACM International Symposium on Mixed and Augmented Reality, Arlington, pp. 260-261.

Alvarez, H., Aguinaga, I. and Borro, D. (2011), “Providing guidance for maintenance operations using automatic markerless augmented reality system”, Virtual Reality, Vol. 29 No. 1, pp. 181-190.

Azuma, R.T. (1997), “A survey of augmented reality”, Presence: Teleoperators and Virtual Environments, Vol. 6 No. 4, pp. 355-385.

Barnes, C.J., Jared, G.E.M. and Swift, K.G. (2004), “Decision support for sequence generation in an assembly oriented design environment”, Robotics and Computer-Integrated Manufacturing, Vol. 20 No. 4, pp. 289-300.

Bay, H., Tuytelaars, T. and Van Gool, L. (2006), “SURF: speeded up robust features”, European Conference on Computer Vision (ECCV), Graz, pp. 404-417.

Caruso, D., Engel, J. and Cremers, D. (2015), “Large-scale direct slam for omnidirectional cameras”, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, pp. 141-148.

Chen, Z. and Li, X. (2010), “Markless tracking based on natural feature for augmented reality”, International Conference on Educational and Information Technology (ICEIT), Chongqing, pp. V2-126-V2-129.

Crivellaro, A., Rad, M., Verdie, Y., Yi, K.M., Fua, P. and Lepetit, V. (2015), “A novel representation of parts for accurate 3d object detection and tracking in monocular images”, Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, pp. 4391-4399.

Damen, D., Bunnun, P., Calway, A. and Mayol-Cuevas, W. (2012), “Real-time learning and detection of 3D texture-less objects: a scalable approach”, The British Machine Vision Conference (BMVC), University of Surrey, Guildford, pp. 1-12.

Engel, J., Schöps, T. and Cremers, D. (2014), “LSD-SLAM: large-scale direct monocular SLAM”, European Conference on Computer Vision (ECCV), Zürich, pp. 834-849.

Garrett, T., Debernardis, S., Radkowski, R. and Oliver, J.H. (2014), “Rigid object tracking algorithms for low-cost AR devices”, ASME 2014 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Buffalo, p. V01BT02A043.

Gordon, I. and Lowe, D.G. (2006), What and Where: 3d Object Recognition With Accurate Pose, Lecture Notes in Computer Science, Vol. 4170, pp. 67-82.

Hinterstoisser, S., Cagniart, C., Ilic, S., Sturm, P., Navab, N. and Fua, P. (2012a), “Gradient response maps for real-time detection of textureless objects”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34 No. 5, pp. 876-888.

Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K. and Nava, N. (2012b), “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes”, Asian Conference on Computer Vision (ACCV), Daejeon, pp. 548-562.

Kalal, Z., Matas, J. and Mikolajczyk, K. (2010), “Pn learning: bootstrapping binary classifiers by structural constraints”, 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, pp. 49-56.

Kim, S., Kweon, I. and Kim, I. (2003), “Robust model-based 3d object recognition by combining feature matching with tracking”, IEEE International Conference on Robotics and Automation (ICRA), Taipei, Vol. 2, pp. 2123-2128.

Lee, T. and Hollerer, T. (2009), “Multithreaded hybrid feature tracking for markerless augmented reality”, IEEE Transactions on Visualization and Computer Graphics, Vol. 15 No. 3, pp. 355-368.

Lepetit, V., Lagger, P. and Fua, P. (2005), “Randomized trees for real-time keypoint recognition”, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, Vol. 2, pp. 775-781.

Leutenegger, S., Chli, M. and Siegwart, R.Y. (2011), “BRISK: Binary robust invariant scalable keypoints”, 2011 International Conference on Computer Vision (ICCV), Barcelona, pp. 2548-2555.

Li, S., Xu, C. and Xie, M. (2012), “A robust O (n) solution to the perspective-n-point problem”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34 No. 7, pp. 1444-1450.

Lowe, D.G. (2004), “Distinctive image features from scale-invariant keypoints”, International Journal of Computer Vision, Vol. 60 No. 2, pp. 91-110.

Lu, C.P., Hager, G.D. and Mjolsness, E. (2000), “Fast and globally convergent pose estimation from video images”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22 No. 6, pp. 610-622.

Mikolajczyk, K. and Schmid, C. (2002), “An affine invariant interest point detector”, European Conference on Computer Vision (ECCV), Copenhagen, pp. 128-142.

Mottaghi, R., Xiang, Y. and Savarese, S. (2015), “A coarse-to-fine model for 3D pose estimation and sub-category recognition”, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 418-426.

Mur-Artal, R. and Tardós, J.D. (2014), “ORB-SLAM: tracking and mapping recognizable features”, Workshop on Multi VIew Geometry in RObotics (MVIGRO), Wheeler Hall at UC Berkeley.

Ng, L.X., Wang, Z.B., Ong, S.K. and Nee, A.Y.C. (2013), “Integrated product design and assembly planning in an augmented reality environment”, Assembly Automation, Vol. 33 No. 4, pp. 345-359.

Pauwels, K., Rubio, L., Diaz, J. and Ros, E. (2013), “Real-time model-based rigid object pose estimation and tracking combining dense and sparse visual cues”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Oregon Convention Center in Portland, Vol. 9 No. 4, pp. 2347-2354.

Payet, N. and Todorovic, S. (2011), “From contours to 3d object detection and pose estimation”, 2011 International Conference on Computer Vision (ICCV), Barcelona, pp. 983-990.

Pressigout, M. and Marchand, E. (2006), “Real-time 3d model-based tracking: combining edge and texture information”, Proceedings 2006 IEEE International Conference on Robotics and Automation (ICRA), Florida, pp. 2726-2731.

Prisacariu, V.A. and Reid, I.D. (2012), “Pwp3d: real-time segmentation and tracking of 3d objects”, International Journal of Computer Vision, Vol. 98 No. 3, pp. 335-354.

Radkowski, R. (2015), “A point cloud-based method for object alignment verification for augmented reality applications”, ASME 2015 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Boston, p. V01BT02A059.

Rublee, E., Rabaud, V., Konolige, K. and Bradski, B. (2011), “ORB: an efficient alternative to SIFT or SURF”, 2011 International Conference on Computer Vision (ICCV), Barcelona, pp. 2564-2571.

Tombari, F., Franchi, A. and Di Stefano, L. (2013), “Bold features to detect texture-less objects”, Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, pp. 1265-1272.

Uchiyama, H. and Marchand, E. (2011), “Deformable random dot markers”, 2011 10th IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Basel, pp. 237-238.

Vacchetti, L., Lepetit, V. and Fua, P. (2004), “Stable real-time 3d tracking using online and offline information”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26 No. 10, pp. 1385-1391.

Wang, X., Ong, S.K. and Nee, A.Y.C. (2016a), “Multi-modal augmented-reality assembly guidance based on bare-hand interface”, Advanced Engineering Informatics, Vol. 30 No. 3, pp. 406-421.

Wang, X., Ong, S.K. and Nee, A.Y.C. (2016b), “Real-virtual components interaction for assembly simulation and planning”, Robotics and Computer-Integrated Manufacturing, Vol. 41, pp. 102-114.

Wang, X., Kotranza, A., Quarles, J. and Lok, B. (2005), “A pipeline for rapidly incorporating real objects into a mixed environment”, 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), University of South Australia, pp. 170-173.

Wang, Z.B., Ng, L.X., Ong, S.K. and Nee, A.Y.C. (2013), “Assembly planning and evaluation in an augmented reality environment”, International Journal of Production Research, Vol. 51 Nos 23/24, pp. 7388-7404.

Wang, G., Wang, B., Zhong, F., Qin, X. and Chen, B. (2015), “Global optimal searching for textureless 3D object tracking”, The Visual Computer, Vol. 31 Nos 6/8, pp. 979-988.

Westerfield, G., Mitrovic, A. and Billinghurst, M. (2015), “Intelligent augmented reality training for motherboard assembly”, International Journal of Artificial Intelligence in Education, Vol. 25 No. 1, pp. 157-172.

You, S. and Neumann, U. (2001), “Fusion of vision and gyro tracking for robust augmented reality registration”, IEEE Proceedings of Virtual Reality, IEEE, pp. 71-78.

Zhu, J., Ong, S.K. and Nee, A.Y.C. (2013), “An authorable context-aware augmented reality system to assist the maintenance technicians”, The International Journal of Advanced Manufacturing Technology, Vol. 66 Nos 9/12, pp. 1699-1714.

Furthur reading

Baillot, Y., Davis, L. and Rolland, J. (2001), “A survey of tracking technology for virtual environments”, Fundamentals of Wearable Computers and Augmented Reality, pp. 67-112.

Engel, J., Stückler, J. and Cremers, D. (2015), “Large-scale direct SLAM with stereo cameras”, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, pp. 1935-1942.

Acknowledgements

This work was supported by “The Fundamental Research Funds for the Central Universities” of China (3102015BJ(II)MYZ21).

Corresponding author

Shusheng Zhang can be contacted at: zssnet@sina.cn