SEMI-AUTOMATIC IMAGE-BASED CO-REGISTRATION OF RANGE IMAGING DATA WITH DIFFERENT CHARACTERISTICS

Currently, enhanced types of active range imaging devices are available for capturing dynamic scenes. By using intensity and range images, data derived from different or the same range imaging devices can be fused. In this paper, an automatic image-based coregistration methodology is presented which uses a RANSAC-based scheme for the Efficient Perspective-n-Point (EPnP) algorithm. For evaluating the methodology, two different types of range imaging devices have been investigated, namely Microsoft Kinect and PMD [vision] CamCube 2.0. The data sets captured with the test devices have been compared to a reference device with respect to the absolute and relative accuracy. As the presented methodology can cope with different configurations concerning measurement principle, point density and range accuracy, it shows a high potential for automated data fusion for range imaging devices.


INTRODUCTION
The capturing of 3D information about the local environment is still an important topic as this is a crucial step for a detailed description or recognition of objects within the scene.Most of the current approaches are based on the use of image or range data.By using passive imaging sensors like cameras, the respective 3D information is obtained indirectly via textured images and stereo-or multiple-view analysis with a high computational effort.These procedures are widely used, but they depend on scenes with adequate illumination conditions and opaque objects with textured surface.Besides, the distances between sensor and object, between the different viewpoints of an imaging sensor and between the sensors of the stereo rig, in the case of using a stereo camera, should be sufficiently large in order to obtain reliable 3D information.
In contrast to the photogrammetric methods, terrestrial laser scanner (TLS) devices allow for a direct and illuminationindependent measurement of 3D object surfaces (Shan & Toth, 2008;Vosselman & Maas, 2010).These scanning sensors capture a sequence of single range values on a regular spherical scan grid and thus accomplish a time-dependent spatial scanning of the local environment.Hence, the scene contents as well as the sensor platform should be static in order to reach an accurate data acquisition.
For an adequate capturing of dynamic scenes given for instance by moving objects or a moving sensor platform, it is essential to obtain all and dense 3D information about the local environment at the same time.Recent developments show that enhanced types of active imaging sensors have started to meet these requirements.Suitable for close-range perception, these sensors allow for simultaneously capturing a range image and a co-registered intensity image while still maintaining high update rates (up to 100 releases per second).However, the nonambiguity range of these sensors is only several meters and depends on the modulation frequency.This problem can currently only be tackled by using active imaging sensors based on different modulation frequencies (Jutzi, 2009;Jutzi, 2011).Besides, the measured intensity strongly depends on the wavelength (typically close infrared) of the laser source as well as on the surface characteristic.Various studies on range imaging focus on hardware and software developments (Lange, 2000), geometric calibration (Reulke, 2006;Kahlmann et al., 2007;Lichti, 2008) and radiometric calibration (Lichti, 2008).
Nowadays, many approaches for capturing single 3D objects are still based on the use of coded structured light.In Salvi et al. (2004), different strategies for pattern codification are summarized and compared.In general, all these strategies are based on the idea of projecting a coded light pattern on the object surface and viewing the illuminated scene.Such coded patterns allow for a simple detection of correspondences between image points and points of the projected pattern.These correspondences are required to triangulate the decoded points and thus obtain the respective 3D information.For real-time applications or dynamic scene acquisition, it is essential to avoid time-multiplexing methods as these usually depend on the successive projection of different binary codes.Very simple patterns with inexpensive hardware requirements which are also suitable for dynamic scenes can for example be established via dot patterns.Using regular dot patterns for measuring surfaces of close-range objects by considering the images of several CCD cameras has been presented in Maas (1992) and offers advantages like redundancy, reliability and accuracy without the need of a priori information or human interaction.The idea of using dot patterns has further been improved and currently, new types of sensors (e.g. the Microsoft Kinect device developed by PrimeSense) use random dot patterns of projected infrared points for getting reliable and dense close-range measurements in real-time.
Using the new types of active imaging sensors is well-suited for dynamic close-range 3D applications, e.g.like the autonomous navigation of robots, driver assistance, traffic monitoring or tracking of pedestrians for building surveillance.Therefore, it is important to further investigate the potentials arising form these sensor types.
In this paper, a method for semi-automatic image-based coregistration of point cloud data is proposed, as an accurate range measurement with a reference target for a large field-of-view is technically demanding and can be expensive.For an automatic image-based algorithm, various general problems have to be tackled, e.g.co-registration, camera calibration, image transformation to a common coordinate frame and resampling.With the range imaging devices (e.g.PMD [vision] CamCube 2.0 and Microsoft Kinect) test data is captured and compared to reference data derived by a reference device (Leica HDS6000).The general framework focuses on an image-based coregistration of the different data types, where keypoints are detected within each data set and the respective transformation parameters are estimated with a RANSAC-based approach to camera pose estimation using the Efficient Perspective-n-Point (EPnP) algorithm.Additionally, the proposed algorithm can as well be used to co-register data derived from different or the same ranging devices without adaptations.This allows for fusing range data in form of point clouds with different densities and accuracy.A typical application could be the completion and densification of sparse data with additional data in a common coordinate system.After this co-registration, the absolute and relative range accuracy of the range imaging devices are evaluated by experiments.For this purpose, the data sets captured with the test devices over a whole sequence of frames are considered and compared to the data set of a reference device (Leica HDS6000) transformed to the local coordinate frame of the test device.The results are shown and discussed for an indoor scene.
The remainder of this paper is organized as follows.In Section 2, the proposed approach for an image-based co-registration of point clouds and a final comparison of the measured data is described.The configuration of the sensors and the scene is outlined in Section 3. Subsequently, the captured data is examined in Section 4. The performance of the presented approach is tested in Section 5.Then, the derived results are evaluated and discussed in Section 6 and finally, in Section 7, the content of the entire paper is concluded and an outlook is given.

METHODOLOGY
For comparing the data captured with a range imaging device to the data captured with a laser scanner which serves as reference, the respective data must be transformed into a common coordinate frame.Therefore, the change in orientation and position, i.e. the rotation and translation parameters between the different sensors, has to be estimated.As illustrated in Figure 1, it is worth analyzing the data after the data acquisition.The laser scanner provides data with high density and high accuracy in the full range of the considered indoor scene, whereas the range imaging devices are especially suited for close-range applications.Hence, the rotation and translation parameters can be estimated via 3D-to-2D correspondences between 3D points derived from the TLS measurements and 2D image points of the respective range imaging sensor.These 3Dto-2D correspondences are derived via a semi-automatic selection of point correspondences between the intensity images of the laser scanner and the test device, and built by combining the 2D points of the test device with the respective interpolated 3D information of the laser scanner.In Moreno-Noguer et al. (2007) and Lepetit et al. (2009), the Efficient Perspective-n-Point (EPnP) algorithm has been presented as a non-iterative solution for estimating the transformation parameters based on such 3D-to-2D correspondences.As the EPnP algorithm takes all the 3D-to-2D correspondences into consideration without checking their reliability, it has furthermore been proposed to increase the quality of the registration results by introducing the RANSAC algorithm (Fischler & Bolles, 1981) for eliminating outliers and thus reaching a more robust pose estimation.Using the estimated transformation parameters, the reference data is transformed into the local coordinate frame of the test device.This part of the proposed approach is comparable to the coarse registration presented in Weinmann et al. (2011).Finally, the estimated transformation allows for comparing the captured data.

CONFIGURATION
To validate the proposed methodology, a configuration concerning sensors and scene has to be utilized.

Sensors
For the experiments, two different range imaging devices were used as test devices and a terrestrial laser scanner as reference device.

Range imaging device -PMD [vision] CamCube 2.0
With a PMD [Vision] CamCube 2.0, various types of data can be captured, namely the range and the intensity, where the intensity can be distinguished in active and passive intensity.
The measured active intensity depends on the illumination emitted by the sensor and the passive intensity on the background illumination (e.g.sun or other light sources).The data can be depicted as image with an image size of 204 x 204 pixels.A field-of-view of 40° x 40° is specified in the manual.
Currently, the non-ambiguity which is sometimes called unique range is less than 10 m and depends on the tunable modulation frequency.This range measurement restriction can be improved by image-or hardware-based unwrapping procedures in order to operate as well in far range (Jutzi, 2009;Jutzi, 2011).
For the experiments the hardware-based unwrapping procedures were utilized, where modulation frequencies of 18 MHz and 21 MHz were selected for maximum frequency discrimination.The integration time was pushed to the maximum of 40 ms in order to gain a high signal-to-noise ratio for the measurement.In this case, saturation could appear in close range or arise from object surfaces with high reflectivity.All measurement values were captured in raw mode.

Range imaging device -Microsoft Kinect
The Microsoft Kinect device is a game console add-on which captures disparity and RGB images with a frame rate of 30 Hz. Originally, the disparity images are used to track full body skeleton poses of several players in order to control the game

Reference device -Leica HDS6000
The Leica HDS6000 is a standard phase-based terrestrial laser scanner with survey-grade accuracy (within mm range) and a field-of-view of 360° x 155°, and the full captured image size is 2500 x 1076 pixels.Hence, the angular resolution is approximately 0.14°.

Scene
A data set of a static indoor scene was recorded with the stationary placed sensors mentioned above.In Figure 3, a photo of the observed scene is depicted.For the environment no reference data concerning the radiometry or geometry was available.Hence, the scene is more suited for investigating the quality of the test devices at different levels of distance, even beyond the sensor specifications, where it will be seen that the captured information might eventually still be suitable.
Figure 3. RGB image of the observed indoor scene.

DATA EXAMINATION
In this section, the semi-automatic feature extraction by an operator, the transformation of the data into a common coordinate system and finally, the resampling of the data into a proper grid is described.

Semi-automatic feature extraction
For an efficient registration process, it has proved to be suitable to establish pairs of points, each consisting of a 3D point representing information derived from the reference data and a 2D point representing the image coordinates measured in the image information of the test device (Weinmann et al., 2011).
Based on these 3D-to-2D correspondences, the co-registration can be carried out via the Efficient Perspective-n-Point (EPnP) algorithm which has recently been presented as a fast and accurate approach to pose estimation.
Hence, the image coordinates of the control points have been measured manually and with sub-pixel accuracy in the passive intensity image of the test devices, which has been undistorted and mapped to the depth image, as well as in the image of the reference device.Subsequently, the corresponding 3D object coordinates have been determined based on the reference data by interpolation as the measured 3D information is only available on a regular grid.
The proposed approach consisting of EPnP and RANSAC has been used to estimate the exterior orientation of both test devices in relation to the reference data.Table 1 shows the resulting reprojection errors, the number of all determined control points and the number of control points selected by the RANSAC algorithm.The low percentage of utilized control points is only slightly influenced by a low quality of the manual 2D measurement but rather by the range information itself.As distinctive 2D control points are selected first which are located at corners or blobs, the respective interpolated 3D information may abruptly change and thus not always be reliable.

Converting range to depth images
Once the transformation parameters between reference and test device are estimated, it is possible to check how 3D points measured with the reference device are projected onto the image plane of a virtual camera with the same intrinsic parameters as the test device.Using homogeneous coordinates, this transformation can be formulated as where K is the calibration matrix of the virtual camera, R the estimated rotation matrix and t the estimated translation vector.If a pixel in this virtual camera image is assigned more than one of the 3D points, the mean values of the respective points are used.Resulting from this, resampled synthetic depth images can be created, which are shown in Figure 4 for using the same calibration matrices as those of the two test devices.The absolute accuracy is given by the depth difference which is calculated by the difference between reference depth z Ref derived from the reference device and the mean value z derived from at least 100 single measurements captured by the investigated range imaging device over a time sequence.
Then, the relative accuracy is given by the standard deviation of the depth difference z σ .

ANALYSIS RESULTS
First over 100 images of the static scene have been captured with both fixed devices, and these images are represented by a stack of images.Unreliable measurement values, resulting from noise effects, yield less than 100 values and have been masked out.The remaining reliable measurement values are utilized for further analysis.The number of reliable measurements depicted by gray values is shown in Figure 5.For the range imaging device PMD [vision] CamCube 2.0, a total number of 33835 reliable pixels (81%) meets our constraints.For the range imaging device Microsoft Kinect, the maximum raw disparity of 2047 (at 11 bits) has been masked out additionally, which yields a total number of 104478 reliable pixels (34%).
From the reliable values, the mean and the standard deviation of the depth have been calculated.

Range imaging device -PMD [vision] CamCube 2.0
In Figure 6a, the mean depth obtained with the PMD [vision] CamCube 2.0 is visualized.Unreliable measurement values, which are represented with white color, appear at the polished surfaces in the foreground mainly on the left side where the incidence angle to the surface is steep, resulting in uncertainties (Figure 7a).Further unreliable measurement values can be observed on the dark colored and polished doors in the back of the room.These outliers occur due to the low reflectivity or specular surface characteristic which can result in multipath measurements.
The depth values are spread over an interval from 4.16 to 24.94 m. Figure 6b

Range imaging device -Microsoft Kinect
In Figure 8a, the mean depth obtained with the Microsoft Kinect is visualized.Obviously, the operation range has been exceeded in the selected scene.Hence, the wall at the back of the room is completely missing, because the maximum raw disparity values have been filtered out (compare Figure 5b to 8a).However the remaining depth measurements still show varying distances to different rows of chairs indicating the rough structure of the scene.The depth values are within an interval from 3.61 to 23.86 m (Figure 8b).This statement supports a use of this test device for densifying sparse depth measurements far beyond the sensor specification.
The object size with its surface direction, where the pattern is projected on, and the correlation window size lead to limitations with respect to the spatial resolution of the depth image.For instance, there is no clear partition in depth for the more distant rows of chairs compared to the PMD [vision] CamCube 2.0, where depth stepping of rows can be resolved up to the last row.

EVALUATION AND DISCUSSION
Finally, the derived depth differences are evaluated and discussed by calculating the mean depth and the standard deviation of the depth.In Figure 10 The Microsoft Kinect is difficult for interpretation, as no systematic error can be detected.Furthermore, a low point density is given at depths above 19 m, which could be interpreted as limitation of the device.Concerning the scene contents, only the four nearest rows of chairs can be recognized within the image in Figure 10b.This is even more clearly presented within the density distribution in Figure 11b, following the vertical direction.Concerning the reliable pixels, 18322 depth difference values (17.5%) are within the interval [-3,0] m.The mean value depicted in Figure 12b shows the standard deviation of the depth difference, which could be roughly generalized.Transferring this information, it could be interpreted that for instance at a depth of 10 m a measurement deviation of approximately 0.2 m can be expected and at 15 m a measurement deviation of approximately 0.5 m.

CONCLUSION AND OUTLOOK
In this paper, a approach for co-registration of data captured by range imaging devices with different configurations has been proposed.This allows for evaluating the absolute and relative accuracy of the range imaging devices.
After registration, the depth difference and the standard deviation of the depth difference have been estimated for two range imaging devices, namely Microsoft Kinect and PMD [vision] CamCube 2.0.
Based on the established 3D-to-2D correspondences, the data captured with the test devices can be used to complete or densify sparse data captured with a reference device.Even more, the point clouds captured with both devices do not necessarily have to provide the same density or accuracy.Hence, the test devices provide additional information about the local environment even beyond the sensor specifications, e.g. the different rows of chairs can still be distinguished and the rough structure of the scene can be recognized.However, in this case, the measured 3D coordinates are significant less accurate for the Microsoft Kinect whereas for the PMD [vision] CamCube 2.0, hardware-based unwrapping procedures using different modulation frequencies yield a measurement accuracy which approximately remains on a constant and relatively low level.
Concerning the utilized data, it can be stated that the intensity of the test data derived from the Microsoft Kinect not always matches to the reference data, due to the different wavelengths of the devices.For a fully automatic approach, these different characteristics will cause that the automatic detection of the point correspondences will fail.
In contrast to this, test data derived from the PMD [vision] CamCube 2.0 matches sufficiently to the reference data.First investigations show that an automatic registration between the different data types can reliably be established via keypoint detectors, e.g. by using SIFT features (Lowe, 2004).However, it has to be mentioned that this device shows limitations arising from its image size, but it can be expected that this will be improved in close future.
Not yet investigated, another straightforward approach might be to consider the texture given by the range image instead of the above mentioned intensity image, because the geometric aspects are invariant to the utilized wavelength.However, the nearly monostatic configuration of the PMD [vision] CamCube 2.0 and the bistatic configuration of the Microsoft Kinect while capturing the data might lead to inconsistencies within the range image and this could be critical for processing.
The promising results of this paper show that the presented methodology has a high potential for automated co-registration of data captured with ranging devices which show different configurations concerning the measurement principle, point density and range accuracy.

Figure 1 .
Figure 1.Processing chain of the proposed approach.
device has a RGB camera, an IR camera and a laserbased IR projector which projects a known structured light pattern of random points onto the scene.IR camera and IR projector form a stereo pair.The pattern matching in the IR image is done directly on-board resulting in a raw disparity image which is read out with 11 bit depth.Both RGB and disparity image have image sizes of 640 x 480 pixels.The disparity image has a constant band of 8 pixels width at the right side which supports speculation(Konolige & Mihelich, 2010) of a correlation window width of 9 pixels used in the hardware-based matching process.For the data examination, this band has been ignored, which yields a final disparity image size of 632 x 480 pixels.Camera intrinsics, baseline and depth offset have been calibrated in order to transform the disparities to depth values and to register RGB image and depth image.The horizontal field-of-view of the RGB camera is with 63.2° wider than the field-of-view of the IR camera with 56.2°.Considering the stereo baseline of 7.96 cm, known from calibration, the range is limited.The Kinect device is based on a reference design (1.08) from PrimeSense, the company that developed the system and licensed it to Microsoft.In the technical specifications of the reference design, an operation range for indoor applications from 0.8 to 3.5 m is given.

Figure 4 .
Figure 4. Synthetic depth images: a) PMD [vision] CamCube 2.0, b) Microsoft Kinect.Due to the given lower angular resolution of the reference device (0.14°) in comparison to the test device Microsoft Kinect (0.09°), artifacts from resampling can be observed in the synthetic depth image in Figure 4b.The test device PMD [vision] CamCube 2.0 records the data with an angular resolution of 0.20° which is lower than the angular resolution of the reference device.For that reason, the synthetic depth image in Figure 4a is without artifacts.Thus, the depth values of the different devices can easily be compared to the depth values of the reference device.

Figure 6 .Figure 7 .
Figure 6.Mean depth: a) gray-coded image, b) histogram.To the mean depth mentioned above, the corresponding standard deviation is shown in Figure7, where most of the values are below σ z with 0.5 m.The standard deviation increases slightly with depth and a maximum of 4.62 m can be observed in the data.

Figure 8 .Figure 9 .
Figure 8. Mean depth: a) gray-coded image, b) histogram.As can be seen in Figure9a, the standard deviation increases with depth, where most of the values are below z σ with 0.2 m and a maximum of 1.41 m is given.

Figure 10 .Figure 12 .
Figure 10.Depth difference between the data of reference and test device: a) PMD [vision] CamCube 2.0, b) Microsoft Kinect.Homogenous areas can be stated for the PMD [vision] CamCube 2.0 in Figure 10a.These represent a systematic range shift, where the range measurement tends to be too close to the sensor.Concerning the reliable pixels over the scene depth, 25109 depth difference values (74%) are within the interval [-1,0] m.The standard deviation of the depth difference might depend on the signal-to-noise ratio of the measurement.Due to the inverse square law concerning range dependency of the received light

Table 1 .
Quantity and quality of the utilized control points.