Impact of geolocation data on augmented reality usability: A comparative user test

Abstract. While the use of location-based augmented reality (AR) for education has demonstrated benefits on participants' motivation, engagement, and on their physical activity, geolocation data inaccuracy causes augmented objects to jitter or drift, which is a factor in downgrading user experience. We developed a free and open source web AR application and conducted a comparative user test (n = 54) in order to assess the impact of geolocation data on usability, exploration, and focus. A control group explored biodiversity in nature using the system in combination with embedded GNSS data, and an experimental group used an external module for RTK data. During the test, eye tracking data, geolocated traces, and in-app user-triggered events were recorded. Participants answered usability questionnaires (SUS, UEQ, HARUS).We found that the geolocation data the RTK group was exposed to was less accurate in average than that of the control group. The RTK group reported lower usability scores on all scales, of which 5 out of 9 were significant, indicating that inaccurate data negatively predicts usability. The GNSS group walked more than the RTK group, indicating a partial effect on exploration. We found no significant effect on interaction time with the screen, indicating no specific relation between data accuracy and focus. While RTK data did not allow us to better the usability of location-based AR interfaces, results allow us to assess our system's overall usability as excellent, and to define optimal operating conditions for future use with pupils.


INTRODUCTION
This study is part of the ongoing BiodivAR project, which attempts to assess the potential benefits of using augmented reality (AR) for outdoor education on biodiversity. In AR interfaces, digital objects can be overlaid on top of users' field of view in real-time, through the screen of a mobile device or a head-mounted display. When used sensibly in an educational setting, it may convey the impression of an enriched environment and make the material more attractive, thus motivating students to learn (Geroimenko, 2020, Alnagrat et al., 2022. The most reported positive effects of AR in education are learning gains and motivation (Bacca et al., 2014). Our research is focused on the use of location-based AR in particular, where the position of augmented objects is computed based on their geographic coordinates relative to the user's location as estimated by the mobile device's GNSS. With this technology, augmented objects can be built remotely from any given geodata, as opposed to marker-based AR which requires physical markers to be physically placed on target locations. Location-based AR specially promotes learning in context (Arvola et al., 2021, Chiang et al., 2014, ecological engagement (Bloom et al., 2010), and causes users to experience a positive interdependence with nature (O'Shea et al., 2011), which fosters improved immersion and learning. Last but not least, location-based AR shows positive effects on the physical activity of users across genders, ages, weight status, and prior activity levels (Rauschnabel et al., 2017). However, location-based AR requires steady and continuously accurate data to operate. While GNSS technology has evolved and improved in the past decades, it has been more of an evolution than a revolution. Usability issues have 1. The system should allow non-expert users to create AR experiences (Cubillo et al., 2015) 2. Users should be able to publish observations rather than being restricted to a passive viewing role; 3. The instability of augmented objects deteriorates usability.
Participants spent 88.5 % of the time looking at the tablet rather than with the surrounding nature. This imbalance could be in part related to inaccurate geolocation data: participants were observed spending considerable time reorienting themselves (Ingensand et al., 2018).
In order to address these identified issues, we developed Biod-ivAR 1 , a free and open source (GNU GPLv3.0) web application using a user-centered design process (Mercier et al., 2023).
It was built using the web framework A-Frame 2 , for which we also created a custom library 3 for the creation of WebXR location-based objects in A-Frame. We used the Leaflet 4 library for the interactive maps. BiodivAR enables the creation and visualization of geolocated POIs in AR (see Figure 1) as well as a cartographic authoring tool for the collaborative management of AR environments (see Figure 2). They can be shared publicly with or without editing privileges. The application allows anyone without technological know-how to create AR environments by importing/exporting geospatial data and styling POIs by attaching medias to them. Medias can be location-triggered (visible/audible) according to various distance thresholds set by the author.

RESEARCH GOALS
The purpose of our research overall is to assess the potential benefits of using this application in the context of biodiversity education. Before introducing the tool to pupils, it seemed important to ensure its usability. This comparative user test will allow us to define and guarantee the best possible conditions cense v3.0. It is accessible (no download required) at: https:// biodivar.heig-vd.ch. The source code is available at https://github.com/ MediaComem/biodivar. 2 https://github.com/aframevr/aframe (MIT License) 3 https://github.com/MediaComem/LBAR.js/ (MIT License) 4 https://github.com/Leaflet/Leaflet (FreeBSD License) of use for a younger audience. The goals of this study can be synthesized as follows: 1. Assess the overall usability of the AR application. 2. Assess the impact of geolocation data accuracy on usability, exploration, and focus. 3. Gather user feedback for future improvements 5 .
The literature review and the observations made during the first iteration led us to propose the following hypothesis: Inaccurate geolocation data negatively affects usability. Additionally, we are looking to investigate the impact that geolocation data accuracy may have on exploration and focus in location-based AR, about which we have not been able to find any literature. The resulting research questions are: Q1: Does geolocation data accuracy predict usability scores? Q2: Is geolocation data accuracy related to exploration? 6 Q3: Is geolocation data accuracy related to focus? 7

Experimental design
The present study aims to measure and compare the usability of a location-based AR application used in combination with different geolocation data sources. Using our authoring tool, we created an AR environment with POIs on biodiversity in the surroundings of the School of Engineering and Management Vaud in Yverdon-les-Bains (Switzerland). After a brief introduction to the tool, all participants freely explored the AR environment for 15 minutes using a Samsung Galaxy Tab Active3 tablet with a SIM card for cellular data. As shown in Figure 3, the comparative user test (n = 54) includes in two groups: GNSS the control group received geolocation data coming from the GNSS sensor embedded in the mobile device RTK the experimental group received geolocation data coming from an external Ardusimple RTK kit 8 .

Participants
The sample includes 54 participants ( = 21, = 33), with a mean age of M = 25.72 (SD = 4.80). They are students and collaborators of the School of Engineering and Management Vaud, and they each signed an informed consent form for the use of the data collected. Login credentials (identifier + password) were created for each participant to record their data separately and facilitate comparison. Among them, 47 agreed to wear eyetracking glasses, of which 41 successfully recorded data. They were randomly assigned to each group. The control group's (GNSS) mean age is M = 27.5 (SD = 6.09), and it includes 12 and 15 . The experimental group's (RTK) mean age is M = 24.2 (SD = 2.22) and it includes 9 and 18 . The first participant eventually had to be excluded from the final results because they experienced numerous crashes due to a bug that was fixed for the subsequent participants. The treatment they received was therefore too different to compare.

Data collection and processing
The four main concepts our study seeks to connect are "location data accuracy", "usability", "exploration", and "focus". The measurable observations we chose to represent those concepts are listed in Table 1. In our experiment, the two groups (or treatments) operationalize the concept of "geolocation data accuracy". This concept is represented by two variables: accuracy and continuity. The accuracy attribute is provided by the Geolocation API along with the horizontal location data as latitude and longitude 9 . It denotes the accuracy level of the latitude and longitude coordinates in meters. We use the average accuracy participants were exposed to while in AR mode as the indicator for accuracy. However, in the specific context of location-based AR, sudden changes in data accuracy heavily impact the display of augmented objects in the interface. An indicator for continuity in the data is thus the amount of outliers-i.e. the points that are visibly out of a user's trajectory (as shown in Figure 4). An additional indicator for continuity in the data is the standard deviation of the data accuracy the participants of each group was exposed to. As far as the concept of "usability" goes, it is represented by a series of nine variables whose indicators are the different scales of the three questionnaires (SUS, HARUS, UEQ): overall usability, ease of handling, ease of understanding, attractability, user-friendliness, efficiency, dependability, motivation, innovativeness. The concept of "exploration" is represented by three variables: quantity, diversity, and ease. The distance walked is the indicator of the quantity of exploration. The amount of POIs visited is the indicator of the diversity of exploration. An important use of the 2D map may indicate that participants required assistance in navigating. The amount of times the 2D map was opened is thus the indicator of the ease users had exploring. Finally, the concept of "focus" in our study is represented by a screen interaction variable, whose indicator is the amount of time participants spent interacting with the tablet screen versus with the real world.

Geolocation data accuracy
During the test, participants' geographical coordinates were logged at 1 Hz. Each log also contains an attribute for location accuracy, user ID and a timestamp. The resulting users' trajectories can be visualized in the application (see Figure 4) and downloaded as GeoJSON files for further analysis. The color of the trajectory changes when the AR session is stopped and resumed again. We downloaded the data and calculated the mean location accuracy each participant was exposed to. As shown in Figure 4, the trajectories-in particular that of the RTK group-contained outliers, which were removed manually using the free and open source software QGIS to get a more accurate estimate of the actual distance travelled (as an indicator of our "exploration quantity" variable, see 4.3.3). By calculating the different amount of points before and after this manual processing, the outliers 9 https://w3c.github.io/geolocation-api were summed for each participant. Once the data was cleaned, we calculated the total distance walked by each participant. Because there were variations in the duration of each participant's test (min = 9 ′ 14, max = 24 ′ 11 s), the data was normalized for a duration of 15 minutes. This allowed us to calculate: 1. The average geolocation data accuracy 2. The amount of outliers in the data 3. The standard deviation of the geolocation data accuracy

Usability
Immediatly after the test, participants answered an online survey containing demographic questions (age, gender), an open question for qualitative feedback, and three usability questionnaires: • SUS (System Usability Scale) is a generic, technologyindependent 10 item questionnaire on a 5 point Likert scale, frequently used for generic evaluation of a system (Brooke, 1996). The Cronbach's alpha of the SUS questionnaire is 0.79, showing an appropriate internal consistency. In accordance with the instructions of the scale's authors, the SUS score is calculated as follows: 1 point was subtracted from the odd-numbered (phrased positively) items' scores.
We subtracted the even-numbered (phrased negatively) items score to 5. The processed scores were added together and then multiplied by 2.5 to get an individual user's score on a scale of 100. While a comparison between two scores is selfexplanatory, we used an adjective scale (Bangor, 2009) to qualify the results individually.
• HARUS (Handheld Augmented Reality Usability Scale) is a mobile AR-specific 16 item questionnaire (Santos et al., 2014) on a 7 point Likert scale that focuses on handheld devices and emphasizes perceptual and ergonomic issues. The Cronbach's alpha of the HARUS questionnaire is 0.798, showing appropriate internal consistency. It has two components: manipulability-the ease of handling the AR system, and comprehensibility-the ease to read the information presented on screen. In accordance with the instructions of the scale's authors, the HARUS scores are calculated as follows: We subtracted the odd-numbered (phrased negatively) items score to 7. 1 point was subtracted from the evennumbered (phrased positively) items' scores. The processed scores for items 1 to 8 were added together, divided by 48, and multiplied by 100 to get the individual "manipulability" score on a scale of 100. Similarly, the processed scores for items 9 to 16 were added together, divided by 48, and multiplied by 100 to get the individual "comprehensibility" score on a scale of 100. HARUS was designed so that its scores are commensurable with SUS scores. • UEQ (User Experience Questionnaire) is a 26 item questionnaire in the form of semantic differentials: each item is scored on a 7 point scale (from -3 to +3, with 0 as neutral) with two terms with opposite meanings at each extreme (i.e. attractive|unattractive). It provides a comprehensive measure of user experience (Laugwitz et al., 2008). It includes six scales, covering classical usability aspects such as efficiency (can users solve their tasks without unnecessary effort?), perspicuity (is it easy to learn how to use the application?), and dependability (does the user feel in control of the interaction?), as well as broader user experience aspects such as attractiveness (do users like the application?), novelty (is the application innovative and creative?), and stimulation (is it exciting and motivating to use the application?). UEQ is typically routinely used to statistically compare two version of a system to check which one has the better user experience. Thus, the UEQ evaluations of both systems or both versions of a system are compared on the basis of the scale means for Each UEQ scale. Attractiveness is calculated by averaging the scores from items 1, 12, 14, 16, 24, and 25. Perspicuity is calculated by averaging the scores from items 2, 4, 13, and 21. Efficiency is calculated by averaging the scores from items 9, 20, 22, and 23. Dependability is calculated by averaging the scores from items 8, 11, 17, and 19. Stimulation is calculated by averaging the scores from items 5, 6, 7, and 18. Novelty is calculated by averaging the scores from items 3, 10, 15, and 26. Values range between -3 (horribly bad) and +3 (extremely good), but in general only values in a restricted range will be observed. The calculation of means over a panel of participants make it extremely unlikely to observe values above +2 or below -2, as specified in the UEQ handbook (Schrepp, 2015). As per their interpretation, values between -0.8 and 0.8 correspond to a neutral evaluation of the corresponding scale and values greater than 0,8 represent a positive evaluation.
These questionnaires provided scores for the nine scales reported in Table 1 as indicators of our usability variables.

Exploration
During the test, various in-app, usertriggered events were recorded by the application. These included: when the AR session was initiated or exited, when the 2D map was opened or closed, and when the triggering radius of a POI was entered or exited. Each log also contains the coordinates the action took place at, the user ID and a timestamp.
The resulting users' action log can be visualized in the application and downloaded as GeoJSON files. Events are represented with red circles on the 2D map (see Figure 4). We downloaded the data and calculated the number of POIs each participant visited as well as how many times they opened the 2D map. These values (POIs visited, 2D map opened) were normalized for a test duration of 15 minutes. This allowed us to calculate: 1. The amount of POIs visited 2. The amount of times the 2D map was opened The distance walked by each participant was calculated from the geolocation data (see 4.3.1).

Focus
The goal of using eye tracking glasses and data in our study is to determine for how long participants were looking in or out of the tablet screen. 47 out of 54 participants were able-and agreed-to wear eye trackers (Tobii Pro Glasses 3), recording their gaze for the duration of the test. The 7 participants that didn't either choose not to or couldn't because they had prescription glasses. Despite rigorous implementation, 6 recordings did not work as expected and no files were saved. The 41 remaining recordings were imported in Tobii's analysis software. Unfortunately, its tools do not support tracking of moving areas of interest (i.e. the surface of the tablet). We exported the videos with the overlaying gaze point and extracted 10 frames per second, resulting in a dataset of 380K images, an instance of which is shown in Figure 5. We attempted to classify the data with openCV pattern recognition, but the variability prevented from obtaining any results. We resolved to train a deep learning multiclass image classifier model by fine-tuning a pretrained vision transformer (ViT) model with our dataset (Dosovitskiy et al., 2020). We first had to manually label a random selection of 10K frames with "in" or "out" labels corresponding to whether the point was in or out of the tablet screen (see Figure 5). After training for only one epoch using Google's colaboratory and obtaining a satisfying validity of 95%, we inferred the whole dataset which provided a label for every frame 10 . They were encoded in order to calculate the ratio of time each user spent looking at the tablet screen versus outside of it, at the real world. Figure 5. Eye tracking data sample. The user's gaze is located within the tablet screen area.

Data analysis
Statistical analysis were made with the free and open platform Jamovi (The jamovi project, 2022). In the following subsections, we report descriptive statistics (M, SD), and compare our groups (GNSS versus RTK) using an independant Student t-test to emphasize to which extent both groups differ on our variables of interest. In cases where the homogeneity of variances assumption is not met, we used a Welch t-test, which is more robust 11 .
5.2 Geolocation data accuracy 5.2.1 Average geolocation data accuracy As shown in Figure 6, the mean accuracy for the GNSS group is M = 11.0 (SD = 15.3), and M = 33.6 (SD = 24.8) for the RTK group. The value is in meters, meaning the data the GNSS group was exposed to was accurate within a 11 meters radius, whereas the RTK group got data accurate within a 33.6 meters radius. A Welch t-test was used. The results show a significant difference between the two groups (t(43.5) = -3.99, p = <.001).  Figure 7, the GNSS group trajectories contained M = 7.2 (SD = 7.55) outliers, and these of the RTK group M = 46.8 (SD = 40.1). A Welch t-test was used. The results show a significant difference between the two groups (t(27.9) = -5.04, p = <.001). The data is available here: https://zenodo.org/record/7845707. Figure 8, the data participants from the GNSS group were exposed to had a standard deviation of M = 32.0 (SD = 77.7), and that of the RTK group M = 168.3 (SD = 120.1). A Welch t-test was used. The results show a significant difference between the two groups (t(44.7) = -4.93, p = <.001).

Usability
The means of each group for all nine scales from the three usability questionnaires are reported in Table 2 along with t-test's p values for significance assessment.  Table 2. Usability results by group and t-tests. Figure 9, the mean SUS score for the GNSS group is M = 81.7 (SD = 9.74). The mean SUS score for the RTK group is M = 74.4 (SD = 12). The results show a significant difference between the two groups (t(51) = 2.45, p = 0.018).

HARUS
On the manipulability scale (indicating ease of handling the AR system), the mean score for the GNSS group is M = 76.7 (SD = 13) and that of the RTK group is M = 68.1 (SD = 16.1), as shown in Figure 10. The results show a significant difference between the two groups (t(51) = 2.13, p = 0.038).
On the comprehensibility scale (indicating ease of understanding information presented in the AR interface), the mean score for the GNSS group is M = 78.3 (SD = 11.3) whereas the mean score and that of the RTK group is M = 74.9 (SD = 12.9). The results do not show any significant difference between the two groups (t(51) = 1.01, p = 0.318).  Figure 11, on the attractiveness scale, the mean score for the GNSS group is M = 1.72 (SD = 0.7) and that of the RTK group is M = 1.1 (SD = 0.98).

UEQ As shown in
The results show a significant difference (t(51) = 2.65, p = 0.011). On the perspicuity scale, the mean score for the GNSS group is 2.02 (SD = 0.64) and that of the RTK group is 1.45 (SD = 0.92). A Welch t-test was used. The results show a significant difference between the two groups (t(46.7) = 2.61, p = 0.012). On the efficiency scale, the mean score for the GNSS group is 1.24 (SD = 0.85) and that of the RTK group is 0.85 (SD = 0.94). The results do not show any significant difference (t(51) = 1.58, p = 0.121). On the dependability scale, the mean score for the GNSS group is 1.17 (SD = 0.68) and that of the RTK group is 1.02 (SD = 0.62). The results do not show any significant difference (t(51) = 0.87, p = 0.39). On the stimulation scale, the mean score for the GNSS group is 1.84 (SD = 0.84) and that of the RTK group is 1.31 (SD = 1.11).
The results do not show any significant difference (t(51) = 1.93, p = 0.059). On the novelty scale, the mean score for the GNSS group is 1.8 (SD = 0.85) and that of the RTK group is 1.21 (SD = 0.89). The results show a significant difference (t(51) = 2.45, p = 0.018).

Focus
The GNSS group spend an average M = 73.3% (SD = 9.81) of the time looking at the tablet screen. The RTK group spend an average M = 69.2% (SD = 12.4) of the time looking at the tablet screen. The results do not show any significant difference (t(51) = 1.16, p = 0.251).

CONCLUSIONS
The purpose of the study was to assess the impact of geolocation data on the usability of our location-based AR system. To test our hypotheses, we exposed the participants to different geolocation data sources with significantly different accuracies. While we expected RTK data to be more accurate and that it would enable us to improve usability, analysis highlights that it was significantly less accurate and less continuous than GNSS data. This appears to be due to the fact that the embedded GNSS sensor contains filters that preprocess data and remove most of the outliers. In contrast, RTK data purposefully remains "raw", which is valuable for an advanced user. RTK data accuracy is very efficient when used on an isolated basis (ie. at a 2D map scale), but not particularly suitable for a real-time continuous usage (where location is measured several times per second) on a 1:1, tridimensional scale, at least without any filters applied onto it. Despite this contingency, both the quality and continuity of the geolocation data accuracy the two groups were exposed to was significantly different, which is the essential premise for testing our hypothesis and addressing our research questions. Regarding our main research question, results reveal that the GNSS group, who used the AR application in combination with more accurate and continuous data, reported higher scores in all usability scales, of which five out of nine were statistically significant. This supports our initial hypothesis that poor data accuracy negatively impacts the usability of a location-based AR system. Futures studies should however investigate whether RTK data with proper outlier processing may actually better usability. Our results further highlight that the GNSS group walked more than the RTK group, revealing that the accuracy of geolocation data was partially related to exploration, at least for the quantity indicator. However, due to the manual removal of the outliers-which were significantly more frequent in the RTK group-from the trajectories, the data could be biased. It would be necessary to record a trajectory with both modalities, remove the outliers and observe if there are not significant difference between the measurements to ensure that there are no bias. The comparison on the exploration diversity indicator (amount of POIs visited) was not significantly different. Additionally, although the difference was not significant, the GNSS group opened the 2D map more often than the RTK group in average, suggesting the RTK group could have had more ease exploring. Our results further highlight that there were no significant difference between the ratio of time participants from each group spent interacting with the tablet screen, which would indicate that there is no particular relation between the accuracy of geolocation data and focus.
Although the two experiments cannot be properly compared, because the tests took place 5 years apart under different conditions, we note that participants spent 69.2%-73.3% of the time looking at the tablet screen, which seems to be a meaningful longitudinal progress from the measurement that was made on our 2017 proof-of-concept, where participants interacted with the screen for 88.5 % of the time (Ingensand et al., 2018). While we are not aware of a method to determine the ideal proportion, this measure overall remains an interesting indicator of the importance of the tablet in this type of activity. In a wide review of mobile learning projects, technology was found to dominate the experience in a problematic way in 70% (28/38) of the cases (Goth et al., 2006). While using RTK data did not allow us to positively impact the usability of our system, our study however demonstrated the impact of varying geolocation data accuracy on usability and exploration. The immediate benefit of performing this comparative study is for us to define the most suitable conditions of use before offering our system to a young audience, as well as to ensure an adequate overall level of usability. The overall score reported by the GNSS group allows us to qualify the application's usability as "excellent" according to the SUS adjective scale (Bangor, 2009).