CROWDSOURCING BIG TRACE DATA FILTERING : A PARTITION-AND-FILTER MODEL

GPS traces collected via crowdsourcing way are low-cost and informative and being as a kind of new big data source for urban geographic information extraction. However, the precision of crowdsourcing traces in urban area is very low because of low-end GPS data devices and urban canyons with tall buildings, thus making it difficult to mine high-precision geographic information such as lane-level road information. In this paper, we propose an efficient partition-and-filter model to filter trajectories, which includes trajectory partitioning and trajectory filtering. For the partition part, the partition with position and angle constrain algorithm is used to partition a trajectory into a set of sub-trajectories based on distance and angle constrains. Then, the trajectory filtering with expected accuracy method is used to filter the sub-trajectories according to the similarity between GPS tracking points and GPS baselines constructed by random sample consensus algorithm. Experimental results demonstrate that the proposed partition-andfiltering model can effectively filter the high quality GPS data from various crowdsourcing trace data sets with the expected accuracy.


INTRODUCTION
In our big data era, data is being generated, collected and analysed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society.Recent studies used volume, velocity, variety, value, and veracity to characterize the key properties of big data.Compared with volume, velocity, variety and value, the fifth 'V' of big data, veracity is more important to knowledge mining because poor quality data has serious consequences on the results of data analyses.So to extract value and make big data operational, veracity, is increasingly being recognized.Data cleansing and data quality management for veracity are the pressing need, and it ensures data in a databases represent the real world entities to which they refer in a consistent, accurate, complete and unique way (Saha, et al., 2014).Without proper data quality management, even minor errors can accumulate resulting in process inefficiency and failure to comply with industry and government regulations (the butterfly effect (Samimi, et al., 2012)).Crowdsourcing big trace data is a kind of spatial positional big data and bears the characteristics of five 'V' features like other big data.Specifically, it provides us with an unprecedented window into the dynamics of urban areas.This information has been analysed to uncover traffic patterns (Masciari et al., 2014, Castro et al., 2012), city dynamics (Tu et al., 2015, Tu, Li, et al., 2015), and urban hot-spots (Tang, et al., 2015).At the same time, crowdsourcing big trace data also makes it possible for mining different sophistication degree of urban information such as centreline-level, carriageway-level and lane-level road map refinement.However, the precision of crowdsourcing traces in urban area is very low because of lowend GPS data devices and urban canyons with tall buildings, thus making it difficult to mine high-precision geographic information such as lane-level road information (Tang, Yang, et al., 2015, Zhou, Li, et al., 2015).In this paper, we propose an efficient partition-and-filter model to filter trajectories, which includes trajectory partitioning and trajectory filtering.For the partition part, the partition with position and angle constrain algorithm is used to partition a trajectory into a set of subtrajectories based on distance and angle constrains.Then, the trajectory filtering with expected accuracy method is used to filter the sub-trajectories according to the similarity between GPS tracking points and GPS baselines constructed by random sample consensus algorithm.Different to the existing methods of data quality management of GPS traces, the main goal of this paper is to provide a way for crowdsourcing big trace data filtering which can classify data according to their positional accuracy.

Crowdsourcing traces
Crowdsourcing traces means that traces are collected by soliciting contributions from a large group of people, rather than from traditional employees or suppliers.It is a low-cost and efficient way to extract useful information from crowdsourcing big trace data.The traces via crowdsourcing way recorded the location, time, and other movement characteristics of moving objects.The veracity of crowdsourcing big trace data predominantly focused on the quality of positional accuracy of GPS records.It is generally known that the quality of crowdsourcing big trace data is poor in urban area because of low-end GPS devices, complicated surroundings and crowdsourcing gathering way and so on.For example, the positioning accuracy of the civil C/A code GPS receiver is about 10-15 m.That means a part of outliers and noises lower the precision level of the data set but some high-precision GPS tracking points still mix in the raw GPS traces (Bradford, et al., 1996).

Methodology of partition-and-filter model
On the basis of the crowdsourcing big trace data quality and its causes analysis, this paper proposed a partition-and-filter model to filter GPS data according to their positional accuracy.Different to the traditional GPS data filtering methods, the partition-and-filter model is used to classify GPS data based on their positioning accuracy rather than correct or repair GPS data.At the same time, based on the partition-and-filter model, we develop a trajectory partition algorithm PPAC and a trajectory filtering method TFRA.  .At the same time, these characteristic points will be primarily separated as a cluster and considered as outliers because their positon and angle are different to other points in the whole trajectory.After first step, filtering highprecision tracking points from traces is relatively easy because of the reduction of interference from outliers.Then, we present that use TFRA algorithm to filter highprecision points from sub-trajectories.Generally, filtering highprecision points form sub-trajectories without a positional reference is very difficult because we don't know which point is satisfied with the expected precision.Thus, the first step of TFRA is to construct a positional reference which is called as GPS baseline.As shown in figure 1(c), each sub-trajectory of Tri is regarded as the object and its GPS baseline is constructed by using random sample consensus method based on the high consistency of high-precision points in position and heading.
Then GPS baseline will be regarded as the positional reference to filter points that means the more similar the GPS points with GPS baseline are, the more precision the GPS points are.To evaluate the similarity between tracking point and GPS baseline, a similarity evaluation model is presented.What is more, we devise a method by analysing the relation between similarity and positional accuracy of GPS points to acquire the similarity threshold to filter data, as shown in figure 1(d).In the rest of this chapter, we will provide more details on the technical aspects of the partition-and-filter model.

Partition with position and angle constrain (PPAC):
Trajectory partition is a preliminary step in trajectory data mining (Rasetic, et al., 2005), and it is also the first phase of the partition-and-filter model.Most trajectory partition methods always focus on the trajectory location and sampling time interval, velocity constraint or other movement characteristics of moving objects (Lee, et al., 2008, Zhang andWang, 2011).But for partition-and-filter model, we only highlight the positional accuracy of filtered data.Although the current research shows that the positional accuracy of GPS tracking points can be speculated according to its location, velocity and sampling time interval (Krishnan, et al., 2015), we are left with considering with position and angle constraint because of the low-sampling rate of crowdsourcing trace data.Hence, we propose a trajectory partition method PPAC with position and angle constraint.Moreover, we devise an adaptive partitioning termination threshold to meet the trajectory data processing requirements.
The key issue for partitioning a Tri into a set of sub-trajectories is to find out all the characteristic points in it.In this section, we propose a new trajectory partitioning algorithm which aims at finding the points where the behaviour of a trajectory changes rapidly.The main idea is to check the value of the threshold of distance and angle, with respect to the present move action.Then the principle of PPAC algorithm is described as The parameters verdisj and angdisj show the vertical distance from point pj to the start vector and the angle between vector pjpj+1 and the start vector.In order to obtain better partitioning, the appropriate threshold should be determined.In this paper, the partitioning termination thresholds ao and do are determined by evaluating the complexity of trajectory shape.We use the standard deviation of the distance between point pi and its projection point pi′ on p1pn, as shown in figure 3a, and the standard deviation of angle between directed line pipi+1 and p1pn, as shown in figure 3b, and the curvature of trajectory Tri to describe its shape complexity.Specifically, the curvature ρ of trajectory Tri can be calculated as: Then the value of ao and do are calculated as follows: (3) Where α and β can be set according to user demand.Overall, the adaptive termination threshold adapts flexibly to all types of trajectory segments, and overcomes the disadvantage that other partition algorithms with fixed termination threshold.

Trajectory filtering with required accuracy (TFRA):
Trajectory filtering is the second phase of the partition-andfilter model, and also the most important and critical step of trajectory classification based on the required precision.Different to the traditional GPS data filtering methods, TFRA is used to classify GPS data based on their positional accuracy rather than correct or repair GPS data.According to the recent study, GPS noises caused by complicated surroundings, lowend GPS devices, crowdsourcing way and so on are always regarded as outliers.Though a lot researchers discussed outliers detection and removal from raw GPS traces (Lee, et al., 2008, Yu, et al., 2014, Gupta, et al., 2014) but trajectory filtering according to different positional accuracy demands has been little investigated.Because it's very difficult to identify the positional accuracy of GPS measurements collected via crowdsourcing way without any space position reference.Thus, in order to identify the position accuracy of GPS measurements, the first step of TFRA is to construct a positional reference which are called as GPS baseline in this paper.Then, TFRA evaluates the similarity between GPS points and GPS baseline for further filtering.Most moving objects keep moving along the route and change moving direction in a short time.The GPS trajectory reflects the tendency of moving object.Thus, the high-precision data reflects high consistency in positon and heading.For example, tracking points of vehicles with high positional accuracy always cluster together along the centreline of each lane, and its heading will not change a lot unless they are at an intersection or change lanes.Based on this observation, we proposed that using GPS baseline to help TFRA filter high-precision GPS measurements from raw GPS traces.Specifically, the GPS baseline is a directed line segment which belongs to the straight line l, as shown in figure 4. Therefore, we equate the straight line l extraction with the construction of GPS baseline.To avoid the adverse effect of these noises, we use random sample consensus method (RANSAC) to construct GPS baseline (Yaniv, 2010).

Inliers Outliers Moving direction
The straight line l GPS baseline Figure 4. Generation of GPS baseline based on RANSAC According to RANSAC algorithm, the estimated of model is called as M * .The threshold which defines if a GPS points pi agrees with model M * is set as τ.The number of iteration is set as N, and the number of data elements required to fit M * is donated as s.The basic principle of using RANSAC algorithm to generate the GPS baseline is described in reference 'Yaniv, 2010'.In this study, GPS baseline is generated by using RANSAC algorithm and then regarded as the positional reference to filter high-precision GPS points.That means GPS point which has a high similarity with the GPS baseline will be classified as the high-precision data.Here, we only care about the location accuracy of GPS points rather than the sampling rate and speed of moving objects.Therefore, to evaluate the similarity between GPS points and the GPS baseline we present a similarity evaluation model with heading and distance constraint.The similarity measure of TFRA is defined in the form of linear weighting: Where |pkpk′| is the distance between pk and its projection point pk′ on GPS baseline G, θk is the angle between pk's heading and GPS baseline G, ω1 and ω2 are the weighting of the difference of vertical distance and angle, ω1+ω2=1.In general similarity of GPS points and GPS baseline ranges from 0 to 1.
After similarity computation, we need to set similarity threshold to filter data from raw GPS traces, and each similarity threshold directly determines the positional accuracy of filtered data.According to our idea, the similarity threshold should corresponds with the expected precision of filtered data.
Assuming that the functional relationship exists among similarity and positional accuracy of GPS traces, and it can be described as:

EXPERIMENTAL RESULTS AND ANALYSIS
In this section we perform an extensive experimental evaluation of the method introduced in Section 3 on the real-world mobility datasets.

Experimental data
We In a follow-up experiment, the low-precision GPS traces in data set will be regarded as the experimental data, and its synchronized high-precision DGPS traces will be considered as ground truth to validate the effectiveness of the partition-andfilter model.

Visual results
In this section, we randomly selected a part of trajectories as the experimental data to present the filtered results according to partition-and-filter model.The experimental traces (Figure 6a) were partitioned according to PPAC algorithm first, and a set of characteristic points were acquired, shown in Figure 6b.Specifically, the constants α and β were set as 25 m and 30⁰ according to our demands, and the partitioning termination threshold ao and do of each trace was different so we would not introduce them one by one.After trace partitioning, the characteristic points of a trajectory were clustered as outliers because the location or angle of these points were very different from other points.The subtrajectories without characteristic points were regarded as raw data and to be classified based on TFRA algorithm.The main operators of the TFRA algorithm included GPS baseline construction, similarity evaluation and filtering threshold determination.The GPS baseline of sub-trajectory was constructed by using RANSAC method, and the direction of GPS baseline was same as the moving direction of the subtrajectory, as shown in figure 7.During GPS baseline construction by using RANSAC, the parameter τ was set as 0.5 m according to the precision requirement and other parameters like N was self-adaptive.

Figure 7. GPS baseline construction
In this study, we think that the GPS baseline represents the actual position of moving object to some extent, and the more similar the GPS points of a sub-trajectory with its GPS baseline, the more accurate the GPS points will be.So TFRA uses similarity evaluation model to compute the similarity between GPS points and its GPS baseline.Particularly, after much trial and error, the weight of distance and angle of the similarity evaluation model was set as 0.98 and 0.02 respectively.In addition, the similarity threshold of trajectory filtering was decided by the expected precision of filtered data.The functional relationship between similarity and positional accuracy obeyed exponential model.Based on extensive testing, the entire correlation coefficient of exponential regression can reach the highest R=0.946 when the parameter a, b, c of the exponential model for GPS traces with 10-15 m accuracy are equal to 1, -0.267, 0, respectively.The corresponding similarity threshold of the expected precision of filtered data was calculated as shown in  The number of experimental points and filtered points is displayed in figure 8, the results show that the proportion of filtered data will fall as the expected precision increased.

Experimental results analysis
To  To estimate the validity of partition-and-filter model, the positional accuracy of filtered data was calculated according to its ground truth.Table 3 shows the average value and standard deviation of positional accuracy of the filtered data from different roads in urban area.Based on the statistics, the average value and the standard deviation of positional accuracy of filtered data indicated that the partition-and-filter model could indeed filter GPS traces according to the expected precision.At the same time, the average value of positional accuracy of filtered data increased when the expected precision was fall as well as the standard deviation of positional accuracy.

CONCLUSION
In this paper, we have proposed a novel framework, the partition-and-filter model for data quality management of crowdsourcing big trace data.Based on this framework, we have developed the trajectory partition algorithm, PPAC, and the trajectory filtering algorithm, TFRA.The main advantage of PPAC is the adaptive partitioning termination threshold technique.The partition results show that PPAC effectively segments trajectory according to its shape complexity.Besides, TFRA algorithm also provides a new understanding of trajectory filtering which classifies crowdsourcing big trace data based its positional accuracy.Overall, we believe that we have provided a new paradigm in trajectory quality management.

Figure 1 .
Figure 1.Methodology of partition-and filter modelGiven a set of trajectories T={Tr1,Tr2, …, Trn}, and Tri is one of them, as shown in figure1(a).According to the partition-andfilter model, first, the PPAC algorithm is used to identify characteristic points CP={ cp1,cp2, …, cpm} of Tri, and these characteristic points partition Tri into a set of sub-trajectories, as shown in figure1(b).At the same time, these characteristic points will be primarily separated as a cluster and considered as outliers because their positon and angle are different to other points in the whole trajectory.After first step, filtering highprecision tracking points from traces is relatively easy because of the reduction of interference from outliers.Then, we present that use TFRA algorithm to filter highprecision points from sub-trajectories.Generally, filtering highprecision points form sub-trajectories without a positional reference is very difficult because we don't know which point is satisfied with the expected precision.Thus, the first step of TFRA is to construct a positional reference which is called as GPS baseline.As shown in figure1(c), each sub-trajectory of Tri is regarded as the object and its GPS baseline is constructed by using random sample consensus method based on the high consistency of high-precision points in position and heading.Then GPS baseline will be regarded as the positional reference to filter points that means the more similar the GPS points with GPS baseline are, the more precision the GPS points are.To evaluate the similarity between tracking point and GPS baseline, a similarity evaluation model is presented.What is more, we devise a method by analysing the relation between similarity and positional accuracy of GPS points to acquire the similarity

Figure 2 .
Figure 2. Partition with position and angle constraintAssuming that the trajectory Tri contains a series of tracking points pi, i=1,2,…, n, as shown in figure2.The partitioning thresholds of distance and angle are do and ao, respectively.Then the principle of PPAC algorithm is described as Table1.1.Input : Trajectory: Tri(p1, p2, p3, …, pn);

Figure 3 .
Figure 3. Trajectory shape complexity.(a) and (b) indicate the computation of trajectory shape complexity.
5)Where Sim represents the similarity and ε indicates the positional accuracy.Then the threshold of trajectory filtering is displayed as: The parameter ξ is the expected precision, ξ =1 m, 2m, …, h m, h<1σ.In particularly, to figure out the relation of Sim and ε, we use the similarity evaluation model to compute the similarity between the GPS measurements and its ground truth, and then carry on linear regression analysis of the similarity and the GPS measurements error.After much investigation, the functional relationship between similarity and positional accuracy was an exponential model.So the equation '6' is redefined as: parameters b, a, c of equation '7' are determined by the positional accuracy of experimental data and the weight of distance and heading of similarity model.At the same time, the threshold of trajectory filtering is displayed as:

DGPSFigure 5 .
Figure 5.The collection of DGPS trajectories and synchronized GPS trajectories.

Figure 6 .
Figure 6.Trajectory partitioning based on PPAC.(a) is the raw GPS points.(b) shows the final result of PPAC where red points and black points present raw GPS points and characteristic points respectively.

Figure 8 .
Figure 8.The filtering results with different expected accuracy

Figure 9 .
Figure 9.The filtering results By using partition-and-filter model, we got the final results of partition and filtering.Figure 9(a) and (b) show the partition results of all GPS points based on PPAC algorithm and filtered results of all sub-trajectories based on TFRA algorithm.The details of the experimental results can be seen in the child window of the figure 9.

Table 1 .
The principle of trajectory partitioning

Table 2 .
The filtering threshold with different expected accuracy According to table 1, we set different similarity threshold to filter data from raw GPS traces.

Table 3 .
The precision of filtered data