K-NEAREST NEIGHBOUR QUERY PERFORMANCE ANALYSES ON A LARGE SCALE TAXI DATASET: POSTGRESQL VS. MONGODB
Keywords: spatial query, kNN, database, GIS, open-source
Abstract. The increasing volume of transport network data necessitates the use of a DataBase Management System (DBMS) to store, query and analyse data. There are two main types of DBMS: relational and non-relational. Many different DBMS are available on the market but only some of them could handle spatial data. Therefore, determining which DBMS to use for operational purposes is of interest to researchers and analysts working in spatial information science. One of the commonly used spatial queries in GIS is the k-Nearest Neighbour (kNN) of a given point. This paper analyses the performance of the kNN query in PostgreSQL and MongoDB, both being a representative of relational and NoSQL DBMS respectively. Two different metrics have been investigated to determine the performance: i) spatial accuracy and ii) run time. Haversine and Vincenty formulas are used to calculate the distance between the point and the determined neighbours, which are then used to determine the spatial accuracy of the DBMS. Sensitivity analysis have been carried out by varying the k value and the execution times are recorded. The experiments are carried out on New York City’s openly available taxi dataset consisting of millions of taxi pickup and dropoff points. The results indicate that MongoDB outperforms Postgres both in terms of execution time and spatial accuracy regardless the value of k. In order to facilitate reproducibility of the results, the developed software is shared on GitHub.