LARGE VECTOR SPATIAL DATA STORAGE AND QUERY PROCESSING USING CLICKHOUSE

: The exponential growth of geospatial data resulting from the development of earth observation technology has created significant challenges for traditional relational databases. While NoSQL databases based on distributed file systems can handle massive data storage, they often struggle to cope with real-time query. Column-storage databases, on other hand, are highly effective at both storage and query processing for large-scale datasets. In this paper, we propose a spatial version of ClickHouse that leverages R-Tree indexing to enable efficient storage and real-time analysis of massive remote sensing data. ClickHouse is a column-oriented, open-source database management system designed for handling large-scale datasets. By integrating R-Tree indexing, we have created a highly efficient system for storing and querying geospatial data. To evaluate the performance of our system, we compare it with HBase, a popular distributed, NoSQL database system. Our experimental results show that ClickHouse outperforms HBase in handling spatial data queries, with a response time approximately three times faster than HBase. We attribute this performance gain to the highly efficient R-Tree indexing used in ClickHouse, which allows for fast spatial data query.


INTRODUCTION
In recent years, the advancement of Earth observation technology has led to an increasing importance of remote sensing data in various fields such as urban construction, land resource survey, crop disease analysis, smart city, land use and environmental protection (Chi et al., 2016;Kucherov et al., 2017;Li et al., 2020;Shi et al., 2022;Song et al., 2020;Wang et al., 2022;Zhang et al., 2021;Zhu et al., 2021).The exponential increase in remote sensing data volume presents a considerable obstacle to the efficient organization and management of data.Furthermore, it hinders the capacity to analyse and respond promptly to spatial data queries and analyses.Moreover, as a result of the development of global positioning system and mobile intelligent terminals, high accuracy and low latency requirements have emerged for location services including precise positioning, regional query, route planning.In this era of massive spatial data, the key challenges in GIS development is to effectively organize, manage and store large-scale spatial data.
The use of relational databases for spatial data storage dates back to 1970 (Codd, 1970).Today they are widely used for spatial data storage, with popular databases such as Oracle Spatial and PostGIS for PostgreSQL featuring spatial data analysis capabilities.PostGIS, based on PostgreSQL for spatial expansion, has surpassed Oracle Spatial with query speeds that are 300% and 450% faster for spatial data querying and analysis, respectively (Shukla et al., 2016).PostgreSQL provides a simple development path to incorporate new space types, offering numerous features such as reliability, transaction integrity, support for SQL standards, pluggable type extension, and community-oriented development model, along with support for large GIS objects and R-Tree index as a general index structure.PostgreSQL mainly extends the geometry data type for the storage of spatial data, capable of effectively storing point, line, polygon, multipoint, multiline, and multipolygon.Despite its powerful spatial analysis function, traditional relational databases face significant performance bottlenecks.The efficient storage and access of massive data, high scalability, and availability of the database are the major gaps that a relational database cannot resolve.PostgreSQL struggles with decreasing query and processing efficiency as data volume increases.Additionally, deployment and installation of a single node result in limited data storage capacity and low security, and node failures can lead to data loss.
With the continuous growth of data volume and the arrival of the era of big data, data storage has begun to shift from centralized storage to distributed storage on multiple machines.There are three main types of data modes: (1) Tree-structured storage where data is organized in a folder containing file organization.(2) Block storage where data is grouped into blocks of the same size with each block having its own identifier, and (3) Object storage where data is stored in an object unit, and the metadata describing the data is stored in an independent database.In 2003, Google released its own distributed file storage system, Google File System (GFS), which uses a single master and multiple chunkservers to organize data.GFS divides files into chunks, each with a unique 64-bit identifier, and replicates each data block to multiple chunkservers to improve reliability (Ghemawat et al., 2003).Distributed data storage and replication mechanisms are capable of overcoming the limitations of traditional database security and storage scalability.NoSQL databases can be used on distributed clusters, providing strong scalability and storage capabilities, as well as low-latency query services (Wang et al., 2017).In recent years, there has been extensive exploration of NoSQL database storage for spatial data.
This paper proposes a spatial data storage strategy based on ClickHouse database that utilizes the superior performance of the R-Tree index for spatial data retrieval in traditional relational databases.The R-Tree index is still used as the index of spatial data in ClickHouse.R-Tree is a balanced tree similar to B-Tree but it solves the problem of fast search highdimensional space, unlike B-Tree, which solves the problem of fast search in one-dimensional space.R-Tree uses the minimum circumscribed rectangle as the boundary of the space geometry, and during the spatial retrieval, it filters out most of the areas that do not overlap with the search rectangle by gradually expanding the boundary (Guttman, 1984).Before storing vector data in ClickHouse, the proposed strategy build the data into an R-Tree, traverses the R-Tree in a hierarchical way during storage, and stores non-leaf nodes in the index table and leaf nodes in the detailed wide table.The data is partitioned based on the root node before storage, so that the non-search areas can be effectively isolated during the query, thereby speeding up the query process.

RELATED WORK
In recent years, there has been a surge of interest in using NoSQL databases for managing and storing large-scale datasets.Redis, Elasticsearch, MongoDB and HBase.Compared with these mainstream NoSQL databases, ClickHouse's architecture is highly scalable, fault-tolerant, and distributed, making it an ideal choice for use cases that require high throughput, low latency, and real-time analytics.In this related work section, we provide an overview of the relevant literature that discusses the use of Redis, Elasticsearch, MongoDB, HBase, and ClickHouse, highlighting their advantages and disadvantages in handling Geospatial data.

Redis
Redis is an open source, memory-based data structure storage system that supports a variety of use cases, including serving as a database, cache, message broker, and streaming engine.It offers support for various data structures such as strings, hashes, lists, sets, sorted sets, bitmaps, hyperloglogs, and geospatial indexs and streams.Redis has built-in replication, Lua scripting, LRU eviction and different levels of persistence.Additionally, it provides high availability through Redis Sentinel and Cluster.When storing geographical locations, Redis uses corresponding key to store the longitude and latitude and the name of the data using Geo Set data structure, which is similar to the Sorted Set data structure.Its implementation leverages Geohash technology to encode the longitude and latitude bits in a staggered manner, and divides the geographical space into gridshaped buckets with each grid having its own corresponding code (Makris et al., 2019).When querying a spatial range using the georadius method, it transforms the query of twodimensional space into the comparison of strings of onedimensional space, which makes it highly efficient for spatial range queries (Liu et al., 2014).(Hao Yu et al., 2012) proposed to use Redis to cache terabytes of geospatial data to cope with highly concurrent queries.However, Redis does not provide data security guarantees and improper design can lead to cache breakdown, cache penetration and cache avalanche.

Elasticsearch
Elasticsearch is a distributed, real-time search and data analysis engine that is highly scalable.It can perform full-text search, structured search, and analysis, making it a versatile tool for many use cases.Geographical location is represented in Elasticsearch using two different data types: geo_point and geo_shape.The former is used to represent geographical coordinate points with latitude and longitude, while the latter is used for complex geographical shapes using GeoJSON.Geo_points is useful for finding points within a certain range, calculating distances, and aggregating data display on the maps.On the other hand, geo_shape is used to filter the data and analyse whether two shapes intersect, contain or overlap in the geographical space.Elasticsearch uses GeoHash indexing for geographical location queries, ensuring high accuracy.One example of using Elasticsearch is for retrieving geographical location in digital corpus (Bartlett, 2019).

MongoDB
MongoDB database is a document-based database that organizes data in the form of BJSON.It offers highperformance data persistence, and an API query interface that allows for easy integration with external programs for CRUD operations.MongoDB is highly reliable and offers failover and redundancy mechanisms to prevent data loss as well as horizontal scalability through data partitioning across multiple machines.It supports multiple storage engines, including memory storage engine and WiredTiger storage engine.To efficiently retrieve geospatial coordinates, MongoDB employs two indexing methods: the 2D index for planar geometry and 2dsphere index for spherical geometry.GeoJSON format must be used for organizing geospatial data in MongoDB due to these indexing methods.MongoDB also provides the calculation of geospatial relationships such as geoWithin for inclusion, geoIntersects for intersection, and nearSphere for adjacency.Recent studies have explored different methods of storing remote sensing data in MongoDB.For instance, a remote sensing data management method was proposed based on MongoDB that stores metadata as documents, and image data in GridFS format (Wang et al., 2019) .

HBase
HBase is a highly scalable NoSQL database that enables the storage of massive amounts of data in a distributed environment.Leveraging the Hadoop Distributed File System (HDFS), HBase uses Namespaces for data space division and Regions for data organization, storing data in a columnar format.Rather than defining specific columns when creating a table, HBase only requires a declaration of the Column Family, providing dynamic field specifications when writing data.This makes Hbase more suitable for scenarios with varying fields, when compared to traditional relational databases.Though HBase does not support geospatial data storage natively, its flexible architecture has led many researchers to explore the use of HBase for this purpose.For example, (Wang et al., 2017) proposed a method of storing vector data in HBase by dividing it into grids using space Z curve filling with the number of space filling curves used as RowKey for storage, and the geometric objects within each grid used as columns (Wang et al.,2017).Similarly, (Wang et al., 2019) proposed an efficient spatial big data storage and query method in HBase, which leverages Hilbert spatial curve filling to convert highdimensional data into one-dimensional data, allowing for faster query processing.

ClickHouse
Clickhouse is an open-source column-oriented storage database (DBMS) that specializes in online analytical processing (OLAP) queries, allowing real-time generation of data analysis reports through SQL queries.ClickHouse has the following features: (1) ClickHouse supports various table-level storage engine, with over 20 engines in four categories, namely merge tree, log, interface and others enabling users to choose different storage engines according to specific requirements.(2) ClickHouse has a high throughput capacity and uses a structure similar to LSM tree to achieve sequential write performance, allowing it to write data in a sequential append manner during data import, and merge-sort multiple segments to optimize disk usage.ClickHouse's write performance is superior, with an official open benchmark test showing it can achieve a write throughput of 50MB-200MB/s, equivalent to a write speed of 50W-200W pieces/s (based on an estimated 100 bytes per line).( 3) ClickHouse divides the data into multiple partitions and further divides each partition into multiple index granularities, which can then be processed by multiple CPU cores for parallel data processing.This design allows a single query to utilize all the CPU of the entire machine, greatly reduces the query delay.Compared with other databases, ClickHouse has superior performance in various SQL queries, making it an excellent choice for low performance machines and outperforming MySQL which is the most popular open-source database at presently available (Wickramasekara et al., 2020).

Storage method of spatial data in ClickHouse
ClickHouse lacks the inherent capability to store geospatial data, but this study builds a logic layer on top of ClickHouse to enable it to do so.This section will focus on the design of this extended logic layer with Figure 1 displaying the comprehensive structure of ClickHouse's storage and retrieval of geospatial data.

Storage service:
The storage service in the ClickHouse shared cluster for spatial data performs the following operations: S1: Read the corresponding shapefile file from the corresponding file system based on the path carried in the application request.S2: Construct an R-Tree using the data from the shapefile file.R-Tree is preferred over the improved R * -Tree due to its proven ability of R-Tree to retrieve geospatial data has been fully proved in traditional relational databases.The construction time of R * -Tree is relatively long, and its query efficiency is not optimal when handling small data amounts.S3: Generate a separate table for each shapefile since they contain unique geographic information.To accelerate geospatial query performance, two tables are created: an index table and a detailed wide table.The index table stores non-leaf node information, while the detailed wide table stores leaf node information in the R-Tree.ClickHouse's ability to partition data allows for effective isolation of irrelevant data during queries, which is cleverly used when designing the index table.The table is built based on the root node with elements in the root node and the subtree below it stored in the corresponding partition as data of the same partition.The partition number is determined by the leftto-right position of the current element in the root node, while the partition number of the leaf node in the detailed wide table is the ID of the parent node, as shown in Figure 2. The level field records the level of the current node in the entire R-Tree.Since the index data is stored in the database, retrieval needs to be performed recursively with each retrieval taking the ID of the parent node as the filter condition is to isolate unrelated nodes in the same subtree.Fortunately, ClickHouse provides the Prewhere statement, which can filter the data in advance before the selection, and significantly improve efficiency, especially when the number of columns queried is significantly more than that filtered.The minimum bounding rectangle of the current node is defined by the min_x, max_ x, min_y and max_ y values.The is_leaf field identifies whether the current node is the last level of the index node and serves as the exit of the recursive query.
In the detailed wide table, the smallest granularity data of the current shapefile file is stored, with the ID generated by the snowflake algorithm used to identify and sort data.The index of ClickHouse itself is used to speed up retrieval efficiency when querying.The parent_id is used as a partition field, and the data of the same parent node is stored in a partition.The minimum limited rectangle can exclude the data that does not meet the requirements when querying.The_geom field records geometric data, while the remaining fields contain non-spatial attributes of the current data.

Figure 2. Use R-Tree to build the logic of index table and detail table.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-1-2023 39th International Symposium on Remote Sensing of Environment (ISRSE-39) "From Human Needs to SDGs", 24-28 April 2023, Antalya, Türkiye 3.1.2Retrieval service: Data retrieval services can be categorized into geospatial retrieval and conventional retrieval.Geospatial retrieval involves providing spatial geometry information to retrieve data, while conventional retrieval requires general retrieval conditions such as city name, street name, etc.For conventional retrieval, index query is avoided, and the corresponding information is directly queried in the detailed wide table based on the conditions.For example, if the geometric value of a street named "Atlantic commons" is needed, the system will retrieve the relevant information from the detailed wide table and return it to the application.In contrast, geospatial retrieval requires calculating the minimum restricted rectangle of the incoming geometry first.The index table is then retrieved from the root node.When R-Tree is stored in the table, parent_id of the root node is set to 0. Therefore, when searching, the filter condition can be set to parent_id=0 to find the root node.Then, determine the node that intersects with the minimum bounding rectangle of the spatial geometry being queried in the root node.The system then recursively searches the lower level of R-Tree in the index table, taking the id of the parent node as the filter condition at each recursion.When the leaf node of the index is reached, the current node id is used as the filter condition to retrieve the corresponding data in the detailed wide table.During the search in the detailed wide table, the system still uses the intersection of the minimum restricted rectangle as the filter condition.However, the intersection of the minimum restricted rectangle in R-Tree does not necessarily mean that the real geometry is intersected (as shown in Figure 3).Therefore, further judgment is required when the corresponding data is obtained.Finally, when the geometric information is retrieved and the geometric information in the detailed wide table meets the topological requirements of the retrieval, the corresponding results are returned to the upper application.intersects, but the polygons do not intersect.

ClickHouse Sharded Cluster:
ClickHouse provides two mechanisms for distributed data storage: replica and sharded cluster.The replica mechanism ensures the high availability of data by allowing the same data to be obtained from other servers if a ClickHouse node goes down.The write process of the replica mechanism is shown in Figure 4.However, it requires each server to accommodate the full amount of data, limiting horizontal expansion.To overcome this limitation, ClickHouse introduces sharding, which segments a complete piece of data and distributes different shards to different nodes.
The Distributed Table Engine acts as a middleware, routing distributed data from multiple nodes with different shards through distributed logical tables.Unlike the replica mechanism, the sharded cluster mechanism allows for horizontal expansion of massive data storage.To ensure both horizontal scalability and high availability, ClickHouse combines sharding and replica mechanisms.In this paper, a sharded cluster architecture is adopted, where three machines are used as the partition cluster.Figure 5 illustrates the design of this architecture.By leveraging sharding and replica mechanisms, ClickHouse provides a powerful solution for distributed data storage and processing.

Storage method of spatial data in HBase
To better evaluate the advantages of ClickHouse in querying spatial geographic data, this paper includes a comparative test with Hbase.The experiment expands HBase using R-Tree index while maintaining the independent storage mode for index and detailed wide table to speed up the query.This is achieved by dividing the data into two tables, namely index table and detail table .Table 3 shows the design of the index table while Table 4 shows the design of the detailed wide table.
HBase stores data using a single partition and employs the "parentRowKey_SnowflakesAlgorithm" design rule for Row_Key to ensure its uniqueness.The parameters min_x, max_x, min_y, and max_y represent the smallest circumscribed rectangle, while is_ Leaf identifies whether the current node is a leaf node of the R-Tree index.The Start_Row and Stop_Row parameters specify the range of Row_Key of the child node.The primary purpose of these design features is to accelerate query performance using scan operations.To optimize query performance, HBase stores data for different column families separately.In the design of the detailed wide table, two column families are used: "ordinary" to store non-geometric information and "geometry" to store geometric information.This approach isolates geometric and non-geometric queries, enabling each to be optimized independently.By separating the data into two column families, HBase can accelerate both geometric and non-geometric queries, resulting in faster query performance and better scalability.

EXPERIMENT AND RESULTS
This section is based on real world data and aims to provide a comprehensive evaluation of the experimental systems.To achieve this, section 4.1 presents the experimental environment used in the study, including details about the hardware and software configurations.Section 4.2 provides a detailed description of the experimental dataset.Finally, in section 4.3, the experimental results are presented and compared to provide insights into the strengths and weaknesses of the systems under test.By using real-world data and a rigorous experimental methodology, this study aims to provide valuable insights into the performance and scalability of the experimental systems, which can be used to inform decisions about their use in realworld applications.

Experiment setting
The experiments were performed on a virtual machine environment with a 4GB RAM and 2.80GHz Intel(R) Core(TM) i7-7700HQ CPU.ClickHouse was deployed in a single-node storage configuration, while HBase was set up as a three-node cluster with identical hardware and software configurations.

Dataset
To evaluate the performance and scalability of the systems under test, a range of real-world datasets were used in the experiments.Specifically, the datasets included community data, census data for New York ("New York Census Data," n.d.), building data, and land data for Papua New Guinea in 2018 ("Papua New Guinea Data," n.d.).The size and characteristics of each dataset are summarized in Table 6, which provides details such as the number of records, file size, and data format.By using real-world datasets, this study aims to provide insights into the systems' ability to handle complex and diverse data types, as well as their performance and scalability under realistic conditions.

Experiment results
To compare the efficiency of spatial queries between ClickHouse and HBase for different data volumes and query rates, we randomly selected 1% of the data from the tables for cross-checking during the experiments.The results, as shown in Figure 6 indicate that ClickHouse outperforms HBase in terms of query speed, particularly for large amounts of data.When the data volume in a single table data was 129 records, HBase's response time for a data query reached the second level, while ClickHouse returned the query results in milliseconds.When the data volume in a single table increased to 10,000 records, HBase's response time was about three times longer than ClickHouse's.Furthermore, with the increase in the volume of data, HBase's response time increased significantly.These findings demonstrate that ClickHouse is more efficient than HBase is spatial data query, especially for larger datasets.

CONCLUSIONS
This paper proposes a storage model for spatial data using ClickHouse and R-Tree as an index for spatial queries.To evaluate its performance, we compare the geospatial data queries with the HBase database using the same indexes.The results of the experiments demonstrate that ClickHouse outperforms HBase in geospatial data queries, even when using the same indexes.

Figure 1 .
Figure 1.The overall architecture of ClickHouse for storing and retrieving geospatial data.

Figure 3 .
Figure 3.The minimum bounding rectangle of two polygons intersects, but the polygons do not intersect.

Figure 6 .
Figure 6.The minimum bounding rectangle of two polygons intersects, but the polygons do not intersect.

Table 1 .
Table1shows the structure of the index table, while Table2shows the structure of the detailed wide table.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-1-2023 39th International Symposium on Remote Sensing of Environment (ISRSE-39) "From Human Needs to SDGs", 24-28 April 2023, Antalya, Türkiye Structure and partial data of index table

Table 2 .
Structure of detailed wide table and partial data.

Table 3 .
Design of index table in HBase.

Table 5 .
Table 5 provides detailed specifications for each of the nodes in the HBase cluster.Node configuration for HBase