GEOSPATIAL BIG DATA ANALYTICS FOR SUSTAINABLE SMART CITIES

: Growing urbanization cause environmental problems such as vast amount of carbon emissions and pollution all over the world. Smart Infrastructure and Smart Environment are two significant components of the smart city paradigm that can create opportunities for ensuring energy conservation, preventing ecological degradation, and using renewable energy sources. Since a great portion of the data contains location information, geospatial intelligence is a key technology for sustainable smart cities. We need a holistic framework for the smart governance of cities by utilizing key technological drivers such as big data, Geographic Information Systems (GIS), cloud computing, Internet of Things (IoT). Geospatial Big Data applications offer predictive data science tools such as grid computing and parallel computing for efficient and fast processing to build a sustainable smart city ecosystem. Effective management of big data in storage, visualization, analytics, and analysis stages can foster green building, green energy, and net zero targets of countries. Parallel computing systems have the ability to scale up analysis on geospatial big data platforms which is key for ocean, atmosphere, land, and climate applications. In this study, it is aimed to create the necessary technical infrastructure for smart city applications with a holistic big data management approach. Thus, a smart city model framework is developed for Smart Environment and Smart Governance components and performance comparison of Dask-GeoPandas and Apache Sedona parallel processing systems are carried out. Apache Sedona performed better on the performance test during read, write, join and clustering operations.


INTRODUCTION
Increasing urbanisation across the world makes cities more crowded and complex. This situation brings along many social, environmental and economic problems such as housing, traffic density and air pollution. In order to overcome these problems and manage cities effectively, there is a need for sustainable smart cities that utilise information and communication technologies (ICTs) (Batty, 2013, Giffinger et al., 2007, Huang, Yao, Krisp, and Jiang, 2021.
Handling geospatial big data for sustainable smart cities is crucial since smart city services rely heavily on location-based data. Effective management of big data in storage, visualization, analytics, and analysis stages can foster green building, green energy, and net zero targets of countries. Geospatial data science ecosystem has many powerful open source software tools. According to the vision of PANGEO, a community of scientists and software developers working on big data software tools and customized environments, parallel computing systems have the ability to scale up analysis on geospatial big data platforms which is key for ocean, atmosphere, land, and climate applications. Those systems allow users to deploy clusters of compute nodes for big data processing.
A smart city can be defined as a modern city that utilises information technologies to improve its services and management and solve problems affecting the city (Berst, 2012, Li, Batty, andGoodchild, 2020). Emerging after 1990, this concept has brought about a paradigm shift by prioritizing sustainable development as well as information systems. Looking at the last two decades, sustainable smart city practices have become increasingly widespread around the world, and smart cities have become a multi-stakeholder and sustainable urban phenomenon that includes social, environmental, economic and political approaches as well as technology, * Corresponding author replacing the purely digital city concept (Ateş and Erinsel Önder, 2019).
Computers, mobile phones, sensors, and even humans generate massive amounts of data, the size of which is increasing day by day (Batty, 2013). Geospatial big data has the potential to contribute to the development of smart cities, diagnosis of existing problems, prediction of changes in cities and optimised decision-making. On the other hand, intensive spatial data from various data sources paves the way for the development of many innovative applications related to cities. In the literature, there are different use cases of geospatial big data such as, social network analysis, mobility analysis and urban planning with communication network data (Calabrese, Ferrari, and Blondel, 2014, Dong, Wang, and Liu, 2021, Huang, Cheng, and Weibel, 2019, urban analysis and mobility with GPS data, event detection, sentiment analysis, travel orientation analysis and modelling of urban functions with location-based social media data (Hu, Mao, andMcKenzie, 2018, Wei and, transportation planning, intelligence and urban planning with smart transportation card data (Huang et al., 2021), shopping behaviour analysis, event management and building occupancy modelling with Wi-Fi and bluetooth data (Mashuk, Pinchin, Siebers, and Moore, 2021, Trasberg, Soundararaj, and Cheshire, 2021, Versichele et al., 2014, monitoring physical environments and human behaviour with camera images (Biljecki and Ito, 2021). The applications of the six basic components of smart cities contribute greatly to achieving the vision of sustainable smart cities.
The study has three main objectives. Firstly, it is aimed to implement smart environment, smart building, smart infrastructure components based on the energy efficiency of buildings in sustainable smart cities and to develop the necessary policies on a building basis. Secondly, it is aimed to ensure the effective management of geospatial big data, which is an important pillar of smart cities. Different approaches other than traditional methods are required for the storage, visualisation, analysis and analytics of geospatial big data. In this context, different data structures such as GeoPackage, Shapefile, GeoJSON and GeoParquet will be tested for performance in tasks such as reading, writing, spatial analysis, and the most suitable file format for geospatial big data will be determined. Finally, it is aimed to examine the effectiveness of parallel processing tools needed for big data analysis and analytics, and to compare the performance of Apache Sedona and Dask GeoPandas scalable big data analysis systems.
There is a need for a new approach in the effective management of geospatial big data coming from many different sources and increasing in size day by day. Geospatial big data tools play an important role in the successful implementation of sustainable smart cities, such as smart governance, smart infrastructure and smart buildings. In this study, it is aimed to create the necessary technical infrastructure for smart city applications with a holistic big data management approach.

GEOSPATIAL BIG DATA
One of the most important resources feeding smart cities is big data. Big data can be defined as data that is diverse and rapidly increasing in volume (De Mauro, Greco, and Grimaldi, 2015, George, Haas, and Pentland, 2014, Schaffers et al., 2012, Villars, Olofson, and Eastwood, 2011. Due to the inadequacy of traditional data processing methods, there is a need to develop different approaches for all steps from storage to analysis, and from analytics to visualization in the management of large-scale data. Although big data is not a new concept, it has become widespread in recent years with the contribution of social media and sensor data and has become an important information extraction and prediction tool. The software ecosystem developed for big data management includes widely used open source systems such as Hadoop and Spark, and data storage, processing and analysis processes can be performed more efficiently.
Big data should have the characteristics of Volume, Velocity, Variety, Veracity and Value, known as 5Vs. The volume of data is growing day by day. Volume in big data is a feature that refers to the continuous and large volumes of data flow. The concept of velocity for big data implies that data should flow continuously and quickly. On the other hand, diversity refers to data in different formats from different sources. In order to ensure the interoperability of big data, data in different formats should be convertible. Accuracy is an important concept for extracting meaningful information from big data. Data should contain reliable and accurate information and meaningless data should be removed. Value is a quality that expresses the advantages that big data adds after processing.
It is known that approximately 80% of the world's data contains location information (Franklin andHane, 1992, Williams, 1987). Today, with the increase in spatial data, the concept of geospatial big data has emerged. There is a need for effective management of geographic big data in smart cities. However, it is very difficult to store, process, analyse and visualize large volumes of spatial data with traditional GIS software and hardware. For this reason, high-performance and scalable infrastructures such as parallel processing and cloud computing should be used (Mete and Yomralioglu, 2021).
Parallel processing is a method that uses two or more processors (CPUs) in parallel to process a computational task in partitions. Splitting the task into different partitions and assigning one partition to each processor core greatly reduces the time it takes to process the data. On the other hand, multi-core processors provide better performance and lower power consumption, and can generate as much processing power as the number of cores. Parallel processing is often used to perform complex computational tasks. Big data computing approaches such as parallel processing are needed for real-time or near real-time processing of data flowing from sources such as the Internet of Things (IoT). Cluster processing, on the other hand, is a method used to solve the problem of out-of-memory data processing by providing faster computation and improved data integrity. In cluster processing, multiple pieces of hardware perform computational tasks in a cluster connected to a main processing unit. Cluster processing offers advantages in terms of cost, performance and resource utilization compared to traditional computing methods.
Dask GeoPandas and Apache Sedona open source software tools have been developed as scalable cluster processing systems in line with the need for effective management of geospatial big data, which has recently increased in volume. GeoPandas is an open source library for working with geographic data in Python. GeoPandas extends the data types in the Pandas library to perform spatial operations on geometry types. Dask provides enhanced parallelism and distributed computing with a dask.dataframe module designed to scale Pandas dataframe operations. Dask-GeoPandas is a parallel computing project that combines the spatial capabilities of GeoPandas and the scalability of Dask. Dask-GeoPandas offers a significant performance improvement with its parallel computing approach for processing large geographic data that does not fit in memory.
Apache Sedona, on the other hand, is a cluster processing system developed to display, query and analyze large-scale geospatial data. Sedona extends existing cluster processing systems such as Apache Spark and Apache Flink with distributed spatial data sets and spatial SQL tools to efficiently process large spatial data. Offering significant advantages such as high data processing speed and low power consumption with spatial indexing, partitioning and serialization operations, Apache Sedona enables data mining and spatial data analytics applications in local computing and cloud environments. Sedona offers API support for Java, Scala, SQL, Python and R programming languages and can work with many vector and raster geographic data formats such as GeoJSON, Parquet, HDF, GeoTIFF.

GEOSPATIAL BIG DATA ANALYTICS FOR SUSTAINABLE SMART CITIES
Big data analysis is an important tool for smart cities with its ability to reveal past trends as well as predict the future. In smart city applications with geographic big data, urban modelling, transportation/mobility, urban planning and human behaviours stand out as the main study topics (Calafiore, Palmer, Comber, Arribas-Bel, and Singleton, 2021, Chen, Gong, Yang, Ma, and Kan, 2020, Dong et al., 2021, Erdelić et al., 2021, Fan and Stewart, 2021, Hoseinzadeh, Liu, Han, Brakewood, and Mohammadnazar, 2020. Smart cities should be empowered with big data analytics and artificial intelligence for applications that support sustainability such as environment and energy efficiency (Allam andDhunny, 2019, Batty, 2018). Artificial Intelligence, Machine Learning, GIS, and Building Information Modelling (BIM) are also used in the studies carried out for the implementation of smart city components (Idowu, Saguna, Åhlund, and Schelén, 2016, Li et al., 2020, Yamamura, Fan, and Suzuki, 2017. However, it is seen that geospatial big data management is not handled in a holistic manner in smart city applications. Although smart cities have a broad scope addressing many different urban issues, they are divided into six main components within the smart city strategy. These components are classified as smart economy, smart people, smart governance, smart transportation, smart environment and smart living (Cohen, 2012, Giffinger et al., 2007. In this study, a smart city model framework is developed for Smart Environment and Smart Governance components and the management of geospatial big data in smart cities with GIS, Artificial Intelligence and Machine Learning is discussed (Figure 1).

Figure 1. Geospatial Big Data Administration Model Framework for Sustainable Smart Cities
For the study region, England and Wales countries of the United Kingdom are selected since the availability of openly licensed data ( Figure 2). England and Wales are located on the island of Great Britain in the North Atlantic Ocean, with France to the southeast and Ireland to the west. In the application phase of the study, Pandas, GeoPandas, Dask, Dask-GeoPandas, and Apache Sedona libraries are used in Python Jupyter Notebook environment. In this context, we carried out a performance comparison of two cluster computing systems: Dask-GeoPandas and Apache Sedona. We also investigated the performance of the novel geospatial data format GeoParquet together with the other well-known format, GeoPackage.
There is a common vision, policy recommendations, and industry-wide actions to achieve the 2050 net zero carbon emission scenario in the United Kingdom. Read, write, and spatial join operations are both conducted on Dask-GeoPandas and Apache Sedona in order to compare the performances of the two big spatial data frameworks.
In the analysis phase, Dask GeoPandas and Apache Sedona libraries are used to analyse spatial data with scalable cluster processing tools. In the paper, two different big data processing tools are used to test the performance of these systems in operations such as spatial join and spatial clustering.
Spatial join is the process of combining the attribute information of the intersection points of data layers at the same location. Spatial join can be performed with different matching options such as intersection, within, contains, and near (search radius). In this study, EPC and UPRN point vector datasets are merged according to the UPRN ID column and spatial join operation is performed with OS Buildings dataset in polygon data type. Thus, attributes of energy performance are included on the building dataset and transferred to the data analytics process together with other attribute information.
Parallel computing system enables much faster data handling when compared with the traditional approaches. Comparing performances of the frameworks, local computing hardware (11th Gen Intel Core i7-11800H 2.30 GHz CPU, 64 GB 3200 MHz DDR4 RAM) is used. Figure 3 shows the performance test for comparing geospatial big data parallel processing frameworks. According to the results, Dask-GeoPandas and Apache Sedona prevailed GeoPandas in read, write, and spatial join operations. Apache Sedona performed better during the performance tests. On the other hand, GeoParquet file format was much faster and smaller in size when compared with the GPKG data format. After spatial join operation, energy performance attributes are included in building data.
After adding all necessary attributes to the building dataset by spatial join operation, clustering technique is used to group buildings according to their location and other attribute information. Cluster analysis is a machine learning approach that groups an unlabelled data set based on the similarities between them. Clustering analysis groups data based on similar textures such as size, shape, colour, etc. in the unlabelled data set. Clustering algorithms are divided into five types: Partitional/Centroid-based, Hierarchical, Density-based, Distribution Model-Based and Fuzzy (Mete and Yomralioglu, 2022). K-means clustering is the most widely used clustering algorithm.
As an unsupervised learning algorithm, k-means works on the principle of minimizing the variance between data points by performing center-based clustering. Center-based clustering groups data into non-hierarchical clusters.
This type of clustering is sensitive to outliers and initial parameters. Hierarchical-based clustering is a grouping method used for data with a hierarchical structure such as taxonomy.
With this approach, data is divided into clusters to form a treelike structure (dendrogram). Density-based clustering is based on grouping areas with high data density. This approach, which does not include outliers in clusters, does not perform well with data of varying density and high dimensions. DSCAN is the most widely used density-based clustering method. The distributionbased clustering approach assumes that the data consists of a certain distribution pattern, such as a normal distribution. As the distance from the center of the distribution increases, the probability of a point belonging to the distribution decreases. If the distribution model of the data set is not known, this approach may not give reliable results. There is also fuzzy clustering approach which is a soft clustering method. In fuzzy clustering, a data point can belong to more than one cluster. As a result of clustering analysis, each data point receives a membership coefficient indicating belonging to clusters.
Within the scope of the study, spatial clustering analysis is performed to group variables with similar characteristics based on location, and smart city indicators are measured through cluster statistics. In order to spatially analyse the energy efficiency level of cities, clustering analysis is performed with the EPC data set shared on building basis and spatial energy efficiency level regions are created.
In order to observe regional energy efficiency patterns, SQL statements are used for filtering the data according to the energy rates. The query result is visualized using Datashader which provides highly optimized rendering with distributed systems (Figure 4).
After completing the analyses, geospatial big data analytics are performed with spatial and non-spatial queries in order to interpret the results and to extract meaningful information. Spatial data analytics offers capabilities such as problem detection, prediction and decision optimisation in smart cities with its capacity to answer the questions "What happened?", "Where did it happen?", "Why did it happen?", "What can happen in the future?", "What steps need to be taken?" (Evans andLindner, 2012, Huang et al., 2021). With this study, according to the four levels of data analytics, a system that can describe, diagnose, predict, and prescript has been created by providing fast and easy access to the information required for smart governance. With the developed geospatial big data analysis and analytics framework for smart cities, it has become possible to predict future energy plans as well as revealing past trends and current situation.

Figure 4. Visualisation of Building-scale Energy Efficiency Analytics Using Datashader
Using geospatial big data in smart city applications, the design and implementation of "Smart Environment", "Smart Infrastructure", "Smart Energy", "Smart Building" and "Smart Governance" components can be addressed and performance indicators can be defined to measure cities' requirements and maturity targets specific to the relevant themes. Within the scope of "Smart Environment" and "Smart Energy" applications of cities, strategies can be determined for targets such as energy efficiency and neutral carbon, current situation measurement and necessary actions can be defined. In this study, the analyses required for the realization of smart city applications are determined and the outputs of the conceptual model are obtained through the physical model.

CONCLUSION
Geospatial big data tools play an important role in the successful implementation of components of sustainable smart cities such as smart governance, smart infrastructure and smart buildings. There is a need for a new approach in the effective management of geographical data coming from many different sources and increasing in size day by day. This study answers the question "Can geospatial big data analytics tools foster sustainable smart cities?". Volume, value, variety, velocity, and veracity of big data require different approaches than traditional data handling procedures in order to reveal patterns, trends, and relationships. Using spatial cluster computing systems for large-scale data enables effective urban management in the context of smart cities. On the other hand, energy policies and action plans such as decarbonization, and net zero targets can be achieved by sustainable smart cities supported by geospatial big data instruments. This study aims to reveal the potential of big data analytics in the establishment of smart infrastructure and smart buildings using large-scale geospatial datasets on state-of-the-art cluster computing systems. In future studies, larger spatial datasets like Planet OSM can be used on cloud-native platforms to test the capabilities of the geospatial big data tools.