HADOOP-BASED DISTRIBUTED SYSTEM FOR ONLINE PREDICTION OF AIR POLLUTION BASED ON SUPPORT VECTOR MACHINE
Keywords: Urban Air Pollution, Online Prediction, Big Data, Spatial Analysis, Distributed Computing, Support Vector Machine
Abstract. The critical impact of air pollution on human health and environment in one hand and the complexity of pollutant concentration behavior in the other hand lead the scientists to look for advance techniques for monitoring and predicting the urban air quality. Additionally, recent developments in data measurement techniques have led to collection of various types of data about air quality. Such data is extremely voluminous and to be useful it must be processed at high velocity. Due to the complexity of big data analysis especially for dynamic applications, online forecasting of pollutant concentration trends within a reasonable processing time is still an open problem. The purpose of this paper is to present an online forecasting approach based on Support Vector Machine (SVM) to predict the air quality one day in advance. In order to overcome the computational requirements for large-scale data analysis, distributed computing based on the Hadoop platform has been employed to leverage the processing power of multiple processing units. The MapReduce programming model is adopted for massive parallel processing in this study. Based on the online algorithm and Hadoop framework, an online forecasting system is designed to predict the air pollution of Tehran for the next 24 hours. The results have been assessed on the basis of Processing Time and Efficiency. Quite accurate predictions of air pollutant indicator levels within an acceptable processing time prove that the presented approach is very suitable to tackle large scale air pollution prediction problems.