INVESTIGATING THE RELATION BETWEEN PREVALENCE OF ASTHMATIC ALLERGY WITH THE CHARACTERISTICS OF THE ENVIRONMENT USING ASSOCIATION RULE MINING

The prevalence of allergic diseases has highly increased in recent decades due to contamination of the environment with the allergy stimuli. A common treat is identifying the allergy stimulus and, then, avoiding the patient to be exposed with it. There are, however, many unknown allergic diseases stimuli that are related to the characteristics of the living environment. In this paper, we focus on the effect of air pollution on asthmatic allergies and investigate the association between prevalence of such allergies with those characteristics of the environment that may affect the air pollution. For this, spatial association rule mining has been deployed to mine the association between spatial distribution of allergy prevalence and the air pollution parameters such as CO, SO2, NO2, PM10, PM2.5, and O3 (compiled by the air pollution monitoring stations) as well as living distance to parks and roads. The results for the case study (i.e., Tehran metropolitan area) indicates that distance to parks and roads as well as CO, NO2, PM10, and PM2.5 is related to the allergy prevalence in December (the most polluted month of the year in Tehran), while SO2 and O3 have no effect on that. * Corresponding author.


INTRODUCTION
Prevalence of allergic diseases has highly increased in recent decades, especially among children, due to modern living conditions resulted in contamination of the environment with the allergy stimuli, called allergen (Ng et al., 2009;Zöllner et al., 2005).Allergic patients have hypersensitive immune systems that abnormally react to harmless substances.Several factors cause allergic reactions, which depend on the gene, living style and habits, foods, as well as the geography and conditions of the environment (Asher et al., 1995).A common treat to allergic diseases is identifying the allergen and, then, avoiding the patient to be exposed with it (Douglass and O Hehir, 2006).There are, however, several unknown stimuli that may cause allergic diseases, many of which are related to the characteristics of the living environment.Therefore, analyzing the data collected about the living environment of allergic patients may lead to identifying the role of environmental parameters in prevalence of allergies.As the patients are distributed in the space, and the relation varies with time, the spatio-temporal data mining techniques seems very efficient in this regards.Spatial data mining concerns development and application of novel computational techniques to analyze very large spatial databases (Buttenfield et al., 2001;Koperski et al., 1996).A major distinction of spatial data mining is that attributes of the neighboring objects influence each other and thus must be taken in to account.Furthermore, the location and extension of spatial objects define implicit relations of spatial neighborhoods (such as topological, distance and directional relations), which are used by spatial data mining algorithms (Miller and Han, 2001).In this paper, we focus on the effect of air pollution on asthmatic allergies and investigate the relation between prevalence of such allergies with those characteristics of the environment that may affect the air pollution.The reside location of a group of asthmatic allergic patients, live in Tehran metropolitan area, as well as spatial characteristics of the environment (e.g., location of parks, roads and air pollution monitoring stations) were placed on the map.We, then, deployed spatial association rule mining (as one of the spatial data mining analyses) to extract the association between asthmatic allergy prevalence and the air pollution parameters such as CO (carbon monoxide), SO2 (sulfur dioxide), NO2 (nitrogen dioxide), PM10 and PM2.5 (particulate matter with a diameter of <10μm and <2.5μm, respectively), and O3 (ozone) as well as living distance to parks and roads, as major sources of asthmatic allergy irritants.The rest of the paper is organized as follow: Section 2 surveys some of the previous work related to the topic of this paper, including data mining to study allergy prevalence; and spatial association rule mining to discover relations between spatially related parameters.Sections 3, introduce spatial association rule mining.In section 4, the components of the research methodology are described in details.The results for the case study are presented and discussed in Section 5. Finally, Section 6 contains concluding remarks and ideas for future research in this direction.

RELATED WORK
This section reviews the previous researches related to the topic of this paper, which are classified into: (i) using data mining techniques to study prevalence of allergic disease; and (ii) deploying spatial association rule mining to discover relations between spatially related parameters.Data mining is an approach to determine the valid, novel, useful and understandable data patterns from huge amount of data stored in a database (Miller and Han, 2001).To the best of our knowledge, three researches used data mining to study allergy outbreaks: Ng et al. (2009) used data mining techniques to predict allergy symptoms among children in Taiwan.They used the allergy data of children under the age of 12 and considered 30 predictor variables including personal factors, health behavior factors, living condition factors, family factors, and allergy-inducing factors.They deployed three predictive models: neural networks, decision trees and support vector machines (SVM).Akinbami et al. (2010) assessed the association between chronic outdoor air pollution exposure and childhood asthma in metropolitan areas across the US.They compiled 12-month average air pollutant levels for SO2, NO2, O3 and PM and linked eligible children to pollutant levels for the previous 12 months for their county of residence.Finally, logistic regression models were used to estimate asthma attack.YoussefAgha et al. (2012) studied the application of data mining techniques to predict allergy outbreaks among elementary school children.They used the binary logistic regression to determine if there is any relation between prevalence of allergies among elementary school children and daily upper-air observations (i.e., temperature, relative humidity, dew point, and mixing ratio) and daily air pollution (CO, SO2, NO2, PM10, PM2.5 and O3).The results of all of these researches are plausible.Nevertheless, none of them considered neither spatial nor temporal characteristics of data to study prevalence of allergy.On the other hand, discovering association rules from data stored in spatial databases has been considered in many researches.Mennis and Liu (2003) explored the spatio-temporal association rules among a set of variables characterizing the socioeconomic and land cover changes in Denver, Colorado region from 1970 to 1990.Shua et al. (2008) used Apriori algorithm to produce association rules in vegetation and climate changing data of north-eastern China.Ladner et al. (2003) studied the correlations of spatially related data such as soil types, directional and geometric relationships.They combined spatial and fuzzy data mining to handle the spatial uncertainty of data.Finally, Calargun and Yazici (2008) analyzed the real meteorological data for Turkey recorded between 1970 and 2007 using spatio-temporal data cube and Apriori algorithm in order to generate fuzzy association rules.The results of the two approaches were then compared according to interpretability, precision, utility, novelty, direct-to-the-point, performance and visualization.They also visualized the association rules based on their significance and support values in order to provide a complete analysis tool for a decision support system in meteorology domain.

SPATIAL ASSOCIATION RULE MINING
Association rule mining seeks interesting association or correlation relationships among a large set of data items, i.e., certain data items that often occur together (Agrawal et al., 1993;Han et al., 2011).An association rule is an implication of the form A → B where A (the antecedent) and B (the consequent) are sets of predicates.For example, the rule like "the person who live in area with very high amount of NO2 and very high park effect, is infected with asthmatic allergy ", which is expressed as: (NO2, very high), (park_efct, very high) → (asthmatic_allergy, yes) (1) If there is only one type of predicate (e.g., park_efct), the association rule is one-dimensional.Whereas, in multi-dimensional association rules, more than one type of predicate involves.
In order to determine if a rule is significant, reliable and interesting, the concepts of support and confidence are used.
The support is the probability of an item in the database satisfying the set of predicates contained in both the antecedent and consequent; and the confidence is the probability that an item that contains the antecedent also contains the consequent: The association rules that have the minimum significant support and confidence are called strong association rules and are considered in decision making process.A common influential algorithm for the association rule mining is the so called Apriori algorithm (Agrawal and Srikant, 1994).
On the other hand, to reliably eliminate the weak associations, correlation factor is defined to measure the degree of relation between A and B (Han et al., 2011).Therefore, the extracted rules are evaluated as: The Kulczynski, a measure to evaluate the correlation, is defined as (Kulczynski, 1927): Which is a value between 0 and 1.A larger Kulc indicates stronger relation between A and B. A spatial association rule contains at least one spatial relationship in an antecedent or consequent predicate (Koperski and Han, 1995).For example distance_to (road, near) is a spatial predicate that results in a spatial association rule.There are two important issues in dealing with spatial association rules:  Unlike non-spatial association ruleswhich are explicitly encoded transactionsspatial relationships are typically embedded within the spatial framework of the geo-referenced data.Therefore, the seeking patterns are implicit and the "spatial relationships must be extracted from the data prior to the actual association rule mining" (Shekhar and Chawla, 2003).Nevertheless, pre-processing and storing all combinations of the relations among massive volume of spatial data is not practically possible.Therefore, there must be a trade-off between pre-and on-demand processing of spatial relationships among geographic objects (Klosgen and May, 2002).


Spatial predicates usually contain numeric data (e.g.metric distance), while the conventional association rule mining can only deal with categorical (classified) data.A solution to this problem is that we, first, classify numeric data into ordinal categories and then mine these ordinal data for association rules (Piatetsky-Shapiro, 1991;Srikant and Agrawal, 1996).For example, metric distance may be categorized into 'very near', 'near', 'far', and 'very far'.

RESEARCH METHODOLOGY
This paper analyze the risk of asthmatic allergy prevalence based on environmental characteristics through deploying the spatial association rule mining to extract the association between prevalence of asthmatic allergies with those characteristics of the environment that may affect the air pollution.The case study is Tehran metropolitan area.Figure 1 illustrates the research methodology:

Data Pre-processing
The air pollution parameters consist of CO, SO2, NO2, PM10, PM2.5, and O3 compiled hourly in December 2013 by Tehran's air pollution monitoring stations are used (Figure 2).This data is cleaned by filling the gaps and filtering the noises.To reduce this voluminous data to monthly air pollution parameters, the monthly average of maximum values observed for each parameter in a day is calculated.These values are used to produce a monthly pollution map for each air pollution parameter through Kriging spatial interpolation (Wackernagel, 2003).

Figure 2. The map of Tehran's roads, parks and air pollution monitoring stations
To model the effect of distance to roads, a map is produced in which the distance to the nearest road is calculated.The same process was applied to model the effect of parks using the following equation, which quantifies the effect of nearby parks: (6) where Tj = the effect of nearby parks for the point j Ai = the area of the park i, and dij = the distance of the park i from the point j

Data Conceptualization
As mentioned in Section 3, the inputs of association rule mining must be categorical values.Therefore, the data items assigned to the patients must be categorized.For this purpose, distance to roads was classified into "very near", "near", "medium" and "far" (Figure 3.a).The effect of parks also is classified into "very highly affected", "highly affected", "moderately affected" and "lowly affected" (Figure 3.b).
To categorize the air pollution parameters, the air quality index (AQI)which is an indicator of air qualityis used.As the categorization breakpoints used by AQI varies from an air pollution parameter to another (Table 1), the following equation is used to normalize the measured values (Mintz, 2012):   We merge the AQI air pollution categories to "very high", "high", "moderate" and "low" (Figure 4).Finally, the reside location of 1000 patients referred to the "Tehran Children's Medical Clinic" in December 2013 are places on the map.For each patient, a data item is stored that shows if he/she has asthmatic allergy.Moreover, having overlaid this map with the classified air pollution and distance to roads and parks maps, the air pollution parameters and distance to roads and parks are assigned to each point as data items (attributes).Table 2 shows some of the recorded data in database.

Spatial Association Rule Mining
Having the multidimensional dataset constructed, the association rules between asthmatic allergy prevalence and spatial characteristics of the environment (i.e., air pollution and distance to parks and roads) are extracted.As we are interested in antecedents that result in allergy, we only keep those rules whose consequence is "(allergy, yes)", such as: [(PM2.5,very high), (park_efct, very high)] → (allergy, yes) In our case, the minimum support and confidence thresholds are respectively defined as 5% and 40%.

IMPLEMENTATION RESULTS
Applying the procedure described in Section 4 to the dataset, provided 60 association rules between prevalence of asthmatic allergy with characteristics of the environment in December 2013, some of which are illustrated in Table 3.For example, rule #5 with 6.55% support, and 75.52% confidence says that 6.55% of the statistical population lives in locations where the amount of NO2 and PM2.5, and the effect of nearby parks are very high and are suffering from asthmatic allergy; and this is 75.52% of the statistical population who live in such areas; And the Kulczynski's correlation measure between the antecedent and the consequence is 64%.
Based on the extracted rules, distance to parks and roads as well as CO, NO2, PM10 and PM2.5 affect asthmatic allergy prevalence in December, while SO2 and O3 has no significant relation.On the other hand, the rules that include "(park_efct, very high)" and for which at least one of the air pollution parameters is high (e.g., rules #2, #4 and #6) has greater confidences compare to those that only have one of these components (e.g., rules #1, #3 and #7).The research on air pollution and asthmatic allergy certify this: Air pollution may itself outbreak the allergy, but it also facilitates the pollen to get into the respiratory system (Bartra et al., 2007).On the other hand, the rules # 10 and #11, which contain "(road, very near)" has no significant increase in confidence as the effect of this parameter already manifested in increase of air pollution parameters. .3. Some of the rules extracted for December through association rule mining

CONCLUSION
This paper deploys the spatial association rule mining to investigate the relation between prevalence of asthmatic allergies and those characteristics of the environment that may affect the air pollution, through which maps the risk of asthmatic allergy prevalence based on environmental characteristics.The results for the case study (i.e., Tehran metropolitan area) shows that considering spatial distribution of the patients as well as classified data items (i.e., attributes) enabled to extract more reliable associations, as their interpretation certifies.As the air pollution conditions and pollen vary from time to time, the rules extracted for December may not be applicable to other months for two different months.
Here, we only consider distance to parks and roads as parameters that may affect the air pollution and asthmatic allergies.In future, other characteristics of the environment may be taken into account.

Figure
Figure 1.Research methodology air quality index for the air pollution parameter p Cp = the value measured for the air pollution parameter p BPHi = the first break point greater than Cp BPLo = the first break point less than Cp IHi = the air quality index for BPHi ILo = the air quality index for BPLo

Figure 3 .
Figure 3.The classified maps to show the effect of distance to (a) roads and (b) nearby parks

Figure 4 .
Figure 4.The categorized maps of different air pollutants in Tehran in December 2013

Table 2 .
Part of the multidimensional dataset used in association rule