AN INTERLINKING APPROACH FOR LINKED GEOSPATIAL DATA

: Geospatial metadata from metadata catalogue can be published as part of Web of data used Linked Data technologies. The published data could be named as linked geospatial metadata. A key issue of Linked Data technologies is to create links among datasets. There are three important types of RDF links: relationship links, identity links, and vocabulary links. This paper proposes a matching method to construct linkages between linked geospatial metadata and geospatial datasets in the linking open data cloud (LOD). This matching method is based on semantic similarity to construct identity links. A matching algorithm using Tversky‘s contrast model and Jaro-Winkler distance is proposed and evaluated.


INTRODUCTION
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web (Bizer, 2009).Linked data technologies use the Resource Description Framework (RDF) language and HTTP protocol to publish structured data on the Web (Bizer, 2008), which have shown great promise for effectively sharing and interlinking of Web resources (Berners-Lee, 2006).Semantic Web researchers and practitioners have started to make geospatial data available as Linked Data on the Web, which promotes sharing and interlinking of geospatial data.For example, LinkedGeoData (Sören, 2009) makes OpenStreetMap data available as RDF.
Creating links is a key issue of the Linked Data, which can connect the data to an unbounded Web in which one can find all kinds of things (Berners-Lee, 2006).There are three important types of RDF links: relationship links, identity links and vocabulary links (Heath & Bizer, 2011).Relationship links set reference from entities in one data set to entities in another, which add more descriptions to the source data set.Identity links aim at constructing interlinks between deferent URIs indentifying the same entity.Vocabulary links map the relationship between terms from different vocabularies.This paper proposes a method based on semantic similarity to construct identity links between linked geospatial data.This method is based on Tversky's contrast model (Tversky, 1997), which determines semantic similarity by comparing properties of two different instances.Jaro-Winkler distance (Winkler, 1990) is used to compute the similarity of these properties.It measures similarity between two strings.Combining the two methods together, instances between two datasets could be linked using values of similarity.The remainder of the paper is organized as follows.Section II introduces related work.Section III describes the method to construct identity links.Conclusion is given in Section IV.

RELATED WORK
Semantic Web is about -making links, so that a person or machine can explore the Web of Data‖ (Berners-Lee, 2006).
The interlinking of Linked Data is an important factor for the success of the Semantic Web.Researchers have done a lot of work to publish data-sets in RDF on the Web according to the principles of Linked Data.These datasets are interlinked with each other.For example, World Wide Web Consortium (W3C) Linking Open Data (LOD) community project has published various open data sets as RDF on the Web and set RDF links between data items from different data sources.Figure 1   DBpedia is part of the Linking Open Data community project, which is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.In recent years, an increasing number of data publishers link their datasets to DBpedia resources, making DBpedia a central interlinking hub for the emerging Web of Data (Auer et al., 2007).GeoNames is another part of the project, which is a linked geospatial data set about place names.
In GeoNames database, latitude, longitude, elevation, population, and other information about places are stored.In this paper, we use the DBpedia as a target dataset to build links.
There are two kinds of approaches to build links: manual and automatic methods.Manual methods are suitable for small and static datasets.They are effective but require skilled human data publishers (Araújo et al., 2011).When comes to big datasets, automatic or semi-automatic methods are needed.There are two main types of automatic approaches: key-based and similaritybased approaches (Heath & Bizer, 2011).In several domains, there are domain-accepted identification codes, for example, ISBN numbers in the publication domain.The code may be used as a property value of the resource or part of the URIs.By using the common codes as keys, links between resources could be established.Where there are no common identifiers among different datasets, similarity-based approaches are needed.
Multiple properties of resources are often selected to be compared and similarity scores are calculated.These similarity scores are aggregated, and if the aggregated value is above a given threshold, resources could be linked.
Researchers have proposed some linking approaches for linked geospatial data.Barnaghi et al. (2012) propose a platform, called Sense2Web, to publish Semantic Sensor Network data as linked data and link them to resources on the Web of Data manually.For example, when a user publishes a new sensor, Sense2Web uses Jena API to query DBpedia and GeoNames to obtain descriptive information such as location and sensor types, which then can be selected manually by users to link with sensor data.Pschorr et al. (2010) present an automatic approach to publish sensors as linked data.Longitude/latitude pairs are extracted from both semantic sensor data and GeoNames respectively using SPARQL queries.Using longitude/latitude, links between semantic sensor data and GeoNames can be established automatically.The links help discovery of sensors using two basic operations (Pschorr et al., 2010): Find the named location closest to a given sensor; Find all sensors near a given named location.Yuan et al. (2013) propose an approach to publish geospatial data provenance in a catalog service into the Web of Data using the Linked-Data approach.They compare the boundingbox of data items from the linked geospatial data provenance with the spatial region of data items from LOD datasets.The topological relation is calculated between two geometries.Once the relation is determined, the dataset will be linked to the data item using one type of geometric relations, for example ‗within'.In this paper, we will use the linked geospatial data mentioned above as our source dataset to construct more links to the LOD cloud.

CONSTRUCTING LINKS TO THE LOD CLOUD
There could be many different providers publishing same entities as linked data with different URIs.It is a common practice to use the link http://www.w3.org/2002/07/owl#sameAs to state that two URI references refer to the same thing (Bizer, 2007).This section describes a method based on semantic similarity to construct such links between different datasets.Linked geospatial data from a catalogue service (Yuan, 2013) and DBpedia are used as source and target datasets respectively.
Figure 2 describes the properties of agent that stands for the provider of geospatial data or service in linked geospatial dataset.The property, dc:title, is the agent's name; pro:city is the city that the agent is located in; pro:province is the province that the agent is located in; pro:country is the country; pro:tel is the telephone number of the agent.In the DBpedia, more detailed information is provided for the entity, Wuhan University.Figure 3 is part of properties of Wuhan University in DBpedia (http://dbpedia.org/page/Wuhan_University).If links between the two datasets are constructed, we could get more details about the agent in linked geospatial dataset using links to DBpedia.The links can be constructed by a match between instances from the two distinct datasets using Tversky's contrast model (Tversky, 1997).In the contrast model, the similarity between two entities, A and B, is expressed as a linear combination of the measures of theirs common and distinctive properties, as shown in equation below.
The contrast model is composed by three disjoint set functions.

The scale function f (A ∩ B) represents the set of common properties between A and B. The function f (A -B) and f (B -A) represents the set of distinct properties between A and B, and B
and A, respectively.The constants α, β, and λ represent weights of the communalities and differences in the equation.In the case taken by this paper, the agent in the linked geospatial dataset has a limited set of properties.Most of them share similarities with properties in DBpedia.Therefore, common properties are experimented in this paper, and α and β are set 0. In the source dataset, the three properties, dc:tile, pro:province and pro:city are selected.Correspondingly, in the target dataset, i.e.DBpedia, dbpprop:name, dbpprop:province, and dbpprop:city are selected.The equation based on Tversky's contrast model for computing similarity is shown below.where n(a,b), p(a,b), c(a,b) denote the similarity function between dc:title and dbpprop:name, pro:province and dbpprop:province, and pro:city and dbpprop:city respectively.The computation of similarity for each function is based on a string-similarity function, i.e.Jaro-Winkler distance.The higher the Jaro-Winkler distance for two strings is, the more similar the strings are.The score is normalized such that 0 equates to no similarity and 1 is an exact match.When property matching is done, the measurement of similarity between two instances will be calculated using Equation 2. If the result is greater than 0.9, identity links will be constructed automatically and a RDF triple is created.If the result is between 0.8 and 0.9, the results can be sent to users for manual decision.If the result is lower than 0.8, it means that they are not matched.A matching workflow is shown in Figure 4. Detailed descriptions are provided as follows.

Extracting and preprocessing properties
In the results, some triples describe the resource by literal, while others provide URIs as links to other resource.For the latter case, the URI is processed and the last part is extracted as the value of the property.This is reasonable because that URIs are the names for resources on the Web according to the principles of Linked Data (Berners-Lee, 2006).For example, the last part of the resource, http://dbpedia.org/resource/Wuhan, is Wuhan, which can be used as the name of the resource.Thus, instances from Step A are preprocessed to extract values for each property.
In order to improve the accuracy of matching, underline and comma in values are replaced by space.When all above steps are done, each instance with its properties is inserted into Postgres as a record, shown in Figure 7.

Matching instances
This step calculates the similarity of every record in the source database with all records in the target database using the Equation 2. For each record in the source database, the target record with the maximum value of similarity is selected.Then, the maximum value is compared with the threshold to determine whether to construct links between the two records.
Interval [1, 0.9] (0.9, 0.8) [0.8, 0] Count 21 3 3 Table 3 Matching results Table 3 shows the matching results.Among the results, 11% of instances need to be determined manually by users.11% of instances cannot be matched.When further checking records in the target database, some empty strings are found.This is because some properties such as dbpprop:province are null in DBpedia.That is also why the SPARQL query in Table II includes the term ‗OPTIONAL' for the condition ‗{?resource dbProperty:province ?province.} ' .In this case, instances with the property value as null could also be returned.There is no meaning to match empty strings.Therefore, when the property, dbpprop:province or dbpprop:city, is null, its weight is set 0, and and more weight is added to dbpprop:name.For example, if the dbpprop:province is null, the weight for n(a,b) is adjusted to 0.8, thus the equation will be

CONCLUSION
This paper proposes a matching method based on semantic similarity to construct the identity links.The matching algorithm is based on Tversky's contrast model, and Jaro-Winkller distance is used to match values of properties.This method is demonstrated by constructing the links between a linked geospatial dataset and DBpedia.Further work will explore how to construct identity links with spatial characteristics between geospatial linked data.
shows datasets that have been published and interlinked by the project.There are already 295 datasets consisting of over 31 billion RDF triples, which are interlinked by around 504 million RDF links by September 2011.