Complex track

General description

Complex alignments are more expressive than simple alignments as their correspondences can contain logical constructors or transformation functions of literal values.

For example, given two ontologies o1 and o2:

A correspondence stating that: Forall x, o1:AcceptedPaper(x) = Exists y, o2:acceptedBy(x,y) is a complex correspondence with a logical constructor (property restriction);
A correspondence stating that: Forall x,y, o1:authorOf(x,y) = o2:writtenBy(y,x) is a complex correspondence with a logical constructor (inverse property);
A correspondence stating that: The values of the o1:fullName property can be obtained by concatenating the values of o2:firstName and o2:lastName is a complex correspondence with a transformation function of literal values (string concatenation).

With this track, we evaluate systems which can generate such correspondences.

The complex track contains 7 datasets about 5 different domains: Conference and Populated Conference, Hydrography, GeoLink, Populated GeoLink, Populated Enslaved and Taxon. Each of the datasets and corresponding evaluation methods are presented below.

The participants of the track should output their (complex) correspondences in the EDOAL format. This format is supported by the Alignment API. The evaluation will be supported by the MELT framwork this year. The participants have to wrap their tool as described at MELT framework. For executing the tasks in each dataset the parameters are listed in boxes below (repository, suite-id, version-id).

The number of ontologies, simple (1:1) and complex (1:n), (m:n) correspondences for each dataset of this track are summarized in the following table.

Dataset	#Ontologies	#(1:1)	#(1:n)	#(m:n)
Conference	3	78	79	0
Populated Conference	5	111	86	98
Hydrography	4	113	69	15
GeoLink	2	19	5	43
Populated GeoLink	2	19	5	43
Populated Enslaved	2	15	0	83
Taxon	4	6	17	3

Schedule

The schedule is available at the OAEI main page.

Datasets and Evaluation Modalities

Conference dataset

Ontologies and correspondences

This dataset is based on the OntoFarm dataset [1] used in the Conference track of the OAEI campaigns. It is composed of 16 ontologies on the conference organisation domain and simple reference alignments between 7 of them. Here, we consider 3 out of the 7 ontologies from the reference alignments (cmt, conference and ekaw), resulting in 3 alignment pairs.

Conference Testsuite

Repository: http://oaei.webdatacommons.org/tdrs/
Suite-ID: conference
Version-ID: conference-v1

The correspondences were manually curated by 3 experts following the query rewriting methodology in [2]. For each pair o1-o2 of ontologies, the following steps were applied:

Create a simple equivalence consensus alignment: ra1 was slightly modified.
For each entity e1 of o1 not in a simple equivalence correspondence, find a semantically equivalent construction from o2 entities. If no equivalence can be found, look for the closest entity or construction from o2 subsumed by e1.
Repeat previous step for each entity of o2 (constructions from o1 entities)

4 experts assessed the curated correspondences to reach a consensus.

Evaluation modalities

The complex correspondences output by the systems will be manually compared to the ones of the consensus alignment.

For this first evaluation, only equivalence correspondences will be evaluated and the confidence of the correspondenes will not be taken into account.

The systems can take the ra1 simple alignments as input. The ra1 alignments can be downloaded here.

Populated Conference dataset

In order to allow matchers which rely on instances to participate over the Conference complex track, we propose a populated version of the Conference dataset. 5 ontologies have been populated with more or less common instances resulting in 6 datasets: (6 versions on the repository: v0, v20, v40, v60, v80 and v100).

Populated Conference Testsuite

Repository: http://oaei.webdatacommons.org/tdrs/
Suite-ID: popconf
Version-ID: popconf-v0, popconf-v20, popconf-v40, popconf-v60, popconf-v80, popconf-v100

Evaluation modalities

The alignments will be evaluated based on Competency Questions for Alignment: basic queries that the alignment should be able to cover [6].

The queries are automatically rewritten using 2 systems:

that from [7] which covers (1:n) correspondences with EDOAL expressions.
a system which compares the answers (sets of instances or sets of pairs of instances) of the source query and the source member of the correspondences and which outputs the target member if both sets are identical.

The best rewritten query scores are kept.

A precision score will be given by comparing the instances described by the source and target members of the correspondences.

Details on the population and evaluation modalities are given at: https://framagit.org/IRIT_UT2J/conference-dataset-population.

Hydrography dataset

Ontologies and correspondences

The hydrography dataset is composed of four source ontologies, which are Hydro3, HydrOntology_native, HydrOntology_translated, and Cree, that each should be aligned to a single target Surface Water Ontology (SWO). The source ontologies vary in their similarity to the target ontology -- Hydro3 is similar in both language and structure, hydrOntology is similar in structure but is in Spanish rather than English, and Cree is very different in terms of both language and structure. All ontologies can be downloaded at once here.

Hydrography

Repository: http://oaei.webdatacommons.org/tdrs/
Suite-ID: hydrography
Version-ID: hydrography-v1

The alignments were created by a geologist and an ontologist, in consultation with a native Spanish speaker regarding the hydrOntology, and consist of logical relations.

Tasks

There are three subtasks in the Hydrography complex alignment track:

Entity Identification
For each entity in the source ontology, the alignment system is asked to list all of the entities in the target ontologies that are related to it in some way.

For example:
owl:equivalentClasses(ont1:A1 owl:intersectionOf(ont2:B1 owl:someValuesFrom(ont2:B2 ont2:B3))

The goal in this task is to find the most relevant entities in the ont2 to the class ont1:A1. In this case, the best output would be ont2:B1, ont2:B2, and ont2:B3.
Relationship Identification
For each alignmnet, the system should then endeavor to find the concrete relationships, such as equivalence, subsumption, intersection, value restriction, and so on, that hold between the entities. In terms of the example above, an alignment system needs to eventually determine that the relationship between the two sides is equivalence.
Full complex alignment Identification
This task is a combination of the two former steps.

Evaluation modalities

After we collect the results from matching systems, we plan to utilize relaxed precision and recall [5] as the metrics to evaluate the performance for three tasks. The full reference alignment can be downloaded from here.

GeoLink dataset

Ontologies and correspondences

This dataset is from the GeoLink project, which was funded under the U.S. National Science Foundation's EarthCube initiative. It is composed of two ontologies: the GeoLink Base Ontology (GBO) and the GeoLink Modular Ontology (GMO). All ontologies can be downloaded at once here. The GeoLink project is a real-world use case of ontologies, and its instance data is available. The alignment between the two ontologies was developed in consultation with domain experts from several geoscience research institutions. This alignment is a slightly simplified version of the one discussed in [4]. The relations that involve punning have been removed due to a concern that many automated alignment systems would not consider these as potential mappings. More details can be found in [4].

GeoLink

Repository: http://oaei.webdatacommons.org/tdrs/
Suite-ID: geolink
Version-ID: geolink-v1

Tasks

The same three subtasks as described for the hydrography dataset apply to this dataset also.

Evaluation modalities

The evaluation of the systems will be performed by computing relaxed precision and recall for three tasks. The reference alignment can be downloaded from here.

Populated Geolink dataset

Ontologies and correspondences

In order to allow alignment systems that rely on the instance data to participate over the Geolink complex track, we also generate a populated version of the Geolink dataset. The instance data are from real-worlds and collected from seven data repositories in the Geolink project. The ontologies with the populated instance data can be downloaded (here). More details of the populated geolink dataset can be found in [8].

GeoLink

Repository: http://oaei.webdatacommons.org/tdrs/
Suite-ID: popgeolink
Version-ID: popgeolink-v1

Tasks

The same three subtasks as described for the hydrography dataset apply to this dataset also.

Evaluation modalities

The evaluation of the systems will be performed by computing relaxed precision and recall for three tasks. The reference alignment can be downloaded from here.

Populated Enslaved dataset

Ontologies and correspondences

This dataset is from the Enslaved project, which was funded under The Andrew w. Mellon Foundation. It is composed of two resources: the Enslaved Ontology and the Enslaved Wikidata Knowledge Graph. The ontologies with the populated instance data can be downloaded (here). The Enslaved project is a real-world use case of ontologies in pepple of historic slave trade domain, and its instance data is available and already incorporated into Wikidata repository. The alignment between the two resources was developed in consultation with domain experts from several historian research institutions. More details of the populated enslaved dataset can be found in [9].

Populated Enslaved

Repository: http://oaei.webdatacommons.org/tdrs/
Suite-ID: popenslaved
Version-ID: popenslaved-v1

Tasks

The same three subtasks as described for the hydrography dataset apply to this dataset also.

Evaluation modalities

The evaluation of the systems will be performed by computing relaxed precision and recall for three tasks. The reference alignment can be downloaded from here.

Taxon dataset

Dataset description

The OAEI Taxon dataset is composed of four taxonomic registers represented as knowledge bases, that contain knowledge about plant taxonomy: AgronomicTaxon, AGROVOC, TAXREF-LD and DBpedia, each having somewhat different geographical or domain coverage. Let us note that the taxonomy domain is not easy to understand for non-experts. A taxon is a scientific hypothesis stating that a set of specimens (biological individuals) belong to the same taxonomic group (that is to say the taxon) due to some common characteristics, such as similar physical traits. Some scientific names are associated to a taxon. However the scientific consensus about taxonomy constantly evolves. In light of new scientific evidence, multiple types of recombinations may occur: two taxa could be merged into a single one, an existing taxon may be split into two separate taxa, or a species (a taxon rank) could move to different genus (the parent rank of species). For example, taxonomists decided that “Delphinus capensis Gray, 1828” and “Delphinus delphis Linnaeus, 1758” are the same biological entity, based on morphological or molecular data. In addition to this, the Code of zoological nomenclature (nomenclature is the set of rules governing scientific names) specifies that this species must be called “Delphinus delphis Linnaeus, 1758” as per the principle of priority. Therefore, a taxon may have a prefered name, the one used to denote the taxon at a certain time, and some synonyms that record changes. All the names and their recombinations are published in scientific literature.

Ontologies

The Taxon dataset is composed of 4 ontologies which describe the classification of species: AgronomicTaxon, Agrovoc, DBpedia and TaxRef-LD. All the ontologies are populated. The common scope of these ontologies is plant taxonomy. This dataset extends the one proposed in [3] by adding the TaxRef-LD ontology.

The four taxonomic registers in the OAEI Taxon dataset adopt somewhat different approaches to model nomenclatural and/or taxonomic information using the Semantic Web standards. TAXREF-LD models the taxon as an OWL class and the scientific names as instances of skos:Concept. Agrovoc and AgronomicTaxon mix taxonomy and nomenclature, representing the taxon and only its prefered scientific name with the same instance of skos:Concept (they propose an extension of the SKOS model to record taxon information and scientific name information). DBpedia models taxa in various and often inconsistent manners. A species may be an instance of class Species, its synonym names are given by a specific property. For more information about theise different models see [3] and [10].

To conclude, two main challenges make the alignment task difficult:

Challenge C1: The modelling discrepancies described above entail that alignments should be able to "cross" modelling perspectives, e.g. aligning an OWL class with an instance of skos:Concept.
Challenge C2: Situations occur where all taxonomic registers are not on the same page regarding taxonomic consensus. For instance, a register A may retain name X as the reference name, while register B may consider name X as a synonym of name Y. Therefore, an ideal alignment system should detect that taxon X in register A is closely related to taxon Y in register B.

Evaluation modalities

In 2022, we will manually evaluate the generated correspondences. The evaluation is blind.

Organizers

Florence Amardeilh (Elzeard.co, France), florence [.] amardeilh [at] elzeard [.] co
Franck Michel (Université Côte d'Azur, CNRS, Inria, France), fmichel [at] i3s [.] unice [.] fr
Catherine Roussey (INRAE Centre Clermont-ARA, laboratoire TSCF, France), catherine [.] roussey [at] inrae [.] fr
Cassia Trojahn (IRIT, Toulouse, France), cassia [.] trojahn [at] irit [.] fr
Ondřej Zamazal (University of Economics, Prague), ondrej [.] zamazal [at] vse [.] cz

References

[1] Ondřej Zamazal, Vojtěch Svátek. The Ten-Year OntoFarm and its Fertilization within the Onto-Sphere. Web Semantics: Science, Services and Agents on the World Wide Web, 43, 46-53. 2017.

[2] Élodie Thiéblin, Ollivier Haemmerlé, Nathalie Hernandez, Cassia Trojahn. Task-Oriented Complex Ontology Alignment: Two Alignment Evaluation Sets. In : European Semantic Web Conference. Springer, Cham, 655-670, 2018.

[3] Élodie Thiéblin, Fabien Amarger, Nathalie Hernandez, Catherine Roussey, Cassia Trojahn. Cross-querying LOD datasets using complex alignments: an application to agronomic taxa. In: Research Conference on Metadata and Semantics Research. Springer, Cham, 25-37, 2017.

[4] Lu Zhou, Michelle Cheatham, Adila Krisnadhi, Pascal Hitzler. A Complex Alignment Benchamark: GeoLink Dataset. In: International Semantic Web Conference. Springer, Proceedings, Part II, pp. 273-288, 2018.

[5] Marc Ehrig, and Jérôme Euzenat. "Relaxed precision and recall for ontology matching." K-CAP 2005 Workshop on Integrating Ontologies, Banff, Canada, 2005.

[6] Élodie Thiéblin. Do competency questions for alignment help fostering complex correspondences?. In EKAW Doctoral Consortium, 2018.

[7] Élodie Thiéblin, Fabien Amarger, Ollivier Haemmerlé, Nathalie Hernandez, Cassia Trojahn. Rewriting SELECT SPARQL queries from 1:n complex correspondences. In: Ontology Matching, pp. 49-60, 2016.

[8] Lu Zhou, Michelle Cheatham, Adila Krisnadhi, Pascal Hitzler. GeoLink DataSet: A Complex Alignment Benchmark from Real-world Ontology. In: Data Intellegence. MIT Press, Proceedings, 2019.

[9] Lu Zhou, Cogan Shimizu, Pascal Hitzler, Alicia M. Sheill, Seila Gonzalez Estrecha, Catherine Foley, Duncan Tarr, Dean Rehberger. The Enslaved Dataset: A Real-world Complex Ontology Alignment Benchmark using Wikibase. In: Conference on Information and Knowledge Management, ACM, 2020.

[10] F. Michel, O. Gargominy, S. Tercerie, C. Faron. A Model to Represent Nomenclatural and Taxonomic Information as Linked Data. Application to the French Taxonomic Register, TAXREF. In Proceedings of the ISWC2017 workshop on Semantics for Biodiversity (S4BioDiv), Vol. 1933. Vienna, Austria: CEUR, 2017.