Ontology Alignment Evaluation Initiative - OAEI-2021 Campaign

Results for OAEI 2021 - Interactive Track


The growth of the ontology alignment area in the past ten years has led to the development of many ontology alignment tools. After several years of experience in the OAEI, we observed that the results only slightly improved in terms of the alignment quality (precision/recall resp. F-measure). Based on this insight, it is clear that fully automatic ontology matching approaches slowly reach an upper bound of the alignment quality they can achieve. A work by (Jimenez-Ruiz et al., 2012) has shown that simulating user interactions with 30% error rate during the alignment process has led to the same results as non-interactive matching. Thus, in addition to the validation of the automatically generated alignments by domain experts, we believe that there is further room for improving the quality of the generated alignments by incorporating user interaction. User involvement during the matching process has been identified as one of the challenges in front of the ontology alignment community by (Shvaiko et al., 2013) and user interaction with a system is an integral part of it.

At the same time, with the tendency of increasing ontology sizes, the alignment problem also grows. It is not feasible for a user to, for instance validate all candidate mappings generated by a system, i.e., tool developers should aim at reducing unnecessary user interventions. All required efforts of the human have to be taken into account and it has to be in an appropriate proportion to the result. Thus, besides the quality of the alignment, other measures like the number of interactions are interesting and meaningful to decide which matching system is best suitable for a certain matching task. All other OAEI tracks focus on fully automatic matching and semi-automatic matching is not evaluated although such systems already exist, e.g., overview in (Ivanova et al., 2015). As long as the evaluation of such systems is not driven forward, it is hardly possible to systematically compare the quality of interactive matching approaches.


In this 9th edition of the Interactive track we use two OAEI datasets, namely Anatomy and Conference. The Anatomy dataset includes two ontologies (1 task), the Adult Mouse Anatomy (AMA) ontology and a part of the National Cancer Institute Thesaurus (NCI) describing the human anatomy. The Conference dataset covers 16 ontologies describing the domain of conference organization. We only use the test cases for which an alignment is publicly available (altogether 21 alignments/tasks).

The quality of the generated alignments in the Anatomy and Conference tracks has been constantly increasing but in most cases only by a small amount (by a few percentage points). For example, in the Conference track in 2013, the best system according to F-measure was YAM++ with 70% (Cuenca Grau et al., 2013). On the other hand, while the best result according to F-measure for the Anatomy track was achieved last year by AML with 94% there has been very little improvement over previous few campaigns (only a few percentage points). This shows that there is room for improvement, which could be filled by interactive means.

Experimental Setting

The interactive matching track was organized at OAEI 2021 for the nineth time. The goal of this evaluation is to simulate interactive matching (see (Paulheim et al., 2013), (Dragisic et al., 2016) and (Li et al., 2020)), where a human expert is involved to validate mappings found by the matching system. In the evaluation, we look at how interacting with the user improves the matching results. The SEALS client was modified to allow interactive matchers to ask an Oracle. The interactive matcher can present a correspondence to the Oracle, which then tells the system whether the correspondence is right or wrong. Since 2016, we have extended this functionality - a matcher can present simultaneously several correspondences to the Oracle. Since 2015, in addition to emulating the perfect user, we have considered domain experts with variable error rates which reflects a more realistic scenario where a (simulated) user does not necessarily provide a correct answer. We experiment with three different error rates, 0.1, 0.2 and 0.3. The errors were randomly introduced into the reference alignment with given rates.

Each system was run ten times and the final result of a system for each error rate represents the average of these runs. This is the same configuration that was used in the non-interactive version of the Anatomy track and runtimes in the interactive version of this track are therefore comparable. For the Conference dataset with the ra1 alignment, we considered macro-average of precision and recall of different ontology pairs, while the number of interactions represents the total number of interactions in all tasks. Finally, the ten runs are averaged.


The results are presented for each dataset separately. Columns "Precision Oracle", "Recall Oracle" and "F-measure Oracle" across all tables for all datasets that follow below contain the evaluation results "according to the Oracle", meaning against the Oracle's alignment (i.e., the reference alignment as modified by the randomly introduced errors). They represent the precision, recall, and F-measure of the matching tool as if the Oracle's answers were all correct and show the impact of the Oracle's errors on the tool. The "Total Requests" column represents the number of distinct user interactions with the tool, where each interaction can contain one or more mappings that could be analyzed simultaneously. The mappings that are not conflicting are counted individually; and if more than three mappings are given, they are all counted independently, regardless of whether they are conflicting. This value is given in the "Distinct Mappings" column. The next four columns represent the true positives, true negatives, false positives and false negatives out of all distinct mappings requests. The "Precision" and "Negative Precision" columns are the precision and negative precision of the Oracle itself, i.e., respectively the fraction of positive and negative answers given by the Oracle that are correct. The "Precision" value affects the true positives and false positives, and the "Negative Precision" value affects the true negatives and false negatives.
Measure Description Meaning for the evaluation
Precision, Recall, F-measure Measure the tool's performance against the fixed reference alignment. Show in absolute terms how the tool performed in the matching problem.
Precision Oracle, Recall Oracle and F-measure Oracle Measure the tool's performance against the reference as modified by the Oracle's errors. Show how the tool is impacted by the errors.
Precision and Negative Precision Measure the performance of the Oracle itself. Show what type of errors the Oracle made, and thus explain the performance of the tool when faced with these errors.

Anatomy Track

The four tables below present the results for the Anatomy dataset with four different error rates. The first two columns in each of the tables present the run time (in seconds) and the size of the alignment. The next eight columns present the precision, recall, F-measure and recall+ obtained from the interactive and non-interactive tracks for the Anatomy dataset. The measure recall+ indicates the amount of detected non-trivial equivalence correspondences. To calculate it, the trivial correspondences (those with the same normalized label including those in the oboInOwlnamespace as well) have been removed as well as correspondences expressing relations different from equivalence. The meaning of the other columns is described above at the beginning of the Results section. Fig. 1 shows the time intervals between the questions to the user/Oracle for the different systems and error rates for the ten runs (the runs are depicted with different colors).

Error Rate 0.0

Error Rate 0.1

Error Rate 0.2

Error Rate 0.3

We first compare the performance of the three systems with an all-knowing Oracle (0.0 error rate), in terms of precision, recall, F-measure and recall+, to the results obtained in the non-interactive Anatomy track ("Precision", "Recall", "F-measure", "Recall+", "Precision Non Inter", "Recall Non Inter", "F-measure Non Inter" and "Recall+ Non Inter"). The effect of introducing interactions with the Oracle/user is mostly pronounced for the ALIN system and especially for its recall measure - ALIN has a 16.1 percentage points increase. At the same time, AML's recall increases with less than 1 percentage points whereas LogMap's recall decreases a little bit. User interactions affect the precision measure differently - it brings increases for all systems, and it is most advantageous for LogMap which precision increases with around 7.3 percentage points. Consequently, the ALIN's F-measure increases the most (from 0.835 to 0.934), the F-measure for LogMap and AML slightly changes. AML has the highest F-measure for this error rate as well as for the rest of the measured error rates. It is worth noting that ALIN detects more trivial correspondences in the non-interactive track ("Recall+ Non Inter" = 0.438), introducing user interactions led to detecting some non-trivial correspondences ("Recall+" = 0.702). Thus it seems that ALIN is slightly relying on user interaction to generate non-trivial mappings. The recall+ slightly improves for AML with around 1.2 percentage points. In terms of the alignment size and the number of total requests, AML generates the largest alignment with 1455 correspondences by 189 requests (7.69 times). ALIN and LogMap show slightly similar results in which ALIN generates 1363 correspondences in 404 requests (3.37 times), and LogMap generates 1299 correspondences in 388 requests (3.34 times).

With the introduction of an erroneous Oracle/user and moving towards higher error rates, the systems' performances start to deteriorate in comparison to the all-knowing Oracle slightly. We can observe that the change in the error rate influences the systems differently in comparison to the non-interactive results. AML's performance with an all-knowing Oracle is better on all measures. To compare the interactive results with non-interactive results, the F-measure drops in the 0.2 and 0.3 cases, while the recall stays higher than the non-interactive results for all error rates except for 0.3. LogMap behaves similarly - the F-measure in the 0.3 case drops below the non-interactive results, while the precision stays higher in all error rates. ALIN's recall is higher for all error rates in comparison to its non-interactive results. F-measure stays higher than the non-interactive results for all error rates except for 0.3 and its precision drops already in the 0.1 case. Overall the F-measure drops with about 18.4 percentage points for ALIN from error rate 0.0 to 0.3 and with around 3.7 and 1.9 percentage points respectively for LogMap and AML. One observation from the recall+ in the interactive and non-interactive tracks, ALIN has higher numbers in the interactive track for all error rates which means they generated more non-trivial mappings. AML's recall+ in interactive tracks drops below that in non-interactive track in 0.2 case. The recall+ of LogMap in the interactive track is slightly higher than that in non-interactive only for all-knowing Oracle. Now let us examine how the precision and recall are affected when moving from error rate 0.0 to 0.3 in connection to the values in the last two columns - "Precision" and "Negative Precision". The "Precision" value drops more than the value for "Negative Precision" for ALIN, AML, and LogMap. ALIN has 58.5 percentage points and 11.9 percentage points drop for "Precision" and "Negative Precision" respectively. While AML drops 55.7 and 11.5, LogMap drops 56.1 and 12. Taking into account that the "Precision" impacts precision and "Negative Precision" impacts recall we would expect that the precision drops more than recall for ALIN, AML, and LogMap. This is actually observed - ALIN's precision and recall drop with respectively 23 percentage points and 14.2 percentage points while AML's - with 3 and 0.8 percentage point and LogMap's - with around 4.9 and 2.7 percentage points. Another observation is connected to the values in the "Precision Oracle" and "Recall Oracle" columns. ALIN's and AML's values are approximately constant, which means that the impact of the errors is linear. But ALIN has an obvious decrease on precision oracle in error rate 0.3. LogMap, as the last year, has a noticeable drop in both "Precision Oracle" and "Recall Oracle" which indicates a supralinear impact of errors.

The size of the alignments produced by AML and LogMap slightly increases with the increasing error rates. For ALIN, it slightly increases from 0 to 0.2 but drops in 0.3 case. The numbers of total requests and distinct mappings for AML increase with the error rate grows from 0 to 0.3. A different trend is observed in ALIN. There is a decrease from 0.0 to 0.1 but slightly increase from 0.1 to 0.2. From 0.2 to 0.3, there is a slightly decrease agian. Different from AML containing one mapping in one user interaction and not asking the user the same question twice as the value of total requests and distinct mappings are almost equal, ALIN combines several mappings in one request to the user. The two numbers for LogMap are same through the different error rates. Another difference is observed in the ratio of correct to incorrect requests made to the Oracle ("True Positives" and "True Negatives"). In the case of an all-knowing Oracle, all the systems make more incorrect than correct requests. The ratios of correct and incorrect requests of ALIN, AML and LogMap are 0.3, 0.3 and 0.32 respectively. AML, LogMap and ALIN keep almost the same ratios in different error rates.

For an interactive system, the time intervals at which the user is involved in an interaction are important. Fig. 1 presents a comparison between the systems regarding the time periods at which the system presents a question to the user. Across the ten runs and different error rates, the AML and LogMap mean requests intervals are around 1 and 0 ms, respectively. The average of ALIN's mean intervals is 5 ms. While there are no significant deviations of this value for LogMap, several outliers are observed for ALIN and AML, i.e., the interval between a few of the AML requests took up between 150ms and 250 ms; For the run time between the different error rates, there is no significant change for LogMap and AML and ALIN. LogMap and AML generate alignments for less than a minute.

The take away of the above analysis is that all systems perform better with an all-knowing Oracle than in the non-interactive Anatomy track in terms of F-measure. For ALIN, all three measures of the interactive track are higher in 0.0 error-rate than the non-interactive one. For AML all three measures (precision, recall, and F-measure) are higher in the Interactive track with 0.0 and 0.1 error rates than in the non-interactive, for LogMap the recall in the 0.0 case is almost same as its non-interactive recall and drops under it in the 0.1 error rate. The growth of the error rate impacts different measures in the different systems. The impact of the Oracle errors is linear for ALIN and AML and supralinear for LogMap. The ratio of correct to incorrect questions to the Oracle is different in the three systems and further changes with the increasing error rates which could be an indicator of different strategies to handle user errors.

Comparison to 2020

AML and LogMap have participated in the track since it has been established in 2013. ALIN has participated since 2016. AML and LogMap show similar results to the results from 2020. For ALIN, there is a slightly increasing trend in terms of F-measure, recall+, the number of requests and the number of distinct mappings but a decrease in Precision. ALIN improves a new strategy to standardize the terms with grammars, which contributes to the increase on the number of distinct mappings. Similarly to the last year, the impact of the Oracle's errors is linear for AML, ALIN and supralinear for LogMap.

Fig1. The Y axis depicts the time intervals between the requests to the user/Oracle (whiskers: Q1-1,5IQR, Q3+1,5IQR, IQR=Q3-Q1). The labels under the system names show the average number of requests and the mean time between the requests for the ten runs.

Conference Track

The four tables below present the results for the Conference dataset with four different error rates. The columns are described above at the beginning of the Results section. Fig. 2 shows the average requests intervals per task (21 tasks in total per run) between the questions to the user/Oracle for the different systems and error rates for all tasks and the ten runs (the runs are depicted with different colors). The first number under the system name is the average number of requests and the second number is the average period of the average requests intervals for all tasks and runs.

Error Rate 0.0

Error Rate 0.1

Error Rate 0.2

Error Rate 0.3

Fig2. The Y axis depicts the average time between the requests per task in the Conference dataset (whiskers: Q1-1,5IQR, Q3+1,5IQR, IQR=Q3-Q1). The labels under the system names show the average number of requests and the mean time between the requests (calculated by taking the average of the average request intervals per task) for the ten runs and all tasks.

When systems are evaluated using an all-knowing Oracle, all the participants present a higher F-measure than the one they obtained non-interactively. ALIN shows the greatest improvement, with around 20 percentage points increase in F-measure, and also the highest F-measure in absolute terms when questioning the Oracle. AML and LogMap improved by 5.1 and 3.7 percentage points respectively. The substantial improvement of ALIN is mostly supported by gains in recall (around 26.2 percentage points), whereas LogMap and AML show more balanced improvements.

When an error rate is introduced, ALIN is the system that is more severely affected, losing around 10.7, 8.7 and 5.7 percentage points in F-measure with each increase in error rate. On the other hand, LogMap is resilient, with losses between 2.4 percentage points and 1 percentage point. In the tests with error rates 0.1, 0.2, and 0.3, AML produced the highest F-measure. When comparing the systems' performance across the different error rates with their non-interactive results, we note that the precision drops earlier than the recall. ALIN's recall is higher than its non-interactive results even in the 0.3 (its relatively low recall in the non-interactive track also contributes to this result), but the precision drops already for the smallest error rate. AML's precision drops under the non-interactive results in the 0.2 case, while the recall drops under the non-interactive results in the 0.3 case. LogMap's precision drops in the 0.3 case while its recall stays almost the same as the non-interactive results. The larger drop in "Precision" than "Negative Precision" values between the 0.0 and 0.3 error rates explains the larger decrease in precision (around 41.6, 18.6, 9 percentage points) than the decrease in recall (around 9.6, 4.8, 2.4 percentage points) for all three systems (ALIN, AML, and LogMap) The impact of Oracle errors ("Precision Oracle" and "Recall Oracle" columns) on the systems' behavior for the Conference dataset is similar to the impact of the errors on their behavior for the Anatomy dataset - it is linear for ALIN and AML and supralinear for LogMap.

The number of total requests posed by each system is different. LogMap poses the lower number (82 for every test), while on average, AML poses around 245, whereas ALIN poses around 250 total requests. Analysing the values for the "True Positives" and "True Negatives" we observe that the systems mainly ask the Oracle about incorrect mappings (true negatives) - in the 0.0 case the ratio is 0.13, 0.24 and 0.25 for ALIN, AML, and LogMap respectively. This ratio almost stays the same for ALIN and LogMap when the error rate increases. While this ratio of AML increases to 0.35 in the test with error rate 0.3, which indicates AML starts to ask more positive questions as the error rate increases. Time between requests is low for all systems. And the total run time between the error rates are almost same.

The take away of the above analysis is that all systems, perform better on all measures with an all-knowing Oracle than in the non-interactive Conference track. In the 0.1 error rate, AML and LogMap still perform better on all measures, while ALIN's precision drops below its non-interactive precision. Its F-measure, however, stays higher than the non-interactive value over all error rates except error rate 0.3 due to the improvement in the recall.

Comparison to 2020

The variation trends with respect to precision, recall, and F-measure in each system between interactive and non-interactive are similar. ALIN shows an increase on all measures The influence due to the different error rates among these measures is also similar. AML and LogMap have similar results in comparison to the last year. ALIN has a slightly decrease on all measures as error rate increases.


While LogMap and AML make use of user interactions exclusively in the post-matching steps to filter their candidate mappings, ALIN can also add new candidate mappings to its initial set. LogMap and AML both request feedback on only selected mapping candidates (based on their similarity patterns or their involvement in unsatisfiabilities). LogMap presents three mappings to the user while AML only presents one mapping at a time to the user. ALIN also employs the new feature similar to LogMap - analysing several mappings simultaneously - and can present up to three mappings together to the user if a full entity name in a candidate mapping is the same as another entity name in another candidate mapping. The difference between ALIN and LogMap is the latter contains three questions (">", "<" and "=") for every candidate mapping.

With the all-knowing Oracle all systems, perform better than their non-interactive results for both datasets. Although systems' performance deteriorates when moving towards larger error rates there are still benefits from the user interaction - some of the systems' measures stay above their non-interactive values even for the larger error rates. For the Anatomy dataset different measures are affected in the different systems in comparison to their non-interactive results; for the Conference dataset the precision drops under the non-interactive values faster than the recall. The drop in precision and recall for all systems is larger for the Conference dataset than for the Anatomy dataset with the increasing error rates.

The impact of the Oracle's errors is linear for ALIN and AML and supralinear for LogMap for both datasets. One difference between the two datasets is the ratio of correct to incorrect requests to the Oracle. There is a clear trend for all three systems in the Conference dataset, the ratio stays around and under 0.5 for all error rates. In the Anatomy dataset the ratio changes differently for each system.

Two models for system response times are frequently used in the literature (Dabrowski et al., 2011): Shneiderman and Seow take different approaches to categorize the response times. Shneiderman takes task-centred view and sort out the response times in four categories according to task complexity: typing, mouse movement (50-150 ms), simple frequent tasks (1 s), common tasks (2-4 s) and complex tasks (8-12 s). He suggests that the user is more tolerable to delays with the growing complexity of the task at hand. Unfortunately no clear definition is given for how to define the task complexity. Seow's model looks at the problem from a user-centred perspective by considering the user expectations towards the execution of a task: instantaneous (100-200 ms), immediate (0.5-1 s), continuous (2-5 s), captive (7-10 s); Ontology alignment is a cognitively demanding task and can fall into the third or fourth categories in both models. In this regard the response times (request intervals as we call them above) observed with both the Anatomy and Conference dataset fall into the tolerable and acceptable response times, and even into the first categories, in both models. The request intervals for LogMap stays around 1 ms for both datasets. ALIN's request intervals are 35 ms for the Anatomy dataset on average and less than 1 ms for the Conference dataset. AML's request intervals are around 4ms for the Anatomy dataset and less than 1 ms for the Conference dataset. It could be the case however that the user could not take advantage of very low response times because the task complexity may result in higher user response time (analogically it measures the time the user needs to respond to the system after the system is ready).


Ernesto Jimenez-Ruiz, Bernardo Cuenca Grau, Yujiao Zhou, Ian Horrocks. "Large-scale Interactive Ontology Matching: Algorithms and Implementation". ECAI 2012. [pdf]

Pavel Shvaiko, Jérôme Euzenat. "Ontology matching: state of the art and future challenges". Knowledge and Data Engineering 2013. [publisher page]

Valentina Ivanova, Patrick Lambrix, Johan Åberg. "Requirements for and evaluation of user support for large-scale ontology alignment". ESWC 2015. [publisher page]

Bernardo Cuenca Grau , Zlatan Dragisic , Kai Eckert & et al. "Results of the Ontology Alignment Evaluation Initiative 2013" OM 2013. [pdf]

Heiko Paulheim, Sven Hertling, Dominique Ritze. "Towards Evaluating Interactive Ontology Matching Tools". ESWC 2013. [pdf]

Zlatan Dragisic, Valentina Ivanova, Patrick Lambrix, Daniel Faria, Ernesto Jimenez-Ruiz and Catia Pesquita. "User validation in ontology alignment". ISWC 2016. [paper] [technical report]

Huanyu Li, Zlatan Dragisic, Daniel Faria, Valentina Ivanova, Ernesto Jimenez-Ruiz, Patrick Lambrix and Catia Pesquita. "User validation in ontology alignment: functional assessment and impact". The Knowledge Engineering Review, 2020. [DOI]

Jim Dabrowski, Ethan V. Munson. "40 years of searching for the best computer system response time". Interacting with Computers 2011. [publisher page]


This track is currently organized by Huanyu Li, Mina Abd Nikooie Pour, Ying Li, Daniel Faria, Ernesto Jimenez Ruiz, Patrick Lambrix, and Catia Pesquita. If you have any problems working with the ontologies or any suggestions related to this track, feel free to write an email to huanyu [dot] li [at] liu [dot] se


We thank Zlatan Dragisic and Valentina Ivanova for their contributions in setting up this track.

We thank Dominique Ritze and Heiko Paulheim, the organisers of the 2013 and 2014 editions of this track, who were very helpful in the setting up of the 2015 edition.

The track is partially supported by the Optique project.