This test case aims at providing a challenging task for ontology matchers in the domain of large directories.
Precision, Recall and F-Measure of the systems on web directories dataset are presented on the three next figures.
Precision for web directories matching task.
Similarly with OAEI-2005 7 matching systems were evaluated on the dataset. However only one of them (Falcon) participated in both evaluations. The systems in general demonstrated higher results than in OAEI-2005. The average Recall of the systems increased from 22.23% to 25.82%. The highest recall (45.47%) was demonstrated by Falcon system what is almost 50% increase in respect to its last year result (31.17%).
Recall for web directories matching task.
Despite of this progress the dataset remains difficult for the matching systems. The maximum and average values for Precision (40.5% and 34.5%), Recall (45.47% and 25.82%) and F-measure (42.85% and 28,56%) are significantly lower than corresponding values in benchmark tests for example.
F-Measure for web directories matching task.
Partitioning of positive and negative mappings according to the systems results are presented on the two following charts.
Partition of the systems results on positive mappings.
As from the figures 43% of positive mappings have not been found by any of the systems. At the same time 22% of negative mappings was found by all the matching systems (i.e., all the matching systems mistakenly returned them as positive). Moreover only 10% of positive mappings were found by all the matching systems.
Partition of the systems results on negative mappings.
6 out of 7 systems participated in the evaluation presented their results only for one of the dataset representations, namely for the representation composed from 4639 node matching tasks. Only one system (H-Match) presented the results also for the other representations. This can be interpreted as a lack of scalability.
Blind evaluation declined for some of the systems possibility to improve their final results after preliminary results disclosure. For example, the final results of COMA and prior matching systems were slightly lower then their preliminary results. The final f-measure of coma dropped from 32.56% to 28.84% while f-measure of prior dropped from 28.32% to 28.29%.