- Global characteristics of results delivered by participants;
- (1) Evaluation based on reference alignments;
- (2)Evaluation based on manual labelling with sampling;
- (3)Evaluation based on Data Mining method;
- (4)Evaluation based on Logical Reasoning;

We have seen eight participants:

*AgreementMaker (AgrMaker)**AROMA**ASMOV**CODI**Ef2Match**Falcon**GeRMeSMB**SOBOM*

- All participants delivered via SEALS platform all 120 alignments.
- CODI matcher delivered 'certain' correspondences, other matchers delivered correspondences with graded confidence values between 0 and 1

This year we considered results of participants with the following evaluation methods:

In the table below, there are results of all eight participants
with regard to the reference alignment. There are traditional precision
(P), recall (R), and F-measure (F-meas) computed for three different
thresholds (0.2, 0.5, and 0.7). We use F-measure, which is the harmonic
mean of precision and recall.

For better comparison we found an optimal threshold in terms of highest average F-measure, see Table below. A dependency of F-measure on a threshold can be seen from the Figure below In the table there are precision, recall, and F-measure for an optimal threshold. There is one 'asterisk' in the column of threshold for matchers which did not provide graded confidence values. The matcher with the highest average F-measure (62%) is the CODI which did not provide graded confidence values. Other matchers are very close to this score (ASMOV 60%, Ef2Match 60%, Falcon 59%). However we should take into account that this evaluation has been made over small part of all alignments (one fifth).

We can also compare performance of participants wrt. last two years (2008, 2009). There are three matchers which also participated in last two years. ASMOV participated in all three consecutive years with increasing highest average F-measure: from 43% in 2008 and 47% in 2009 to 60% in 2010. AgreementMaker participated with 57% in 2009 and 58% in 2010 regarding highest average F-measure. Finally, AROMA participated with the same highest average F-measure in both years, 2009 and 2010.

This year we have not received any alignments with subsumption relations, therefore we did not compute `restricted semantic precision and recall'.

This year we take the most secure, i.e., with highest confidence, correct correspondences as a population for each matcher. Particularly, we evaluate 100 correspondences per matcher randomly chosen from all correspondences of all 120 alignments with confidence 1.0 (sampling). Because AROMA, ASMOV, Falcon, GeRMeSMB and SOBOM do not have enough correspondences with 1.0 confidence we take 100 correspondences with highest confidence. For all of these matchers (except ASMOV where we found exactly 100 correspondences with highest confidence values) we sampled over their population. In table below you can see approximated precisions for each matcher over its population of best correspondences. N is a population of all the best correspondences for one matcher. n is a number of randomly chosen correspondences so it is 100 best correspondences for each matcher. TP is a number of correct correspondences from the sample, and P* is an approximation of precision for the correspondences in each population; additionally there is a margin of error computed as: sqrt((N/n)-1)/sqrt(N) based on [4].

From the table above we can conclude that CODI, Falcon and AgreementMaker have the best precision (higher than 90%) over their 100 more confident correspondences.

Data Mining technique enables us to discover non-trivial findings about systems of participants. These findings can answer *analytic questions*, such as:

- Which systems give higher/lower validity than others to the mappings that are deemed 'in/correct'?
- Which systems produce certain mapping patterns/correspondence patterns more often than others?
- Which systems are more successful on certain types of ontologies?

We formulated abovementioned and similar analytic questions as tasks for mining association rules using the *LISp-Miner* tool and its *4ft-Miner procedure*. In those tasks we also employed matching patterns [2] and correspondence patterns [1].

Here are the full descriptions and results of four tasks output by LISp-Miner tool:

- Task 1 - confidence
- Task 2 - origin of resources
- Task 3 - "neutral" matching patterns
- Task 4 - correspondence (matching) patterns

Results are briefly described in [6].

This method has been done by Christian Meilicke and Heiner Stuckenschmidt from Computer Science Institure at University Mannheim, Germany.

Results are available in [6].

