Dataset used for ACL 2012 Student Research Workshop paper
Several approaches have been proposed for the automatic acquisition of multiword expressions from corpora. However, there is no agreement about which of them presents the best cost-benefit ratio, as they have been evaluated in distinct datasets and/or languages. To address this issue, we investigate these techniques analysing the following dimensions: expression type (compound nouns, phrasal verbs), language (English, French) and corpus size.
This directory contains the results of the evaluation/MAP for each tool and each configuration, as well as the execution times of each step of each tool. It also contains the corpora from which the candidates have been extracted, the references used for evaluation, and the scripts used to run the experiments.
More details and result analysis can be found in the paper:
Carlos Ramisch, Vitor De Araujo and Aline Villavicencio. 2012. A Broad Evaluation of Techniques for Automatic Acquisition of Multiword Expressions. In Proceedings of the ACL 2012 Student Research Workshop, Jeju, Republic of Korea, Aug. ACL.
/runEval.shThis will generate directories with the results for each tool (localmaxs/, mwetk/, nsp/, ucs/), as well as a times/ directory containing the execution times for each step of execution of each tool. This will take around 24h to complete, depending on the computer.
./timetable.shThis script will collect the time information generated by runEval.sh, as well as the evaluation/MAP information, and print it.
./intersections.shThis script generates tables of the number of candidates that are common for each pair of tools, one table per corpus size, candidate type (noun compound or verb-particle), and language (en or fr).