logo cogalex
     

Home
Call for Paper
Call for Shared Task
Important Dates
Submission
Invited Talk
Registration
Workshop Program
Accepted Papers
Program Committee
Organizing Committee
Venue
Contact
Previous Workshops
 

The 2000 words test data is NOW released !!!
The 2000 words training data can be found here.

Call for Participation for Shared Task

No formal registration is required. Just send us your results (list of 2000 words) until May 9, 2014 (date of the deadline). If you would like to participate in the shared task but cannot attend the workshop, please let us know.

We invite participation in a shared task devoted to the problem of lexical access in language production, with the aim of providing a quantitative comparison between different systems.

Motivation

The quality of a dictionary depends not only on coverage, but also on the accessibility of the information. That is, a crucial point is dictionary access. Access strategies vary with the task (text understanding vs. text production) and the knowledge available at the very moment of consultation (words, concepts, speech sounds). Unlike readers who look for meanings, writers start from them, searching for the corresponding words. While paper dictionaries are static, permitting only limited strategies for accessing information, their electronic counterparts promise dynamic, proactive search via multiple criteria (meaning, sound, related words) and via diverse access routes. Navigation takes place in a huge conceptual lexical space, and the results are displayable in a multitude of forms (e.g. as trees, as lists, as graphs, or sorted alphabetically, by topic, by frequency).

To bring some structure into this multitude of possibilities, the shared task will concentrate on a crucial subtask, namely multiword association. we will organize a novel type of shared task which will allow quantitative comparisons between different systems. The task chosen is multiword association. What we mean by this in the context of this workshop is the following. Suppose, we were looking for a word expressing the following ideas: 'superior dark coffee made of beans from Arabia', but could not remember the intended word 'mocha'. Since people always remember something concerning the elusive word, it would be nice to have a system accepting this kind of input, to propose then a number of candidates for the target word. Given the above example, we might enter 'dark', 'coffee', 'beans', and 'Arabia', and the system would be supposed to come up with one or several associated words such as 'mocha', 'espresso', or 'cappuccino'.

Procedure

The participants will receive lists of five given words (primes) such as 'circus', 'funny', 'nose', 'fool', and 'fun' and are supposed to compute the word which is most closely associated to all of them. In this case, the word 'clown' would be the expected answer. Here are some more examples

    • given words: gin, drink, scotch, bottle, soda
    • expected answer: whisky

    • given words: wheel, driver, bus, drive, lorry
    • expected answer: car

    • given words: neck, animal, zoo, long, tall
    • expected answer: giraffe

    • given words: holiday, work, sun, summer, abroad
    • expected answer: vacation

    • given words: home, garden, door, boat, chimney
    • expected answer: house

    • given words: blue, cloud, stars, night, high
    • expected answer: sky

We will provide a training set of 2000 sets of five input words (multiword stimuli), together with the expected target words (associative response). The participants will have five weeks to train their systems on this data. After the training phase, we will release a test set containing another 2000 sets of five input words, but without providing the expected target words.

Participants will have five days to run their systems on the test data, thereby predicting the target words. For each system, we will compare the results to the expected target words and compute an accuracy. The participants will be invited to submit a paper describing their approach and the results.

For the participating systems, we will distinguish two categories:

    • Unrestricted systems : They can use any kind of data to compute their results.

    • Restricted systems: These systems are only allowed to draw on the freely available ukWaC corpus in order to extract information on word associations. This corpus comprises about 2 billion words and is downloadable from wacky.

Participants are allowed to compete in either category or in both.

Venue

The shared task will take place as part of the CogALex workshop which is co-located with COLING 2014 (Dublin). The workshop date is August 23, 2014. Shared task participants who wish to have a paper published in the workshop proceedings will be required to present their work at the workshop.

Schedule for Shared Task

Training Data Release:
March 27, 2014

Test Data Release:
May 5, 2014

Final Results:
May 9, 2014

Deadline for Paper Submission:
June 8, 2014

Reviewers' feedback:
June, 29, 2014

Camera-ready version
July 7, 2014

Further information

For the instruction, please click here or

For the training data, please click here or

For the test data, please click here or

The training- and the test-datasets were both derived from the Edinburgh Associative Thesaurus (Kiss et al., 1973) in a way as described in the Proceedings of the CogALex-IV workshop.

Kiss, G.R., Armstrong, C., Milroy, R., and Piper, J. (1973) An associative thesaurus of English and its computer analysis. In Aitken, A.J., Bailey, R.W. and Hamilton-Smith, N. (Eds.), The Computer and Literary Studies. Edinburgh: University Press.

Registration for the shared task: Please send an e-mail to Reinhard Rapp, with Michael Zock in c.c.