PARSEME: PARSing and Multi-word Expressions
sourced from UniDive COST Action CA21167, licensed under CC BY-SA 4.0.
PARSEME Corpora
The PARSEME multilingual corpora have been annotated with VMWEs to serve, notably, as both training and testing resources for shared tasks (Savary et al. 2017; Ramisch et al. 2018, 2020) , with the aim of enhancing the identification of VMWEs in written content. Presently in version 1.3, the PARSEME corpora encompass 26 languages, including French. Collectively, the corpora comprise \(455,629\) sentences, equivalent to 9 million tokens, and \(127,498\) VMWEs. Specifically, the French corpus contains \(20,961\) sentences, equivalent to \(525,842\) tokens, with \(5,655\) annotated VMWEs .
The VMWE types are detailed in the annotation guidelines.1 The VMWE types annotated specifically for French are as follows:
- IRV (inherently reflexive verbs): e.g. s’évanouir (lit. ‘to faint oneself’) ‘to faint’
- LVC.full (light verb constructions in which the verb is semantically totally bleached): e.g. faire une présentation ‘to make a presentation’
- LVC.cause (light verb constructions in which the verb adds a causative meaning to the noun): e.g. donner le droit ‘to grant the right’
- VID (verbal idioms ): e.g. se faire des idées (lit. ‘make oneself ideas’) ‘to imagine something false’
- MVC (multi-verb constructions): e.g. ce mot veut dire autre chose (lit. ‘this word wants to mean something else’) ‘this word means something else’
References
Footnotes
https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/↩︎