Coordination Prof. Amina Mettouchi, Université de Nantes
Research partners : LLING (University of Nantes, A. Mettouchi), LLACAN (CNRS Villejuif, M. Vanhove), CREAM-LacNad (Inalco Paris , D. Caubet).
Experts : Prof Bernard Comrie (MPI Leipzig & UCSB Santa Barbara), Prof Shlomo Izre’el (Tel Aviv University).
Bernard CARON (IFRA) is part of the project for Hausa and Zaar, two Chadic languages spoken in Northern Nigeria.

This projet operates within the general field of the collection, analysis and dissemination of oral corpora in Non-European languages. Several French teams, within the CNRS and in different Universities, work on Afroasiatic languages and have at their disposal a certain amount of raw data. Within these teams, some researchers have begun publishing on line their oral data, duly transcribed and translated.

The number of spoken corpora of Afroasiatic languages worldwide is very small (see however the Semitisches Tonarchiv of the University of Heidelberg, et le Corpus of Spoken Israeli Hebrew de Tel-Aviv There are therefore opportunities for new endeavours in the field.

The aim of this project is to establish a methodology in order to unify and share spoken field data in one phylum, Afroasiatic. This methodology is based on the linguistic analysis of the prosodic and morphosyntactic structure of the languages studied in the project. We aim at compiling a pilot corpus accessible on-line to the scientific community, in particular for typological studies. The term corpus’ implies that we are not compiling an archive for conservation purposes, but a structured body of systematically unified transcripts, accompanied by morphosyntactic annotations, and associating sound and text. This creation is grounded in the theoretical analysis of spoken field data. This effort towards the unification of the data and its sharing is linked to two levels of analysis, implying both a theoretical stake and a practical one.

- the level of prosodic analysis: which units of spoken language are relevant for the languages under study, and on which principles are they founded (cognitive, phonological, pragmatic...)?

- the level of morphosyntactic analysis: how can we code in a unified manner the minimal segmental units of the languages, for the whole sample? Through this project, we would like to contribute to answering the following questions:

What are the units of spoken language? Do those units differ on the basis of the tonal or accentual nature of the intonation systems of the languages? How are prosody and morphosyntax articulated (especially at information-structure level)? What is the optimal degree of unification of the annotations, in order to both respect the specificities of languages, and provide a comparative basis for typology? In order to provide answers to those questions, we will compile a pilot-corpus built according to the following criteria:
- it will be freely accessible on-line in xml format,
- it will be constituted of languages belonging to the Afroasiatic phylum, with one hour of recorded materials per language,
- it will be segmented into prosodic units
- it will minimally contain: a transcript, a translation, interlinear glossing, and the sound (downloadable on-line) will be indexed to the texts.

The languages represented in the project are: Berber (Taqbaylit, Tatserret), Cushitic (Beja, Gawwada, Ts’amakko, Ongota, Afar), Omotic (Wolaitta), Semitic (spoken Arabic (Morocco, Libya, Sudan), Maltese, spoken Hebrew, Dahalikt), and Chadic (Hausa, Bata, Zaar).

