Skip to main content

The translation has been generated automatically  (elia.eus)

Noticia Elhuyar

Observatory of the Lexicon, corpus in Basque of almost 60 million words

2017 | March 07

 

Euskaltzaindia has updated the Lexical Observatory with several texts. Most of the new texts are in the period 2011-2016.

In total, the corpus currently contains 58,576,635 ...

 

Euskaltzaindia has updated the Lexical Observatory with several texts. Most of the new texts are in the period 2011-2016.

In total, the corpus currently contains 58,576,635 words of text and, above all, media documents (newspapers, magazines and radio television), although in recent years it has begun to diversify the sources incorporating literary and teaching texts.

The corpus can be consulted at the following address: euskaltzaindia.eus All texts are classified (for example, by area of knowledge and registration) and all words are automatically lemmatized to facilitate user searches and make search results more valid. Thus, for example, if we seek “conciliation”, the system will show us the appearances of all forms of that word: conciliation, conciliation, conciliation, conciliation…

Today, corpus is an indispensable tool in linguistic research and dictionaries. Thus, Euskaltzaindia uses this corpus as a source to feed its normative vocabulary, with the corpus of the General Basque Dictionary and the XX. Together with the Statistical Corpus of the Basque Country of the 20th century.

UZEI, the IXA Group of the UPV/EHU and Elhuyar have been collaborating on this project since 2009 with the aim of collecting a representative sample of the current use of written Basque.

The corpus processing process is semi-automated and advanced linguistic technology is used. In the project, Elhuyar offers his corpus technology and his experience in lexicography.