AGE: Amharic, Ge’ez and English Parallel Dataset

1Ethiopian AI Institute, 2Maharashi International University

Abstract

African languages are not well-represented in Natural Language Processing (NLP). The main reason is a lack of resources for training models. Low-resource languages, such as Amharic and Ge’ez, cannot benefit from modern NLP methods because of the lack of high-quality datasets. This paper presents AGE, an opensource tripartite alignment of Amharic, Ge’ez, and English parallel dataset. Additionally, we introduced a novel, 1,000 Ge’ez-centered sentences sourced from areas such as news and novels. Furthermore, we developed a model from a multilingual pre-trained language model, which brings 12.29 and 30.66 for English to Ge’ez and Ge’ez to English, respectively, and 9.39 and 12.29 for Amharic-Ge’ez and Ge’ez-Amharic respectively. Our dataset and models are available at the AGE Dataset repository.

About Ge'ez(ግዕዝ)

Ge’ez (ግዕዝ), which is also known as Ethiopic, is one of the oldest Semitic languages (Tareke et al., 2002) and its alphabets is among the oldest alphabets still in use in the world of today. Furthermore, the Ge’ez language is among the four languages (Sabaean, Greek, and Arabic) that have been and continue to be used for ancient inscriptional arts. Ge’ez is currently not an actively spoken language nor a native tongue of any people. Its use is limited to the liturgical language of the Ethiopian Orthodox Tewahedo, Eritrean Orthodox Tewahedo, Ethiopian Catholic, and Eritrean Catholic Christians. It is also used during prayer and at regularly scheduled public religious feast celebrations. The Bible dominates the literature, and it comprises the Deuterocanonical books. According to (Molla & Tabor, 2018), this language also has many medieval and early modern original texts. The majority of the essential works are correspondingly the literature of the Ethiopian Orthodox Tewahedo Church. These works include Christian Orthodox liturgy (service books, prayers, hymns), hagiographies, and a range of Patristic literature. Around 200 texts were written about home-grown Ethiopian saints from the fourteenth to the nineteenth century. The religious alignment of Ge’ez literature was due to traditional education being the obligation of priests and monks. More info about the alphabet on Ge’ez can be found in the appendix section.

Related Works

One of the major challenges in developing MT models for Ge’ez is the lack of public data. There were attempts to compile parallel corpora for Ge’ez to English and Ge’ez to Amharic MT tasks, but the development was unsatisfactory.

Results

Our results reveal a clear gradient in BLEU score performance across various language pairs. For translations from Amharic to Ge’ez and vice versa, the model achieved BLEU scores of 9.03 and 12.26 for evaluation, with a slight increase in the prediction phase to 9.39 and 12.87, respectively. Our BLEU scores showed a dramatic increase in scores for the Ge’ez to English language pair. Notably, English translations demonstrated superior performance, with the Ge’ez to English pair achieving the highest scores of 30.35 in evaluation and 30.66 in prediction, indicating a robust model capability in this language direction.

Data collection

Our dataset, sourced from diverse sources, exhibited significant textual inconsistencies. We found portions of the data excessively disordered and removed them from our collection. The figure shows the general framework for the dataset development process.



Sample Output

BibTeX

@inproceedings{
  ademtew2024age,
  author    = {Henok Biadglign Ademtew and Mikiyas Girma Birbo},
  title     = {{AGE}: Amharic, Ge{\textquoteright}ez and English Parallel Dataset},
  booktitle = {5th Workshop on African Natural Language Processing},
  year      = {2024},
  url       = {https://openreview.net/forum?id=tHNfskz2WG}
  }