HTREC 2022

Ancient Greek BERT

Load & use the Ancient Greek BERT pre-trained masked language model

ipavlopoulos

A subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Experimental results (read their work) showed good perplexity and state-of-the-art performance for fine-grained POS tagging on treebanks with Classical and Medieval Greek and on a Byzantine Greek dataset.

This notebook loads this pre-trained model and shows how to use it to mask/unmask words. It needs to be fine-tuned on in-domain data to properly work. 

Ancient Greek BERT

This is a subword-based BERT masked language model, trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Read the work of Singh et al (2021) or employ their code using this notebook.

In [4]:
%%capture
!pip install transformers
!pip install unicodedata
!pip install flair
In [5]:
%%capture
from transformers import AutoTokenizer, AutoModel
tokeniser = AutoTokenizer.from_pretrained("pranaydeeps/Ancient-Greek-BERT")
model = AutoModel.from_pretrained("pranaydeeps/Ancient-Greek-BERT")

Employing the seek-and-find code from GreekBERT, but using the AncientGreekBERT instead.

In [24]:
import torch
input_ids = tokeniser.encode('τοῦ βίου τοῦ καθ ΄ εαυτοὺς πολλὰ γίνεσθαι συγχωροῦν [MASK]')
tokens = tokeniser.convert_ids_to_tokens(input_ids)
idx = tokens.index("[MASK]")
print(idx, tokens)
outputs = model(torch.tensor([input_ids]))[0]
print(tokeniser.convert_ids_to_tokens(outputs[0, idx].max(0)[1].item()))
13 ['[CLS]', 'του', 'βιου', 'του', 'καθ', '΄', 'εαυτους', 'πολλα', 'γινε', '##σθαι', 'συγχ', '##ωρου', '##ν', '[MASK]', '[SEP]']
##τικα

Suggested next steps

  • Fine-tune on a dataset (e.g., on HTREC data).
  • Mask likely mistaken (e.g., HTRed) words, then use the model to unmask.
  • Use some word lexicon (e.g., based on this resource) to detect likely errors.

Comments

You must login before you can post a comment.

Execute