25.05.2026 — Seminarium "Przetwarzania Języka Naturalnego" — godz. 10:15
Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)
Streszczenie (autorskie):
In the presentation we will explore to what extent the integration of morphological information can improve subword tokenization and thus also language modeling performance. We will focus on Spanish, a language with fusional morphology, where subword segmentation can benefit from linguistic structure. Instead of relying on purely data-driven strategies like Byte Pair Encoding (BPE), I will demonstrate a linguistically grounded approach: training a tokenizer on morphologically segmented data. This is possible thanks to developing a semi-supervised segmentation model for Spanish, building gold-standard datasets to guide and evaluating it. This tokenizer can be used to pre-train a masked language model and assess its performance on several downstream tasks. Our results show improvements over a baseline with a standard tokenizer, supporting our hypothesis that morphology-aware tokenization offers a viable and principled alternative for improving language modeling.
Więcej…25.05.2026 - Seminarium "Przetwarzania Języka Naturalnego"


