Skip to main content

25.05.2026 — Seminarium "Przetwarzania Języka Naturalnego" — godz. 10:15

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)


Streszczenie (autorskie):

In the presentation we will explore to what extent the integration of morphological information can improve subword tokenization and thus also language modeling performance. We will focus on Spanish, a language with fusional morphology, where subword segmentation can benefit from linguistic structure. Instead of relying on purely data-driven strategies like Byte Pair Encoding (BPE), I will demonstrate a linguistically grounded approach: training a tokenizer on morphologically segmented data. This is possible thanks to developing a semi-supervised segmentation model for Spanish, building gold-standard datasets to guide and evaluating it. This tokenizer can be used to pre-train a masked language model and assess its performance on several downstream tasks. Our results show improvements over a baseline with a standard tokenizer, supporting our hypothesis that morphology-aware tokenization offers a viable and principled alternative for improving language modeling.


© 2021 INSTYTUT PODSTAW INFORMATYKI PAN | Polityka prywatności | Deklaracja dostępności