25.05.2026 - Seminarium "Przetwarzania Języka Naturalnego" - Instytut Podstaw Informatyki Polskiej Akademii Nauk

25.05.2026 — Seminarium "Przetwarzania Języka Naturalnego" — godz. 10:15

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)

Odnośnik do spotkania w MS Teams (nowe okno)

+ - Exploring morphology-aware tokenization: A case study on Spanish language modeling Click to collapse

Streszczenie (autorskie):

In the presentation we will explore to what extent the integration of morphological information can improve subword tokenization and thus also language modeling performance. We will focus on Spanish, a language with fusional morphology, where subword segmentation can benefit from linguistic structure. Instead of relying on purely data-driven strategies like Byte Pair Encoding (BPE), I will demonstrate a linguistically grounded approach: training a tokenizer on morphologically segmented data. This is possible thanks to developing a semi-supervised segmentation model for Spanish, building gold-standard datasets to guide and evaluating it. This tokenizer can be used to pre-train a masked language model and assess its performance on several downstream tasks. Our results show improvements over a baseline with a standard tokenizer, supporting our hypothesis that morphology-aware tokenization offers a viable and principled alternative for improving language modeling.

Seminaria w Instytucie Podstaw Informatyki PAN

Seminarium Instytutowe (Poniedziałki, o godz 12:00)

Inne seminaria:

Najbliższe seminaria:

25.05.2026 — Seminarium "Przetwarzania Języka Naturalnego" — godz. 10:15

Piotr Przybyła (Pompeu Fabra University / Institute of Computer Science, Polish Academy of Sciences)