Setting up training config for MWT expander fails with "No training data available..." #91

Open

opened

on Mar 29, 2025

I am helping my student @Jembda to train a customized-mwt pipeline for one of the languages you officially support, Amharic.

After struggling with the problems described in #48 and #87, we managed to train a tokenizer, a parser-tagger and a lemmatizer. As per the title of this issue, however, setting up the training config for the MWT expander fails with No training data available....

Our train_conllu_fpath and dev_conllu_fpath point to the exact same files that seem to work perfectly fine for the other steps of the pipeline (excepts the lemmatizer, which needed some additional preprocessing to deal with tokens where the lemma is _).

My questions are the following:

is the training data supposed to be in a different format, and if so, where is that described?
is there any way around training a MWT expander? Our focus is parsing, and as a last resort I suppose we could settle on training a parser that does not split MWTs, although that is less than ideal at best when it comes to Amharic.

Metadata

Assignees

No one assigned

Labels

No labels

No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests