I am helping my student @Jembda to train a customized-mwt pipeline for one of the languages you officially support, Amharic.
After struggling with the problems described in #48 and #87, we managed to train a tokenizer, a parser-tagger and a lemmatizer. As per the title of this issue, however, setting up the training config for the MWT expander fails with No training data available....
Our train_conllu_fpath and dev_conllu_fpath point to the exact same files that seem to work perfectly fine for the other steps of the pipeline (excepts the lemmatizer, which needed some additional preprocessing to deal with tokens where the lemma is _).
My questions are the following:
- is the training data supposed to be in a different format, and if so, where is that described?
- is there any way around training a MWT expander? Our focus is parsing, and as a last resort I suppose we could settle on training a parser that does not split MWTs, although that is less than ideal at best when it comes to Amharic.