Skip to content

Setting up training config for MWT expander fails with "No training data available..." #91

@harisont

Description

@harisont

I am helping my student @Jembda to train a customized-mwt pipeline for one of the languages you officially support, Amharic.

After struggling with the problems described in #48 and #87, we managed to train a tokenizer, a parser-tagger and a lemmatizer. As per the title of this issue, however, setting up the training config for the MWT expander fails with No training data available....

Our train_conllu_fpath and dev_conllu_fpath point to the exact same files that seem to work perfectly fine for the other steps of the pipeline (excepts the lemmatizer, which needed some additional preprocessing to deal with tokens where the lemma is _).

My questions are the following:

  1. is the training data supposed to be in a different format, and if so, where is that described?
  2. is there any way around training a MWT expander? Our focus is parsing, and as a last resort I suppose we could settle on training a parser that does not split MWTs, although that is less than ideal at best when it comes to Amharic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions