My NMT project – A case study in failure


As some of you may know, my mother’s tongue is Romansh. One thing that has been bothering me and probably also many other Romansh speakers or generally anyone doing anything for which they want to provide a Romansh translation is the pure lack of automated Romansh translations. Of course, it does make sense that Romansh isn’t on the top of the list to implement for Google Translates or Deepln´s or frankly any provider of Neural Machine Translation. And of course overly enthusiastic and self-overestimating me thought I should give it a shot, how hard could it be anyway? (foreshadowing) I decided on using the Open source Neural Machine Translation framework OpenNMT to try to get a very rudimental translation system going from German to Romansh. It was intended, to be a mere proof of concept, kind of similar to my “Maturaarbeit” in which I trained a text generation neural network with a mixed idiom Dataset. Of which the results were in practical terms completely unusable, yet interesting to say the least. I can promise you that the first of the two Statements also held up for this project.

Roadblock 1: Gathering a Dataset. ¶

Whilst finding, or if you are really masochistic collecting, a large enough Dataset for NMT is a hard enough task for bigger languages, it becomes a real nightmare with Romansh. A Minority language, spoken by less than 100k people (note this is an estimate, we don’t have good numbers on it) simply generates fewer data. Now the good news was, that I found Corpora containing Pairs of translated sentences in Romansh and German, courtesy of opus.nlpl.eu. A good rule of thumb regarding Dataset size is that you´d want at least multiple hundreds of thousands of sentences for having any chance to get usable results from the whole ordeal. Meanwhile, I was able to collect a grand total of 50 thousand sentences. I probably could have gathered a few more thousand sentences, but I figured that it wouldn’t be worth the hassle and I could always come back to doing so after I had a proof of concept going. And here’s where the project really fell apart, I really shouldn’t have gone forward with the project with such a laughably small Dataset. But I really had no possibility of getting a large enough Dataset, as that simply doesn’t exist (yet). Now with gathered data, we set off to the task of setting up OpenNMT. A task that took me only a few hours in dependency hell, a process I won’t reiterate here, as I don’t want to think of such traumatizing memories. Hooray, we can finally start training the Model.

Roadblock 2: GPU troubles ¶

Except we can´t train the Model. You see my Graphics card is an old GTX 970 and whilst this really doesn’t bother me whilst doing anything other, be it productivity or Gaming, it really is a hindrance with Machine learning Projects. As when I tried to initiate the learning process the GPU would run out of memory and the process would crash. Upgrading the GPU would be an option during normal times we are currently experiencing a massive Semiconductor shortage and thanks to that GPUs are nearly everywhere out of Stock and their prices have quite literally tripled. But hey there’s a way out. A tremendously painful and slow way out. No, I’m not talking about selling my kidney for an RTX 3080, I’m talking about the dreadful CPU training. Which in hindsight should have been the second time I dropped the project. See training a model once with the CPU is okay, but doing it over and over and over again to change parameters, to perhaps get better results, is mind-numbingly slow, and I quite honestly wouldn’t be bothered to do that process under these conditions. So anyway, with the now decimated time efficiency I let the model train for half a day until the model stopped improving anymore. After the whole ordeal was done, I went through the different checkpoints I saved and saw a nice improvement: With increasing learning time, the model started translating commonly used words correctly, which got me quite excited. So I skipped forward to the Checkpoint which had the smallest error and gave it a Romansh text from a newspaper to translate and when I saw the results I was dumbfounded:

  • “Herausgabe einer CD im Jahr 2000”
  • “Herausgabe einer CD im Jahr 2000”
  • “Herausgabe einer CD im Jahr 2000”
  • “Herausgabe einer CD im Jahr 2000”
  • “Herausgabe einer CD im Jahr 2000”
  • “Herausgabe einer CD im Jahr 2000”

It translated every single sentence with the same Sentence. It won’t surprise you that I dropped the project right there and then. I was not going o try again “Release of a CD in 2000” what does it mean? Is there any meaning behind it? I don’t know and I will never know. The only thing I do know is that this project was a total disaster and it was primarily caused by me going forward despite knowing damn well that I neither have the knowledge about Machine learning to endure on such a Project and me not having anything near enough data to do this.

But hey, at least it was a fun learning process I guess. Stay tuned for the next complete failure of a project I’m gonna post about here.


Member of the Polyring webring