Robocox: Fine tuning GTP-2 for rowing

Posted on February 10, 2020

At the moment, my main hobby is rowing. A rowing boat with eight rowers has coxswain (cox) who steers the boat. He or she also has a microphone which is linked to speakers in the hull that they can use to communicate with and motivate the rowers.

I have made a bot that imitates the styles and phrases often used by coxes during races.

Generating the text takes around 3-5 minutes once you have pressed the button. Why not read more about how to make a bot like this below while you wait?
The coxwains in the training data often swear. Expect Robocox to do the same in the output

Are you ready?

How it all works

An organisation called OpenAI trained a machine learning model on a very large amount of text from the Internet.

We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training.

The model they trained is called GTP-2 and some versions of it are open source and available for free.

The model is large (the one used here has 128 million parameters) and, as mentioned before, the amount of text used to train it is HUGE so it would be quite difficult for someone like me to duplicate their work starting from nothing. EDIT: maybe not that difficult, just extremely expensive!

What someone like me can do is to fine tune the existing model on my own data. This is also called “transfer learning” because some of the knowledge from the original network is “transferred” onto a new problem.

This is easier to understand by considering something like ImageNet which classifies images. This is a deep neural network so there are many layers. The top layers take input from layers lower down and use it to make the classification. What are the lower layers doing?

In ImageNet, layers near the top seem to recognise features in images like edges of objects or textures of surfaces. Then these features are combined to make the prediction at the end. The idea of transfer learning is to use the lower layers that recognise these features but to change the final layers that combine the features into a prediction.

For ImageNet this might mean making a classifier for some new objects (e.g. cats and dogs) using the more general features of the original model.

For the GPT-2 example here, the original model produces text that is representative of a broad swathe of the internet. This has all the problems you might expect:

72% of “why” questions in SQuAD to be answered “to kill american people”, and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts

From Universal Adversarial Triggers for Attacking and Analyzing NLP

I took this model and then “transferred” the learning onto coxing by fine tuning based on transcripts of what coxes have actually said during races.

There is an open source project called GPT-2 Simple that made this very easy to do. One surprising thing is how little extra training data was required for this; only 1231 lines of text representing just four races!

I am slowly increasing the size of the training corpus which should improve results somewhat, but the current version isn’t bad for a weekend project!