{
"path": "/posts/2024/fine-tuning-connections/index",
"site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
"tags": [
"language_models",
"fine_tuning",
"connections"
],
"$type": "site.standard.document",
"title": "Fine-tuning gpt-3.5-turbo to learn to play \"Connections\"",
"updatedAt": "2024-01-13T02:58:01.000Z",
"publishedAt": "2024-01-13T02:58:01.000Z",
"textContent": "I started playing the NYTimes word game \"Connections\" recently, by the recommendation of a few friends.\nIt has the type of freshness that Wordle lost for me a long time ago.\nAfter playing Connections for a few days, I wondered if an OpenAI language model could solve the game (the objective is to group the 16 words into 4 categories of 4 words).\nI tried with gpt-4-32k and gpt-4-1106-preview, tweaking prompts for a few hours and wasn't able to make much progress.\nIt's certainly possible prompt engineering alone could solve this problem, but it wasn't easy for me for find a path forward.\nI imagine it will involve a bit of creativity.\nI decided this was as good a time as any to try and fine tune a model to do a thing I couldn't easily get it to do with prompts.\n\nGetting the dataset\n\nI remembered seeing at some point that Wordle had an API to return the day's word and it does at https://www.nytimes.com/svc/wordle/v2/yyyy-mm-dd.json.\nI figured Connections might have a similar JSON API and it does.\nI wrote a script to get all the solutions to past game and put the in the connections_data folder with the naming convention yyyy-mm-dd.json.\n\nIt turns out the first game was published on 2023-06-12.\nNext, I wrote some not pretty code to create a jsonl file to upload to OpenAI.\nThey describe the structure of the file to fine-tune the gpt-3.5-turbo model here.\n\nEstimating Price\n\nI estimated the price of the job by counting the number of tokens in my fine-tune file.\nThe OpenAI pricing page lists the price of the fine-tune at $0.0080 / 1K tokens.\nAdditionally, the fine-tuning guide notes\n\n> To estimate the costs for a specific fine-tuning job, use the following formula:\n>\n> \n>\n> For a training file with 100,000 tokens trained over 3 epochs, the expected cost would be ~$2.40 USD.\n\nUsing tiktoken to count tokens\n\nwhich output\n\nRunning the fine-tune job\n\nI took 15 of the rows (number chosen pretty randomly) from the result and separated them out into a training file, so that OpenAI can used it to calculate loss for the fine-tuning (I think).\nI called these files connections_prompts_train.jsonl and connections_prompts_test.jsonl.\nI went to OpenAI and created a new fine-tuning job.\nI selected the gpt-3.5-turbo-1106 model then uploaded my training and test data, and clicked \"create\" and off it went.\nAs the job ran, I could see this dashboard.\nHere's what it looked like when it finished.\n\n!Fine-tuning progress. A loss graph with two lines, slowly decreasing starting around 3 and down to around 0.2\n\nAnd it cost \\\\$0.90.\nAccording to the fine-tune metrics, 3 epochs (given my estimated cost of \\\\$0.37) as mentioned fine-tune training guide though I don't recall setting that or seeing it anywhere else.\n\n!Fine-tuning cost screenshot\n\nTesting the fine-tune\n\nWith the fine-tuning done, I went to the playground to try it out.\nI fetched the words for the next day's puzzle, which was mercifully already available.\nI forgot to make a validation set.\nShame on me.\nIt would have take a bit longer to fine-tune again, and I had a lot of anticipation to see the results, it being my first fine-tune and not really understanding the loss graph or knowing if it worked.\n\nI copied the same system prompt from the training\n\nthen added the user message with the words\n\nand clicked submit.\nThe model outputted\n\nA definite bummer.\nI did a quick search in my project directory for \"words: level members\", and found it _everywhere_.\nOops!\nI built the dataset improperly.\n\nFixing a dataset bug\n\nAfter a bit of a reset, I found some issues in the prompt generation python script and fixed them here\n\nI checked my data and validated the correct words for each category were in the output jsonl file.\nThis time around, I remembered to create an external validation set so I had a few cases to run once the fine-tune was ready.\nI split the data up 60% to train, 20% to validate for the fine-tune and 20% to use myself to validate after the fine-tune.\n\nI put up my feet and waited for the fine-tune to run.\n\nThe first run of the fine-tune.\n\n!fine-tune run model output\n\nI'm very glad I had more validation data because I couldn't believe it when I saw it work.\nI ran several more and each time, it got the word groups correct.\nThough occasionally it seemed to struggle to get the category right with x ___ or ___ x categories.\nBut honestly, I was a little stunned.\nBefore this fine-tuning, the model could not come all that close to solving Connections puzzles and now it could generally get the 4 word groups right every time and the categories right most of the time.\nIt seemed to good to be true.\nIt _was_!\nCan you spot the problem?\nThe input words\n\nand the output categories\n\nare in the same order.\n\nWhen I change the input order of the words for this fine-tune\n\nit all falls apart.\n\nThe categories stop making sense.\nThe model hallucinates words.\nSome categories don't even have 4 words in them.\nThis fine-tune is toast.\n\nFixing another dataset bug\n\nBack to the code.\nI modified my data set generation code, then split my data into three sets again.\nI used a stable random seed, so that the generation would be the same each time.\n\nI ran a few tests\n\nand\n\nThe results are _ok_, but not incredible.\nIt does get some groupings correct.\nThere are hallucinations, adding words that don't exist.\nAlso, most categories and grouping are not correct.\n\nEvaluating the results\n\nTo figure out if this fine-tune was worthwhile, I needed benchmark against the performance of gpt-3.5-turbo-1106.\n\nI wrote some quick and dirty code to read the validation.jsonl, make an OpenAI call, parse the response and compare it to the known, correct answer.\nI decided to measure percentage of puzzles correct (all four categories) and percentage of categories correct (4 words correctly grouped).\nI also augmented the user prompt a bit to get the un-fine-tuned model to output its response the same way the fine-tuned model would for consistent parsing.\n\nI started by running the code against gpt-3.5-turbo-1106.\n\nNext, I ran it against my fine-tuned model.\n\nNot a huge difference.\nI did a few more runs and none of the results we're too out of the ordinary.\nOther than consistent formatting, it's not clear the model got all that much better at the game after fine-tuning.\n\nWrap up\n\nThis experience was an interesting introduction to model fine-tuning.\nThe results weren't that amazing, but I learned a lot about easy pitfalls and mistakes one can make and had some fun.\n\nFuture work\n\nTo see if there might be more to explore for this project, I ran the validation set through gpt-4 with the following results\n\nThis improvement is notable over gpt-3.5's ~20-25%.\nI requested access to fine-tune gpt-4, so we'll see if that comes through and I can try it out.",
"canonicalUrl": "https://www.danielcorin.com/posts/2024/fine-tuning-connections/index"
}