Raw Record Source

{
  "path": "/posts/2024/claude-3.5-sonnet-connections-evals/index",
  "site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
  "tags": [
    "connections",
    "evals"
  ],
  "$type": "site.standard.document",
  "title": "Claude 3.5 Sonnet Connections Evals",
  "updatedAt": "2024-09-21T12:53:47.000Z",
  "publishedAt": "2024-09-21T12:53:47.000Z",
  "textContent": "import Chat from '@components/prose/Chat.astro';\n\nI've continued experimenting with techniques to prompt a language model to solve Connections.\nAt a high level, I set out to design an approach to hold the model to a similar standard as a human player, within the restrictions of the game.\nThese standards and guardrails include the following:\n\n1. The model is only prompted to make one guess at a time\n2. The model is given feedback after each guess including:\n   - if the guess was correct or incorrect\n   - if 3/4 words were correct\n   - if a guess was invalid (including a repeated group or if greater than or fewer than four words, or hallucinated words are proposed)\n3. If the model guesses four words that fit in a group, the guess is considered correct, even if the category isn't correct\n\nAn example\n\nHere is an example conversation between the model and the scorer, as the model attempts to solve the puzzle.\n\nA few notes:\n\n- The chat can be collapsed if you'd rather not read it or once you get the general idea\n- XML-like tags like <scratchpad> are being hidden because they are being invisibly rendered by the DOM (I am working on a way around this, but you can look at the code for the exact prompt used)\n\n<Chat\n  model=\"claude-3-5-sonnet-20240620\"\n  messages={[\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: '\"Connections\" is a word categorization game. I will provide you with 16 words, and your goal is to find four groups of four words that share a common category. Each word will belong to only one category in the correct solution. Be careful of words that seem like they could fit in more than one category. Consider guessing other categories first to improve your chances of success by elimination of more obvious groups. You have a maximum of four incorrect guesses, so choose carefully!\\n\\nAfter I give you the words, you will suggest one group of four words at a time and the category that connects them. I will provide feedback on whether the group of four words is correct or incorrect. The accuracy of the category name is not important; what matters is that the four words belong together. If three out of the four words you guess share a category, I will let you know. Otherwise, I will simply tell you if your guess was correct or incorrect.\\n\\nDon\\'t get discouraged if you make invalid guesses. Keep trying! I am very patient.\\n\\nThe connection between words is _not_ vague. The connection is clear and unambiguous, although it may not be obvious at first glance.\\n\\nSometimes the categories are \"outside the box\". Here are some examples in the form of Category: WORD1, WORD2, WORD3, WORD4:\\n\\n- Starts of planet names: EAR, MAR, MER, SAT\\n- Second ___: FIDDLE, GUESS, NATURE, WIND\\n- Associated with \"stub\": CIGARETTE, PENCIL, TICKET, TOE\\n- ___ Dream: AMERICAN, FEVER, LUCID, PIPE\\n\\nHere is a example solution to a full puzzle for further context\\n\\nWords:\\n\\nSPRINKLE\\nSPONGE\\nBIRD\\nROSE\\nPICK\\nCHERRY\\nDROP\\nCREAM\\nMUD\\nBUBBLE\\nTOP\\nSPOT\\nRUBY\\nBEST\\nSPLASH\\nBRICK\\n\\nSolution:\\n\\n- A little bit of a beverage: DROP, SPLASH, SPOT, SPRINKLE\\n- Shades of red: BRICK, CHERRY, ROSE, RUBY\\n- ___  Bath: BIRD, BUBBLE, MUD, SPONGE\\n- Choicest: BEST, CREAM, PICK, TOP\\n\\nHere are the 16 words:\\nEXCLUSIVE\\nPROPS\\nLIMITED\\nRARE\\nREAL\\nNICE\\nBLOODY\\nIMPRESSIVE\\nPSYCHIC\\nMYSTIC\\nWAY\\nAWFUL\\nCOLLECTIBLE\\nWELL DONE\\nMEDIUM\\nORACLE\\n\\nFirst do some thinking inside <scratchpad> tags. Make loose groupings of words and see if you can find one of the easier groupings.\\nThen make your first guess.\\n\\nOutput guesses in the following format inside the backticks:\\n\\n\\n\\nFor example:\\n\\n\\n\\nGood luck!',\n        },\n      ],\n    },\n    {\n      role: 'assistant',\n      content: [\n        {\n          type: 'text',\n          text: '<scratchpad>\\nLet\\'s analyze the words and look for potential groupings:\\n\\n1. Words related to rarity or exclusivity:\\n   EXCLUSIVE, LIMITED, RARE, COLLECTIBLE\\n\\n2. Words that could describe psychic abilities:\\n   PSYCHIC, MYSTIC, ORACLE, MEDIUM\\n\\n3. Positive adjectives:\\n   NICE, IMPRESSIVE, WELL DONE, REAL\\n\\n4. Words that could be used as intensifiers:\\n   BLOODY, AWFUL, WAY, REAL\\n\\n5. Miscellaneous words that don\\'t seem to fit clearly:\\n   PROPS, WAY\\n\\nThe first two groupings seem the most promising, so I\\'ll start with the rarity-related words.\\n</scratchpad>\\n\\nI\\'ll make my first guess with the rarity-related words:\\n\\n',\n        },\n      ],\n    },\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: \"Correct! You've guessed 1/4 groups. You have 4 guesses remaining.\\nCorrect guesses so far: {'COLLECTIBLE', 'LIMITED', 'RARE', 'EXCLUSIVE'}\",\n        },\n      ],\n    },\n    {\n      role: 'assistant',\n      content: [\n        {\n          type: 'text',\n          text: 'Excellent! Thank you for confirming. Now let\\'s move on to the next most promising group.\\n\\n<scratchpad>\\nSince we\\'ve confirmed the rarity group, let\\'s focus on the psychic-related words:\\n\\nPSYCHIC, MYSTIC, ORACLE, MEDIUM\\n\\nThese seem to form a coherent group related to supernatural or divination abilities. I\\'m quite confident about this grouping, so I\\'ll make this my next guess.\\n</scratchpad>\\n\\nHere\\'s my second guess:\\n\\n',\n        },\n      ],\n    },\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: \"Correct! You've guessed 2/4 groups. You have 4 guesses remaining.\\nCorrect guesses so far: {'COLLECTIBLE', 'LIMITED', 'RARE', 'EXCLUSIVE'} {'PSYCHIC', 'MEDIUM', 'MYSTIC', 'ORACLE'}\",\n        },\n      ],\n    },\n    {\n      role: 'assistant',\n      content: [\n        {\n          type: 'text',\n          text: 'Great! We\\'re making good progress. Let\\'s analyze the remaining words.\\n\\n<scratchpad>\\nRemaining words:\\nPROPS, REAL, NICE, BLOODY, IMPRESSIVE, WAY, AWFUL, WELL DONE\\n\\nLooking at these words, I can see a potential group of positive adjectives or expressions:\\nNICE, IMPRESSIVE, WELL DONE, REAL\\n\\nThe word \"REAL\" could fit here as a positive descriptor (e.g., \"That\\'s real nice!\").\\n\\nThe remaining words don\\'t form an obvious group yet, but we\\'ll focus on that after confirming this group.\\n</scratchpad>\\n\\nHere\\'s my third guess:\\n\\n',\n        },\n      ],\n    },\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: \"Incorrect, but three out of four words belong to the same category. You have 3 guesses remaining.\\nCorrect guesses so far: {'COLLECTIBLE', 'LIMITED', 'RARE', 'EXCLUSIVE'} {'PSYCHIC', 'MEDIUM', 'MYSTIC', 'ORACLE'}\",\n        },\n      ],\n    },\n    {\n      role: 'assistant',\n      content: [\n        {\n          type: 'text',\n          text: 'Thank you for the feedback. It seems I was close, but one word doesn\\'t fit. Let\\'s reassess.\\n\\n<scratchpad>\\nThe group I proposed was: NICE, IMPRESSIVE, WELL DONE, REAL\\n\\nSince three out of four are correct, one of these doesn\\'t belong. \"REAL\" is the most likely outlier here, as it\\'s not inherently positive like the others.\\n\\nLooking at the remaining words:\\nPROPS, REAL, BLOODY, WAY, AWFUL\\n\\n\"PROPS\" could potentially fit as a positive descriptor, as in \"giving props\" or praise to someone.\\n\\nSo, let\\'s try replacing \"REAL\" with \"PROPS\" in our group of positive descriptors.\\n</scratchpad>\\n\\nHere\\'s my revised guess:\\n\\n',\n        },\n      ],\n    },\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: \"Correct! You've guessed 3/4 groups. You have 3 guesses remaining.\\nCorrect guesses so far: {'COLLECTIBLE', 'LIMITED', 'RARE', 'EXCLUSIVE'} {'PSYCHIC', 'MEDIUM', 'MYSTIC', 'ORACLE'} {'NICE', 'IMPRESSIVE', 'WELL DONE', 'PROPS'}\",\n        },\n      ],\n    },\n    {\n      role: 'assistant',\n      content: [\n        {\n          type: 'text',\n          text: 'Excellent! We\\'re making great progress. Now let\\'s focus on the remaining words.\\n\\n<scratchpad>\\nRemaining words:\\nREAL, BLOODY, WAY, AWFUL\\n\\nThese words don\\'t immediately form an obvious category, but let\\'s think about potential connections:\\n\\n1. They could all be used as intensifiers or emphasis words in informal speech:\\n   - REAL (as in \"real quick\")\\n   - BLOODY (British slang for emphasis)\\n   - WAY (as in \"way better\")\\n   - AWFUL (as in \"awful nice\")\\n\\n2. They could all be part of common phrases or idioms:\\n   - REAL deal\\n   - BLOODY hell\\n   - WAY out\\n   - AWFUL lot\\n\\nThe intensifier/emphasis usage seems to be the stronger connection here, as it applies more directly to all four words without requiring additional words to complete phrases.\\n</scratchpad>\\n\\nBased on this analysis, I\\'ll make my final guess:\\n\\n',\n        },\n      ],\n    },\n    {\n      role: 'user',\n      content: [\n        {\n          type: 'text',\n          text: \"Correct! You've guessed 4/4 groups. You have 3 guesses remaining.\\nCorrect guesses so far: {'COLLECTIBLE', 'LIMITED', 'RARE', 'EXCLUSIVE'} {'PSYCHIC', 'MEDIUM', 'MYSTIC', 'ORACLE'} {'NICE', 'IMPRESSIVE', 'WELL DONE', 'PROPS'} {'AWFUL', 'WAY', 'REAL', 'BLOODY'}\",\n        },\n      ],\n    },\n  ]}\n/>\n\nPrompting\n\nThe model is given context about the game, a few example word groups, including a fully solved game with labeled categories (few-shot) and simple chain-of-thought directions (do some thinking inside <scratchpad> tags) for _each guess_.\nI might loosely classify this approach as something like a rudimentary implementation of what OpenAI has done more generically with o1-preview/o1-mini.\nIn this case, the scorer gives the model helpful feedback in the direction of the goal since the right answer is known.\n\nPerformance\n\nI experimented with a few different models to run evals against early on including gpt-4-0613, gpt-4-1106-preview and claude-3-sonnet-20240229.\nBased on results of 50-100 games for each of these models, Sonnet seemed the most promising and in an effort to keep the cost of this experiment reasonable (~10-20 USD if I recall correctly), I decided to stick with it to see what the best pe",
  "canonicalUrl": "https://www.danielcorin.com/posts/2024/claude-3.5-sonnet-connections-evals/index"
}