Preprocessing WhatsApp for style cloning: How to handle session gaps and multi-message blocks?
I am preparing a dataset to fine-tune a model on a specific chat style (Person Y) using WhatsApp exports. Most scripts pair messages 1:1, which loses context when one person sends multiple messages in a row.
I’m training on an RTX 3060 12GB. Here is the logic I’m using for the pipeline:
Phase 1: Grouping & Sessions
Block Merging: Consecutive messages from the same sender are merged into one block. (X X X → User block, Y Y → Assistant block).
60-Minute Gap: If a reply takes over an hour, it starts a new
session_id.Session Pairing: To avoid “hallucinated context,” I only pair a User block with an Assistant block if they share the same Session ID. If Y replies days later, that pair is skipped.
Cleaning: Stripping invisible Unicode characters (
\u200e),<Media omitted>, and URLs.
Phase 2: Chunking
Word Limit: 500 words per block.
Sentence Splitting: If a block is over 500 words, it splits at the nearest sentence boundary (
.!?) so thoughts aren’t cut in half.
Questions:
Is 60 minutes a good threshold for a “conversation break” in personal chats? Though sometimes it has exceeded 1 hour but I have no idea what to do.
When merging messages, is it better to join them with a space or a newline (
\n) for the model to learn the cadence?Should I filter out low-signal pairs like “Ok” → “K”, or does that help the model sound more natural?
For Llama 3/Mistral, is there a preferred format for this kind of multi-message block data?
Looking for feedback on the logic before I start the training run.
Discussion in the ATmosphere