{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreia64leml5zqajxa5j77ra4sgmurxkcvvz2k25arlh5vftkiygmwdy",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgvhi62k2tl2"
},
"path": "/t/preprocessing-whatsapp-for-style-cloning-how-to-handle-session-gaps-and-multi-message-blocks/174226#post_1",
"publishedAt": "2026-03-12T17:28:00.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "I am preparing a dataset to fine-tune a model on a specific chat style (Person Y) using WhatsApp exports. Most scripts pair messages 1:1, which loses context when one person sends multiple messages in a row.\n\nI’m training on an **RTX 3060 12GB**. Here is the logic I’m using for the pipeline:\n\n**Phase 1: Grouping & Sessions**\n\n * **Block Merging:** Consecutive messages from the same sender are merged into one block. (X X X → User block, Y Y → Assistant block).\n\n * **60-Minute Gap:** If a reply takes over an hour, it starts a new `session_id`.\n\n * **Session Pairing:** To avoid “hallucinated context,” I only pair a User block with an Assistant block if they share the same Session ID. If Y replies days later, that pair is skipped.\n\n * **Cleaning:** Stripping invisible Unicode characters (`\\u200e`), `<Media omitted>`, and URLs.\n\n\n\n\n**Phase 2: Chunking**\n\n * **Word Limit:** 500 words per block.\n\n * **Sentence Splitting:** If a block is over 500 words, it splits at the nearest sentence boundary (`.!?`) so thoughts aren’t cut in half.\n\n\n\n\n**Questions:**\n\n 1. Is 60 minutes a good threshold for a “conversation break” in personal chats? Though sometimes it has exceeded 1 hour but I have no idea what to do.\n\n 2. When merging messages, is it better to join them with a space or a newline (`\\n`) for the model to learn the cadence?\n\n 3. Should I filter out low-signal pairs like “Ok” → “K”, or does that help the model sound more natural?\n\n 4. For Llama 3/Mistral, is there a preferred format for this kind of multi-message block data?\n\n\n\n\nLooking for feedback on the logic before I start the training run.",
"title": "Preprocessing WhatsApp for style cloning: How to handle session gaps and multi-message blocks?"
}