External Publication
Visit Post

Preprocessing WhatsApp for style cloning: How to handle session gaps and multi-message blocks?

Hugging Face Forums [Unofficial] March 12, 2026
Source

I am preparing a dataset to fine-tune a model on a specific chat style (Person Y) using WhatsApp exports. Most scripts pair messages 1:1, which loses context when one person sends multiple messages in a row.

I’m training on an RTX 3060 12GB. Here is the logic I’m using for the pipeline:

Phase 1: Grouping & Sessions

  • Block Merging: Consecutive messages from the same sender are merged into one block. (X X X → User block, Y Y → Assistant block).

  • 60-Minute Gap: If a reply takes over an hour, it starts a new session_id.

  • Session Pairing: To avoid “hallucinated context,” I only pair a User block with an Assistant block if they share the same Session ID. If Y replies days later, that pair is skipped.

  • Cleaning: Stripping invisible Unicode characters (\u200e), <Media omitted>, and URLs.

Phase 2: Chunking

  • Word Limit: 500 words per block.

  • Sentence Splitting: If a block is over 500 words, it splits at the nearest sentence boundary (.!?) so thoughts aren’t cut in half.

Questions:

  1. Is 60 minutes a good threshold for a “conversation break” in personal chats? Though sometimes it has exceeded 1 hour but I have no idea what to do.

  2. When merging messages, is it better to join them with a space or a newline (\n) for the model to learn the cadence?

  3. Should I filter out low-signal pairs like “Ok” → “K”, or does that help the model sound more natural?

  4. For Llama 3/Mistral, is there a preferred format for this kind of multi-message block data?

Looking for feedback on the logic before I start the training run.

Discussion in the ATmosphere

Loading comments...