Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreighfb6ku24pz3lln7btwji3ghbgzywa2i7pfrxcwgu2okzvpvlcha",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mfln73q4ahg2"
  },
  "path": "/t/llm-course-training-a-causal-language-model-from-scratch/173736#post_1",
  "publishedAt": "2026-02-24T05:39:39.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hi, I’m working through the LLM course in building a causal language model from scratch. I have two problems.\n\n  1. I’m running the notebook in Google Colab and when I get to\n\n`def tokenize(element):`\n` outputs = tokenizer(`\n` element[“content”],`\n` truncation=True,`\n` max_length=context_length,`\n` return_overflowing_tokens=True,`\n` return_length=True,`\n` )`\n` input_batch = []`\n` for length, input_ids in zip(outputs[“length”], outputs[“input_ids”]):`\n` if length == context_length:`\n` input_batch.append(input_ids)`\n` return {“input_ids”: input_batch}`\n\n`tokenized_datasets = raw_datasets.map(`\n` tokenize, batched=True, remove_columns=raw_datasets[“train”].column_names`\n`)`\n`tokenized_datasets`\n\nEven though I have the GPU selected, it says this will take over an hour to process (is this a reasonable amount of time?). I keep getting a warning that I am not using the GPU. How can I get`Dataset.map()`to run on the GPU (I was under the impression that it should)?\n\n  2. Second, there is a “Try it out!” exercise about modifying the `tokenize()` function above so that it packs the sequences together before chunking them (so you throw away less of the training data). My approach is to\n\n     1. Add the EOS token to the end of each sequence.\n\n     2. Tokenize all of the sequences.\n\n     3. Concatenate the lists (either through `+` or `append()`)\n\n     4. Chunk by iterating through the list like\n\n` chunks = []`\n` for i in range(0, len(concatenated_sequence), context_length):`\n` chunks.append(concatenated_sequences[i:i+context_length])`\n\nBut I have a feeling that this is note the correct way to do it and will be incredibly slow. Does anyone have any pointers?\n\n\n\n\nThank you for any help!",
  "title": "LLM course - training a causal language model from scratch"
}