External Publication
Visit Post

LLM course - training a causal language model from scratch

Hugging Face Forums [Unofficial] February 24, 2026
Source

Hi, I’m working through the LLM course in building a causal language model from scratch. I have two problems.

  1. I’m running the notebook in Google Colab and when I get to

def tokenize(element): outputs = tokenizer( element[“content”], truncation=True, max_length=context_length, return_overflowing_tokens=True, return_length=True, ) input_batch = [] for length, input_ids in zip(outputs[“length”], outputs[“input_ids”]): if length == context_length: input_batch.append(input_ids) return {“input_ids”: input_batch}

tokenized_datasets = raw_datasets.map( tokenize, batched=True, remove_columns=raw_datasets[“train”].column_names ) tokenized_datasets

Even though I have the GPU selected, it says this will take over an hour to process (is this a reasonable amount of time?). I keep getting a warning that I am not using the GPU. How can I getDataset.map()to run on the GPU (I was under the impression that it should)?

  1. Second, there is a “Try it out!” exercise about modifying the tokenize() function above so that it packs the sequences together before chunking them (so you throw away less of the training data). My approach is to

    1. Add the EOS token to the end of each sequence.

    2. Tokenize all of the sequences.

    3. Concatenate the lists (either through + or append())

    4. Chunk by iterating through the list like

chunks = [] for i in range(0, len(concatenated_sequence), context_length): chunks.append(concatenated_sequences[i:i+context_length])

But I have a feeling that this is note the correct way to do it and will be incredibly slow. Does anyone have any pointers?

Thank you for any help!

Discussion in the ATmosphere

Loading comments...