LLM course - training a causal language model from scratch
Hi, I’m working through the LLM course in building a causal language model from scratch. I have two problems.
- I’m running the notebook in Google Colab and when I get to
def tokenize(element):
outputs = tokenizer(
element[“content”],
truncation=True,
max_length=context_length,
return_overflowing_tokens=True,
return_length=True,
)
input_batch = []
for length, input_ids in zip(outputs[“length”], outputs[“input_ids”]):
if length == context_length:
input_batch.append(input_ids)
return {“input_ids”: input_batch}
tokenized_datasets = raw_datasets.map(
tokenize, batched=True, remove_columns=raw_datasets[“train”].column_names
)
tokenized_datasets
Even though I have the GPU selected, it says this will take over an hour to process (is this a reasonable amount of time?). I keep getting a warning that I am not using the GPU. How can I getDataset.map()to run on the GPU (I was under the impression that it should)?
Second, there is a “Try it out!” exercise about modifying the
tokenize()function above so that it packs the sequences together before chunking them (so you throw away less of the training data). My approach is toAdd the EOS token to the end of each sequence.
Tokenize all of the sequences.
Concatenate the lists (either through
+orappend())Chunk by iterating through the list like
chunks = []
for i in range(0, len(concatenated_sequence), context_length):
chunks.append(concatenated_sequences[i:i+context_length])
But I have a feeling that this is note the correct way to do it and will be incredibly slow. Does anyone have any pointers?
Thank you for any help!
Discussion in the ATmosphere