Example: Training a LLM on the WikiText Dataset
Before moving forward with this tutorial, make sure you've followed our CLI Getting-Started Guide.
train.py
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset
def main():
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Adjust tokenizer's padding token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split='train')
tokenized_datasets = dataset.map(
lambda examples: tokenizer(examples['text'],
padding="max_length", truncation=True, max_length=512),
batched=True
)
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="steps",
learning_rate=2e-5,
per_device_train_batch_size=4,
weight_decay=0.01,
num_train_epochs=3,
warmup_steps=500,
save_steps=10_000,
logging_dir="./logs",
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets,
)
trainer.train()
if __name__ == "__main__":
main()
Dockerfile
# Use an official PyTorch image with CUDA support if GPU training is desired
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
WORKDIR /app
# Install additional Python packages
RUN pip install -U transformers datasets accelerate
# Copy the training script into the container
COPY train.py /app/train.py
# Command to run the training script
CMD ["python", "train.py"]
Creating the training job
We provide the above container image publically here: europe-docker.pkg.dev/perian/workloads/pytorch-llm.
Now that we have suitable workload for our platform, we can create the job to run the training of our LLM. For this example, we will be using a NVIDIA A100 accelerator.
perian job create --image europe-docker.pkg.dev/perian/workloads/pytorch-llm --accelerator-type A100
The above command will create the job on the Sky Platform and return its ID to us.
Monitoring the job
As the training progresses, we can take a look at the job artifacts, e.g. the logs. We can take the ID of the previous command to directly get the job details or first list all jobs.
perian job get --last
Running the above command will give us all the information regarding the training job we submitted. We can see the cloud provider and the instance type which have been automatically selected by the platform to execute the job.
Despite that, we can see the whole logs of the job and we can see, that the training of our small LLM finished successfully.