Example: Training a LLM on the WikiText Dataset

Before moving forward with this tutorial, make sure you've followed our CLI Getting-Started Guide.

Our platform is optimized for AI and Big Data workloads. To showcase the workflow of our platform, we created a simple example use case to train a GPT2 model with the WikiText dataset . The following Python code shows how we are using PyTorch with the transformers package and the dataset from Hugging Face.

train.py

from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset

def main():
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2')

    # Adjust tokenizer's padding token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split='train')

    tokenized_datasets = dataset.map(
        lambda examples: tokenizer(examples['text'], 
padding="max_length", truncation=True, max_length=512),
        batched=True
    )

    training_args = TrainingArguments(
        output_dir="./results",
        evaluation_strategy="steps",
        learning_rate=2e-5,
        per_device_train_batch_size=4,
        weight_decay=0.01,
        num_train_epochs=3,
        warmup_steps=500,
        save_steps=10_000,
        logging_dir="./logs",
        load_best_model_at_end=True,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets,
    )

    trainer.train()

if __name__ == "__main__":
    main()
From this code we created a simple Dockerfile:

Dockerfile

# Use an official PyTorch image with CUDA support if GPU training is desired
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

WORKDIR /app

# Install additional Python packages
RUN pip install -U transformers datasets accelerate

# Copy the training script into the container
COPY train.py /app/train.py

# Command to run the training script
CMD ["python", "train.py"]

Creating the training job

We provide the above container image publically here: europe-docker.pkg.dev/perian/workloads/pytorch-llm.

Now that we have suitable workload for our platform, we can create the job to run the training of our LLM. For this example, we will be using a NVIDIA A100 accelerator.

perian job create --image europe-docker.pkg.dev/perian/workloads/pytorch-llm --accelerator-type A100

The above command will create the job on the Sky Platform and return its ID to us.

Monitoring the job

As the training progresses, we can take a look at the job artifacts, e.g. the logs. We can take the ID of the previous command to directly get the job details or first list all jobs.

perian job get --last

Running the above command will give us all the information regarding the training job we submitted. We can see the cloud provider and the instance type which have been automatically selected by the platform to execute the job.

Despite that, we can see the whole logs of the job and we can see, that the training of our small LLM finished successfully.