As I mentioned in my last post, I’ve been experimenting with LLMs.
I’ve fine-tuned a couple of my own models. This post goes over my experiments and their conclusions. Note that I’m not an expert, I’m just messing around.

Fine-tuning models

The largest one I’ve fine-tuned so far has been around 3B parameters. The model size limit has been due to GPU VRAM - training a model in mixed precision takes about 12x the parameter count in VRAM. A 3B model required about 33GB of VRAM. And that was the VRAM optimized version: gradient checkpointing, adafactor optimizer, fp16 mixed precision, and a batch size 1 with gradient accumulation. You can read more about those terms here. So, if you wanted to train a 3B model in fp16 on consumer hardware, you’d already need 2x RTX 3090 to pull it off. I rented a RTX A6000 for a couple of hours instead.

There are some ‘fresh from the oven’ solutions like QLoRA: Efficient Finetuning of Quantized LLMs, but I’m still a bit cautious about quantization and LORAs. If I can make my models work without these ‘hacks’, then it will be almost no work to transfer it later. The biggest issue is collecting high-quality data, and PEFT isn’t going to change that. However, if I dove head first into the latest parameter-efficient tech, I could draw wrong conclusions from something that is really a LORA/Quantization side-effect. I’ll wait until more research is done. As far as I know, none of the commercial models use a LORA, an adapter, or quantization. The tech is very new.

To train bigger models you need pretty crazy computing infrastructure. To train on multiple GPUs the model layers are split across GPUs and the outputs from each layer can be pipelined. As long as you can fit the biggest layer on a single GPU this works. I think you can do about 30B parameters fp16 on a single machine with 8x A100 80GB or something. By using a large batch size (in practice it’s gradient accumulation, I think) the inefficiency becomes acceptable. Too slow for pre-training, but okay for fine-tuning.

Prompt design

The pre-trained models don’t have any prompt structure, they just complete random text. The data on the internet has some structure, so it’s possible to do some prompt engineering (like the TLDR or Q:, A: tricks), but it’s inconsistent. That’s why I adapted a custom prompt structure for my fine-tuned models. The V1 and V2 versions were in last post.

###I###
Jhon talks to Mary about his dog.
happy
###P###
Jhon: Hi.
Mary: Hey, Jhon. How are you?
Jhon: I'm good. How are you?

###C###
Howler is sensitive to sunlight.
Lex has a plan to make the world safe for war profits again.
###I###
People:
Howler {gender: ?, age: ?, tags: }
Lex {gender: male, age: ?, tags: }

Mode:
dialog

Keywords:
action, family, movie-corpus, sci-fi, adventure, fantasy
###P###
Howler {}: Close the curtains. The sun is hurting my eyes.
Lex {}: Of course it is. Do you know what the sun is? Why it's just one huge nuclear bomb. A bomb with enough radiation to incinerate an average man like . . .  that.
Howler {}: ...

Tags are special tokens that I intended to use to differentiate different personas. They’re the same in the ###I### section and before the ‘:’ in curly braces.

V2 has more context information. It might seem like a good thing, but it makes training data a lot harder to find because you need all your training data to be in this format. You could leave it blank, but this might not help much. For example, let’s say you have a common knowledge question-answering dataset, and you leave the context blank. The model could learn that if the context isn’t blank, then it shouldn’t answer the questions, but should instead behave like other datasets that had the context. You probably included the question-answering dataset because you wanted the model to be better at answering questions, but instead, it learns the structure of your training data and ignores them unless you leave the context blank. The model has already learned English during pre-training, fine-tuning data should be the direct representation of what you want the model to output.

In my experience, the model seems to transfer some knowledge, but training with partial context definitely doesn’t seem to transfer too well because of the mentioned issues. I don’t have many other options, but to work with the data I can find. I hope that I can do fine-tuning in two steps. First I train it on a mixture of different partial context datasets like question answering, instruction following, and chatting, and then I do another fine-tune with higher quality data that uses them all at once. The model still learns how to do those tasks, it just doesn’t know when to apply them. I hope that another fine-tune pass will teach the model to use them together with less data. There is just no way that I’m gathering tens of thousands of examples. So far the results have been promising, but I don’t have anything to compare to, yet. I will definitely train a model without this secondary pre-training step to see if it helps at all.

Another issue I ran into is that the model really struggles with understanding the prompt. It kind of recognizes what the sections are about, but it doesn’t make good use of the information. My guess is that it’s just too complex for the amount of data I have, or the model just doesn’t have enough parameters to pick up those patterns. For example, if I ask the model its name, it still sometimes messes up or uses the name of the person it’s talking to. Harder questions like “How many people are in this conversation?” are even worse. I composed a small dataset with questions like “What’s your name?”, “What are the names of the people in this conversation” and so on. It helped a bit, but not much. That’s why I decided to simplify the prompt structure in the next iteration.

Also, I figured the keywords do more harm than good, so I decided to remove them in the next iteration. I think it could potentially be useful for something like prompt tuning (some info below). However that’s too complicated for now, so it just ends up altering the model behavior. For example, if I have a movie script with an ‘action’ tag, the model kind of learns that ’this isn’t action’ if it’s missing, because during training if it saw ‘action’, it learned to behave differently. I don’t want this kind of conditioning on the keywords to happen right now.

In prompt tuning you give up some tokens from context, and replace them with parameters instead. In transformer architecture, all tokens are converted into embeddings through a lookup table. What you can do is skip this tokens -> embeddings step for part of the context, and turn these into learnable parameters. Now you can fine-tune them through backpropagation. This is often used to alter model behavior without full fine-tuning, but the model can’t really learn new knowledge.

Another thing I don’t like about the prompt is that it uses common ASCII characters for structure. This means that if you use things like colon (:) or curly braces ({}) in a sentence, the model gets confused sometimes. The model even copied the curly braces with its name sometimes when answering questions like “What’s your name?”, even though this never happened in training data. It leaves so much room for things like prompt injection. You have to carefully escape all the input if you were to make the model publicly interactable.

<CTX> fact1<SEP> fact2<SEP><PER> name<GENDER> gender<TAGS>tags<PER> name<GENDER> gender<TAGS>tags<MODE> mode<CHAT> name<UTTER> sentence<SEP> name2<UTTER> sentence2<SEP> ...

The main thought about this structure is that computers don’t care about what looks good to us.
2-dimensional text separated by lines is easier for us to read, but computers don’t care. The symbols in angle braces (<>) are special tokens. Spaces are intentional because it results in tokenization more similar to normal text, and doesn’t actually use more tokens.
Tokens can be added during fine-tuning.
Example using huggingface transformers.

special_tokens_dict = {'additional_special_tokens': ['<CTX>', '<SEP>', '<PER>']}
tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

The results have mixed success.

Good things:

The prompts are shorter now, because of the special tokens.
The model seems to make fewer structural mistakes, like generating special tokens in the middle of the output sentence.
Can output multi-line responses or use colons in sentences with no issues.
Safer to prompt injections (special tokens still need to be escaped or user input should just use tokenizer without the special tokens)

Bad things:

Use of pre-tokenized datasets is a bit harder (have to use a special tokenizer)
Model struggles to understand the meaning of the new tokens
Harder to read the model output, it needs to be processed

The biggest issue out of the bad things was the second one.
The model became much worse at things like understanding its own name. I think it’s because training a new token embedding from scratch requires a lot more data. The model has seen other tokens hundreds of billions of times during pre-training, but the new tokens have hardly any meaning to the model. This is bad enough that I might have to go back to not using these special tokens. The lesson so far has definitely been to say as close to pre-training data (random internet text) as possible.
In hindsight I should’ve probably copied embeddings for existing tokens for the new tokens instead of randomly generated ones. For example I could’ve used embedding of the word “tag” for the <TAGS> token.

Weighting tokens

By default huggingface transformers considers all tokens to be equal when it comes to computing the training loss. Since I’m working with smaller models, I felt like forcing the model to learn to generate the whole prompt is inefficient. I thought that maybe reducing the loss for mispredicting prompt tokens could be more parameter-efficient and give better output for the number of parameters my model has. I had also noticed the “learning_rate_multiplier” parameter in OpenAI fine-tune API, so I’m certainly not the first one to have that idea.

learning_rate_multiplier number Optional Defaults to null
The learning rate multiplier to use for training. The fine-tuning learning rate is the original learning rate used for pretraining multiplied by this value. By default, the learning rate multiplier is the 0.05, 0.1, or 0.2 depending on final batch_size (larger learning rates tend to perform better with larger batch sizes). We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results.

Of course, I’m not working with OpenAI API, so I need to implement it myself. I add the weights to the tokenized training data and set the

remove_unused_columns=False

in TrainingArguments.
From there, all I need is a custom loss function.

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        outputs = model(inputs['input_ids'])
        shift_logits = outputs['logits'][:, :-1, :].contiguous()
        labels = inputs['labels'][:, 1:].contiguous()
        weights = inputs.get('weights')[:, 1:].contiguous()
        loss_per_token = F.cross_entropy(shift_logits.view(-1, shift_logits.size(-1)), labels.view(-1), reduction="none")
        loss = (loss_per_token * weights.view(-1)).mean()
        return (loss, outputs) if return_outputs else loss

I used a weight of 0.05 for prompt tokens, 0.1 for non-persona utterances, and 1.0 for the expected output tokens. I felt like this change made my 410M parameter model output more similar to the 1B parameter model. However, feeling is not an objective measure, so draw your own conclusions. I’ll probably revisit the comparison at some point, would be a shame using this and actually having it hurt the model performance. LLM evaluation is a hard problem, as seen by the ‘99% Chat-GPT performance’ claims made by some open-source models based on some automated metrics or prompting GPT-4 for ratings. The best metric so far is human evaluation, but that’s expensive and time-consuming.

The 3B parameter model

As I mentioned earlier, I trained a 3B parameter model. Since it needed 33GB of VRAM, there was no way I could do it on my own machine. I rented a 1xRTX A6000 machine and trained the model there, which only costs a couple of $ for a few hours. I used the RedPajama-INCITE-Base-3B-v1 base model. I chose that model because it had a parameter count I could work with, and also I’ve had a good experience with the GPT-NEOX model when using the smaller Pythia models.
Anyways, here’s the output:

User: Hi
Model: Oh hi!
User: What’s going on?
Model: Nothing.
User: Could you tell me about yourself?
Model: Well, I’m an AI and I don’t have any brothers or sisters like humans do.
User: That’s oddly specific. Do you think AI is dangerous?
Model: No! I’ve the best intentions.
User: Do you think other AI is like you? Harmless.
Model: I believe they are neutral to human emotions, but I hope not hostile towards them like some people make AI seem.
User: Tell me a long story about an AI in human world.
Model: Once there was a small village nestled in a valley deep in the forest.
The village was surrounded by tall trees, with lush meadows leading down from it.
Many different species of animals lived there and there were even a few robots living among them.
One day, a mysterious stranger came to visit the village and seemed to be interested in the robot community.
They talked for hours over tea, exchanging stories and learning secrets from each other.
Although the visitors had many questions about technology and what it can do,
the robot community was welcoming and willing to share their knowledge.
After this meeting, the visitor felt more connected to the robot community than ever before.
From that moment onward, humans and robots had become friends.
User: I think it would be very inconvenient for robots to live in a valley deep in the forest.
Model: Oh yeah? Why would that be inconvenient?
User: It would be hard to produce electricity, how would you power the technology?
Model: You know, robots are smart. We just need to find ways to provide energy to our villages.

The model claiming it’s an AI isn’t from using training data generated by ChatGPT, it’s my own fine-tuning. I wanted the model to be aware that it’s an AI, instead of pretending to be a human with random details.
Unlike aligned models like ChatGPT it would talk about anything. I tried 8-bit quantization the other day and it seemed a lot less stable, talking about things it wasn’t really meant to talk about. I think that suggests that quantized models are much harder to keep under control. Maybe quantization might surface pre-training data more easily?
That’s where I’m at right now, maybe more to come in the future.

Blog about stuff

My experiments with LLMs 2

2023/05/30

Fine-tuning models

Prompt design

Weighting tokens

The 3B parameter model