## GPT-2

views 1366 words

### LM

A language model is a model which learns to predict the probability of a sequence of words. In simpler words, language models essentially predict the next word given some text. By training language models on specific texts, it is possible to make the model learn the writing style of that text

### Perplexity Intuition (and its derivation)

https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3

In general, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models $$PP(W) = P(w_1w_2...w_N)^{-\frac{1}{N}}$$ perplexity of a discrete probability distribution: $$2^{H(p)} = 2^{-\sum_xp(x)log_2p(x)}$$ where H(p) is the entropy of the distribution p(x) and x is a random variable over all possible events. H(p) =-Σ p(x) log p(x)

!!!perplexity is just an exponentiation of the entropy!!!

• Entropy is the average number of bits to encode the information contained in a random variable, so the exponentiation of the entropy should be the total amount of all possible information, or more precisely, the weighted average number of choices a random variable has.
• For example, if the average sentence in the test set could be coded in 100 bits, the model perplexity is 2¹⁰⁰ per sentence

Definition: Where

• p : A probability distribution that we want to model. A training sample is drawn from p and it’s unknown distribution.
• q : A proposed probability model. Our prediction

Can evaluate prediction q by testing against samples drawn from p. Then it’s basically calculating the cross-entropy. In the derivation above, we assumed all words have the same probability (1 / # of words) in p.

Takeaway:

• Less entropy (or less disordered system) is favorable over more entropy. Because predictable results are preferred over randomness. This is why people say low perplexity is good and high perplexity is bad since the perplexity is the exponentiation of the entropy.
• A language model is a probability distribution over sentences. And the best language model is one that best predicts an unseen test set.
• Why do we use perplexity instead of entropy?
• If we think of perplexity as a branching factor (the weighted average number of choices a random variable has), then that number is easier to understand than the entropy.

### code

In this blog, we will leverage the awesome HuggingFace’s transformer repository to train our own GPT-2 model on text from Harry Potter books. We will provide a sentence prompt to the model and the model will complete the text. In order to train the model, we will feed all Harry Potter books for the model to learn from them. We have cloned the huggingface repo and updated the code to correctly perform language model training and inference. Please follow along on my Github repo.

The first step is downloading all the harry potter books and preprocessing the text. We scraped the text from the first 4books and merged it together. Then we wrote a short piece of code to remove unnecessary text like the page numbers from the merged text. Finally the GPT-2 model needs both train and validation text. So we take first 90% of the data as training sample and the remaining as validation sample. The preprocessing code is here.

#### Training a GPT-2 model

To train the model we use the script — run_lm_finetuning.py. The script takes as input the model type and its size, as well as the preprocessed text. The script also provides a bunch of hyperparameters that can be tweaked in order to customize the training process. The code snippet for training is:

cd examples  ## Move to examples directory
python run_lm_finetuning.py \
--output_dir=output \
--model_type=gpt2 \
--model_name_or_path=gpt2-medium \
--do_train \
--train_data_file='input_data/train_harry.txt' \
--do_eval \
--eval_data_file='input_data/val_harry.txt'\
--overwrite_output_dir\
--block_size=200\
--per_gpu_train_batch_size=1\
--save_steps 5000\
--num_train_epochs=2

The parameters used here are explained as follows:

• Output_dir is the name of the folder where the model weights are stored.
• Model_type is the name of the model. In our case we are training on the gpt-2 architecture, we use ‘gpt-2’.
• Model_name_or_path is where we define the model size to be used.(’gpt2’ for small, ‘gpt2-medium’ for a medium model and ‘gpt2-large’ for a large model)
• Do_train is essentially a flag which we define to train the model.
• train_data_file is used to specify the training file name.
• Do_eval is a flag which we define whether to evaluate the model or not, if we don’t define this, there would not be a perplexity score calculated.
• Eval_data_file is used to specify the test file name.
• gradient_accumulation_steps is a parameter used to define the number of updates steps to accumulate before performing a backward/update pass.
• Overwrite_output_dir is a parameter which when specified overwrites the output directory with new weights.
• block_size is a parameter according to which the training dataset will be truncated in block of this size for training.
• Per_gpu_train_batch_size is the batch size per GPU/CPU for training.
• Save steps — allows you to periodically save weights before the final set of weights
• num_epochs — Determines how many epochs are run.

We trained a medium GPT-2 model on the text of 4harry potter books. This model took only 10 min to train on a GTX 1080 Ti. The perplexity score of the trained model was 12.71. Read this blog to learn more about Perplexity score. But remember, lower the score, the better the model is.

#### Inference Script

Once the model is trained, we can run inference using it. The inference script is run_generation.py

For doing inference, the input text is first encoded through the tokenizer , then the result is passed through a generate function where the generation of text happens based on parameters like temperature, top-p and k values.

The code snippet for doing inference is:

cd examples
python run_generation.py --model_type gpt2 --model_name_or_path output --length 300 --prompt "Malfoy hadn’t noticed anything."

These parameters are explained below:

• model_name_or_path : This is the folder path where the weights of the trained model are stored.
• Prompt: This is the input prompt based on which the rest of the text will be generated.
• Length: This parameter controls the length of characters to be generated in the output.

Some additional parameters that can be tweaked are:

• Temperature: This parameter decides how adventurous the model gets with its word selection.
• p : This parameter controls how broad a range of continuations are considered. Set it high to consider all continuations. Set it low to just consider likely continuations. The overall effect is similar to temperature, but more subtle.
• k: This parameter controls the number of beams or parallel searches through the sequence of probabilities. Higher the value, better the accuracy , but slower the speed.
• Seed: This parameter helps in setting the seed.
• Repetition_penalty: This parameter penalizes the model for repeating the words chosen.

### Conclusion

The advent of transformers has truly revolutionized many Natural language processing tasks, and language generation is one of them. The potential of a language generation model is huge and can be leveraged in many applications like chatbots, long answer generation, writing automated reports and many more. In this blog, we understood the working of transformers, how they are used in language generation and some examples of how anyone can leverage these architectures to train their own language model and generate text.

I am extremely passionate about NLP, Transformers and deep learning in general. I have my own deep learning consultancy and love to work on interesting problems. I have helped many startups deploy innovative AI based solutions

## GPT-2 Chinese

1. 模型的收敛取决于词嵌入的维度，维度越大收敛越快越好
3. 模型长度不影响训练，但与学习效果有很大关联，能大些就大些。
5. 参数与batch设置为双数

• ctx:512
• embed:800
• layer:10
• positions:512

/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:231: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr(). warnings.warn(“To get the last learning rate computed by the scheduler, “

LR不能大于0.0004，超过这个值后loss会在某个值震荡无法继续下降，我使用的最佳LR是0.0003，能够得到0.2的loss，是否能够再低，没有再做尝试。

pretrain的话，要禁用掉动态学习率，否则loss不降反升

from prefetch_generator import BackgroundGenerator

class MyDataset(Dataset):
def init(self,num):
self.char_len=num
self.char=[]
self.get_text()
self.sector=0
def getitem(self, index):
self.sector=(index+1)*self.char_len
#data=Variable(torch.LongTensor(self.char[self.sector:self.sector+self.char_len])).to(device)
data=Variable(torch.LongTensor(self.char[(self.sector-(self.char_len//2)):(self.sector+(self.char_len//2))])).to(device)
#label=Variable(torch.LongTensor(self.char[(self.sector-(self.char_len//2)):(self.sector+(self.char_len//2))])).to(device)
return data

def get_text(self):
for e in range(100):
with open('E:/jupyter/gptch/data/tokenized/tokenized_train_{}.txt'.format(e), 'r') as f:
#self.section=[n for n in [self.char[i:i+self.char_len] for i in range(0,len(self.char),self.char_len//2)] if len(n)==self.char_len]

def len(self):
return (len(self.char)-(self.char_len//2))//self.char_len

def char_len(self):
return len(self.char)

train_loader = DataLoaderX(dataset=MyDataset(n_ctx),shuffle=True,pin_memory=False, batch_size=batch_size)