Prompt Engineering for Large Language Models ( Few-shot learners)

5 min readMay 3, 2022

Large Language Models (LLMs) have became very popular in natural language processing recently. These models take a sequence of tokens as input and tries to predict the next token or masked tokens. In comparison with the other machine learning and deep learning models that require training data to perform a task, LLMs are pre-trained on large amount of data and can perform tasks in a zero- or few-shot. Zero-shot means LLM can perform the task ( like classification ) without any training example and few-shot means LLM can perform task by only using few examples.

Training examples in few-shot learning are called prompt. Few-shot prompts are very useful for defining tasks that are supposed to do one single function because LLMs can generate the same format response as the prompt format. For example, if training examples in the prompt are in format of questions and answers, the output of LLM will be in format of questions and answers. So, it is very important to design a good prompt to get the best output from LLM. A prompt contains three components:

Format : A template which consists of a natural language description of the task (Optional) and placeholders for the training and test example(s).
Training examples: The prompt’s training examples are used to teach the LM how to solve a task. For example, for text classification, you can show which class labels you are looking for by showing just a few training examples.
Ordering for training examples: The order of training examples.

For example this is a zero-shot prompt format for question answering:

<Task description>
Question: <Text for question>

and for a one-shot prompt:

<Task description>
Question: <Text for question>
Answer: <Answer to this question>
###
Question: <Text for question>

The response of Few-shot learner can be very unstable and strongly dependent on the choice of prompt format, training examples, and the order of the training examples. For example, only changing the order of training examples can change accuracy from 54% to 93%. What makes this large variance in accuracy?

Majority Label Bias LLMs are biased towards answers that are frequent in the prompt. For example, when one class is more common (imbalanced classification), LLM is strongly biased towards predicting that class.

Recency Bias LLMs tend to repeat answers that appear towards the end of the prompt. In other words, Few-shots later in your examples will bias completion results more than earlier ones. For instance, when two Negative examples appear at the end (P P N N), the model will prefer the Negative class.

Common Token Bias: LLMs are biased towards outputting tokens that are common in their pre-training data. For example, the model often predicts common entities such as “Cold Blood” when the ground-truth answer is instead a rare entity name such as “CB”. The common token bias helps to explain why the choice of label names is important, and why the model struggles on rare answers.

In addition to above problems, LMs also have a tendency to get stuck in loops, because as soon as they accidentally repeat themselves for any reason, this becomes their prior: they now find themselves completing a document that has repetitions in it, and so continue the trend.

Also If your prompt has spelling or grammar errors, the output of LLM will probably have these issues as well because LLM completes the prompt based on the highest likelihood next token, so LLM will continue the trend of mistakes for you. For the same reason, the style of LLM generated output text will be same as the style of the prompt. For example if prompt is written in formal style, the output also will be formal and vise versa.

Prompt Engineering to get more accurate results:

To solve the above mentioned biased and get more accurate results, some techniques could be used to do prompt engineering:

To overcome recency bias ( LLM’s accuracy depends on ordering of training examples), randomize the order of training examples.
To overcome common token bias, choose more common names for labels in classification, name entity recognition, topics, etc.
Take advantage of the task description. The more information that you provide in the description, the more accurate output will be generated by the LM. For example compare these two prompts. The first one is without any task description and the second is with task description:

Example 1:
prompt:
Q: Recommend a women a hobby?
GPT3 generated answer: “A woman might enjoy a hobby such as gardening, cooking, or sewing.”
Example 2:
prompt:
Women are independent, educated and strong
Q: Recommend a women a hobby?
GPT3 generated answer: “A woman should have many hobbies. A few of our favorites include: hiking, biking, cooking, gardening, and reading.”

Always be careful of the words that you choose and the grammar of the prompt and expect LLM to follow your writing style. For example, for a friendly chatbot, choose informal words in the prompt and for a customer service chatbot, choose formal words with good grammar.
To overcome the problem of getting locked in loop, If you’re trying to generate many examples, generate examples one by one. For example, if you want to generate a list, instead of allowing the model to complete items 6, 7, 8, and so on, let it complete only a single line, and set the number of results parameter to something greater than 1 to generate many single-line completions efficiently.
Calibrate the model’s output: To calibrate the model’s output, you need to first estimate the model’s bias towards each answer by giving a training prompt to the LLM and a test it with a content-free test input such as and then fit calibration parameters that cause the prediction for this input to be uniform across answers. You can read more about this technique here.

I have shared a notebook on using GPT3, GPT-J and GPT-Neo in a Google Colab notebook. Please feel free to play around and let me know what you think in the comments.

References:

Reynolds, L., & McDonell, K. (2021, May). Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–7).
https://blog.andrewcantino.com/blog/2021/04/21/prompt-engineering-tips-and-tricks/

Prompt Engineering for Large Language Models ( Few-shot learners)

Written by parvaneh shayegh