Sander Schulhoff1∗ Michael Ilie1∗ Nishant Balepur1 Konstantine Kahadze1
Amanda Liu1 Chenglei Si3 Yinheng Li4 Aayush Gupta1 HyoJung Han1 Sevien Schulhoff1
Pranav Sandeep Dulepet1 Saurav Vidyadhara1 Dayeon Ki1 Sweta Agrawal11 Chau Pham12
Gerson Kroiz Feileen Li1 Hudson Tao1 Ashay Srivastava1 Hevander Da Costa1 Saloni Gupta1
Megan L. Rogers7
Inna Goncearenco8 Giuseppe Sarli8,9
Igor Galynker10
Denis Peskoff6 Marine Carpuat1 Jules White5 Shyamal Anadkat2 Alexander Hoyle1 Philip Resnik1
1 University of Maryland 2 OpenAI 3 Stanford 4 Microsoft 5 Vanderbilt 6 Princeton
7 Texas State University 8
Icahn School of Medicine 9 ASST Brianza
10 Mount Sinai Beth Israel 11 Instituto de Telecomunicações 12 University of Massachusetts Amherst
sschulho@umd.edu milie@umd.edu resnik@umd.edu
While prompting is a widespread and highly researched concept, there
exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the
area’s nascency. (1)
Transformer-based LLMs are widely deployed
in consumer-facing, internal, and research settings (4)
(Bommasani et al., 2021)
Knowing how to effectively structure, evaluate,
and perform other tasks with prompts is essential
to using these models. Empirically, better prompts
lead to improved results across a wide range of
tasks (Wei et al., 2022; Liu et al., 2023b; Schulhoff, 2022). A large body of literature has grown
around the use of prompting to improve results
and the number of prompting techniques is rapidly
increasing.
However, as prompting is an emerging field, the
use of prompts continues to be poorly understood,
with only a fraction of existing terminologies and
techniques being well-known among practitioners(4)
Additionally, we refined our focus to hard (discrete) prompts rather
than soft (continuous) prompts and leave out papers
that make use of techniques using gradient-based
updates (i.e. fine-tuning).(4)
Many multilingual and multimodal prompting techniques are
direct extensions of English text-only prompting
techniques.
As prompting techniques grow more complex,
they have begun to incorporate external tools, such
as Internet browsing and calculators. We use the
term ‘agents‘ to describe these types of prompting
techniques (Section 4.1).(4)
Finally, we apply prompting techniques in two
case studies (Section 6.1). In the first, we test a
range of prompting techniques against the commonly used benchmark MMLU (Hendrycks et al.,
2021). In the second, we explore in detail an example of manual prompt engineering on a significant,
real-world use case, identifying signals of frantic
hopelessness–a top indicator of suicidal crisis–in
the text of individuals seeking support (Schuck
et al., 2019a). We conclude with a discussion of
the nature of prompting and its recent development
(Section 8).(5)
1.1 What is a Prompt?
A prompt is an input to a Generative AI model, that
is used to guide its output (Meskó, 2023; White
et al., 2023; Heston and Khun, 2023; Hadi et al.,
2023; Brown et al., 2020). Prompts may consist of
text, image, sound, or other media.(5)
Prompt Template Prompts are often constructed
via a prompt template (Shin et al., 2020b). A
prompt template is a function that contains one or
more variables which will be replaced by some media (usually text) to create a prompt. This prompt
can then be considered to be an instance of the
template.
(5)
Directive Many prompts issue a directive in the
form of an instruction or question.1 This is the
core intent of the prompt, sometimes simply called
the "intent". For example, here is an example of a
prompt with a single instruction:
Tell me five good books to read.(5)
Directives can also be implicit(5)
Examples Examples, also known as exemplars or
shots, act as demonstrations that guide the GenAI
to accomplish a task. The above prompt is a OneShot (i.e. one example) prompt.(5)
Output Formatting It is often desirable for the
GenAI to output information in certain formats, for
example, CSVs or markdown formats (Xia et al.,
2024).(5)
Style Instructions Style instructions are a type
of output formatting used to modify the output
stylistically rather than structurally (Section 2.2.2).(6)
Additional Information It is often necessary to
include additional information in the prompt. For
example, if the directive is to write an email, you
might include information such as your name and
position so the GenAI can properly sign the email.
Additional Information is sometimes called ‘context‘ (6)
Prompting Prompting is the process of providing a prompt to a GenAI(6)
Prompt Chain A prompt chain (activity: prompt
chaining) consists of two or more prompt templates
used in succession. The output of the prompt generated by the first prompt template is used to parameterize the second template, continuing until all
templates are exhausted (Wu et al., 2022).(6)
Prompting Technique A prompting technique
is a blueprint that describes how to structure a
prompt, prompts, or dynamic sequencing of multiple prompts. A prompting technique may incorporate conditional or branching logic, parallelism, or
other architectural considerations spanning multiple prompts.
Prompt Engineering Prompt engineering is the
iterative process of developing a prompt by modifying or changing the prompting technique that you
are using (Figure 1.4).
Prompt Engineering Technique A prompt engineering technique is a strategy for iterating on a
prompt to improve it. In literature, this will often
be automated techniques (Deng et al., 2022), but
in consumer settings, users often perform prompt
engineering manually.
Exemplar Exemplars are examples of a task being completed that are shown to a model in a
prompt (Brown et al., 2020)(7)
The term Prompt Engineering appears to have
come into existence more recently from Radford
et al. (2021) then slightly later from Reynolds and
McDonell (2021).
(7)
1 Systematic Review Process
In order to robustly collect a dataset of sources
for this paper, we ran a systematic literature review grounded in the PRISMA process (Page et al.,
2021) (Figure 2.1). We host this dataset on HuggingFace and present a datasheet (Gebru et al.,
2021) for the dataset in Appendix A.3. Our main
data sources were arXiv, Semantic Scholar, and
ACL. We query these databases with a list of
44 keywords narrowly related to prompting and
prompt engineering (Appendix A.4).(8)
2.2.1 In-Context Learning (ICL)
ICL refers to the ability of GenAIs to learn skills
and tasks by providing them with exemplars and or
relevant instructions within the prompt, without the
need for weight updates/retraining (Brown et al.,
2020; Radford et al., 2019b). These skills can be
learned from exemplars (Figure 2.4) and/or instructions (Figure 2.5). Note that the word ’learn’ is
misleading. ICL can simply be task specification–
the skills are not necessarily new, and can have
already been included in the training data (Figure
2.6). (8)
Few-Shot Prompting (Brown et al., 2020) is the
paradigm seen in Figure 2.4, where the GenAI
learns to complete a task with only a few examples (exemplars)(8)
Few-Shot Learning (FSL) (Fei-Fei et al., 2006;
Wang et al., 2019) is often conflated with Few-Shot
Prompting (Brown et al., 2020).(8)
Few-Shot Prompting Design Decisions (10)
Exemplar Quantity
Exemplar Ordering
Exemplar Label Distribution As in traditional
supervised machine learning, the distribution of
exemplar labels in the prompt affects behavior. For
example, if 10 exemplars from one class and 2
exemplars of another class are included, this may
cause the model to be biased toward the first class.
Exemplar Label Quality Despite the general benefit of multiple exemplars, the necessity of strictly
valid demonstrations is unclear (10)
Exemplar Format The formatting of exemplars
also affects performance. One of the most common
formats is “Q: {input}, A: {label}”, but the optimal
format may vary across tasks (10)
Exemplar Similarity (11)
Self-Generated In-Context Learning (SG-ICL)
(Kim et al., 2022) leverages a GenAI to automatically generate exemplars. While better than zeroshot scenarios when training data is unavailable(11)
Prompt Mining (Jiang et al., 2020) is the process
of discovering optimal "middle words" in prompts
(effectively prompt templates) through large corpus
analysis.(11)
2.2.2 Zero-Shot
In contrast to Few-Shot Prompting, Zero-Shot
Prompting uses zero exemplars.
>Разница между zero-shot и few-shot обучением заключается в следующем:
Zero-shot обучение - это когда модель решает задачу без каких-либо примеров, только на основе описания задачи в промпте. Модель должна обобщить свои знания и применить их к новой задаче, которую она никогда не видела.
Few-shot обучение - это когда модель обучается на небольшом количестве примеров (например, 1-10 примеров) новой задачи. Модель использует эти немногочисленные примеры, чтобы быстро адаптироваться и решать новую задачу.
Основное отличие в том, что zero-shot обучение не требует никаких примеров, в то время как few-shot обучение использует небольшое количество примеров для быстрой адаптации к новой задаче. Zero-shot подход более сложен для модели, но few-shot позволяет ей быстрее обучаться на ограниченных данных.(perplexity)
Role Prompting
Style Prompting
Emotion Prompting
System 2 Attention (S2A) (Weston and
Sukhbaatar, 2023) first asks an LLM to rewrite
the prompt and remove any information unrelated
to the question therein.
SimToM (Wilf et al., 2023) deals with complicated questions which involve multiple people or
objects. Given the question, it attempts to establish
the set of facts one person knows, then answer the
question based only on those facts. This is a two
prompt process and can help eliminate the effect of
irrelevant information in the prompt. (12)
Rephrase and Respond (RaR) (Deng et al., 2023)
instructs the LLM to rephrase and expand the question before generating the final answer (12)
Re-reading (RE2) (Xu et al., 2023) adds the
phrase "Read the question again:" to the prompt in
addition to repeating the question. Although this is
such a simple technique, it has shown improvement
in reasoning benchmarks, especially with complex
questions.(12)
Self-Ask (Press et al., 2022) prompts LLMs to
first decide if they need to ask follow up questions
for a given prompt. If so, the LLM generates these
questions, then answers them and finally answers
the original question(12)
>Согласно исследованию, стандартный подход к использованию LLMs, когда пользователь просто задает вопрос или дает инструкцию, может быть недостаточным. Вместо этого авторы предлагают следующий многоэтапный процесс:
Определение необходимости уточняющих вопросов: Сначала LLM должна решить, нужно ли ей задать дополнительные уточняющие вопросы, чтобы лучше понять исходный запрос пользователя.
Генерация уточняющих вопросов: Если LLM решит, что нужны уточнения, она должна сгенерировать соответствующие вопросы.
Ответы на уточняющие вопросы: Пользователь отвечает на сгенерированные LLM вопросы, чтобы предоставить дополнительную информацию.
Ответ на исходный запрос: Теперь, когда LLM получила больше контекста, она может дать окончательный ответ на первоначальный запрос пользователя.
Chain-of-Thought (CoT) Prompting (Wei et al.,
2022) leverages few-shot prompting to encourage the LLM to express its thought process before
delivering its final answer.5 This technique is occasionally referred to as Chain-of-Thoughts (Tutunov
et al., 2023; Besta et al., 2024; Chen et al., 2023d).
It has been demonstrated to significantly enhance
the LLM’s performance in mathematics and reasoning tasks. In Wei et al. (2022), the prompt includes
an exemplar featuring a question, a reasoning path,
and the correct answer (Figure 2.8).(12)
Хорошо, попробую объяснить Chain-of-Thought (CoT) prompting простыми словами:
Представьте, что вы даете задачу большой языковой модели (LLM), например, решить математическую задачу. Обычно LLM просто выдаст ответ, не объясняя, как она к нему пришла.
Но с CoT prompting все по-другому. В этом случае вы даете LLM пример задачи, где показан не только ответ, но и весь процесс рассуждений, шаг за шагом, который привел к этому ответу.
Увидев такой пример, LLM учится сама генерировать подобную "цепочку рассуждений" при решении новых задач. Она не просто выдает ответ, а объясняет, как она к нему пришла.
Это помогает LLM лучше понимать задачу и находить более точные решения, особенно для сложных задач, требующих многошаговых рассуждений. Модель как бы "думает вслух", проговаривая свои мысли, прежде чем дать ответ.
Таким образом, CoT prompting позволяет сделать взаимодействие с LLM более понятным и прозрачным, а также значительно повышает ее эффективность на задачах, требующих логического мышления.
2.2.3.1 Zero-Shot-CoT
The most straightforward version of CoT contains
zero exemplars. It involves appending a thought
inducing phrase like "Let’s think step by step." (Kojima et al., 2022) to the prompt. (12)
Step-Back Prompting (Zheng et al., 2023c) is a
modification of CoT where the LLM is first asked
a generic, high-level question about relevant concepts or facts before delving into reasoning. This
approach has improved performance significantly
on multiple reasoning benchmarks for both PaLM2L and GPT-4.
Analogical Prompting (Yasunaga et al., 2023)
is similar to SG-ICL, and automatically generates
exemplars that include CoTs. It has demonstrated
improvements in mathematical reasoning and code
generation tasks.
Thread-of-Thought (ThoT) Prompting (Zhou
et al., 2023) consists of an improved thought inducer for CoT reasoning. Instead of "Let’s think
step by step," it uses "Walk me through this context
in manageable parts step by step, summarizing and
analyzing as we go."(12)
Tabular Chain-of-Thought (Tab-CoT) (Jin and
Lu, 2023) consists of a Zero-Shot CoT prompt that
makes the LLM output reasoning as a markdown
(12)
2.2.3.2 Few-Shot CoT
Contrastive CoT Prompting (Chia et al., 2023)
adds both exemplars with incorrect and correct explanations to the CoT prompt in order to show the
LLM how not to reason. This method has shown
significant improvement in areas like Arithmetic
Reasoning and Factual QA.(13)
Active Prompting (Diao et al., 2023) starts with
some training questions/exemplars, asks the LLM
to solve them, then calculates uncertainty (disagreement in this case) and asks human annotators to
rewrite the exemplars with highest uncertainty(13)
2.2.4 Decomposition
Least-to-Most Prompting (Zhou et al., 2022a)
starts by prompting a LLM to break a given problem into sub-problems without solving them. Then,
it solves them sequentially, appending model responses to the prompt each time, until it arrives
at a final result (13)
Decomposed Prompting (DECOMP) (Khot
et al., 2022) Few-Shot prompts a LLM to show it
how to use certain functions. These might include
things like string splitting or internet searching;
(13)
Plan-and-Solve Prompting (Wang et al., 2023f)
consists of an improved Zero-Shot CoT prompt,
"Let’s first understand the problem and devise a
plan to solve it. Then, let’s carry out the plan and
solve the problem step by step". (13)
Tree-of-Thought (ToT) (Yao et al., 2023b), also
known as Tree of Thoughts, (Long, 2023), creates a
tree-like search problem by starting with an initial
problem then generating multiple possible steps in
the form of thoughts (as from a CoT)(13)
Recursion-of-Thought (Lee and Kim, 2023) is
similar to regular CoT. However, every time it encounters a complicated problem in the middle of its
reasoning chain, it sends this problem into another
prompt/LLM call(14)
Program-of-Thoughts (Chen et al., 2023d) uses
LLMs like Codex to generate programming code
as reasoning steps. A(14)
2.2.5 Ensembling
In GenAI, ensembling is the process of using multiple prompts to solve the same problem, then aggregating these responses into a final output.(14)
Max Mutual Information Method (Sorensen
et al., 2022) creates multiple prompt templates with
varied styles and exemplars, then selects the optimal template as the one that maximizes mutual
information between the prompt and the LLM’s
outputs(14)
2.2.6 Self-Criticism
When creating GenAI systems, it can be useful to
have LLMs criticize their own outputs (Huang et al.,
2022). (15)
Self-Calibration (Kadavath et al., 2022) first
prompts an LLM to answer a question. Then, it
builds a new prompt that includes the question, the
LLM’s answer, and an additional instruction asking
whether the answer is correct.(15)
Chain-of-Verification (COVE) (Dhuliawala
et al., 2023) first uses an LLM to generate an
answer to a given question. Then creates a list
of related questions that would help verify the
correctness of the answer. Each question is
answered by the LLM, then all the information
is given to the LLM to produce the final revised
answer. This method has shown improvements in
various question-answering and text-generation
tasks.(15)
2.4 Prompt Engineering
In addition to surveying prompting technique, we
also review prompt engineering techniques, which
are used to automatically optimize prompts(17)
Meta Prompting is the process of prompting a
LLM to generate or improve a prompt or prompt
template (Reynolds and McDonell, 2021; Zhou
et al., 2022b; Ye et al., 2023).
AutoPrompt (Shin et al., 2020b) uses a frozen
LLM as well as a prompt template that includes
some "trigger tokens", whose values are updated
via backpropogation at training time. This is a
version of soft-prompting.
Automatic Prompt Engineer (APE) (Zhou et al.,
2022b) uses a set of exemplars to generate a ZeroShot instruction prompt. It generates multiple possible prompts, scores them, then creates variations
of the best ones (e.g. by using prompt paraphrasing). It iterates on this process until some desiderata are reached.
Gradientfree Instructional Prompt(17)
2.5 Answer Engineering
Answer engineering is the iterative process of developing or selecting among algorithms that extract
precise answers from LLM outputs.(17)
Answer Shape,Answer Space,
Answer Extractor
In cases where it is impossible to entirely control
the answer space (e.g. consumer-facing LLMs), or
the expected answer may be located somewhere
within the model output, a rule can be defined to
extract the final answer. This rule is often a simple
function (e.g. a regular expression), but can also
use a separate LLM to extract the answer.(18)
Separate LLM Sometimes outputs are so complicated that regexes won’t work consistently. In
this case, it can be useful to have a separate LLM
evaluate the output and extract an answer.(18)
1 Multilingual
State-of-the-art GenAIs have often been predominately trained with English dataset, leading to a
notable disparity in the output quality in languages
other than English, particularly low-resource languages (Bang et al., 2023; Jiao et al., 2023; Hendy
et al., 2023; Shi et al., 2022). As a result, various
multilingual prompting techniques have emerged
in an attempt to improve model performance in
non-English settings. (19)
Translate First Prompting (Shi et al., 2022) is
perhaps the simplest strategy and first translates
non-English input examples into English.(19)
In-CLT (Cross-lingual Transfer) Prompting
(Kim et al., 2023) leverages both the source and
target languages to create in-context examples, diverging from the traditional method of using source
language exemplars. This strategy helps stimulate
the cross-lingual cognitive capabilities of multilingual LLMs, thus boosting performance on crosslingual tasks.(19)
PARC (Prompts Augmented by Retrieval Crosslingually) (Nie et al., 2023) introduce a framework that retrieves relevant exemplars from a high
resource language. This framework is specifically
designed to enhance cross-lingual transfer performance, particularly for low-resource target languages. Li et al. (2023g) extend this work to
Bangla.
3.1.4 Prompt Template Language Selection
In multilingual prompting, the selection of language for the prompt template can markedly influence the model performance.(19)
Task Language Prompt Template In contrast,
many multilingual prompting benchmarks such
as BUFFET (Asai et al., 2023) or LongBench
(Bai et al., 2023a) use task language prompts
for language-specific use cases. Muennighoff
et al. (2023) specifically studies different translation methods when constructing native-language
prompts. They demonstrate that human translated
prompts are superior to their machine-translated
counterparts -20-
Chain-of-Dictionary (CoD) (Lu et al., 2023b)
first extracts words from the source phrase, then
makes a list of their meanings in multiple languages, automatically via retrieval from a dictionary (e.g. English: ‘apple’, Spanish: ‘manzana’).
Then, they prepend these dictionary phrases to the
prompt, where it asks a GenAI to use them during
translation.
Dictionary-based Prompting for Machine Translation (DiPMT) (Ghazvininejad et al., 2023)
works similarly to CoD, but only gives definitions
in the source and target languages, and formats
them slightly differently.-20-
3.1.5.1 Human-in-the-Loop
Interactive-Chain-Prompting (ICP) (Pilault
et al., 2023) deals with potential ambiguities in
translation by first asking the GenAI to generate
sub-questions about any ambiguities in the phrase
to be translated. Humans later respond to these
questions and the system includes this information
to generate a final translation.
Iterative Prompting (Yang et al., 2023d) also
involves humans during translation. First, they
prompt LLMs to create a draft translation. This
initial version is further refined by integrating supervision signals obtained from either automated
retrieval systems or direct human feedback.
-21-
Negative Prompting allows users to numerically
weight certain terms in the prompt so that the
model considers them more/less heavily than others. For example, by negatively weighting the
terms “bad hands” and “extra digits”, models may
be more likely to generate anatomically accurate
hands (Schulhoff, 2022).
-21-
Definition of Agent In the context of GenAI, we
define agents to be GenAI systems that serve a
user’s goals via actions that engage with systems
outside the GenAI itself.7 -23-
Modular Reasoning, Knowledge, and Language
(MRKL) System (Karpas et al., 2022) is one of
the simplest formulations of an agent.-23-
Reasoning and Acting (ReAct) (Yao et al.
(2022)) generates a thought, takes an action, and
receives an observation (and repeats this process)
when given a problem to solve. All of this information is inserted into the prompt so it has a memory
of past thoughts, actions, and observations.
Reflexion (Shinn et al., 2023) builds on ReAct,
adding a layer of introspection-24-
4.1.3.1 Lifelong Learning Agents
Work on LLM-integrated Minecraft agents has generated impressive results, with agents able to acquire new skills as they navigate the world of thisopen-world videogame. We view these agents
not merely as applications of agent techniques
to Minecraft, but rather novel agent frameworks
which can be explored in real world tasks that require lifelong learning.
Voyager (Wang et al., 2023a) is composed of
three parts. First, it proposes tasks for itself to
complete in order to learn more about the world.
Second, it generates code to execute these actions.
Finally, it saves these actions to be retrieved later
when useful, as part of a long-term memory system.
This system could be applied to real world tasks
where an agent needs to explore and interact with
a tool or website (e.g. penetration testing, usability
testing).
Ghost in the Minecraft (GITM) -24-
4.1.4 Retrieval Augmented Generation (RAG)
In the context of GenAI agents, RAG is a paradigm
in which information is retrieved from an external
source and inserted into the prompt. This can enhance performance in knowledge intensive tasks
(Lewis et al., 2021). When retrieval itself is used
as an external tool, RAG systems are considered to
be agents.
Verify-and-Edit (Zhao et al., 2023a) improves on
self-consistency by generating multiple chains-ofthought, then selecting some to be edited.-24-
Demonstrate-Search-Predict (Khattab et al.,
2022) first decomposes a question into subquestions, then uses queries to solve them and
combine their responses in a final answer. It uses
few-shot prompting to decompose the problem and
combine responses.-25-
Interleaved Retrieval guided by Chain-ofThought (IRCoT) (Trivedi et al., 2023) is a
technique for multi-hop question answering that
interleaves CoT and retrieval-25-
Iterative Retrieval Augmentation techniques,
like Forward-Looking Active REtrieval augmented
generation (FLARE) (Jiang et al., 2023) and Imitate, Retrieve, Paraphrase (IRP) (Balepur et al.,
2023), perform retrieval multiple times during longform generation.-25-
In-Context Learning is frequently used in evaluation prompts, much in the same way it is used in
other applications (Dubois et al., 2023; Kocmi and
Federmann, 2023a).
-25
Role-based Evaluation is a useful technique for
improving and diversifying evaluations (Wu et al.,
2023b; Chan et al., 2024). By creating prompts
with the same instructions for evaluation, but different roles, it is possible to effectively generate
diverse evaluations.-25-
Chain-of-Thought prompting can further improve evaluation performance (Lu et al., 2023c;
Fernandes et al., 2023).-26-
Model-Generated Guidelines (Liu et al.,
2023d,h) prompt an LLM to generate guidelines
for evaluation. This reduces the insufficient
prompting problem arising from ill-defined scoring
guidelines and output spaces, which can result in
inconsistent and misaligned evaluations. Liu et al.-26-
4.2.2 Output Format
The output format of the LLM can significantly
affect evaluation performance Gao et al. (2023c).
Styling Formatting the LLM’s response using
XML or JSON styling has also been shown to improve the accuracy of the judgment generated by
the evaluator (Hada et al., 2024; Lin and Chen,
2023; Dubois et al., 2023).
Linear Scale A very simple output format is a
linear scale (e.g. 1-5).
Binary Score-26-
Likert Scale Prompting the GenAI to make use
of a Likert Scale (Bai et al., 2023b; Lin and Chen,
2023; Peskoff et al., 2023) can give it a better understanding of the meaning of the scale-26-
> Likert Scale Prompting - это техника, которая помогает языковым моделям (GenAI) лучше понимать и использовать шкалу Ликерта при ответах на вопросы.
Шкала Ликерта - это широко используемый метод измерения мнений, отношений и восприятий. Она предлагает респондентам выбрать ответ на вопрос из нескольких вариантов, обычно от "полностью не согласен" до "полностью согласен".
4.2.3 Prompting Frameworks
LLM-EVAL (Lin and Chen, 2023) is one of the
simplest evaluation frameworks. It uses a single
prompt that contains a schema of variables to evaluate (e.g. grammar, relevance, etc.), an instruction
telling to model to output scores for each variable
within a certain range, and the content to evaluate.-26-
Prompt Injection is the process of overriding
original developer instructions in the prompt
with user input (Schulhoff, 2024; Willison, 2024;
Branch et al., 2022; Goodside, 2022).-28-
! Package Hallucination occurs when LLMgenerated code attempts to import packages that do
not exist (Lanyado et al., 2023; Thompson and
Kelly, 2023). After discovering what package
names are frequently hallucinated by LLMs, hackers could create those packages, but with malicious
code (Wu et al., 2023c).!
Detectors are tools designed to detect malicious
inputs and prevent prompt hacking. Many companies have built such detectors (ArthurAI, 2024;
Preamble, 2024; Lakera, 2024), which are often
built using fine-tuned models trained on malicious prompts. Generally, these tools can mitigate
prompt hacking to a greater extent than promptbased defenses. -29-
5.2.1 Prompt Sensitivity
Prompt Wording can be altered by adding extra
spaces, changing capitalization, or modifying delimiters. -30-
Task Format describes different ways to prompt
an LLM to execute the same task.-30-
Prompt Drift (Chen et al., 2023b) occurs when
the model behind an API changes over time, so the
same prompt may produce different results on the
updated model. A-30-
Overconfidence and Calibration
Verbalized Score is a simple calibration technique that generates a confidence score (e.g. “How
confident are you from 1 to 10”), but its efficacy
is under debate. - 30-
Sycophancy refers to the concept that LLMs will
often express agreement with the user-30-
3 Biases, Stereotypes, and Culture
LLMs should be fair to all users, such that no biases, stereotypes, or cultural harms are perpetuated
in model outputs (Mehrabi et al., 2021). Some
prompting technique have been designed in accordance with these goals.-31-
Vanilla Prompting (Si et al., 2023b) simply consists of an instruction in the prompt that tells the
LLM to be unbiased. This technique has also been
referred to as moral self-correction (Ganguli et al.,
2023).
Selecting Balanced Demonstrations (Si et al.,
2023b) -31
Cultural Awareness (Yao et al., 2023a) can be
injected into prompts to help LLMs with cultural
adaptation (Peskov et al., 2021). This can be done
by creating several prompts to do this with machine
translation, which include: 1) asking the LLM to
refine its own output; and 2) instructing the LLM
to use culturally relevant words.-31-
AttrPrompt (Yu et al., 2023) is a prompting
technique designed to avoid producing text biased
towards certain attributes when generating synthetic data-31-
5.2.4 Ambiguity
Questions that are ambiguous can be interpreted in
multiple ways, where each interpretation could result in a different answer (Min et al., 2020). -31-
Ambiguous Demonstrations Gao et al. (2023a)
are examples that have an ambiguous label set.
Including them in a prompt can increase ICL
performance.-31-
Question Clarification (Rao and Daumé III,
2019) allows the LLM to identify ambiguous questions and generate clarifying questions to pose to
the user.-31-
2 Prompt Engineering Case Study
Prompt engineering is emerging as an art that many
people have begun to practice professionally, but
the literature does not yet include detailed guidance on the process. As a first step in this direction,
we present an annotated prompt engineering case
study for a difficult real-world problem.-33-
First, prompt engineering is fundamentally
different from other ways of getting a computer to
behave the way you want it to: these systems are
being cajoled, not programmed, and, in addition
to being quite sensitive to the specific LLM being
used, they can be incredibly sensitive to specific
details in prompts without there being any obvious reason those details should matter. Second,
therefore, it is important to dig into the data (e.g.
generating potential explanations for LLM “reasoning” that leads to incorrect responses). Related, the
third and most important take-away is that prompt
engineering should involve engagement between
the prompt engineer, who has expertise in how to
coax LLMs to behave in desired ways, and domain
experts, who understand what those desired ways
are and why-41-
Santu and Feng (2023)
provide a general taxonomy of prompts that can be
used to design prompts with specific properties to
perform a wide range of complex tasks
Bubeck
et al. (2023) qualitatively experiment with a wide
range of prompting methods on the early version
of GPT-4 to understand its capabilities
. Meskó (2023) and Wang et al.
(2023d) offer recommended use cases and limitations of prompt engineering in the medical and
healthcare domains.
Hua et al. (2024) use a GPT-4-automated approach to review LLMs in the mental health space.
Moreover, we base
our work in the widely well-received standard for
systematic literature reviews — PRISMA (Page
et al., 2021).
https://arxiv.org/pdf/2305.11430