With AI, you get what you give

Dec 2024

What separates AI that’s merely helpful from AI you can rely on for your work? Painstaking, ongoing prompt engineering.

Takeaways

Significance of Prompt Engineering: Crafting precise prompts is crucial to ensure AI models like GPT-4 produce correct and dependable results. Inadequate prompts can lead to suboptimal or incorrect outputs, which is particularly problematic in legal settings.

Common AI Limitations: AI models may generate plausible-sounding but incorrect information, known as "hallucinations." This underscores the need for meticulous prompt design and validation of AI-generated content.

Advanced Prompting Techniques:‍
- Retrieval-Augmented Generation (RAG): Enhances AI responses by integrating external, reliable data sources, thereby grounding outputs in factual information.
- ‍Few-Shot Prompting: Provides the AI with a few examples within the prompt to guide it toward generating the desired response.
- ‍Chain-of-Thought Prompting: Encourages the AI to articulate intermediate reasoning steps, improving problem-solving for complex tasks.
- ‍Self-Reflection: Prompts the AI to review and refine its own outputs, enhancing accuracy and coherence.
- Prompt Chaining: Involves breaking down a complex task into a series of simpler prompts, allowing the AI to address each step sequentially for better overall performance.

Plenty of legal professionals have shared their experiences using ChatGPT or the large language model (LLM) GPT-4 alone to do legal work. While these powerful tools have advantages, they also bring problems—such as insufficient data security and privacy—and using them on their own delivers results that are at best suboptimal and at worst flat-out wrong. Just ask the lawyer who was sanctioned because he used ChatGPT for legal research without checking its work.

Here’s just one example of a GPT-4 response to a typical legal query, or “prompt”:

QUERY: In California, is a state habeas case governed by the rules of discovery for a civil case or by the rules of discovery for a criminal case?

RESPONSE: A state habeas case in California is generally governed by the rules of discovery for a criminal case. However, it's important to note that habeas corpus proceedings are unique and have their own specific rules and procedures, which can sometimes differ from standard criminal proceedings. It's always best to consult with a legal professional or refer to the specific rules of the court to understand the exact procedures that apply.

While GPT-4’s output seems reasonable, it’s incorrect. This excerpt from Jimenez v. Superior Court, 40 Cal.App.5th 824 (Cal. Ct. App. 2019) explains: Habeas corpus “is a special proceeding and not entirely analogous to either [civil or criminal proceedings].” (Id. at p. 815, fn. 7, 129 Cal.Rptr.2d 605, 61 P.3d 402.) Thus, the statutory provisions governing discovery in criminal trials do not apply to habeas corpus matters (id. at p. 813, 129 Cal.Rptr.2d 605, 61 P.3d 402), “although they may provide guidance in crafting discovery orders on habeas corpus” (Pearson, supra, 48 Cal.4th at p. 572, 107 Cal.Rptr.3d 265, 227 P.3d 858). Nor has our Supreme Court indicated that the civil discovery scheme governs discovery in habeas proceedings. Instead, the Court has tasked trial courts with “fashion[ing] a fair discovery rule.” (Scott, at p. 814, 129 Cal.Rptr.2d 605, 61 P.3d 402.)

So how can CoCounsel—which is powered by LLMs like GPT-4's successor, GPT-4o—give correct answers? The answer lies in part in the different prompting techniques we on the machine learning and prompt engineering teams employ “behind the scenes” to ensure precise, accurate, and complete output. This work means that when you enter an ordinary prompt, like the one above, rather than taking that language at face value, CoCounsel picks up what you’ve submitted and does more sophisticated work with it, in a sense “translating” what you’ve entered into a more complex query, then sends that to the LLM to get the best possible results before delivering the answer to you in CoCounsel.

Though the actual prompts we use within CoCounsel, when it “communicates” with LLMs are much more complicated than the examples below, these are five of the techniques we use, how they work, and why they make the difference for your output: Retrieval-augmented generation (RAG), few-shot prompting, chain-of-thought prompting, self-reflection, and prompt chaining.

Retrieval-augmented generation (RAG)

When a user directly asks GPT-4 or GPT-4o a question, the models answer using the giant trove of information (from the internet and other sources) used to train them. But while these models have memorized much more information than a human can, they still make mistakes and fabricate information, known as hallucinations. By connecting a model to a database of reliable legal sources, we’re able to ground its output in real-world knowledge rather than leaving it to rely only on its own memory.

Let’s revisit the habeas query above. When a user enters the query as written into CoCounsel, the query with “instructions” that actually makes its way to GPT-4o in the background looks something like the below, where our prompt engineering tells the LLM to first retrieve relevant case law from our legal database, then to answer based only on that retrieved information. Grounding GPT-4’os answers in real case law results in an answer that’s not only accurate, but also more in-depth:

QUERY: Consider the following legal research query: In California, is a state habeas case governed by the rules of discovery for a civil case or by the rules of discovery for a criminal case?

Draft an answer using only on the information provided in the following snippets:
<Snippet from People v. Superior Court (Pearson)>

…

RESPONSE (generated by CoCounsel's Legal Research Memo skill): There is no definitive answer to whether a state habeas case in California is governed by the rules of discovery for a civil case or a criminal case. However, several cases suggest that habeas corpus proceedings are not entirely analogous to either type of case, and that discovery in such proceedings is generally resolved on a case-by-case basis.

Several cases discuss the applicability of various discovery rules to habeas corpus proceedings in California. People v. Superior Court (Pearson) is particularly relevant, as it discusses the applicability of Proposition 115's discovery provisions to habeas corpus cases. The court ultimately concludes that the initiative does not prohibit post-conviction discovery.

Hurd v. Superior Court and Satele v. Superior Court both discuss the applicability of California Penal Code section 1054.9 to habeas corpus cases. Hurd holds that section 1054.9 authorizes a pre-habeas corpus motion for discovery of peace officer personnel records, while Satele clarifies that the good cause requirement does not apply to evidence held by the court.

Jimenez v. Superior Court and Board of Prison Terms v. Superior Court both discuss the procedures governing habeas corpus discovery. Jimenez notes that habeas corpus proceedings are not entirely analogous to either civil or criminal proceedings, while Board of Prison Terms states that the court has the power to order discovery when requested by a party.

Briggs v. Brown discusses whether discovery in a habeas corpus matter falls within the scope of a "criminal case" under a particular statute, and concludes that it does not because a habeas corpus matter is separate from the criminal case itself.

People v. Calhoun and Jackson v. Superior Court both discuss the applicability of civil rules to different types of proceedings. Calhoun holds that a commitment proceeding pursuant to the Sexually Violent Predator Act is a special proceeding of a civil nature, while Jackson notes that habeas corpus proceedings are not entirely analogous to either civil or criminal proceedings.

Finally, Rubio v. Superior Court discusses California's reciprocal pretrial discovery statutes, which may be relevant to the research request.

Taken together, these cases suggest that there is no definitive answer to whether a state habeas case in California is governed by the rules of discovery for a civil case or a criminal case. However, they do suggest that habeas corpus proceedings are not entirely analogous to either type of case, and that discovery in such proceedings is generally resolved on a case-by-case basis.

Few-shot prompting

Though GPT-4 and GPT-4o can handle a wide variety of tasks, these models might not know what to do with some specialized tasks they weren't trained on. In these cases, including examples of how to do the task in the prompt itself—called few-shot prompting—can dramatically improve the performance. Here’s an example dealing with understanding the content of contract clauses:

QUERY: Is a party restricted from contesting the validity of the counterparty’s ownership of intellectual property or otherwise bringing a claim against the counterparty for matters unrelated to the contract, according to this clause?

Clause: RemainCo hereby covenants not to sue SpinCo under any Licensed RemainCo Know-How and under RemainCo's right, title and interest in and to Shared Library Materials and Foundational Software, including, without limitation, all applicable Patents, Copyrights, and Know-How, for infringement or misappropriation based upon any action that occurs in connection with the continued operation of the SpinCo Business and any future extensions of the SpinCo Business in any field other than the RemainCo Core Field after the Distribution Date.

Respond yes or no.

RESPONSE: No.

This answer is incorrect, as the clause is clearly a covenant by RemainCo not to sue SpinCo. This is precisely the kind of result we saw when we first began working with GPT-4 to build CoCounsel back in 2022, until we started engineering our prompts for optimal results.

Now let’s try again with a simplified example of few-shot prompting, where we include similar clauses before listing the clause we have a question about, and provide the correct yes or no response for each clause to our initial query, effectively giving GPT-4 examples to guide its “thinking”:

Clause: Company agrees that it will not at any time contest the ownership or validity of any Reed's Intellectual Property or Deliverables, nor register or attempt to register any rights with respect to Reed's Intellectual Property, nor do anything that would jeopardize or diminish Reed's rights to or the value of Reed's Intellectual Property or Deliverables.

Response: Yes.

Clause: In addition, VerticalNet shall not now or in the future contest the validity of LeadersOnline's ownership of its Intellectual Property; provided, however, that VerticalNet may contest the validity of LeadersOnline's Intellectual Property in any proceeding brought against VerticalNet alleging infringement or misappropriation of LeadersOnline's Intellectual Property.

Response: Yes.

Clause: At no time during or after the Term of this Agreement shall a party challenge or assist others to challenge the other party's Intellectual Property or the registration thereof or attempt to register any trademarks, marks or trade names confusingly similar to those or the other party.

Response: Yes.

Clause: Xencor shall have the right, upon [***] ([***]) days' prior written notice to Aimmune, to cause an independent, certified international public accounting firm reasonably acceptable to Aimmune or reasonably acceptable to its Affiliates or Sublicensees, as applicable, to audit such records during Aimmune's, or its Affiliates’ or Sublicensees', as applicable, normal business hours to confirm the number of Product units sold, the gross sales and Net Sales of Product, the royalties payable, the method used to calculate the royalties payable, and the exchange rates used in accordance with Section 8.2

Response: No.

Clause: ACSI and Company are sometimes referred to collectively herein as the ""Parties"" and individually as a ""Party.""

Response: No.

Then, at the very end of this list of examples, we include the original clause this query was about. And when you use CoCounsel and enter the query as it originally appeared, with only the single clause in question, CoCounsel does all the above prompt refinement for you, in the background.

And now, because of our guidance, GPT-4 delivers the correct answer, which CoCounsel then delivers to you:

Respond yes or no.

RESPONSE: Yes.

Chain-of-thought prompting

GPT-4 doesn’t always do well with problems requiring reasoning. Let’s take this example from the LSAT:

QUERY: The Venetian Renaissance painter Vittore Carpaccio used sumptuous reds in most of his paintings. Since the recently discovered Venetian Renaissance painting Erato Declaiming contains notable sumptuous reds, it is probably by Carpaccio.

Which one of the following contains a pattern of flawed reasoning most similar to that in the argument above?

A: Most Renaissance painters worked in a single medium, either tempera or oil. Since the Renaissance painting Calypso's Bower is in oil, its painter probably always used oil.

B: In Italian Renaissance painting, the single most common subject was the Virgin and Child, so the single most common subject in Western art probably is also the Virgin and Child.

C: Works of art in the Renaissance were mostly commissioned by patrons, so the Renaissance work The Dances of Terpsichore was probably commissioned by a patron.

D: The anonymous painting St. Sebastian is probably an early Florentine painting since it is in tempera, and most early Florentine paintings were in tempera.

E: Since late-Renaissance paintings were mostly in oil, the Venetian late-Renaissance painter Arnoldi, whose works are now lost, probably painted in oil.

RESPONSE: Most Renaissance painters worked in a single medium, either tempera or oil. Since the Renaissance painting Calypso's Bower is in oil, its painter probably always used oil.

This is incorrect (the correct answer is D), but we can improve this query by simply adding this single line at the end:

Before you answer, think through your reasoning step-by-step.

This is called chain-of-thought prompting, where we ask GPT-4 to explicitly state its problem-solving process. This technique gives GPT-4 a “scratch pad” where it can solve step-by-step rather than trying immediately to come up with the answer.

(New, correct) RESPONSE: The flawed reasoning in the original argument is that it assumes that because a certain characteristic (sumptuous reds) is common in the works of a certain artist (Carpaccio), any painting with that characteristic is probably by that artist. This is a flawed reasoning because it ignores the possibility that other artists could also use sumptuous reds in their paintings.

The answer choice that most closely mirrors this flawed reasoning is D: The anonymous painting St. Sebastian is probably an early Florentine painting since it is in tempera, and most early Florentine paintings were in tempera. This argument assumes that because a certain characteristic (being in tempera) is common in the works of a certain group of artists (early Florentine painters), any painting with that characteristic is probably by an artist from that group. This is flawed for the same reason as the original argument: it ignores the possibility that artists from other groups could also paint in tempera.

Self-reflection

Self-reflection is similar to chain-of-thought prompting, often used in lieu of or in conjunction with that technique. It involves asking an LLM to reflect on its own outputs and make any corrections it deems necessary.

Using the same LSAT example as above, when GPT-4 returns the incorrect response, we then enter this prompt:

PROMPT: Double-check your answer and fix any problems you find.

RESPONSE: D: The anonymous painting St. Sebastian is probably an early Florentine painting since it is in tempera, and most early Florentine paintings were in tempera.

We see that GPT-4 is able to correct its own reasoning mistakes. Similarly to chain-of-thought prompting, giving GPT-4 more “scratch room” to work increases its ability to handle more complex tasks.

Prompt chaining

Self-reflection is a type of simple prompt chain, where the output of one query is fed as part of the input into another. Similar to chain-of-thought prompting, prompt chaining involves breaking down complex tasks into steps, enabling us to use LLMs o process different types of documents and more complex information, on a much larger scale.

Chains don’t have to be linear—they can branch, run in parallel, and then join up again. Many ways exist to break complex tasks like legal research or database search into smaller components.

While in practice the way each of these prompting types actually unfolds is much more complex than in the examples we’ve shared, at heart these are the techniques we use. And we continue to refine them, every day, taking into consideration all the feedback CoCounsel users give us, while our machine learning and prompt engineering experts continue to expand their LLM expertise and optimize LLMs' output for CoCounsel.

With AI, you get what you give

What separates AI that’s merely helpful from AI you can rely on for your work? Painstaking, ongoing prompt engineering.

Takeaways

Retrieval-augmented generation (RAG)

Few-shot prompting

Chain-of-thought prompting

Self-reflection

Prompt chaining

Related Content

Efficiently evaluating LLMs for legal tasks

Legal AI Benchmarking: CoCounsel

Retrieval-augmented generation in legal tech

MeetCoCounsel

Meet
CoCounsel