In Ross v. United States, 331 A.3d 220 (D.C. 2025), the District of Columbia Court of Appeals joined the Eleventh Circuit Court of Appeals in addressing the use of large language models (LLMs) in litigation.
In Ross, the Court of Appeals was tasked with determining whether the evidence presented at trial was sufficient to uphold the appellant’s conviction for animal cruelty. The appellant had left her dog in a parked car on a hot day, and the court ultimately decided, in a 2-1 ruling, that the evidence was insufficient to support the conviction. Both the majority and dissenting opinions used ChatGPT to bolster their arguments, while a concurring opinion discussed the potential benefits and risks of the judiciary’s use of artificial intelligence, including LLMs.
A key issue in Ross was whether the trial court could rely on “common sense” or “common knowledge” to conclude that the appellant failed to provide adequate protection for her dog from the weather—as required by the relevant D.C. statute—when she left the dog in a car on a day when the temperature reached 98 degrees. The majority held that such reliance was inappropriate, as the government failed to provide evidence of the car’s internal temperature or signs of heat-related distress in the dog. Judge Deahl dissented, arguing that the trial court’s conclusion was rational based on the evidence and common-sense inferences.
In order to “scrutinize” what he claimed to be adequate common knowledge, Judge Deahl turned to an LLM. He asked ChatGPT whether it was harmful to leave a dog in a car with the windows slightly open for an hour and 20 minutes in 98-degree weather. ChatGPT responded unequivocally that, yes, such conditions were very harmful. To further support his argument, Judge Deahl queried ChatGPT about a previous D.C. Court of Appeals decision, Jordan v. United States, where an animal cruelty conviction was reversed due to insufficient evidence. ChatGPT’s response to a similar question about leaving a German shepherd outside in 25-degree weather was more equivocal, aligning with Judge Deahl’s common-sense understanding that the conviction in Jordan was correctly reversed, while Ross’s conviction should be affirmed.
The majority responded to Judge Deahl’s use of ChatGPT with its own. In a footnote, the majority referenced a previous decision, Long v. United States, where the court held that a jury could not use common sense to determine that the value of a stolen 2002 Dodge Intrepid exceeded $1,000 without specific evidence. When the majority asked ChatGPT about the value of a 2002 Dodge Intrepid in 2012, ChatGPT estimated a value between $3,000 and $5,000, which was significantly above the $1,000 threshold in Long. The majority argued that ChatGPT’s analysis did not align with at least one of their sufficiency decisions.
Judge Deahl countered the majority’s argument in his own footnote, noting that ChatGPT’s valuation of the Intrepid included caveats about factors like mileage and maintenance, similar to the court’s reasoning in Long. He noted that he then asked a more targeted question: “Would you say, beyond a reasonable doubt, that an operable 2002 Dodge Intrepid would be worth more than $1,000 in 2012?” And ChatGPT responded in roughly the same way that the court responded in Long—the car would likely be worth more than $1,000 but whether it was beyond a reasonable doubt depended on other factors. Conversely, Judge Deahl pointed out, when he asked ChatGPT whether it can “say, beyond a reasonable doubt, that leaving a dog in a car for an hour and 20 minutes in 98-degree heat would raise a plain and strong likelihood of harming the dog,” its answer was, “Yes, beyond a reasonable doubt,” with extensive elaboration.
The dueling uses of ChatGPT was not the only discussion of LLMs in Ross v. United States. Judge Howard noted that the majority’s and dissent’s use of an LLM followed Judge Newsom’s Eleventh Circuit Court of Appeals concurrence discussing LLMs’ potential roles in litigation. In his own concurrence, Judge Howard then highlighted a few additional concerns that Judge Newsom had not addressed.
Judge Howard began by emphasizing that with AI tools becoming continuously more prevalent, courts must not ignore them. He cautioned, however, that while courts need to develop AI competency, they must proceed carefully.
Although noting issues of reliability and bias in AI, Judge Howard opined that “security and confidentiality of court information are particular concerns,” as AI systems often “learn” from user input, risking exposure of sensitive court documents and personal information. He then provided several hypotheticals to illustrate these concerns.
First, Judge Howard noted that the D.C. Court of Appeals announces its panel members a week before oral arguments. If a judge uses AI to summarize briefs well in advance, they risk “surrendering data which includes—at bare minimum—the submitted data, i.e., the briefs of the parties, and potentially personally identifying data, i.e., a username, IP address, and email address.” This could reveal the judge’s panel involvement before it becomes public.
Second, a judge might use AI to prepare a decision, including factual background, legal standards, and even some analysis. If the AI trains on this data, someone with access could obtain judicial deliberative information.
Third, judicial records and briefs often contain sensitive information typically subject to redaction, such as social security numbers, account numbers, and minors’ names. “If unredacted briefs or records were loaded into the AI tool, it would be an instant failure of the court’s duty to protect such information,” Judge Howard remarked, as LLMs have millions of users.
Judge Howard noted that improper AI use could be mitigated with tools offering robust security and privacy protections, and companies are developing government-oriented solutions.
He concluded by noting that many court systems, including the District of Columbia’s, have established task forces to ensure the beneficial, secure, and safe use of AI technology.