December 1, 2025

Data Governance

How OpenAI Harmed Its Own Case

How OpenAI Harmed Its Own Case

How OpenAI Harmed Its Own Case

OpenAI’s real setback in the Authors Guild case stems not from using shadow-library data, but from deleting key datasets and giving shifting explanations that damaged its credibility and triggered a privilege waiver.

OpenAI’s real setback in the Authors Guild case stems not from using shadow-library data, but from deleting key datasets and giving shifting explanations that damaged its credibility and triggered a privilege waiver.

OpenAI’s real setback in the Authors Guild case stems not from using shadow-library data, but from deleting key datasets and giving shifting explanations that damaged its credibility and triggered a privilege waiver.

Amy Swaner

OpenAI had strong Fair Use arguments, but they made several decisions in 2022 that now look much worse after Bartz.

Executive Summary 

OpenAI’s main problem in the Authors Guild case is no longer whether training on shadow-library data was Fair Use, but that the company gave shifting explanations for why it deleted two large datasets built from LibGen. Judge Ona Wang found that OpenAI first claimed the deletion was routine “non-use,” then tried to hide the real reasoning behind attorney–client privilege, creating a classic sword-and-shield waiver. As a result, OpenAI must now turn over internal Slack communications and present its in-house lawyers for deposition, which could strengthen claims of willfulness and spoliation. The ruling shows that poor data-governance practices and inconsistent privilege strategies can turn a defensible copyright position into a serious credibility problem. 


Judges don’t usually accuse sophisticated litigants of turning privilege into a “moving target” unless something has gone badly wrong. 

That is essentially what happened to OpenAI in the Authors Guild litigation over its use of pirated book datasets “Books1” and “Books2.” In a recent opinion, Magistrate Judge Ona Wang ordered OpenAI to produce internal communications—including Slack channels named project-clear and excise-libgen—and to make its in-house lawyers available for deposition about why those datasets were deleted. 

On its face, this is an appropriate discovery ruling. But its true meaning goes deeper. For lawyers advising AI companies, it is a case study in how to turn a defensible (or at least arguable) copyright position into a willfulness and spoliation problem through poor data governance and worse privilege strategy.  And for lawyers in general, it is a textbook case of spoliation, and how not to ruin your own best defense. 

Earlier, I explained how OpenAI’s use of online material was appropriate and defensible.  This was based on the Fair Use exception to the copyright protections enjoyed by authors in the United States under copyright law. But with this latest discovery dispute, it’s clear that OpenAI’s biggest problem in this case isn’t trying to prove that their use was acceptable under a Fair Use analysis. Now, thanks to decisions it made back in 2022, OpenAI’s biggest problem is credibility. 

OpenAI’s Biggest Mistake Wasn’t Training on Pirated Books. It Was the Story They Told Afterward. 

Numerous cases have taught us that the problem, the true fiasco, is rarely that actual wrongdoing.  It’s almost always the coverup that follows. OpenAI’s actions in this case are a good reminder of just how poisonous the coverup can be.  The basic timeline in the case, and why OpenAI deserved to win based on claims of Fair Use, case be found here.  

Briefly, however, OpenAI needed a massive amount of data points—written material in this case--to train its LLM.  OpenAI found just such massive amounts of information in ‘shadow libraries’ such as LibGen. LibGen was a controversial, large-scale file-sharing platform that provides free access to copyrighted books and academic articles.  Think of it as “Napster” for written materials, instead of music, if you’re old enough to know what Napster was.  Using LibGen and other shadow libraries, OpenAI created DataSet1 and DataSet2. OpenAI then used these large datasets in training their LLM model, which powers the ChatGPT AI tools.   

In 2022, OpenAI executives discussed these now useless datasets, and decided to delete them.  Fast forward to 2025, in the case of Authors’ Guild v. OpenAI. During discovery, OpenAI initially told plaintiffs that the datasets were deleted “due to their non-use,” framing the decision as routine operational cleanup.  That could make sense since these large datasets were memory-intensive. 

When Plaintiffs sought the underlying communications regarding deleting these datasets, OpenAI shifted positions. It claimed that the reasons for deletion—and much of the surrounding discussion—were protected by attorney–client privilege and work product. In other words, OpenAI tried to treat the decision as both a benign business judgment (when explaining itself to the court) and a privileged legal consultation (when resisting discovery). 

Judge Wang was not impressed. She held that OpenAI had waived privilege over much of this material by putting its “reason” for deletion at issue and then trying to retreat behind privilege when discovery got uncomfortable.  

This is classic sword-and-shield territory: 

A defendant can say, “We deleted Books1 and Books2 for neutral business reasons, and we did not willfully infringe.” 

Or it can say: 

“We did what our lawyers advised, and those communications are privileged.” 

What it cannot do is use the first story as a shield against willfulness while invoking the second story to block any inquiry into whether that story is true. 

By offering “non-use” as a business justification and then treating the underlying reasoning as off-limits, OpenAI effectively put its state of mind in play and tried to wall off the evidence that would confirm or contradict it. Judge Wang’s remedy was predictable.  She ordered production of internal Slack discussions and depositions of in-house counsel on the decision to delete. 

For purposes of willfulness and spoliation, that is potentially devastating. Those internal Slack discussions and the evidence to be revealed during depositions are likely to answer the two questions that matter most to Plaintiffs: 

  1. Did OpenAI know that Books1 and Books2 were built from obviously infringing sources such as LibGen? 

  2. Did it delete those datasets in part to reduce legal exposure once AI copyright litigation became foreseeable? 

If the documents and testimony come in badly on either point, the abstract doctrinal debate over fair use will likely not be OpenAI’s main problem. 

Bartz Shows Why Those Legacy Decisions Have Become So Dangerous 

When OpenAI deleted Books1 and Books2 in 2022, Bartz v. Anthropic did not yet exist. But Bartz now provides the doctrinal lens through which courts (and plaintiffs) will view those earlier choices. 

In Bartz, Judge Alsup drew a sharp distinction between two different acts: 

  • Using books, including unauthorized copies, in the course of training a large language model; and 

  • Building and maintaining a central library of pirated books drawn from shadow libraries. 

On the record before him, Alsup was prepared to treat the use of vast datasets as fair use. But he was equally prepared to treat the creation and retention of a “library of pirated books” as a straightforward act of infringement that could go to trial on liability, willfulness, and damages. 

Seen through that lens, the OpenAI discovery ruling looks even more consequential. Books1 and Books2 are alleged to be precisely the kind of shadow-library-derived corpus that Bartz isolates as a separate, freestanding wrong. Judge Wang’s order ensures that plaintiffs will be able to see: 

  • How those datasets were assembled and discussed internally; 

  • What OpenAI employees and lawyers said about their provenance; and 

  • Why, exactly, the decision was made to delete them when it was. 

From a strategy perspective, that is how OpenAI hurt its own case. 

Is All Lost for OpenAI? 

No, all is not lost for OpenAI—but Judge Wang’s order is more than a mere procedural hiccup. The ruling compels OpenAI to produce internal communications (including Slack) and to submit in-house lawyers for depositions about why the Books1/Books2 datasets—compiled from LibGen—were deleted, and holds that OpenAI waived privilege by giving a “non-use” business justification and then trying to reframe the decision as privileged once discovery pressure mounted. The Opinion expressly finds waiver as to communications about the “reasons” for deletion and OpenAI’s claimed good-faith, non-willful state of mind, though it stops short of applying the crime-fraud exception and grants the motion only in part. 

That’s damaging because it strengthens authors’ willfulness and spoliation narratives, but it does not decide liability, fair use, causation, class certification, or damages. OpenAI has already asked Judge Wang to stay the ruling and indicated it will appeal to Judge Stein, so the scope of waiver and the breadth of what must be produced could still be narrowed. Even if the order stands, ugly internal emails do not automatically translate into a verdict: OpenAI can still prevail on core legal questions (e.g., whether training-use is fair use, whether plaintiffs can prove market harm or output-based infringement, how far any “pirated library” theory extends) or resolve the exposure through settlement. In short, the discovery loss materially weakens OpenAI’s posture and raises the stakes of what those internal documents show, but it is far from a final, case-ending blow. 

Practical Takeaways for In-House Counsel and Lawyers in General 

The Wang ruling and the Books1/Books2 saga are not just cautionary tales for OpenAI. They are a good example of what not to do for any lawyer advising a company. Or any lawyer advising clients. Here are five concrete lessons that we can generalize from OpenAI’s missteps. 

1. Treat Questionable Items as Questionable 

My criminal law professor taught us that if you are buying an expensive computer for a fraction of the cost, but you must pay in cash, and you are finalizing the transaction in the parking lot of an unused building, you are likely entering into a criminal transaction and could have liability. This should be common sense, but it’s not. 

It’s easy to get caught up in the need to meet goals, and build in the least expensive, most expeditious way possible.  OpenAI fell into this mire. So eager to get ChatGPT trained and usable, they turned to shady ‘shadow libraries.’ But that alone was not so problematic as what came later. They deleted two datasets derived from the shadow libraries, and then gave questionable explanations to the court and opposing counsel.

For OpenAI and all other AI Companies, the lesson might appear to be “if you or your engineers can recognize a source as a ‘shadow library,’ a court will, too.” But the real lesson here is “the coverup is worse than the original wrongdoing.”

For the rest of us lawyers—not advising AI companies--the lesson goes back to what we learned as IL’s in criminal law. If you are about to engage in a dodgy or questionable activity, don’t. And if you do, don’t try to cover it up.

2. Build a Real Data-Governance Framework—On Paper—And Live Up to It 

For OpenAI, the lesson is fairly narrow -- if the only documentation of how you sourced data is in scattered Slack threads and one-off emails, your client will look disorganized at best and reckless at worst. 

But this is a crucial lesson for all law firms and companies using AI tools. And probably the most important lesson most of us can take from this ruling. Courts are rapidly becoming unwilling to accept generic statements like “we followed industry practice,” nor should they be. They want to see: 

  • Written policies for data sourcing, vetting, and approval

  • Criteria for what is “off limits” (e.g., clearly infringing repositories, hacked databases, password-protected content). 

  • A documented process for escalating questions to legal for risk assessment. 

3. Don’t Improvise Deletions of Inconvenient Evidence 

This is first and foremost a lesson to avoid spoliation, and anything that might look like a coverup if spoliation occurs. It’s not worth the fallout to delete harmful evidence and worse, to cover it up.  Secondly, it’s a lesson most directed to in-house counsel, but not limited to those advising AI companies.  Deleting a problematic document can be the right move. It can also look like spoliation--destruction of evidence--if done at the wrong time, for the wrong reasons, or with no record. 

Before any high-risk deletion, in-house counsel should ensure: 

  • There is a clear, contemporaneous business and compliance rationale (e.g., “we have determined this dataset violates our sourcing policy; we are replacing it with licensed or documented alternatives”). 

  • The decision is aligned with a neutral retention/deletion policy, not a one-off exception for the riskiest or most damning information in the building. 

  • You’ve considered whether a litigation hold is already in place or should be, given regulatory complaints, demand letters, press coverage, or other red flags. 

If you later need to defend the deletion, you want to be able to show that it was part of a disciplined governance program—not an ad hoc attempt to erase the worst evidence. 

4. Get Privilege Strategy and Your Privilege Log Right Before You Hand It Over to Opposing Counsel, Not After 

The core mistake by OpenAI that resulted in Wang’s opinion was not a subtle one: OpenAI put forward a simple “non-use” business justification, and then tried to recharacterize the underlying reasoning as privileged when plaintiffs asked to see it. 

In-house can avoid that trap by: 

  • Deciding early whether a given decision will be defended primarily as a business judgment or as a legal-driven compliance step. 

  • Making sure public-facing and court-facing explanations are consistent with that choice. 

  • Avoiding casual, off-the-cuff “reasons” in correspondence or hearings that could later be treated as waiving privilege on the entire decision. 

Once you affirmatively offer a “reason” for a contentious action, you should assume that the communications behind that reason may become discoverable. If you are not prepared to live with that, rethink how you are framing the decision. 

5. Clean up Your Communications Before a Judge Does It For You 

In modern companies, the real corporate memory might be in Slack, Teams, internal wikis, and/or emails. That’s where judges and opposing counsel will go looking for: 

  • Discussions about obviously infringing sources (“let’s just use this/delete this; everyone does it”). 

  • Acknowledgments of legal or PR risk (“this might be illegal, but the gain is huge”). 

  • Candid talk around deletion and cleanup (“let’s get rid of this before plaintiffs start asking questions”). 

Counsel can’t—and shouldn’t—try to sterilize these systems after the fact. But you can: 

  • Provide training to engineers and product teams about how legal risk is assessed and why off-hand comments about “piracy” and “stealing” will be devastating in discovery. 

  • Set up separate, clearly labeled channels for legal advice, with counsel visibly involved, so you have at least a fighting chance of preserving privilege where it legitimately applies. 

  • Implement information-governance policies that limit how long high-volume chat data is retained, in a way that is neutral and defensible if later scrutinized. 

If your internal channels read like a live-blog of “we know this is piracy but we’re doing it anyway,” no amount of doctrinal fair-use argument will save you. 

Conclusion 

Judge Wang’s Books1/Books2 ruling is not a referendum on fair use.  It is, however, a warning about evidence and judgment. OpenAI’s real problem was not only that its engineers ever used a shadow library, but that the company later tried to explain away deletions of those datasets while shielding the underlying reasoning behind shifting claims of privilege—just as courts are becoming more skeptical of “pirated libraries” after Bartz. In fact, OpenAI’s judgment call to delete the data sets and then cover it up make it appear shady and wrongful.  Better to have owned up to their use, and their rationale. OpenAI would be in a far better position now. 

For the rest of us, the message is more clear. You cannot litigate your way out of bad data provenance or improvisational cleanup. If you quietly deleted damaging evidence, do not except lofty talk about privilege to rescue you. The only durable defense is to act in accordance with your governance policy, and then stand by your decision rather than skulking around like the cat that ate the family bird.  

 © 2025 Amy Swaner. All Rights Reserved.  May use with attribution and link to article.  

More Like This

How OpenAI Harmed Its Own Case

OpenAI’s real setback in the Authors Guild case stems not from using shadow-library data, but from deleting key datasets and giving shifting explanations that damaged its credibility and triggered a privilege waiver.

Why You Need to Adopt a Data Governance Plan Before You Adopt GenAI, and How to Do It

If your firm is going to use AI tools for anything other than an enhanced Google search, you need to have a strong data governance plan in place.

hooded lawyer hackers
hooded lawyer hackers
hooded lawyer hackers
hooded lawyer hackers

How Strong Data Governance Might Also Improve Your Cybersecurity

The relationship between data governance and cybersecurity is complementary and deeply intertwined