Gracenote Sues OpenAI, and the Evidence Is in the Database Schema

Mar 14

On March 10, 2026, Nielsen’s Gracenote Media Services, LLC filed a copyright infringement lawsuit against OpenAI in the United States District Court for the Southern District of New York. Gracenote Media Services, LLC v. OpenAI Foundation, et al., No. 1:26-cv-01947-SHS (S.D.N.Y. filed Mar. 10, 2026). Represented by Susman Godfrey, Gracenote asserts claims for direct, vicarious, and contributory copyright infringement, as well as unjust enrichment.

Most AI copyright lawsuits to date have focused on the ingestion of creative works (e.g., books, articles, images, etc.) to train large language models. Gracenote’s complaint is different. Its core theory is that OpenAI copied not just the content of Gracenote’s proprietary media metadata, but also the relational framework that organizes it—the structure, identifiers, and editorial linkages that make the data commercially valuable in the first place.

What Gracenote Is and What It Does

Gracenote maintains what it describes as one of the most comprehensive repositories of media content metadata in the world: the Gracenote Programs Database. The database is a proprietary, single-file relational database spanning over half a century of television shows and movies, containing millions of program elements that expand hourly. Compl. ¶¶ 25–26.

More than 1,000 Gracenote editors source, write, curate, and link content from over 100,000 sources worldwide. Compl. ¶ 26. Their work product falls into several categories:

1. Descriptive Records: Editors write original program descriptions in neutral, objective language; assign genres; and create proprietary video descriptors capturing attributes like mood, theme, scenario, setting, and subject. Compl. ¶¶ 28–29.

2. TMSIDs: Each program is assigned a unique, proprietary 14-character alphanumeric identifier (a “TMSID”) that places the program within an organized editorial schema. TMSIDs serve as primary keys linking together all metadata for a unique program—title, description, language, schedules, and showtimes. Even different versions of a show (e.g., language or format) receive distinct TMSIDs. Compl. ¶¶ 31–33.

3. Relational Structure: Editors determine how to connect and arrange program elements within the database—linking cast, crew, genres, moods, and themes across records. This interconnected structure powers content discovery, personalized recommendations, and efficient search across platforms. Compl. ¶¶ 30, 35–36.

Gracenote licenses this data to cable, satellite, and streaming distributors; device manufacturers; and, more recently, AI and machine learning companies for training and grounding. The entire Programs Database is registered with the U.S. Copyright Office. Compl. ¶¶ 38–42.

The Allegations: Training and Verbatim Reproduction

Gracenote alleges that OpenAI copied Gracenote Data without authorization to train its GPT models and to ground outputs in its products, including ChatGPT. Compl. ¶¶ 63, 69–72.

On the training side, the complaint points to the Common Crawl dataset, which comprised 60% of GPT-3’s training data by weight, and alleges that it includes web domains containing Gracenote Data, specifically tvlistings.gracenote.com. That data, Gracenote asserts, was taken in violation of its Terms of Use, which prohibit reproduction and distribution without prior consent. Compl. ¶¶ 67–69.

For GPT-4 and GPT-5, OpenAI has disclosed almost nothing about its training data. The complaint cites OpenAI’s own GPT-4 technical report, which stated that it “contains no further details about the architecture (including model size), hardware, training compute, dataset construction, [or] training method.” Compl. ¶ 48.

The Evidence: TMSIDs and Memorized Descriptions

The complaint’s most distinctive allegations concern the outputs. Gracenote tested multiple GPT models and found they could reproduce its proprietary data with striking specificity.

First, Gracenote tested TMSIDs. GPT-4.5-preview correctly generated thirteen exact TMSIDs for programs including Breaking Bad, Game of Thrones, Saturday Night Live, The Office, and The Big Bang Theory. GPT-4o and GPT-4.1 each generated the exact TMSID for Breaking Bad. Additional outputs from those models included near-exact matches with only one or two incorrect characters. Compl. ¶¶ 73–77. Gracenote notes that the probability of randomly guessing the 12-character sequence following a TMSID’s “SH” prefix is 1 in 1 trillion. Compl. ¶ 76.

Second, Gracenote tested program descriptions. The complaint includes a table of side-by-side comparisons showing that GPT models reproduced Gracenote’s copyrighted descriptions virtually verbatim. For example, GPT-5.2’s output for Ready or Not matched Gracenote’s description word-for-word. The same was true for Game of Thrones, Sex/Life, and American Horror Story: 1984, among others. Compl. ¶ 80.

Third, Gracenots tested mood tags. When prompted for Gracenote’s mood tags for specific titles, GPT-4 and GPT-4.1 reproduced the identical tags, in both content and number, for dozens of shows and movies. Across forty titles, the models matched Gracenote’s assignments exactly, differing only in the order of the tags and, in one case, adding a single extra tag. Compl. ¶¶ 81–82.

Compilation Theory Applied to AI Training

Copyright protection for compilations whose selection, coordination, or arrangement reflects original editorial judgment is well-established. Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991); 17 U.S.C. § 103. Post-Feist decisions in the Second Circuit, where this case was filed, have recognized that creative choices in how to organize and present data can satisfy the originality threshold. See, e.g., CCC Information Services, Inc. v. Maclean Hunter Market Reports, Inc., 44 F.3d 61 (2d Cir. 1994). What distinguishes Gracenote's complaint is how it deploys that doctrine: as evidence that AI models encoded the protectable structure of a compilation, not just the data values.

TMSIDs are not random serial numbers. Their first two characters indicate whether the content is a movie, television show, or other type of content. For television shows, the last four characters encode series and episode information. Editors exercise judgment in deciding how to classify franchise installments. For example, TMSIDs treat The Real Housewives of Atlanta and The Real Housewives of New Jersey as separate shows with distinct TMSIDs rather than seasons of the same show, and assign a Portuguese-language version of House a different TMSID than its English counterpart. Compl. ¶¶ 33–34.

The complaint argues that this organizational schema, together with the relational mapping among works, people, genres, and descriptors, is itself copyrightable, and that the models’ structural recall of TMSIDs and their associations demonstrates that OpenAI’s training encoded not just content but Gracenote’s “expressive and organizational choices.” Compl. ¶ 101.

If the court finds that GPT's reproduction of Gracenote's identifiers and relational schema demonstrates copying of the protectable compilation, rather than mere extraction of data values, it would give compilation copyright holders a concrete evidentiary framework for proving AI training infringement.

Market Harm: Two Distinct Markets

No doubt anticipating a fair use affirmative defense, Gracenote identifies two markets in which OpenAI’s conduct causes harm. The first is the traditional market for media metadata, in which Gracenote licenses its data to content distributors, streaming platforms, and device manufacturers. The complaint alleges that OpenAI’s products can substitute for Gracenote’s offerings, and that companies have told Gracenote they can use LLMs trained on its data as a replacement for paid licenses. Compl. ¶ 94.

The second is the emerging market for licensing high-quality metadata to AI companies themselves. Gracenote has entered into licensing agreements with companies like Samsung and Google, under terms that restrict uses to prevent disintermediation of Gracenote’s core business. Compl. ¶¶ 41, 98–99. The complaint alleges that OpenAI’s unlicensed use bypasses these contractual protections entirely, enabling the same outputs through channels that mirror Gracenote’s licensed distribution pathways. Compl. ¶¶ 100–103.

Notably, Gracenote alleges it reached out to OpenAI to discuss licensing “many times over an extended time period,” and that OpenAI “rebuffed or ignored every single attempt.” Compl. ¶ 4 n.1.

Takeaway

Most AI copyright cases are about content: books, articles, images, code. Gracenote v. OpenAI is, at its core, also about structure. The complaint asks whether the editorial choices that go into organizing and relating data (the identifiers, the taxonomy, the relational logic) are themselves protectable works. The evidence that GPT models can reproduce Gracenote’s proprietary TMSIDs at odds of 1 in 1 trillion strongly suggests that the models were trained on Gracenote’s data. But the more consequential question may be whether a model's structural recall of a database's proprietary identifiers and relational schema is sufficient to prove that the protectable compilation (not just the data values) was copied during training.

CopyrightCopyright InfringementIntellectual PropertyAI in LitigationArtificial IntelligenceSouthern District of New YorkDatabase Copyright

David Sergenian