Gretchen A. Peck | for E&P Magazine
There is growing angst in the news media community about how their products — the journalism they create, at no small expense — are being used to train the Generative AI Large Language Models (LLMs). They wonder whether copyright law will protect them, whether they should sue over copyright violations or agree to license and compensation terms offered by AI developers. E&P sought to understand these dilemmas better, so we asked news media publishers and advocates how they think these relationships will come to pass.
In this new AI realm, Danielle Coffey, president and CEO of the News/Media Alliance, stressed the critical need for copyright registrations, “It signifies ownership. It gives you the ability to enforce the copyright protection. In order to enforce it — in practice, in the marketplace, in the courts, it has to be registered.”
After a decade of advocacy, in late July, the Copyright Office issued a new ruling for group registration reflecting updates to news websites.
But what assurances does copyright provide a publisher with respect to Generative AI?
“Right now, we do have copyright protections — full stop. But whether or not it’s fair use and free use of our content is something the courts will determine in the coming months and years,” Coffey said.
Copyright protection is reactive, but there are also ways to protect intellectual property proactively. Coffey said some AI developers now allow publishers to opt out of having their content crawled, which is impactful.
Reporting for The New York Times in July, Technology Columnist Kevin Roose explained, “Over the past year, many of the most important web sources used for training AI models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group. The study, when looking at 14,000 web domains included in three commonly used AI training data sets, discovered ‘an emerging crisis in consent,’ as publishers and online platforms have taken steps to prevent their data from being harvested.”
In May 2024, OpenAI disclosed that it was developing Media Manager, “a tool that will enable creators and content owners to tell us what they own and specify how they want their works to be included or excluded from machine learning and training.”
However, this doesn’t retroactively affect content already collected, and it doesn’t necessarily protect it from RAG — Retrieval Augmented Generation — which allows developers to scrape content in real-time via search engines.
“This means today’s publication, the breaking news story,” Coffey explained. “The reason why that’s a problem is that the LLMs can then produce verbatim copies of real-time content. They would be competing with us for our breaking news and what we’re covering in real time. They become a full-fledged competitor, and that’s highly problematic, which is why we wrote to the [Department of Justice].”
Coffey said she fully supports news media publishers that have filed suits, including The New York Times, which she says has “a very strong case.” However, she doesn’t think all developers are “bad actors.” She hopes for a more mutually beneficial relationship between the Alliance’s members and AI developers. To help build that bridge, the Alliance is working on the framework for licensing agreements and compliance.
“I think licensing content, partnerships and collaboration with these AI companies is the best path forward because prolonged, protracted litigation can be avoided for both industries. At the end of the day, we need to continue to exist,” she said.
Though terms of the deals — how they’re structured and the values — aren’t disclosed, a number of publishers have reached agreements with developers. News Corp entered into a multi-year agreement with OpenAI, which enables the developer to tap content from The Wall Street Journal, New York Post and other Murdoch-empire titles. OpenAI also inked contracts with Axel Springer, the Financial Times, The Atlantic, The Associated Press, DotDash Meredith and Vox Media.
Building a case
Ian B. Crosby, partner, Susman Godfrey. (Photo credit: Nick Hanyok Imaging)
Ian Crosby is a partner at Susman Godfrey — a law firm representing both platforms and publishers in copyright matters — and is lead counsel for The New York Times suit against Microsoft and OpenAI, filed in December 2023. He spoke with E&P about copyright protections for publishers, noting, “In order to get statutory damages, you have to have registered your works with the copyright office before or soon after the works were infringed,” Crosby explained. “Until recently — literally this week — it was very difficult to register online-only works. The New York Times can register its works because it has a daily print edition, so mechanisms that have been in place for a long time make it easy — or relatively easy — for a regular periodical publisher to register their daily print editions. And those registrations cover the digital versions of those works.”
Crosby also explained how damages are calculated in copyright cases: “Copyright law recognizes that it can be challenging to calculate what your actual damages are for copyright infringement,” Crosby said. “This, along with deterrence, is one reason copyright law provides statutory damages. [They] are significant on a per-work basis. There’s a minimum of $750 per work infringed, and they can go as high as $30,000. Or, in the case of infringement deemed to be willful, it can go as high as $150,000 per work.”
In July, publishers saw how OpenAI, in particular, plans to defend its actions. In its case with The New York Times, OpenAI filed a pleading in the New York District Court venue, asking that The Times prove that the content in contention is, in fact, “original” and created wholly by the publisher. E&P welcomed comment from OpenAI, but received no response.
‘Pillaging’ publishers’ content
Alden Global Capital-owned MediaNews Group and Tribune Publishing filed suit against OpenAI in April 2024.
Frank Pine, executive director, MediaNews Group and Tribune Publishing
“Companies like OpenAI have brazenly misappropriated copyrighted content, including millions of our stories, to build their products,” Frank Pine told E&P via email. Pine is the executive editor of MediaNews Group and Tribune Publishing. “It’s like they went into a bookstore and took all the books without paying for them, claiming they must be free because they were just sitting there on the shelf for anyone to browse. Next, they deploy AI in such a way as to undermine and ultimately replace our business, pillaging publisher sites in real time to provide plagiarized summaries to their subscribers. Even worse, the summaries sometimes contain ‘hallucinations,’ falsely attributing misinformation to otherwise credible publications.”
Pine asserted that MediaNews Group/Tribune Publishing has substantial evidence to support its case.
“In our lawsuit, we provide evidence that ChatGPT was trained on our news content and that it provides information based on that material, sometimes replicating our stories verbatim,” Pine explained. “We also provide examples of how ChatGPT has attributed stories to our publications that we never actually published.”
“We believe our legal case is strong and that there is a clear wrong here that must be corrected,” he added. We are, therefore, confident we will prevail. Considering that OpenAI’s CEO has publicly stated that OpenAI could not have made ChatGPT without copyrighted content and that he admitted in a Congressional hearing that content owners deserve to control their content and should benefit from its use, it would appear they agree.
Pine sees OpenAI’s efforts to license content from a growing list of publishers as an acknowledgment of ownership and value.
“The fact that OpenAI is making these deals is a clear indication they recognize that the content they use to build and power their products has real material value and they should pay for it. We agree,” Pine concluded.
For nonprofit and independent publishers, legal remedies may be the only path
On June 27, 2024, the Center for Investigative Reporting — the nonprofit parent of Mother Jones and RevealNews.org — also filed suit against OpenAI and Microsoft, alleging the defendants “copied, used, abridged, and displayed CIR’s valuable content without CIR’s permission or authorization, and without any compensation to CIR.”
Monika Bauerlein, CEO, Center for Investigative Reporting/Mother Jones
CIR’s CEO Monika Bauerlein knew Generative AI would be problematic.
“When ChatGPT 3 landed in 2022, it was blindingly obvious that this would be a challenge in two ways. One, you could assume from the outset that these companies have used every bit of text they can get ahold of on the internet to train these models. And, since none of us have been asked for permission to do that, it was a safe bet that they just went ahead and did it. … What do we do about that when our work — that we have spent a lot of blood, sweat and treasure to create — is being used as a free resource for these incredibly lucrative companies?
“And the second piece was, what does this mean for the relationship between creator and audience? It suddenly becomes mediated by this tool that spits out a summary or full-on extracts from the work you’ve done and doesn’t lead a user back to the work that it’s excerpted or extracted from — doesn’t lead the user back to the author or originator, and breaks that connection,” she explained. “CIR is a nonprofit newsroom, and we entirely rely on our audiences’ support. Support from individuals is two-thirds of our budget, and people give that support because they find our work useful and valuable. They connect with it, and if that connection is broken, that’s it.”
Bauerlein said CIR would’ve welcomed the opportunity to consider a content licensing agreement, but they have not received a proposal for their “deep content archive” that dates back nearly 50 years.
“I am a little concerned that, to some extent, publishers are making similar mistakes that they’ve made in relationships with these other platforms. From the dawn of the internet, publishers have essentially provided content that people are interested in, which is time- and labor-intensive to produce, to tech platforms for free without asserting our power and rights as the originators of this content. And then when, after the fact, the tech platforms come around and offer some modest handouts, it’s usually too late. We wanted to make this statement, in part, because we can’t have that happen again,” Bauerlein said. “We can’t have that cycle again, where the tech platforms use our content and five or 10 years later, we wake up to what has happened.”
A solution without a problem
Not all news publishers feel Generative AI developers present an existential threat.
Joey Young, owner and publisher, Kansas Publishing Ventures
Joey Young owns and publishes Kansas Publishing Ventures (KPV). “They [LLMs] are just the newest way for tech to glom onto others’ work and attempt to profit from it,” he told E&P.
Young believes the lawsuits filed by larger publishers have merit and wishes them luck, but he’s pragmatic about the outcome. Should those publishers prevail, he said, “the likelihood of any of that trickling down to community publishers and the people trying to eke out a living is slim.”
“It’s difficult to fret over AI when you know there is nothing you can do about it, and any wasted energy on it will just diminish the vital work we are doing in our communities,” he said.
Young also tempered the frenzy about Generative AI, suggesting it may not be as profound an innovation as many proclaim.
“There is no trillion-dollar problem for them to solve, so I am not sure [about] the amount of money being pumped into these things. It’s far more likely these companies collapse in on themselves after the hype wears off than they become applicable on a large scale enough to be profitable,” Young predicted. “By the time these companies could be profitable, they wouldn’t have made it to our tiny neck of south-central Kansas to make an offer anyway.”
Gretchen A. Peck is a contributing editor to Editor & Publisher. She’s reported for E&P since 2010 and welcomes comments at gretchenapeck@gmail.com.