A central ethical question that has surfaced amid the ongoing popularization and commercialization of generative artificial intelligence revolves around copyright infringement and what, exactly, constitutes “fair use.

As lawsuits alleging copyright infringement in both the input and output of generative AI models continue to stack up, the position of the bulk of the tech sector, perhaps represented best by OpenAI, boils down to a simple opinion: that it is fair for tech companies to train their commercially available, lucrative models — without permission, credit or compensation — on publicly available content. 

At the same time, OpenAI has been exploring content licensing deals with publishers, including Axel Springer, the parent company of Business Insider. And according to a new report from 404 Media, OpenAI is on the verge of closing a deal with a new customer: Automattic, the company behind Tumblr and WordPress.com. 

Related: OpenAI accuses New York Times of paying someone to hack ChatGPT

It remains unclear exactly what kind of content will be included in the licensing deal, as well as when the deal might occur or the price tag behind it. 

404 reviewed internal documents that showed that an initial data dump, which compiled a list of Tumblr’s content between 2014 and 2023, included a number of things that should not have been included, such as private posts on public blogs, posts on deleted blogs, and explicit posts. 

Automattic did not clarify whether this compilation of data was sent to OpenAI. 

The company did not respond to questions regarding the type of content included in the deal. TheStreet additionally asked whether self-hosted sites on WordPress (separate from WordPress.com) would be included in the data sale. 

More deep dives on AI:

Think tank director warns of the danger around ‘non-democratic tech leaders deciding the future’ George Carlin resurrected – without permission – by self-described ‘comedy AI’Deepfake porn: It’s not just about Taylor Swift

Automattic did not respond. 

The company instead pointed TheStreet to a public statement that says that it currently blocks AI platform crawlers and will further allow users to opt out of sharing their content.

“We are also working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control. Our partnerships will respect all opt-out settings,” the company said. “We also plan to take that a step further and regularly update any partners about people who newly opt out and ask that their content be removed from past sources and future training.”

Automattic will also reportedly be selling user data to Midjourney, an AI image generation company. 

Neither Midjourney nor OpenAI responded to a request for comment. 

Related: New platform seeks to prevent Big Tech from stealing art

Social media and artificial intelligence

Automattic is hardly the first platform to enter into a licensing deal with an AI company. 

The week before, Reddit went public with a deal it had signed with Google — worth around $60 million annually — to license its user content to, among other things, train Google’s AI models. 

Meta  (META) has admitted that it used public posts on its platforms to train parts of its own AI models. 

artists on tumblr, please go to your blog settings and check this pic.twitter.com/05VUxQRMQB

— adrienne 🔜 ✨GDC! (@insertdisc5) February 27, 2024

X’s privacy policy makes clear that it “may use” user content to “help train our machine learning or artificial intelligence models.” 

Though Tumblr’s privacy policy states that it does not share user information unless that information has been anonymized, or permission has been granted by the user, the policy does say that Tumblr may share information with “entities we do business with.”

The policy makes no mention of artificial intelligence models or training. 

WordPress.com’s privacy policy likewise makes mention of sharing some data with third parties, but makes no mention of artificial intelligence models or training. WordPress.com’s terms of service state: “We don’t own your content, and you retain all ownership rights you have in the content you post to your website.”

“For years I’ve journaled in a private WordPress blog, writing down my most traumatic memories. I can’t believe our most intimate words and images are now going to be sold off & maybe reproduced by a prompt somewhere, shared for the world to see,” one X user wrote. “This has to be criminal.”

“Opt-out” being the standard for AI is total BS. It’s a way for AI companies to justify wrongdoing. Opt-out is itself an admission of wrongdoing, because if they felt they were legally and morally in the right, why would they bother extending the option in the first place? 🤔

— Reid Southen (@Rahll) February 28, 2024

Again, Automattic has not clarified what type of content will be included in the sale. 

The inclusion of AI training as a term of service is still a new phenomenon. No mention of utilizing user content to train AI models, or selling user content to AI companies, existed in Twitter’s privacy policies through May of 2023

The potential deal, according to Jason Kint, CEO of Digital Content Next, “must indicate at least some lack of confidence” in the tech sector’s oft-repeated fair use claim. 

Related: Copyright expert predicts result of NY Times lawsuit against Microsoft, OpenAI

The viability of licensing

As licensing deals continue to crop up, Clément Delangue, the CEO of open-source machine learning platform Hugging Face, said that such deals represent a risk of power concentration. 

“It might not be the users, artists, or content creators who will benefit from this but big companies and Hollywood studios who will trade their rights and not redistribute,” he said

Others, including Ed Newton-Rex, the CEO of the nonprofit Fairly Trained, see licensing as the only way forward. 

Newton-Rex wrote in a post on X that it seems likely that generative AI will replace the demand for a big swath of creative work, “training on people’s work to do so.” 

“It seems like the only company that’s getting a really good deal on that is the platform.” – Jason Kint, CEO of Digital Content Next

“Without licensing (or another similar solution) it will do this without compensating the creators / rights holders of that work. The creative industries will be decimated, via exploitation of their work,” he said. “Without licensing, there is a concentration of power, but a different one: a concentration of power in the AI industry, with that power taken away from creators and rights holders alike.”

AI ethicist and researcher Nell Watson told TheStreet that as such practices of content licensing continue, she expects social media users and consumers to “have very little influence” over how their content — used now to train AI models in addition to the tailoring of targeted advertising — will be used. 

Related: Deepfake program shows scary and destructive side of AI technology

“Platforms will claim to speak in defense of their users, but offer nothing to them directly in compensation,” she said. 

“The remarkable nature of the user-generated content business is that platforms don’t have to pay for the creation of the content,” Kint told TheStreet. “The user is doing the work for free. And then all their property is able to be monetized, including their data to run targeted ads.”

“It seems like the only company that’s getting a really good deal on that is the platform.”

Contact Ian with AI stories via email, [email protected], or Signal 732-804-1223.

Related: Artificial Intelligence is a sustainability nightmare – but it doesn’t have to be