Automattic Faces Scrutiny Over AI Access Policy

This article is a joint effort by James Giroux & Jyolsna.

After unconfirmed reports of Google entering into a content licensing agreement with Reddit for training its AI, 404 Media claimed yesterday that Automattic is set to sell Tumblr and WordPress.com users’ content to Midjourney and OpenAI. If true, this could mirror an extended partnership that Shutterstock entered into with OpenAI last year.

Claims of 404 Media 

404 Media claims insider information about the deal–backed up with documentation–confirming Automattic is in the advanced stages of negotiation with these AI companies. To validate its claims 404 Media quoted Tumblr Product Manager Cyle Gage as he reported on an internal message board, the status of the initial data collection process and how it included content that should not have been collected.

While 404 Media has provided quotes from an internal source, it has not provided any specific proof such as screenshots of conversations or access to source materials to assist others in validating their claims. 404 Media also refers to user content as “users’ data” which can easily be misconstrued as personally identifiable information (PII) or credit card information. Whereas the content being discussed in the article is content that is already publicly available.

Response From Automattic 

Within a few hours of 404 Media’s article going up, Automattic released a statement describing its position on content distribution and the rights of all users on WordPress.com and Tumblr to opt out of their public content being included in data shared with AI partners.

Automattic makes the argument that AI regulation and legislation do not yet exist and, as such, is taking these steps to proactively provide users with additional methods of controlling how and where their content is made available. They are creating a pathway for AI partners to get streamlined access to the content users are open to sharing while also taking steps to remove access to content that users no longer want to be shared. In other words, the content in question is already available to the AI companies as it’s publicly crawlable and content deals only make it more accessible and manageable. 

Automattic published “Protecting User Choice” emphasizing the following points:

  • We currently block, by default, major AI platform crawlers—including ones from the biggest tech companies—and update our lists as new ones launch.
  • We have a setting to discourage search engines from indexing a site on WordPress.com and Tumblr. This signals to search engines not to crawl that content or include it in search results.
  • We have added similar settings to WordPress.com and Tumblr to discourage crawling by AI companies. If you already discourage search engine indexing, this is automatically enabled.
  • We will share only public content that’s hosted on WordPress.com and Tumblr from sites that haven’t opted out.

We will share only public content that’s hosted on WordPress.com and Tumblr from sites that haven’t opted out.

The article continues hinting at a deal in the future: “We are also working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control. Our partnerships will respect all opt-out settings. We also plan to take that a step further and regularly update any partners about people who newly opt out and ask that their content be removed from past sources and future training.” 

Automattic also released a new tool that “lets you opt out of sharing content from your public blogs with third parties, including AI platforms that use such content for training models. We will engage with AI companies that we can have productive relationships with, and are working to give you an easy way to control access to your content…We already discourage AI crawlers from gathering content from WordPress.com and will continue to do so, save for those with which we partner… We are committed to making sure our partners respect those decisions.”

WordPress.org Users Aren’t Affected

Josepha Haden Chomphosy, Executive Director of WordPress shared this with the community in the Slack channel: “I can confirm that the WordPress project is not involved in selling user data or content for AI training purposes. This has been our consistent stance across the long history of WordPress, even as recently as when I was sharing thoughts for the future of our project heading into 2023.”

Later, Jetpack tweeted that “data from Jetpack connected sites is not included. This only applies to WordPress.com hosted sites.”

Interestingly, Automattic has been struggling to make Tumblr profitable after acquiring it in 2019. Last year Matt revealed that Tumblr is losing $30M each year.

We have reached out to Chenda Ngak (Head of Communications at Automattic) and will update this article once we get her quote.

(WordPress (or WordPress.org) is an open-source CMS while WordPress.com is a hosted platform owned by Automattic, a company founded by Matt Mullenweg. Both are not the same.)

10

10 responses to “Automattic Faces Scrutiny Over AI Access Policy”

  1. Two points:

    1) “Opt-out” is still a user-hostile design choice. If they wanted to make sure people were ACTUALLY consenting to this data sharing, they’d make it opt-in.

    2) While .org and .com are legally distinct entities, it’s also true that Matt Mullenweg is still on the board at the WordPress Foundation and as such has some influence over its decisions. How much influence, I’m not sure. But I am quite leery of his decision making process after the last few days.

  2. Well, I think this goes to show that we can’t expect writers hired by Matt and Automattic to report on Automattic in an unbiased manner.

    Selling people’s content unless they opt-out is obviously a dark pattern. And a blogging platform like WordPress.com or tumblr often attracts high investment content that people would want to protect. Automattic has also been making claims that if people opt-out late, they will somehow retroactively get people’s content out of training sets just by asking nicely.

    I’m to hear the data Jepack takes is not currently included. But considering Jetpack’s invasive data practices (taking data supposedly needed for features that are not even enabled, including all of the content in a site), I think whether Automattic will be selling content it takes from WordPress.org is still a concern.

    I don’t expect WP Tavern to just dunk on Automattic, but this defensive coverage is still infuriating.

  3. As a former active WordPress developer & contributor to code & community I’m saddened by the recent developments at WordPress.com

    Automattic’s tone-deaf decision to offer their customers’ contents to “selected” AI partners, unless they opt-out (a user hostile ‘solution’), comes across as a classic techbro money grab scheme.

    It is going against everything the opensource project should & has stand for. This questionable business decision by Automattic and thus Matt Mullenweg has considerable negative impact on the opensource project.

    It should have consequences for Matt Mullenweg’s continued involvement in the decision-making process of the WordPress opensource project.

  4. Can we get clarity on what the WordPress Firehose does within the context of data sharing for AI and other companies? It’s enabled by default if you install JetPack.

    But let’s face it, the siren call of getting a bit more money for freely provided content, without thinking too much about ethics, has always been there with Automattic, ever since the spam sites on wordpress.com many years ago, then the practice of putting ads on wordpress.com blogs but hiding them from the blog owners, resulting in suicide awareness blogs carrying scientology ads that the blog owner had zero idea about.

    I appreciate the open source product and free nature of it, but I’m kind of enjoying just paying for things with Laravel. It feels like a purer deal.

    • This is like saying if you don’t want to be mugged, never step outside, rather than doing literally anything to stop muggers. People wouldn’t have put so much content online if copyright law didn’t protect it. Now the walls are coming up. Is that what we want? Surely privacy legislation would be better than an internet with nothing but generated content farms.

  5. Opt-in by default is the wrong decision because it takes advantage of users who do not realize that it is happening.

    Out-out by default is also the wrong decision, because most people would never opt in or out and thus is unworkable for Automattic.

    What Automattic should to is force users to make a explicit choice to opt-in or out the next time they access their admin console, defaulting to out until they do. They could also email everyone asking them to sign-in so they could explicitly indicate their choice.

    They should also show the user exactly how to change that setting in the future just the way they show users how to use WordPress when on-boarding.

    Frankly, while I am particularly allergic to false binaries, I cannot think of any other approach that would be correct. #jmtcw #fwiw

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Newsletter

Subscribe Via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Discover more from WP Tavern

Subscribe now to keep reading and get access to the full archive.

Continue reading