What WordPress.org Does with the Data it Collects from Users Sites

Freemius Featured Image
photo credit: data slide(license)

Since I started covering WordPress in 2009, one of the things I’ve noticed is that certain topics have a cyclical nature to them. One of these is the contention in the WordPress community on what data is sent, stored, and shared on WordPress.org. In a post published on Torquemag.io, Josh Pollock, Founder of CalderaWP, argues that WordPress is a community-driven project and as such, data collected by WordPress.org should be shared with the community.

If installing and updating themes via the WordPress dashboard wasn’t so easy, WordPress wouldn’t be what it is today. I understand and appreciate this.

Here’s the part that doesn’t sit well with me: WordPress.org is collecting data on all of its users (as it should), but this information isn’t available in aggregate form to the community.

Pollock says that as an entrepreneur, the information would help him make informed business decisions.

Data is Stored for Two Days

I spoke to Samuel ‘Otto’ Wood, who helps maintain WordPress.org, and discovered that some of the assumptions people have are not true.

“The data collection systems on w.org have been inconsistent at best, and re-written several times,” Wood said.

“But the general idea that there is some kind of treasure trove of information we’re storing is misguided, at best. The data is collected, aggregated for the things we display, then tossed. We don’t store it for any serious length of time. Just the results of the data like the counts.”

Gathering, sorting, and displaying the large amount of data associated with WordPress is a CPU intensive job. The most recent example of WordPress.org sharing aggregate data is for active installs of plugins and themes. Displaying the Active Install count is the result of significant performance improvements from WordPress lead developer Dion Hulse. Without the improvements, the data collection would have overloaded CPUs and MySQL databases.

“Gathering that data is frickin’ difficult to start with, “Wood said. “For the longest time, we didn’t even have the actual system resources to pull off the ‘Active Installs’ count. We didn’t display that count because we couldn’t do it. The idea that we’re hiding things is ludicrous.”

Raw data is stored for two days and is then overwritten, “basically, there’s too much data to store,” Wood said. “All of the data that w.org gathers is used to display the stats on w.org itself. Nothing special is hidden.”

Data Accuracy is Hard

If developers are going to make business decisions using public data, the data has to be accurate. Accuracy is a complex problem but the team has slowly made progress over the years as legacy systems on W.org are phased out.

“A lot of the w.org systems are poorly made,” Wood said. “They’re old, have been modified dozens of times over the years, and badly in need of updating. For a long time, the data we gathered could not be processed fast enough so we simply threw over half of it away.

“Mostly, we phase out old useless systems and replace them with something better and newer which gives us things to display. Active Install counts was an entirely new system that replaced an older one which didn’t give any useful information.”

Wood confirms what I’ve believed to be true for a long time. WordPress.org is not storing data for an extended period of time and the information that is collected is likely on public display somewhere on the site. What types of data would you like to see on WordPress.org?


LIKE THIS

51

51 responses to “What WordPress.org Does with the Data it Collects from Users Sites”

  1. The explanation completely makes sense. Only counting the stats for millions of sites is hard- requests per seconds level would be crazy resource intensive. Unless there is huge “business interests”, extensively collecting data wouldn’t be cost effective at all.

    That being said, for plugins/themes certain stats really could be helpful to the authors- for example: versions count for plugins/theme and PHP/WordPress.

  2. I see it plainly this way:

    There are 76 million websites powered by WordPress. That number is so high because of the efforts of all of us spending years of our lives and millions of dollars of investment. We built that number together.

    That we, as a community, are not leveraging anonymized data about these sites seems like an enormous waste and disservice to our users… and, well, the entire globe. We are talking about 25% of the web.

    And to think that nobody really cares enough about the data to build anything to crunch it? We should be better than that.

      • Plugin and theme developers (of course) care. But maybe the folks that actually have access to the data don’t.

        It’s clear In Otto’s comment below that there just isn’t anyone on the .org team that has the resources to pay attention to big data. It’s not a problem of political will, but of resource allocation.

        The data dumps from Wikimedia that Paul mentions would be a fantastic start. If we can get some data released a community will quickly form around making something useful from that.

        We would see the most fantastic bloom of projects since the wporg api itself.

        The internet is the most important creation in the history of humanity. It is the first truly global connecting of cultures and people. We’ve built a quarter of it with WordPress.

        Knowing the technical intricacies, dependencies and limitations of our collective work should be considered essential. Google has data on their slice. Facebook does as well. They and their peers proactively leverage, plan, and develop around this information. Why don’t we?

  3. It would be nice if we could be rid of conspiracy theories and presumptions that good people are up to something. Thanks Jeff and Otto for the information. More communication helps.

    You ask what types of information we’d like to see. PHP and WordPress version usage would be interesting. Maybe a quarterly report … or any report that would allow us to see trends over time.

    • It is interesting. And since it seems reasonably accurate, I’ll try to make it public soon. In the meantime, here is my first attempt at showing something.

      Cloudup vpgahovqd5x

      The thing is, data gathering is hard. It takes time and resources. Yes, we care about certain things. Knowing that hosts are updating php is important to me. So is updating the plugin directory. And the forums. And internationalization. Those Germans need a forum in their own language to work right too. There’s a lot of things happening. Sorry, but there are priorities, and stats don’t quite get there straight up. Unless we need them right now. Then we go and get them, when we need them. I’m more interested in internationalization stats, honestly. Those seem promising. Wish I had them. I don’t, yet.

      • No apologies nessecary. The efforts put forth are tremendous. I find it fascinating that you guys provide what you do with the resources at hand. I think what Josh would like to see is a greater effort to the availability of the aggregates. does .org run on WordPress? ;)

      • Very interesting. The same break-down for MySQL versions would also be interesting. We still see developers who are trying to develop their sites on modern PHP/MySQL versions, and then upload them to live hosting on obsolete ones (the issue with the MySQL utf8mb4 charset change in WP 4.2 is significant here). When we say “don’t do that, WP core doesn’t support that; and launching a new site on obsolete/EOLed tech is a terrible idea”, sometimes they come back and insist that doing such things is common or good practice. To be able to point to the stats can really help them.

  4. There is enough place on w.org site to publish details about what is send from every WP site, how often, the purpose of that and publish these data.

    If there is not well usage of these data, stop collect them from millions of websites, at least from my sites, please.

    The idea that we’re hiding things is ludicrous

    Nope, it’s logical if there are is not enough details/information about any subject.

    … some of the assumptions people have are not true

    Obvious, no information are published. So assumptions, conspiracy, and other theories will pop-ups only.

  5. “Data is stored for two days” on .org? Where does it go? How can it be accessed with ease? Who owns WordPress.org? WordPress.com employees are HALF the community staff? Lots of conspiracy still left up in the air. Maybe a follow up article would be good here, because this has certainly raised some intriguing points.

    This comment stream is noisy only with a new data display, nothing addresses the aggregate access.

    • A data dump of the raw requests sent for updates would be out of the question, for starters, because privacy. It would need to be fully anonymized, and that would be difficult. However, the big kicker is this: *we don’t have it*. We simply don’t keep it for very long.

      We don’t have any data to dump. Seriously, we don’t collect that type of raw information for any length of time. We collect and gather the data that we display, then toss it. Literally. I don’t know how many more ways I can say this. You’re asking for things that do not actually exist.

        • Define “collected”. We get all the data we display from the update checks made by WordPress for plugins and themes and core. For example, the “Active Installs” count is simply a count of how many sites checked for an update of that plugin/theme yesterday. Download counts are simply like a naive counter of the downloads of the ZIP files. With some filtering to prevent people from spamming them up by repeatedly requesting the same files over and over, of course. We have Google Analytics on every page of the site, but honestly, that sort of thing is over my head. I never could figure out Google Analytics, personally.

          If you want to see what is sent, you can look at the core code. As for what is retained, it’s all posted up on w.org somewhere, once we’re confident that the results from the counts are reasonably accurate. We don’t retain any raw data for more than a couple days, for debugging. Like those download requests. We store the request information for the day, then update the count once a day, then delete the information. Simple as that, really.

          What data, exactly, are you concerned about? I find it difficult to imagine what you think we’re doing, really, that would be in any way concerning. Dion went to a lot of trouble to make the Active Install counts available, for example. Why is that a bad thing? Why all the fuss? I don’t get it, really.

        • Not a bad thing Otto, not the fuss. I just want to know what my website sends somewhere to space.
          I don’t care who, how and for what period play with these data, it’s not under my control and responsibility.

          Not a drama, I just miss asked information, which should be published somewhere without requirements to know the code.

        • well, you are right.
          but he is the one who have all the data.
          (and some others like matt, etc)

          I understand your point Jeff,
          but I really wish that matt and others responsible (maybe) don’t hide behind WP as organisation or as open source cms.
          because I feel WP as cms/org is lead by several core team and org/company behind it. and these people make decisions about WP direction/where to go.
          (lets admit that some have bigger voice than others)

          it’s not uncommon in open source. e.g.
          chrome and android dev is directed by google.
          (both open source)

  6. Would love to hear what w.org does with the amount of registered users on a site that is submitted with every update check. And: if it’s not used why is this number part of an update request?

    Plugin data are submitted with their descriptions included, regardless if a plugin is self written or not. But an update check could simply transfer a hash of all these data of a plugin: easier to compare and it would be a huge reduction of data submission and therefore almost a blessing for w.org’s old systems. But this won’t happen, Otto. The amount of users collected with each update call will remain in core.

    Tell us why. And why users may not have a clue about all this. Why isn’t there a disclaimer that says: these data of your installation will be submitted to w.org twice a day. If you don’t like it don’t use WordPress. Or even better: every software has a checkbox and says: we collect data of your sofware usage. If not desired, uncheck. – Why isn’t such a thing in WordPress?

    I have to laugh. Because after all (and there is more) presenting oneself insulted by conspirations theories is not really a reason for trust. Maybe Otto is not the only one with hands on w.org.

    • User and blog count are included in the check because there is potentially a future case where the updates may need to be targeted.

      Blog count was originally there for WordPress MU, where the thinking was that multisites with large numbers of blogs would need special handling code if there was an extremely big database change. This is because each blog is stored in it’s own set of tables, therefore the “upgrade” process needs to run independently on each of them. If you have a couple dozen or so, no big deal. If you have several thousand, then you probably want to do a different way of sending the SQL needed to alter all those tables.

      User count, same principle. If your users table is excessively large or perhaps even split across various databases with HyperDB, then we probably don’t want to send you an update that does a big ALTER to it, potentially taking down the site for an extended period of time.

      So far, those numbers have not been needed for database upgrades. That doesn’t mean they won’t be. There was a change a while back that changed the size of the user_pass field, but testing showed that it was not a game breaker for the users table even on very large tables, so it wasn’t needed. Other alterations, such as moving data to/from usermeta and such, might need special code for larger sites.

      The data is not stored or used for anything else, if that’s what you’re asking. The API currently ignores it. It’s a just-in-case measure, because you can’t predict the future and upgrades are complicated by a huge number of factors.

      • User and blog count are included in the check because there is potentially a future case where the updates may need to be targeted.

        @Otto: w.org runs regular updates 3 times a year. Opportunity enough to extend data collection when needed. Isn’t this the same argumentation as if you say: install all plugins available on w.org since one day you could need them? – Weird.

        And what WP really lacks is the user permission to take their data away. You ask: who in the hell is concerned about blog and user count?! Well, what about business owners who offer a site inside a network for money? Or a membership site? Maybe they have competition. These data for owners like that are sort of business secrets. Definitely not meant to be sent out several times a day. So without permission it’s like sniffing. Thats why it’s really important to ask site owners for permission.

        • You ask: who in the hell is concerned about blog and user count?!

          I asked no such thing. But I *know* what we do with it, and that is *nothing*. It is not stored. It is not aggregated. We don’t save it, because it is not useful *to us*.

          You want to doubt my word? That’s fine. If that is the case, don’t ask me the question to begin with.

        • Ok. Otto cannot or does not want to answer at least one of my questions. But he said sth. important.

          It is not useful *to us*.

          Nice! To whom are collected data useful then? IF they are collected they CAN be used, e.g. due to FISA. But why, again, are “not useful” data collected at all?

          To not answer this question is user privacy ignorant. No allowance checkbox, no removing of fetched data that are wether stored nor used. Ah, and, Otto, it’s really not about YOU and whether I believe YOU. It’s about user privacy and what it means to Automattic and what that means to users. Matt seemed to be happy with his last key note: about making WP sites more secure by applying letsencrypt.org certs to all wordpress.com sites.

          However, this kind of happiness for me overlooks some black holes in WP itself, if we think about user privacy.

        • Otto, I think that james is pointing the finger at the wrong direction, but basically he is right, as no one ever communicates the privacy challenges that come from operating wordpress.

          I 100% trust your words about what you do with the data, but you are unlikely to do the same thing for the rest of your life, so just trusting one person is just not good enough. In addition wordpress.org is a golden hacking target. You know that you are deleting files every two days but you don’t know what the hacker that copies them (hopefully it doesn’t actually happen) do with them.

          Anything that might impact privacy should be opt-in not opt-out. WordPress core can use whatever scare tactics they want to make people opy-in, but it should be the decision of the site admin. Yes there are plugins that do that, but if the admin doesn’t really aware to the implications of automatic updates, he is not likely to use them.

    • Oh, as for the plugin and theme checks sending headers, hey, I want to change that too. My idea was to give every plugin and theme in our system a unique ID (I was thinking UUIDs, to be precise). This UUID could be in the header of the plugin, and it would uniquely identify that plugin. Eliminate all confusion, and solves quite a lot of pain for me. I didn’t write the current update check system. If I had, it would be different. :)

  7. Wonderful discution guys, I would like to thank everyone for asking such questions and thank you very much Otto for taking time answering and thanks to you Jeff, as well, for bringing this up.

    On the idea of themes and plugins’ UUIDs, I think it would be a nice initiative but it will be quite hard to implement since a lot of plugins in the repo won’t implement it for a long time. This gets us to the idea of hashes proposed by James wich should be more easily implementable or to the idea of a tag that would belong to “private” plugins/themes which do not need/should not be checked for eventual updates, this one too could be more or less easily implemented.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Newsletter

Subscribe Via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Discover more from WP Tavern

Subscribe now to keep reading and get access to the full archive.

Continue reading