This week Audit WP launched a new SEO consultancy, headed up by SEO strategist Jacob King. The firm’s first order of business after launching was to publish a post to expose what King perceives to be important SEO and privacy concerns for customers of managed WordPress host WP Engine.
Audit WP uncovered more than 1.5 million results of WP Engine subdomains indexed in Google. Through scraping, King was able to acquire a massive list of indexed subdomains hosted on WP Engine in addition to 2,000+ customer emails, which he did not publicize.
I spoke with WP Engine founder Jason Cohen and asked if they consider this to be a privacy concern for their customers. Cohen said that it doesn’t constitute a privacy issue, given that the indexed domains are already public. “Those emails were literally already published on the Internet,” he said. “That is, the reason those were available to be scraped, is that they were already public, and scrape-able, by any person or robot.”
The issue with WP Engine’s staging sites being indexed was corrected some time ago, however developers will sometimes create other versions of their sites on subdomains, not fully understanding that these sites are public via a Google search. In the case of sites hosted with WP Engine, they are very easy to find, given that they all end in *.wpengine.com. When not properly hidden from search engines, sites in progress are exposed.
As demonstrated with one particular high profile example, the staging site for Harvard Law Review was public. If you’re curious about what the next iteration might possibly look like, King posted a screenshot in his post. Developers of that site have since made it private, but unfortunately, in the meantime of setting up a new WordPress install, someone got in and created an explicit website, ostensibly to prove a point. I captured a screenshot before it was taken down.
This illustrates the very real danger of having your subdomains public while they are still works in progress. Ultimately, hiding these sites from Google is the responsibility of the developer, but Cohen said that they will be taking some suggestions from the post, as he mentioned in his comment on the post:
However, your suggestion that it’s better to 301 that domain is still *also* very valid. Also, not all search engines are aware of this scenario, and thus one of the take-aways we have from your article is that we should auto-force robots.txt for the XYZ.wpengine.com domains just as we do for the staging domains
The Issue of Duplicate Content
You don’t have to be very technical to know that search engines regard duplicate content as a cardinal sin and will swiftly penalize you for it. Although WP Engine forces a “deny robots everything” in the robots.txt file on staging, not all search engines will respect this.
However, Jason Cohen is hanging his hat on personal assurances he received from Matt Cutts regarding the practice:
Google maintains a set of root domains that they know are companies that do exactly what we and many other hosting companies do. Included in that list are WordPress.com, SquareSpace, and us. When they detect “duplicate content” on subdomains from that list, they know that’s not actually duplicate content. You can see it in Google Search, but it’s not counted against you.
We have had a dialog directly with Matt Cutts on this point, so this is not conjecture, but fact.
SEO is oftentimes a moving target and there are varying and contradictory opinions about many of the fundamentals, including issues such as duplicate content.
King stated that he doesn’t trust any information from Matt Cutts to be “fact.” He replied that they don’t think it’s wise to let Google determine which site is a duplicate and that’s why the post offers step-by-step instructions to help WP Engine customers set up the proper subdomain redirects to prevent indexing.
A Long Standing Issue with Subdomains
The post on Audit WP sent WP Engine into damage control mode and they launched a massive Twitter campaign of replies discrediting the main points of the article. So far they have not tweeted any acknowledgement of the suggestions that they will be implementing.
I spoke with Audit WP’s founder Jacob King to find out his motivation behind publicizing his findings of the 1.5 million indexed subdomains.
“Well, we were a bit floored upon the initial discovery,” King said. “As someone with web scraping experience, the ease of access to WP Engine user names, emails, and other information troubled me greatly.”
He also felt that it was important to go public with the information given his previous interactions with WP Engine support:
I was hosting my personal blog there so I have a good amount of experience with the system. One big issue was my actual monthly human traffic being massively different from the traffic stats WP Engine was recording. A very large difference, analytics showing ~25k monthly visits, yet WP engine was showing well over 100k. I brought it up on Facebook and one day on Twitter as well, I was told it’s from Bot traffic and search engine spiders.
When King commented that he had blocked all common bots and asked for further explanation of his bot traffic being more than 5 times his human visitors, he received no reply.
“We never discussed anything specific to the indexation issue,” King said. “I went with my gut which told me they wouldn’t give me the time of day if we didn’t make it public.”
However, Audit WP is not the first to publicize problems with WP Engine’s subdomains. Although, WordPress SEO expert Joost de Valk chimed in on the post to condemn King’s public handling of the situation, he tweeted last April regarding what he perceived to be an SEO mistake by WP Engine:
WP Engine co-founder Ben Metcalfe responded in a blog post at that time, clarifying that clients can redirect traffic arriving at the WP Engine subdomain to the primary domain via the .htaccess file or the client portal. He also clarified why this is not done by default: “We don’t do this by default because it would then prevent us accessing the site via the sub-domain during a support call/etc, should the DNS on the primary domain fail.”
Joost de Valk replied:
If it were truly that common I wouldn’t have tweeted it. It’s an issue, it’s something you’re aware of because there’s a setting for it, but it’s something you could & should prevent from happening altogether. That’s what the “managed” in managed hosting stands for in my eyes.
WP Site Care founder Ryan Sullivan, reiterated that this issue has been a customer concern for quite some time:
I’ve brought this issue up to WP Engine on several different occasions through several different channels. In fact, it was Rob who pointed the issue out to me early last year, and I’ve never once been given a clear answer on plans to solve it. And with the number of sites we host on WP Engine (I give them a lot of money every month), this is a legitimate concern that should have been addressed a long time ago.
The post on Audit WP brought attention to an issue that was originally discussed in April of 2013 without a satisfactory response from WP Engine. Metcalfe’s post addressing the issue concluded by stating that it’s a common practice for all web hosts that don’t use VirtualHosts.
Forthcoming Changes at WP Engine
When I spoke with Jason Cohen, he confirmed that the post has spurred them to make some changes at WP Engine.
“First, we’re making a change to /robots.txt when served from our canonical domain (i.e. the ABC.wpengine.com style domains) so that spiders won’t index or crawl those domains.” Although it is WP Engine’s official position that there is no SEO penalty for what they are already doing, the company has decided that it’s a good idea to make this change anyway.
“Second, the point was made that our customers ought to 301-redirect their ABC.wpengine.com domains to their proper domains,” Cohen said. “We agree that’s a best-practice. While that’s trivial to do in our User Portal, we do NOT do a good job TELLING customers that this is a good idea.”
WP Engine is also considering a more push-button approach to suggest that developers make their staging sites private via a plugin of some kind. They are in the process of creating a “Best Practices” document which they will include in their public knowledgebase. “We are going to proactively link to it inside our User Portal for all customers to see,” Cohen said. The article will include suggestions with screenshots so that customers will be better-informed.
Cohen said that if they make any sweeping product changes, they will email customers. However, they do not have plans to proactively email the owners of the 1.5 million subdomains that have been indexed by Google, given that some of those are intentionally public.
If your site is among the list found by Audit WP and your subdomain isn’t meant to be public, you may want to double check your robots.txt file and set up the proper 301 redirects. Ultimately, no matter how many conveniences a managed host provides its customers, the responsibility for any subdomains falls to the site owner.
My biggest concern is how many of those subdomains with fresh, ready to install copies of WordPress exist. I could potentially scrape based on those subdomains that are indexed, locate what I’m guessing would be hundreds to thousands of that same scenario, and go ballistic.
Something certainly needs to be changed here. A staging area is good, but exposing all of those is a serious potential security risk.