How To Archive A Site You Don’t Have Access To

There are several options when it comes to backing up a WordPress site. Depending on the type of access you have, retrieving the database or an XML backup is easy. But what if you don’t have access to the database or the backend? Consider the following scenario presented on the WordPress subreddit: A relative who used WordPress recently passed away and you have no way to access the backend of their site. Their site is filled with memorable posts you’d like to archive.

One option is to use WGET. WGET is a free, open source software package used for retrieving files using HTTP, HTTPS and FTP, the most widely used Internet protocols. I used version 1.10.2 of this WGET package put together by HHerold which worked successfully on my Windows 7 64-bit desktop. Once installed, you’ll need to activate the command prompt and navigate to the folder where WGET.exe is installed.

WGET In Action
WGET In Action

In the example below, I used four different parameters. GNU.org has an excellent guide available that explains what each parameter does. Alternatively, you can use the wget — help command to see a list of commands.

  • HTML Extension – This will save the retrieved files as .HTML
  • Convert Links – After the download is complete, this will convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
  • -m – This turns on the options suitable for mirroring.
  • -w 20 – This command puts 20 seconds in between each file retrieved so it doesn’t hammer the server with traffic.

wget –html-extension –convert-links -m -w 20 http://example.com

Using this command, each post and page will be saved as an HTML file. The site will be mirrored and links will be converted so I can browse them locally. The last parameter places 20 second intervals between each file retrieved to help prevent overloading the web server.

Keep in mind that this is saving the output of a post or page into an HTML file. This method should not be used as the primary means of backing up a website.

Results Of The WGET Command
Results Of The WGET Command

The Command Line Proves Superior To The GUI

A popular alternative to WGET is WinHTTrack. Unfortunately, I couldn’t figure out how to get it to provide me with more than just the index.html of the site.  I found WinHTTrack to be confusing and hard to use. I spent a few hours trying several different programs to archive the output of websites. Most were hard to use or didn’t provide an easy to use interface. While I normally don’t use the command line to accomplish a task, it was superior in this case.

Going back to our scenario, it’s entirely possible to archive a site you don’t have access to thanks to WGET and similar tools.

What tools or software do you recommend for archiving the output of a website?

18

18 responses to “How To Archive A Site You Don’t Have Access To”

  1. wget will not pull information from a WordPress site fully secured by iThemes Security. Attempts to pull information from sites with iThemes or old installs of Better WP Security return 403 forbidden. Our sites using Wordfence do not seem protected from wget, but I have not looked into why this might be. This is a great solution to getting files and has come in handy countless times for our group, but if the target site in question uses iThemes Security fully, this is a no go.

    • Good point about the WordPress security plugins and how some may prevent wget from working properly. Also, it’s possible the robots.txt file is setup to disallow wget as well. Even though there is a way to bypass the robots.txt file it should always be respected

    • I believe iThemes security does not like the wget user agent. A way to get around this is to add --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36" (or some other user-agent) to the end of the command.

  2. This is fascinating. Does that mean that I can now find any WordPress site I wish and archive it, regardless of whether it is mine or belongs to someone who would allow it? Does that mean I can effectively ‘scrape’ any WordPress site and create my own mirror of it elsewhere?

    Or, to put it another way: can I steal someone else’s content?

  3. Sorry you had trouble with HTTrack. I use the Linux version occasionally to put a static copy of our accreditation web documents on a thumb drive when we go through accreditation reviews. Yes, you read that correctly; make a copy, put it on a thumbdrive, and mail it to person who reviews it for accuracy. But httrack works well once you fiddle with the settings for a couple days. (You can suck down large swaths of the internet once you get it working.) :/

      • It may have had something to do with the theme we are using or perhaps the link depth settings. Using the default settings, all I got was an index.html page which was nothing more than what you see on the front page of the site. WGET on the other hand was able to browse each post/page and create an .html file of them. I tried to look for support or see if someone else had the same issues and I couldn’t find anything.

    • I just didn’t have the patience to learn and fiddle with the settings. Also, the interface was like something out of DOS days with the grey boxes and tiny text. When researching for this post, HTTrack was definitely one of the most recommended tools to perform this task. I wonder if a blackhat SEO scraper tool would have been easier to use?

        • Ok, so maybe DOS days is too far back. Maybe Windows 95 Prompts. Just ugly grey boxes with tiny black font. That video has a rad soundtrack. Can you do me a favor? Try using the default settings in HTTrack and download WPTavern.com and see if all you get is the index.html file.

          • I never had a problem using HTTrack. Using default settings I just took a small snapshot (around 3MB) of wptavern.com without any hiccups. I said small, as I manually cancelled the job as not to abuse the wptavern server. It successfully grabbed the front page and a few of the front page articles (along with their assets) before I stopped it.
            It actually has sensible defaults for getting a working local mirror of a website. Most of the options are useful when you realize you don’t need everything and you run out of space.

            Perhaps you were confused by the fact that httrack creates a single html file as the project’s base file, which once you open, provides you with a list of the target URL’s you have specified when started the project, and once you follow one of those links, you get your local copy. The actual html files are saved under a subdirectory named after the server’s domain name.

            Having that said, wget and httrack are two very similar yet totally different tools for me. Both get the job done, however I’d use wget in scripts as it’s more generally available, while I’d use httrack for bigger, manually triggered jobs where i’d want to fine tune a lot of options.

  4. HTTrack is great and in alot of cases better than wget! For example, using HTTrack you can create a “library” of archived sites and save the options used to create those archived sites. Then periodically “refresh” the archive collecting any new content.

    The home page for HTTrack builds links to all of the archives and is customizable using CSS. I use this to maintain a local copy of our company websites (and others) like our own Internet go-back machine.

    I like wget and use it often but, I think you missed the boat on HTTrack.

    Good article and it is a topic that more people will find useful in the near future – I think.

    Cheers,
    Charles

  5. I am not a command line kind of guy. Has anyone tried Site Orbiter (Mac) or Xenu (Win)?

    I use Site Orbiter all the time to gather information about client sites before I start a content strategy engagement.

    When doing a site conversion to WP I have sucked down an entire site (pages, images, css, etc.), saved the pages to html and then used an HTML to WP plugin to pull the pages into the WP install. That has saved me and my clients tons of time. There is still a lot of formatting to do but it’s better than retyping everything.

    I haven’t used Xenu but have tried it. I don’t remember if it sucks pages down but it does provide metadata information about a site.

    On the Mac I have also used Site Sucker.

  6. I’ve played around with Windows / GUI version of HTTrack quite a bit and it works magic if you’re patient enough to try several combination of settings. One key trick is to “Get HTML first” and “Ignore robot.txt” (use responsibly), and then structure based on filenames, so you get /html, /jpg, /pdf folders, and just need to go to /html folders to get to specific pages easily. Pretty sure I can pull WPTavern and show you how Jeff. Let me know if you’re interested, and I can PM you my findings on the settings that works. And no, I won’t do anything with the mirror/copy. :-)

Newsletter

Subscribe Via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.