How To Archive A Site You Don’t Have Access To

There are several options when it comes to backing up a WordPress site. Depending on the type of access you have, retrieving the database or an XML backup is easy. But what if you don’t have access to the database or the backend? Consider the following scenario presented on the WordPress subreddit: A relative who used WordPress recently passed away and you have no way to access the backend of their site. Their site is filled with memorable posts you’d like to archive.

One option is to use WGET. WGET is a free, open source software package used for retrieving files using HTTP, HTTPS and FTP, the most widely used Internet protocols. I used version 1.10.2 of this WGET package put together by HHerold which worked successfully on my Windows 7 64-bit desktop. Once installed, you’ll need to activate the command prompt and navigate to the folder where WGET.exe is installed.

WGET In Action

In the example below, I used four different parameters. GNU.org has an excellent guide available that explains what each parameter does. Alternatively, you can use the wget — help command to see a list of commands.

HTML Extension – This will save the retrieved files as .HTML
Convert Links – After the download is complete, this will convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
-m – This turns on the options suitable for mirroring.
-w 20 – This command puts 20 seconds in between each file retrieved so it doesn’t hammer the server with traffic.

wget –html-extension –convert-links -m -w 20 http://example.com

Using this command, each post and page will be saved as an HTML file. The site will be mirrored and links will be converted so I can browse them locally. The last parameter places 20 second intervals between each file retrieved to help prevent overloading the web server.

Keep in mind that this is saving the output of a post or page into an HTML file. This method should not be used as the primary means of backing up a website.

Results Of The WGET Command

The Command Line Proves Superior To The GUI

A popular alternative to WGET is WinHTTrack. Unfortunately, I couldn’t figure out how to get it to provide me with more than just the index.html of the site. I found WinHTTrack to be confusing and hard to use. I spent a few hours trying several different programs to archive the output of websites. Most were hard to use or didn’t provide an easy to use interface. While I normally don’t use the command line to accomplish a task, it was superior in this case.

Going back to our scenario, it’s entirely possible to archive a site you don’t have access to thanks to WGET and similar tools.

What tools or software do you recommend for archiving the output of a website?

Category: Blogging

Tags: archive, backup, output, wget

18 Comments

18 responses to “How To Archive A Site You Don’t Have Access To”

Scott Bradford says:

April 29, 2014 at 6:39 PM

wget will not pull information from a WordPress site fully secured by iThemes Security. Attempts to pull information from sites with iThemes or old installs of Better WP Security return 403 forbidden. Our sites using Wordfence do not seem protected from wget, but I have not looked into why this might be. This is a great solution to getting files and has come in handy countless times for our group, but if the target site in question uses iThemes Security fully, this is a no go.

Loading…
- Jeff Chandler says:
  
  April 29, 2014 at 7:02 PM
  
  Good point about the WordPress security plugins and how some may prevent wget from working properly. Also, it’s possible the robots.txt file is setup to disallow wget as well. Even though there is a way to bypass the robots.txt file it should always be respected
  
  Loading…
- Andrew Sun says:
  
  April 29, 2014 at 7:51 PM
  
  I believe iThemes security does not like the wget user agent. A way to get around this is to add --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36" (or some other user-agent) to the end of the command.
  
  Loading…
Viktoria Michaelis says:

April 30, 2014 at 1:29 AM

This is fascinating. Does that mean that I can now find any WordPress site I wish and archive it, regardless of whether it is mine or belongs to someone who would allow it? Does that mean I can effectively ‘scrape’ any WordPress site and create my own mirror of it elsewhere?

Or, to put it another way: can I steal someone else’s content?

Loading…
Dan says:

April 30, 2014 at 6:13 AM

You can bypass robots and no-follow links by using wget -e robots=off, the user agent argument itself is not sufficient to bypass robots.txt.

Loading…
- Jeff Chandler says:
  
  April 30, 2014 at 12:38 PM
  
  Thanks for the tip although people should rarely be doing things that ignore the robots.txt file.
  
  Loading…
Jess Planck says:

April 30, 2014 at 11:09 AM

Sorry you had trouble with HTTrack. I use the Linux version occasionally to put a static copy of our accreditation web documents on a thumb drive when we go through accreditation reviews. Yes, you read that correctly; make a copy, put it on a thumbdrive, and mail it to person who reviews it for accuracy. But httrack works well once you fiddle with the settings for a couple days. (You can suck down large swaths of the internet once you get it working.) :/

Loading…
- Tomas M. says:
  
  April 30, 2014 at 11:42 AM
  
  I successfully used HTTRack on several occasions to copy sites that amounted to several GBs. Program was very easy to use with default settings, not sure what happened to author :-/
  
  Was using it on Win 8.
  
  Loading…
  - Jeff Chandler says:
    
    April 30, 2014 at 12:20 PM
    
    It may have had something to do with the theme we are using or perhaps the link depth settings. Using the default settings, all I got was an index.html page which was nothing more than what you see on the front page of the site. WGET on the other hand was able to browse each post/page and create an .html file of them. I tried to look for support or see if someone else had the same issues and I couldn’t find anything.
    
    Loading…
- Jeff Chandler says:
  
  April 30, 2014 at 12:23 PM
  
  I just didn’t have the patience to learn and fiddle with the settings. Also, the interface was like something out of DOS days with the grey boxes and tiny text. When researching for this post, HTTrack was definitely one of the most recommended tools to perform this task. I wonder if a blackhat SEO scraper tool would have been easier to use?
  
  Loading…
  - Tomas M. says:
    
    April 30, 2014 at 12:28 PM
    
    DOS days? Even in this case graphic interface > command line :)
    
    Have you checked this out: https://www.youtube.com/watch?v=P-CuWZ_g0-s
    
    Loading…
    - Jeff Chandler says:
      
      April 30, 2014 at 12:49 PM
      
      Ok, so maybe DOS days is too far back. Maybe Windows 95 Prompts. Just ugly grey boxes with tiny black font. That video has a rad soundtrack. Can you do me a favor? Try using the default settings in HTTrack and download WPTavern.com and see if all you get is the index.html file.
      
      Loading…
      - anastis says:
        
        May 2, 2014 at 8:09 AM
        
        I never had a problem using HTTrack. Using default settings I just took a small snapshot (around 3MB) of wptavern.com without any hiccups. I said small, as I manually cancelled the job as not to abuse the wptavern server. It successfully grabbed the front page and a few of the front page articles (along with their assets) before I stopped it.
        It actually has sensible defaults for getting a working local mirror of a website. Most of the options are useful when you realize you don’t need everything and you run out of space.
        
        Perhaps you were confused by the fact that httrack creates a single html file as the project’s base file, which once you open, provides you with a list of the target URL’s you have specified when started the project, and once you follow one of those links, you get your local copy. The actual html files are saved under a subdirectory named after the server’s domain name.
        
        Having that said, wget and httrack are two very similar yet totally different tools for me. Both get the job done, however I’d use wget in scripts as it’s more generally available, while I’d use httrack for bigger, manually triggered jobs where i’d want to fine tune a lot of options.
        
        Loading…
jchaven says:

April 30, 2014 at 2:28 PM

HTTrack is great and in alot of cases better than wget! For example, using HTTrack you can create a “library” of archived sites and save the options used to create those archived sites. Then periodically “refresh” the archive collecting any new content.

The home page for HTTrack builds links to all of the archives and is customizable using CSS. I use this to maintain a local copy of our company websites (and others) like our own Internet go-back machine.

I like wget and use it often but, I think you missed the boat on HTTrack.

Good article and it is a topic that more people will find useful in the near future – I think.

Cheers,
Charles

Loading…
Mithun John Jacob says:

May 5, 2014 at 2:36 AM

Jeff, did you try http://www.metaproducts.com/mp/offline_explorer.htm

It’s awesome.

Loading…
Todd O’Neill says:

May 5, 2014 at 7:07 PM

I am not a command line kind of guy. Has anyone tried Site Orbiter (Mac) or Xenu (Win)?

I use Site Orbiter all the time to gather information about client sites before I start a content strategy engagement.

When doing a site conversion to WP I have sucked down an entire site (pages, images, css, etc.), saved the pages to html and then used an HTML to WP plugin to pull the pages into the WP install. That has saved me and my clients tons of time. There is still a lot of formatting to do but it’s better than retyping everything.

I haven’t used Xenu but have tried it. I don’t remember if it sucks pages down but it does provide metadata information about a site.

On the Mac I have also used Site Sucker.

Loading…
Bowo says:

August 15, 2014 at 11:47 PM

I’ve played around with Windows / GUI version of HTTrack quite a bit and it works magic if you’re patient enough to try several combination of settings. One key trick is to “Get HTML first” and “Ignore robot.txt” (use responsibly), and then structure based on filenames, so you get /html, /jpg, /pdf folders, and just need to go to /html folders to get to specific pages easily. Pretty sure I can pull WPTavern and show you how Jeff. Let me know if you’re interested, and I can PM you my findings on the settings that works. And no, I won’t do anything with the mirror/copy. :-)

Loading…
Dan Mitroi says:

November 14, 2014 at 3:56 PM

Hi Jeff,
I am Dan, founder at Darcy Ripper, would love your feedback on our tool. Darcy Ripper is a Free Website Downloader.
We’ve got many features built in.
Thx, Dan

Loading…