17 Comments


  1. wget will not pull information from a WordPress site fully secured by iThemes Security. Attempts to pull information from sites with iThemes or old installs of Better WP Security return 403 forbidden. Our sites using Wordfence do not seem protected from wget, but I have not looked into why this might be. This is a great solution to getting files and has come in handy countless times for our group, but if the target site in question uses iThemes Security fully, this is a no go.

    Reply

    1. Good point about the WordPress security plugins and how some may prevent wget from working properly. Also, it’s possible the robots.txt file is setup to disallow wget as well. Even though there is a way to bypass the robots.txt file it should always be respected

      Reply
    2. Andrew Sun

      I believe iThemes security does not like the wget user agent. A way to get around this is to add --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36" (or some other user-agent) to the end of the command.

      Reply

  2. This is fascinating. Does that mean that I can now find any WordPress site I wish and archive it, regardless of whether it is mine or belongs to someone who would allow it? Does that mean I can effectively ‘scrape’ any WordPress site and create my own mirror of it elsewhere?

    Or, to put it another way: can I steal someone else’s content?

    Reply

  3. You can bypass robots and no-follow links by using wget -e robots=off, the user agent argument itself is not sufficient to bypass robots.txt.

    Reply

    1. Thanks for the tip although people should rarely be doing things that ignore the robots.txt file.

      Reply

  4. Sorry you had trouble with HTTrack. I use the Linux version occasionally to put a static copy of our accreditation web documents on a thumb drive when we go through accreditation reviews. Yes, you read that correctly; make a copy, put it on a thumbdrive, and mail it to person who reviews it for accuracy. But httrack works well once you fiddle with the settings for a couple days. (You can suck down large swaths of the internet once you get it working.) :/

    Reply

    1. I successfully used HTTRack on several occasions to copy sites that amounted to several GBs. Program was very easy to use with default settings, not sure what happened to author :-/

      Was using it on Win 8.

      Reply

      1. It may have had something to do with the theme we are using or perhaps the link depth settings. Using the default settings, all I got was an index.html page which was nothing more than what you see on the front page of the site. WGET on the other hand was able to browse each post/page and create an .html file of them. I tried to look for support or see if someone else had the same issues and I couldn’t find anything.

        Reply

    2. I just didn’t have the patience to learn and fiddle with the settings. Also, the interface was like something out of DOS days with the grey boxes and tiny text. When researching for this post, HTTrack was definitely one of the most recommended tools to perform this task. I wonder if a blackhat SEO scraper tool would have been easier to use?

      Reply

        1. Ok, so maybe DOS days is too far back. Maybe Windows 95 Prompts. Just ugly grey boxes with tiny black font. That video has a rad soundtrack. Can you do me a favor? Try using the default settings in HTTrack and download WPTavern.com and see if all you get is the index.html file.

          Reply

          1. I never had a problem using HTTrack. Using default settings I just took a small snapshot (around 3MB) of wptavern.com without any hiccups. I said small, as I manually cancelled the job as not to abuse the wptavern server. It successfully grabbed the front page and a few of the front page articles (along with their assets) before I stopped it.
            It actually has sensible defaults for getting a working local mirror of a website. Most of the options are useful when you realize you don’t need everything and you run out of space.

            Perhaps you were confused by the fact that httrack creates a single html file as the project’s base file, which once you open, provides you with a list of the target URL’s you have specified when started the project, and once you follow one of those links, you get your local copy. The actual html files are saved under a subdirectory named after the server’s domain name.

            Having that said, wget and httrack are two very similar yet totally different tools for me. Both get the job done, however I’d use wget in scripts as it’s more generally available, while I’d use httrack for bigger, manually triggered jobs where i’d want to fine tune a lot of options.

  5. jchaven

    HTTrack is great and in alot of cases better than wget! For example, using HTTrack you can create a “library” of archived sites and save the options used to create those archived sites. Then periodically “refresh” the archive collecting any new content.

    The home page for HTTrack builds links to all of the archives and is customizable using CSS. I use this to maintain a local copy of our company websites (and others) like our own Internet go-back machine.

    I like wget and use it often but, I think you missed the boat on HTTrack.

    Good article and it is a topic that more people will find useful in the near future – I think.

    Cheers,
    Charles

    Reply

  6. I am not a command line kind of guy. Has anyone tried Site Orbiter (Mac) or Xenu (Win)?

    I use Site Orbiter all the time to gather information about client sites before I start a content strategy engagement.

    When doing a site conversion to WP I have sucked down an entire site (pages, images, css, etc.), saved the pages to html and then used an HTML to WP plugin to pull the pages into the WP install. That has saved me and my clients tons of time. There is still a lot of formatting to do but it’s better than retyping everything.

    I haven’t used Xenu but have tried it. I don’t remember if it sucks pages down but it does provide metadata information about a site.

    On the Mac I have also used Site Sucker.

    Reply

  7. I’ve played around with Windows / GUI version of HTTrack quite a bit and it works magic if you’re patient enough to try several combination of settings. One key trick is to “Get HTML first” and “Ignore robot.txt” (use responsibly), and then structure based on filenames, so you get /html, /jpg, /pdf folders, and just need to go to /html folders to get to specific pages easily. Pretty sure I can pull WPTavern and show you how Jeff. Let me know if you’re interested, and I can PM you my findings on the settings that works. And no, I won’t do anything with the mirror/copy. :-)

    Reply

Leave a Reply