Exploring The Idea Of An Internet Archive Specifically For WordPress Content

It seems like each time a WordPress podcast disappears, there is one or more to take its place. A few weeks ago, the WP Bacon podcast announced the end of their show to concentrate on other projects. However, a recent search in iTunes for WordPress Podcasts show there is almost an endless amount of content to listen to.

Variety of WordPress Podcasts To Listen To On iTunes
Variety of WordPress Podcasts To Listen To On iTunes

Although websites can be archived by the Internet Archive web crawler to be preserved, podcasts don’t have that luxury since they are audio files. It’s disappointing knowing that some WordPress podcasts will be lost to the ether, never to be heard from again. It’s an even harder pill to swallow if the podcast has 50-100 episodes. It would be great if there was a resource on WordPress.org that acted as a digital archive of WordPress history for text, video, and audio. An enhanced version of the Internet Archive but specifically for WordPress.

Results For WordPress.org In The Wayback Machine
Results For WordPress.org In The Wayback Machine

Make Sure Your Site Is Not Blocking The Internet Archive Web Crawler

The Internet Archive uses web crawlers or spiders to automatically scan and download websites. You can manually trigger the spiders to crawl your site by searching for it using the Wayback Machine. If the site is already indexed, you’ll see a list of results. If not, the Internet Archive will attempt to crawl the site and display the results within six months.

It generally takes 6 months or more (up to 24 months) for pages to appear in the Wayback Machine after they are collected, because of delays in transferring material to long-term storage and indexing, or the requirements of our collection partners.

A robots.txt file at the top-level of a domain is enough to block the Internet Archive from crawling the site, so please don’t use it. The Archive Team explains the history of robots.txt and why it’s dangerous to preserving the web.

Robots
photo credit: gruntzookicc

How To Upload Audio To The Wayback Machine

In order to upload audio to the Internet Archive, you’ll need to register for an account to obtain a virtual library card. Once you’ve registered and activated your account, browser to https://archive.org/upload/. This is the submission form you’ll use to upload audio to the Internet Archive. Select the audio file or drag to the screen to begin the process.

With the audio file selected, you’ll need to fill in additional details such as the description, subject tags, date the work was created, etc. Please be as detailed and descriptive as possible. This is where publishing decent show notes helps as you can just copy and paste the relevant material into the submission form.

One thing you’ll want to pay particular attention to is the license. If the work is not considered in the public domain, CC0 is the least restrictive license. While you can choose to be more restrictive, I recommend being the least restrictive license as possible to remove doubt on how the content can be reused. As an example, I uploaded episode 154 of WordPress Weekly.

The Wayback Machine Audio Upload Form
The Internet Archive Audio Upload Form

Once the upload process is complete, the Internet Archive creates a page dedicated to the piece of audio content. From this page, visitors can read information and listen to the uploaded audio file. I also searched the audio section of the Internet Archive for WordPress Weekly and was able to locate Episode 154 of the show.

Internet Archive Search Results For WordPress Weekly Audio
Internet Archive Search Results For WordPress Weekly Audio

If you’ve produced at least 25 or more episodes of a WordPress podcast and have decided to call it quits, could you please consider uploading the shows to the Internet Archive. I realize it’s manual labor and takes time, but at least your hard work of preparing for each show and the information discussed will not go to waste!

Uploading Video To The Internet Archive

Although the Internet Archive has a section devoted to video content, you’re required to have the source files for upload. These are not only larger, but  require more time and labor to obtain. I doubt YouTube.com is going anywhere, anytime soon, but if you want your WordPress centric videos to be archived, this is where you’d upload them.

Why Archiving WordPress Information Is Important To Me

I think of WP Tavern as a site with a continuous mission of documenting what’s happening within the WordPress ecosystem. Our job is never completed and I value the archived content as if it were gold. When I read posts from the archive, I’m reminded of how many projects that have come and gone over the past few years. It doesn’t matter if it’s text, audio, or video, each piece of content about WordPress whether it’s published on WP Tavern or not is important, especially when looking at the big picture.

My hope is that websites that write about WordPress on a routine basis do their best to archive content, even if they decide to shut down. For example, if WPCandy disappears from the web, a large gaping hole of WordPress history will go with it. During the height of WPCandy’s success, I spent time away from WP Tavern. The Tavern doesn’t have any relevant content from that time period. When piecing together stories to make sense of decisions and trends, historical content is important. Once those holes are created, it’s nearly impossible to fill them.

A lot has happened since the birth of WordPress over 10 years ago. Much of WordPress’ earlier history is documented fairly well but the events and milestones between the beginning and the present are spread throughout many sites in text, video, and audio. As someone who writes about WordPress for a living, it’s important that as much WordPress history as possible is archived. It sucks to view an article about WordPress with a bunch of potentially relevant information to a recent topic of discussion only to discover a 404 error.

How important is it to you that there is a proper archive of historical content related to WordPress and it being available to the public? Is the Internet Archive good enough or would you like to see something catered specifically to WordPress?

20 Comments


  1. This is an excellent point and an issue that is pretty pervasive, both for hosted and self-hosted blogs.

    I don’t know what the answer is. I think hoping that people will back up their audio / video is unlikely. But archive.org is pretty incredible, and I also like that a project BDFL like Matt has acquired some WordPress properties for the purpose off keeping them online as well.

    I know Siobhan McKeown has struggled with some of this issue quite a bit while she’s been working on her book. She’s be a really good person to ask about it.

    Report


    1. Which reminds me, gotta get her on the show. At least the WP Daily archives were saved through TorqueMag but you’re right, manual archiving for audio and video just doesn’t cut it. And yeah, Weblogtoolscollection.com is a property Matt acquired with at least one of the primary purposes being to archive the site.

      Report


  2. I think it is an extremely important thing to do. There’s a lot of useful information and history in various WordPress sites, and letting them disappear permanently would be very sad.

    Speaking of old content, it would be nice if this page reappeared, even if just as a read-only archive … https://wptavern.com/forum

    Report


    1. The irony is not lost on me, knowing that page is missing from the web as I wrote this post. I can always count on you to bring this stuff up :). I have the content archived, just not anywhere that is available to the public. I want to bring it back somehow, either as part of the site again or at the very least, a public archive in read-only mode.

      Report


      1. I’ve done a similar thing myself. I have an old forum which I accidentally let drop offline (I let the domain expire), but I still have the old database stashed away for the rainy day when I can be bothered turfing it back online for historical purposes :)

        Report


  3. Do you know what happened with WPCandy? No new posts since a year. A few months ago I’ve asked one of their editor on twitter and she told me they will come back soon.

    Report


    1. The site disappeared. The owner never responded to questions about it, then some of the posts reappeared (but not all), then a couple of new posts were made, then it just died.

      Report


      1. Quite interesting, but not a unique case in the online world.

        Report


      2. …really hope it does come back though, I for one used to really enjoy reading WPCandy…. Ryan? Listening?

        Report


  4. Another option is for podcasters to post transcripts for their shows. Those transcripts would be indexed and archived. Same thing is true for video. Add the transcript and the content will be archived.

    Report


    1. I’m going to look into this more. When I think of transcripts for podcast, I always equate that with having to spend money. I’ll head over to your site shortly as I know you’ve written about this topic a few times before.

      Report


      1. Check out fanscribed.com – they have a good thing going already. The transcription work is crowdsourced.

        Report


  5. Yeah, I’d love to have an archive of all the podcasts I’ve been on… from the WordPress Podcast with Charles Stricklin, the TechCanuck Podcast with James Cogan, PerfCast with Jeff, and of course WP Weekly… I think the ones with Charles would be the hardest to find/get at this point.

    Report


    1. Well, all 29 episodes of Perfcast are available for you to download and archive yourself here, http://www.talkshoe.com/talkshoe/web/talkCast.jsp?masterId=24073&cmd=tc

      I don’t know which episode you started to host the show with me, I bet I wrote about it. But archived episodes can be downloaded here. http://www.talkshoe.com/talkshoe/web/talkCast.jsp?masterId=34224&cmd=tc

      I also have most of them on an external hard drive. All newer episodes are hosted on the same site as the Tavern.

      I brought up the issue of the WordPress community podcast being archived to Joost De Valk a long time ago and he said he would keep them available online. All of the newest episodes are on WebmasterRadio.FM http://www2.webmasterradio.fm/wordpress-community-podcast/ but I don’t know exactly where all of the very old episodes of the show are. They may still be on the wp-community domain but that just shows a login form now.

      For the archives sake, Joost De Valk’s original Press This podcast episodes can be found here. http://www2.webmasterradio.fm/press-this/

      Report


  6. Archive.org is fantastic for text – I’ve used it extensively in my research. Unfortunately podcasts don’t archive so well :( I’ve been trying to get this podcast: https://web.archive.org/web/20091210030233/http://bitwiremedia.com/wordcast/wordcast-special-edition-live-may-12th-at-6pm-eastern/ but having no luck. I’ve been in touch with one of the publishers of the podcast and he doesn’t have it any more.

    Here’s a good one that I did find via archive.org though: https://web.archive.org/web/20080427183149/http://www.revolutionizeyourblog.com/askthanks.php

    Report


    1. Well that sucks, I guess Dave Moyer is hard to get in touch with these days or maybe he didn’t archive them. Perhaps Lorelle VanFossen has a copy?

      Report


  7. I started podcasting just two months ago, using the Internet Archive to store and host my 100% Royalty-Free “Eclectic Music” Podcast, Amateur Zen. I was informed by some web how-tos (I’ll try to find and credit those sources) that one way to podcast for zero cost (as in beer) is to use the ‘Internet Archive – Feedburner – WordPress’ triumvarate. I’ve been wholly satisfied with this method, and my handful of listeners have too. Although the learning curve is steep-ish, submitting audio content to the archive is extremely easy and reliable. I’ve come across plenty of podcasts hosted there in their comprehensive 100+ “episode” glory (Note: Creating playlists of episodical ‘casts in sequence MAY entice enforcement of payment of fees to ‘Pro Audio’).

    The issue tackled in this article truly doesn’t compute with me. As long as podcasters are independent creators, they can’t expect a free automated platform to pick up where their own laziness or lack of spare time leave off. WordPress is certainly not that platform. The various podcast plug-ins don’t bother to help producers get their content submitted to archive.org either. What do they do? I’m getting snarky so..’nuff for now.

    If a podcaster intends for his/her works to be archived, that is their own responsibility – and hopefully a responsibility shared with one’s eager-to-contribute audience. If podcasters who are unable to do this make up a significant portion of the WP user-base then I’d work on a solution proritizing the creation of a simple automation script by WP. A script automating the upload of podcasts containing a given tag (‘archive’, perhaps?) to archive.org under an auto-generated account there. Perhaps we put a limit of X gigabytes on this process, after which a podcaster must go and actually interact with archive.org manually to prove his/her interest/sentience. I’m no developer, so i’ll stop there.

    To suffice, this is a solution in search of a problem.

    Report


  8. Great post and I love anything that raises awareness about the free resources provided by archive.org. Some of the information is inaccurate though: Wayback Machine crawls do include MP3s or any other file types, provided they are served up by normal HTTP or HTTPS download links and not hidden behind flash players or MediaFire-type download sites. As with any Wayback Machine content it can be hit and miss as to whether any particular link gets archived, although they have been improving over time. I highly recommend just using the archive.org file area to host the MP3s in the first place – free bandwidth!

    Submitting your site to be archived is another very useful feature, however it will only archive the specific URL you give it and does not actually trigger a crawl of the entire site. The six months blurb you quote was in reference to the previous situation where URLs archived by Wayback would not go live until months later when the indexing and such had been completed. This is no longer the case and archived URLs are now available within seconds.

    If you have a full site that needs crawling you can bring it to the attention of the Archive Team (http://archiveteam.org/) and we have tools that can make that happen and import the results into Wayback.

    Report

Comments are closed.