HTTP/1.1 // 38.107.191.88 // // 21629 // // CCBot/1.0 (+http://www.commoncrawl.org/bot.html)

ThinkGeek - Cool Stuff for Geeks and Technophiles

How to Scrape an Entire Wordpress Blog

Jul
7
2008

When I find a new blog that I really like or think has some good information I usually want to read the whole thing, and in the past, I've been frustrated by how annoying it is to do it online. I just want all posts in one document, free of ads, that I can read from start to finish.

I've also been having the itch to learn the Perl programming language, and I thought this would be a good project to get my feet wet.

Strategy

Since the majority of blogs I read are in written in WordPress, it's pretty easy to take advantage of the GET string that can be used to access each post. The GET string looks like http://www.devtrench.com/?p=2. Normally, post id's start with 1 and auto increment from there, so all we have to do is figure out what the id is of the latest post, and write some code that will loop over a scraper that many times. Sometimes the latest post id will be hidden in the HTML of the post somewhere, and other times you just have to guess at it for a while until you figure it out.

Code

Here's the Perl code that I'm currently using to do this. It's meant to be run from the command line, and I output the data to a file using '> output.txt' after the command. Also, I ripped off most of this code from PLEAC-Perl, which is an online Perl cookbook. Cookbooks can really accelerate learning a language when you want to do specific tasks like this.

Hopefully this code is commented enough for you to understand what is going on.

PERL:
  1. #!/usr/bin/perl -w
  2. # http://www.devtrench.com - How to Scrape a Wordpress Blog
  3.  
  4. # These packages are required so load them
  5. use LWP::UserAgent;
  6. use HTTP::Request;
  7. use HTTP::Response;
  8. use URI::Heuristic;
  9.  
  10. # Usage: perl fetch_blog.pl http://www.domain.com highest_post_id start_pattern end_pattern
  11. my $raw_url  = shift; # should be the domain with http:// and no trailing slash
  12. my $post_num = shift; # this is the highest id number that you can find in the blog
  13. my $start_pattern = shift; # a start pattern to match. we don't want the entire page so find some unique text before the blog post. it's best to put the pattern in single quotes
  14. my $end_pattern = shift; # some unique text that signifies the end of a blog post
  15. my $url = URI::Heuristic::uf_urlstr($raw_url);
  16.  
  17. $| = 1; #to flush next line
  18.  
  19. # fire up a new user agent that can browse the web for us
  20. my $ua = LWP::UserAgent->new();
  21. $ua->agent("DEVTRENCH.COM WordPress Post Fetcher v0.1"); # give it time, it'll get there
  22.  
  23. # now loop through all posts
  24. for ($i=1;$i<=$post_num;$i++) {
  25.   # here we request each post from the site using the get string common to wordpress blogs
  26.   my $req = HTTP::Request->new(GET => $url.'/?p='.$i);
  27.   $req->referer("http://www.devtrench.com"); # perplex the log analysers
  28.   my $response = $ua->request($req);
  29.  
  30.   if ($response->is_error())
  31.   {
  32.      # uh oh there was an error, probably a 404 which is caused by deleted or non published posts
  33.      printf "<p>ERROR POST #$i: %s</p><hr>", $response->status_line;
  34.   } else {
  35.      my $content;
  36.      $content = $response->content();
  37.      my $test;
  38.      $_ = $content;
  39.      # this is the regular expression that will match the start and end pattern
  40.      if ($test = /\Q$start_pattern\E(.*)\Q$end_pattern\E/s) {
  41.        $content = $1;
  42.      }
  43.      # this line prints out what the regex found as html, but it could easily be modified to print as plain text as well
  44.      print "<div class='fetcher_content'><p><b>Post #$i</b></p>".$content."</div><hr>";
  45.   }
  46. }

This was a fun experiment for me and a great way to finally get around to learning Perl. I always find that if I create small tasks like this for myself that learning a language or framework is much more interesting and rewarding. Now if you want to do this in PHP, which is what I usually program in, you can follow the same logic using curl as your user agent.

Hint 1: If you use this code as is it will print out HTML. Surround the HTML that it generates with basic <html> and <body> tags and create your own styles in the <head>. Then you can view it in a web browser and print to PDF.

Hint 2: The blog I was trying to scrape did not allow me to hotlink images from my HTML file, but saving the page as HTML from FireFox was a workaround for that, and then I saved that created page as a PDF.


6 Comments »

Allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
MyAvatars 0.2

Comment by Scott (spammer) Fish
on 11 Jul 2008 at 12:35 pm #

This is a really cool way to scrape. Thanks for the code!

 
MyAvatars 0.2

Comment by Link Building Bible
on 19 Jul 2008 at 10:08 am #

Yah, this is a great idea….. i have that same problem… i find a new blog i wanna read about, and then I hate trying to navigate through it all…

 
MyAvatars 0.2

Comment by Mike Wanner
on 19 Apr 2009 at 5:48 pm #

I have a quick question that maybe related – I am looking for a program that can pull all the COMMENTS from posts. For example, on a site that is similiar to mine I want to grab all the people who are posting and put it into a format or list. Many people post and put their full information and URL/URI. This was I can build a list quickly. Any ideas?

MyAvatars 0.2

Comment by James Ehly
on 19 Apr 2009 at 8:15 pm #

My PERL skills are still pretty green :) In PHP you’d use preg_match_all on a pattern that would find a single comment. That would find all the comments and store them in an array for further processing. You’ll need to find a PERL guru or try a Google search to find something that might work with this script.

If anyone else that knows PERL would like to make a suggestion for this, that would be great!

 
 
MyAvatars 0.2

Comment by Arunabh Das
on 03 Aug 2010 at 1:20 pm # Subscribed to comments via email

I would recommend doing this in PHP as a schedule cron task so that it can dump it to your DB and allow you to mashup content from other blogs.

 
MyAvatars 0.2

Comment by Joe Kenedy
on 18 Aug 2010 at 8:46 am # Subscribed to comments via email

Has anyone actually written this in PHP. I would be interested in the code if you have it to share…

Thanks

-j