HTTP/1.1 // 38.107.191.87 // // 56938 // // CCBot/1.0 (+http://www.commoncrawl.org/bot.html)

Clicktale

How to Scrape an Entire Wordpress Blog

Jul
7
2008

When I find a new blog that I really like or think has some good information I usually want to read the whole thing, and in the past, I've been frustrated by how annoying it is to do it online. I just want all posts in one document, free of ads, that I can read from start to finish.

I've also been having the itch to learn the Perl programming language, and I thought this would be a good project to get my feet wet.

Strategy

Since the majority of blogs I read are in written in WordPress, it's pretty easy to take advantage of the GET string that can be used to access each post. The GET string looks like http://www.devtrench.com/?p=2. Normally, post id's start with 1 and auto increment from there, so all we have to do is figure out what the id is of the latest post, and write some code that will loop over a scraper that many times. Sometimes the latest post id will be hidden in the HTML of the post somewhere, and other times you just have to guess at it for a while until you figure it out.

Code

Here's the Perl code that I'm currently using to do this. It's meant to be run from the command line, and I output the data to a file using '> output.txt' after the command. Also, I ripped off most of this code from PLEAC-Perl, which is an online Perl cookbook. Cookbooks can really accelerate learning a language when you want to do specific tasks like this.

Hopefully this code is commented enough for you to understand what is going on.

PERL:
  1. #!/usr/bin/perl -w
  2. # http://www.devtrench.com - How to Scrape a Wordpress Blog
  3.  
  4. # These packages are required so load them
  5. use LWP::UserAgent;
  6. use HTTP::Request;
  7. use HTTP::Response;
  8. use URI::Heuristic;
  9.  
  10. # Usage: perl fetch_blog.pl http://www.domain.com highest_post_id start_pattern end_pattern
  11. my $raw_url  = shift; # should be the domain with http:// and no trailing slash
  12. my $post_num = shift; # this is the highest id number that you can find in the blog
  13. my $start_pattern = shift; # a start pattern to match. we don't want the entire page so find some unique text before the blog post. it's best to put the pattern in single quotes
  14. my $end_pattern = shift; # some unique text that signifies the end of a blog post
  15. my $url = URI::Heuristic::uf_urlstr($raw_url);
  16.  
  17. $| = 1; #to flush next line
  18.  
  19. # fire up a new user agent that can browse the web for us
  20. my $ua = LWP::UserAgent->new();
  21. $ua->agent("DEVTRENCH.COM WordPress Post Fetcher v0.1"); # give it time, it'll get there
  22.  
  23. # now loop through all posts
  24. for ($i=1;$i<=$post_num;$i++) {
  25.   # here we request each post from the site using the get string common to wordpress blogs
  26.   my $req = HTTP::Request->new(GET => $url.'/?p='.$i);
  27.   $req->referer("http://www.devtrench.com"); # perplex the log analysers
  28.   my $response = $ua->request($req);
  29.  
  30.   if ($response->is_error())
  31.   {
  32.      # uh oh there was an error, probably a 404 which is caused by deleted or non published posts
  33.      printf "<p>ERROR POST #$i: %s</p><hr>", $response->status_line;
  34.   } else {
  35.      my $content;
  36.      $content = $response->content();
  37.      my $test;
  38.      $_ = $content;
  39.      # this is the regular expression that will match the start and end pattern
  40.      if ($test = /\Q$start_pattern\E(.*)\Q$end_pattern\E/s) {
  41.        $content = $1;
  42.      }
  43.      # this line prints out what the regex found as html, but it could easily be modified to print as plain text as well
  44.      print "<div class='fetcher_content'><p><b>Post #$i</b></p>".$content."</div><hr>";
  45.   }
  46. }

This was a fun experiment for me and a great way to finally get around to learning Perl. I always find that if I create small tasks like this for myself that learning a language or framework is much more interesting and rewarding. Now if you want to do this in PHP, which is what I usually program in, you can follow the same logic using curl as your user agent.

Hint 1: If you use this code as is it will print out HTML. Surround the HTML that it generates with basic <html> and <body> tags and create your own styles in the <head>. Then you can view it in a web browser and print to PDF.

Hint 2: The blog I was trying to scrape did not allow me to hotlink images from my HTML file, but saving the page as HTML from FireFox was a workaround for that, and then I saved that created page as a PDF.


4 Comments »

Allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
MyAvatars 0.2

Comment by Scott (spammer) Fish
on 11 Jul 2008 at 12:35 pm #

This is a really cool way to scrape. Thanks for the code!

 
MyAvatars 0.2

Comment by Link Building Bible
on 19 Jul 2008 at 10:08 am #

Yah, this is a great idea….. i have that same problem… i find a new blog i wanna read about, and then I hate trying to navigate through it all…

 
MyAvatars 0.2

Comment by Mike Wanner
on 19 Apr 2009 at 5:48 pm #

I have a quick question that maybe related – I am looking for a program that can pull all the COMMENTS from posts. For example, on a site that is similiar to mine I want to grab all the people who are posting and put it into a format or list. Many people post and put their full information and URL/URI. This was I can build a list quickly. Any ideas?

MyAvatars 0.2

Comment by James Ehly
on 19 Apr 2009 at 8:15 pm #

My PERL skills are still pretty green :) In PHP you’d use preg_match_all on a pattern that would find a single comment. That would find all the comments and store them in an array for further processing. You’ll need to find a PERL guru or try a Google search to find something that might work with this script.

If anyone else that knows PERL would like to make a suggestion for this, that would be great!