How to Scrape an Entire Wordpress Blog
When I find a new blog that I really like or think has some good information I usually want to read the whole thing, and in the past, I've been frustrated by how annoying it is to do it online. I just want all posts in one document, free of ads, that I can read from start to finish.
I've also been having the itch to learn the Perl programming language, and I thought this would be a good project to get my feet wet.
Strategy
Since the majority of blogs I read are in written in WordPress, it's pretty easy to take advantage of the GET string that can be used to access each post. The GET string looks like http://www.devtrench.com/?p=2. Normally, post id's start with 1 and auto increment from there, so all we have to do is figure out what the id is of the latest post, and write some code that will loop over a scraper that many times. Sometimes the latest post id will be hidden in the HTML of the post somewhere, and other times you just have to guess at it for a while until you figure it out.
Code
Here's the Perl code that I'm currently using to do this. It's meant to be run from the command line, and I output the data to a file using '> output.txt' after the command. Also, I ripped off most of this code from PLEAC-Perl, which is an online Perl cookbook. Cookbooks can really accelerate learning a language when you want to do specific tasks like this.
Hopefully this code is commented enough for you to understand what is going on.
-
#!/usr/bin/perl -w
-
# http://www.devtrench.com - How to Scrape a Wordpress Blog
-
-
# These packages are required so load them
-
use LWP::UserAgent;
-
use HTTP::Request;
-
use HTTP::Response;
-
use URI::Heuristic;
-
-
# Usage: perl fetch_blog.pl http://www.domain.com highest_post_id start_pattern end_pattern
-
my $start_pattern = shift; # a start pattern to match. we don't want the entire page so find some unique text before the blog post. it's best to put the pattern in single quotes
-
my $url = URI::Heuristic::uf_urlstr($raw_url);
-
-
$| = 1; #to flush next line
-
-
# fire up a new user agent that can browse the web for us
-
my $ua = LWP::UserAgent->new();
-
$ua->agent("DEVTRENCH.COM WordPress Post Fetcher v0.1"); # give it time, it'll get there
-
-
# now loop through all posts
-
for ($i=1;$i<=$post_num;$i++) {
-
# here we request each post from the site using the get string common to wordpress blogs
-
my $req = HTTP::Request->new(GET => $url.'/?p='.$i);
-
$req->referer("http://www.devtrench.com"); # perplex the log analysers
-
my $response = $ua->request($req);
-
-
if ($response->is_error())
-
{
-
# uh oh there was an error, probably a 404 which is caused by deleted or non published posts
-
} else {
-
my $content;
-
$content = $response->content();
-
my $test;
-
$_ = $content;
-
# this is the regular expression that will match the start and end pattern
-
$content = $1;
-
}
-
# this line prints out what the regex found as html, but it could easily be modified to print as plain text as well
-
}
-
}
This was a fun experiment for me and a great way to finally get around to learning Perl. I always find that if I create small tasks like this for myself that learning a language or framework is much more interesting and rewarding. Now if you want to do this in PHP, which is what I usually program in, you can follow the same logic using curl as your user agent.
Hint 1: If you use this code as is it will print out HTML. Surround the HTML that it generates with basic <html> and <body> tags and create your own styles in the <head>. Then you can view it in a web browser and print to PDF.
Hint 2: The blog I was trying to scrape did not allow me to hotlink images from my HTML file, but saving the page as HTML from FireFox was a workaround for that, and then I saved that created page as a PDF.

Comment by Scott (spammer) Fish
on 11 Jul 2008 at 12:35 pm #
This is a really cool way to scrape. Thanks for the code!
Comment by Link Building Bible
on 19 Jul 2008 at 10:08 am #
Yah, this is a great idea….. i have that same problem… i find a new blog i wanna read about, and then I hate trying to navigate through it all…