Archive for July, 2008

July 30th 2008

Extending a WordPress Site: Core Hijack

It’s pretty easy to hijack some core WordPress code and use it to extend the functionality of your WordPress site while keeping a cohesive look and feel. Basically, this technique creates a WordPress ’shell’ that you can code inside of; utilizing all of WordPress’s functions and themes, but leaving out the blog. I use this technique to add on functionality to my WordPress sites that don’t fit within the WordPress framework, such as zip code locators and other custom database applications. You can even use this technique to build custom includes based on WordPress functions. Lets get to it…

Assumtions

First I need to tell you about how my WordPress blogs are set up. I usually set up my blogs in the root directory, and that is how this tutorial assumes you have done it as well. This means all of our coding will be going on inside of the WordPress install. Of course, your setup may be different than this, so just keep that in mind.

I’m also assuming that you’ve been good and have upgraded to WordPress 2.6. The WP developers have changed how the core code is included in WP 2.6 and this tutorial reflects that.

Set Up

To begin create a directory in your WordPress install for testing (this directory should be parallel to the wp-content folder). You can call it anything you like - I’m calling mine ‘tool’. Next, create a blank index.php file in that folder.

Go on to learn how to include the WordPress core code…

Pages: 1 2 3 4

4 Comments »

July 21st 2008

Top Affiliate Challenge is Over

I was finally able to watch episodes 5-14 of TAC over a 3 day stretch last week. Overall I was pretty impressed with the show, even though my initial impressions left me wanting more. The acting got better, and they actually showed how some things were done, and some of the contestants shared pretty good details about how they got started and how they make money. Just the fact that Thor was able to pull it off while running his business and other things he does is pretty amazing to me.

Big surprise that Jonathan van Clute won, but he deserved it since he really knows his stuff. I was glad to see Carl come in 2nd; he’s a smart guy. I think Ian Jani got lucky in third place and it could have been anyone else.

As far as the Gurus go, it was very unfortunate for Shoemoney that his baby had such problems after she was born. Not exactly the best way to enter this world, but I’m glad that she’s doing better now and hopefully Shoe and his wife can get into the groove of being parents of 2 kids soon (if there is such a groove :) ). John Chow got voted off and I’m not really sure I agreed with that. I think he gives the most honest view of the show in his shocking behind the scenes take on TAC. So since Shoe and John were out, Ken McArthur had the show all to himself to promote the JV Alert conference and his book…good for him.

It sounds like there could be a TAC 2 in the works already so I’m excited to see that if it happens.

4 Comments »

July 7th 2008

How to Scrape an Entire Wordpress Blog

When I find a new blog that I really like or think has some good information I usually want to read the whole thing, and in the past, I've been frustrated by how annoying it is to do it online. I just want all posts in one document, free of ads, that I can read from start to finish.

I've also been having the itch to learn the Perl programming language, and I thought this would be a good project to get my feet wet.

Strategy

Since the majority of blogs I read are in written in WordPress, it's pretty easy to take advantage of the GET string that can be used to access each post. The GET string looks like http://www.devtrench.com/?p=2. Normally, post id's start with 1 and auto increment from there, so all we have to do is figure out what the id is of the latest post, and write some code that will loop over a scraper that many times. Sometimes the latest post id will be hidden in the HTML of the post somewhere, and other times you just have to guess at it for a while until you figure it out.

Code

Here's the Perl code that I'm currently using to do this. It's meant to be run from the command line, and I output the data to a file using '> output.txt' after the command. Also, I ripped off most of this code from PLEAC-Perl, which is an online Perl cookbook. Cookbooks can really accelerate learning a language when you want to do specific tasks like this.

Hopefully this code is commented enough for you to understand what is going on.

PERL:
  1. #!/usr/bin/perl -w
  2. # http://www.devtrench.com - How to Scrape a Wordpress Blog
  3.  
  4. # These packages are required so load them
  5. use LWP::UserAgent;
  6. use HTTP::Request;
  7. use HTTP::Response;
  8. use URI::Heuristic;
  9.  
  10. # Usage: perl fetch_blog.pl http://www.domain.com highest_post_id start_pattern end_pattern
  11. my $raw_url  = shift; # should be the domain with http:// and no trailing slash
  12. my $post_num = shift; # this is the highest id number that you can find in the blog
  13. my $start_pattern = shift; # a start pattern to match. we don't want the entire page so find some unique text before the blog post. it's best to put the pattern in single quotes
  14. my $end_pattern = shift; # some unique text that signifies the end of a blog post
  15. my $url = URI::Heuristic::uf_urlstr($raw_url);
  16.  
  17. $| = 1; #to flush next line
  18.  
  19. # fire up a new user agent that can browse the web for us
  20. my $ua = LWP::UserAgent->new();
  21. $ua->agent("DEVTRENCH.COM WordPress Post Fetcher v0.1"); # give it time, it'll get there
  22.  
  23. # now loop through all posts
  24. for ($i=1;$i<=$post_num;$i++) {
  25.   # here we request each post from the site using the get string common to wordpress blogs
  26.   my $req = HTTP::Request->new(GET => $url.'/?p='.$i);
  27.   $req->referer("http://www.devtrench.com"); # perplex the log analysers
  28.   my $response = $ua->request($req);
  29.  
  30.   if ($response->is_error())
  31.   {
  32.      # uh oh there was an error, probably a 404 which is caused by deleted or non published posts
  33.      printf "<p>ERROR POST #$i: %s</p><hr>", $response->status_line;
  34.   } else {
  35.      my $content;
  36.      $content = $response->content();
  37.      my $test;
  38.      $_ = $content;
  39.      # this is the regular expression that will match the start and end pattern
  40.      if ($test = /\Q$start_pattern\E(.*)\Q$end_pattern\E/s) {
  41.        $content = $1;
  42.      }
  43.      # this line prints out what the regex found as html, but it could easily be modified to print as plain text as well
  44.      print "<div class='fetcher_content'><p><b>Post #$i</b></p>".$content."</div><hr>";
  45.   }
  46. }

This was a fun experiment for me and a great way to finally get around to learning Perl. I always find that if I create small tasks like this for myself that learning a language or framework is much more interesting and rewarding. Now if you want to do this in PHP, which is what I usually program in, you can follow the same logic using curl as your user agent.

Hint 1: If you use this code as is it will print out HTML. Surround the HTML that it generates with basic <html> and <body> tags and create your own styles in the <head>. Then you can view it in a web browser and print to PDF.

Hint 2: The blog I was trying to scrape did not allow me to hotlink images from my HTML file, but saving the page as HTML from FireFox was a workaround for that, and then I saved that created page as a PDF.

2 Comments »

Next »