Archive for September, 2007

Screen Scraping: How to Screen Scrape a Website with PHP and cURL

Screen Scrape with PHP and cURLScreen scraping has been around on the internet since people could code on it, and there are dozens of resources out there to figure out how to do it (google php screen scrape to see what I mean). I want to touch on some things that I’ve figured out while scraping some screens. I assume you have php running, and know your way around Windows.

  • Do it on your local computer. If you are scraping a lot of data you are going to have to do it in an environment that doesn’t have script time limits. The server that I use has a max execution time of 30 seconds, which just doesn’t work if you are scraping a lot of data off of slow pages. The best thing to do is to run your script from the command line where there is no limit to how long a script can take to execute. This way, you’re not hogging server resources if you are on a shared host, or your own server’s resources if you are on a dedicated host. Obviously, if your screen scraping data to serve ‘on-the-fly’, then this senario won’t work, but it’s awesome for collecting data. Make sure you can run php from the command line by opening up a command prompt window, and type ‘php -v’. You should get the version of php you are running. If you get an error message then you’ll need to map your PATH environment variable to your php executable.
  • Do it once. If you are writing a script that loops through all of the pages on a site, or a lot of pages - make sure your script works right before you execute it. If the host sees what you are doing and doesn’t like it, then they could just block you. So it’s best to make sure your script runs correctly by doing a small test run. Then when that works, unleash your script on the entire site. In that same vein, don’t screen scrape a site all the time. You’re just going to piss off the admin if they figure it out.
  • Do it smart. Make sure the site doesn’t offer an api for doing what you want before you scrape their site. Often, the api can get you the information quicker and in a better format than the screen scrape can.
  • Use the cURL library. I really don’t know any other way to scrape a page other than to use cURL — it works so well I just never have had to try anything else. Since you are going to be using php from the command line, you’re also going to want to use curl from the command line (it’s easier than using the PHP functions, and external libraries are not loaded any way). Get the curl library from http://curl.haxx.se/download.html and download the non ssl version. Map the path to curl.exe in your PATH environment variable, and make sure you can run curl from the command line.

Those are all of my tips. Here is some screen scrape code that I use.

To call curl just write a function like this. This is so much easier than using the php commands, but you probably don’t want to use a shell_exec command on a web server where someone can put in their own input. That might be bad. I only use this code when I run it locally.

function GetCurlPage ($pageSpec) {
    return shell_exec("curl $pageSpec");
}

This is the code that calls the curl function. We start by using the output buffer, this greatly speeds up our code. This particular code would grab the title of a page and print it:

ob_start();
$url = 'http://www.example.com';
$page = GetCurlPage($url);
preg_match("~~",$page,$m);
print $m[1];
ob_end_flush();

To run your script from the command line and generate output to a file you simply call it like this:

php my_script_name.php > output.txt

Any output captured by the output buffer will be printed to the file you pass the output to.

This is a very simple example that doesn’t even check to see if the title exists on that page before it prints, but hopefully you can use your imagination to expand this into something that might grab all of the titles on an entire site. A common thing that I do is use a program like Xenu Link Sleuth to build my list of links I want to scrape, and then use a loop to go through and scrape every link on the list (in other words, use Xenu for your spider and your code to process the results). This was how I build the Shoemoney Blog Archive list. The challenge and fun with screen scraping is how can you use that data that is out there to your advantage. The End

if you liked this post :)

Category: Programming | 4,311 views | Posted: September 17th 2007 07:00 am

Related Posts

Links from the Trenches #6

How Make Money On A Site With A Small Budget
Shoemoney had an awesome post this week on how to advertise on a website with a low budget. A reader of the blog posed this question to him in his QA post last week, and he answered the question by delving into how to use Google Analytics to track your goal (new rss subscribers), and then how to use those stats to decide where to advertise. While his math seemed a little fuzzy to me (he even points it out), the overall concept is solid.

APIs
I’ve been looking into APIs this week. This was sparked by the interview that John Chow had with on the Lab with Leo. In that interview John said he maintains a list of 1,000 news sites that he emails when he’s got a new story on the Tech Zone, and that is the #1 way how he started the Tech Zone. I emailed John about how he emailed them, but no response as of yet. I’m just wondering how he does it without it looking like spam, and I doubt that he emails them all one by one. If anyone has any hints, I’d love to hear them.

Technorati API
I made a page scraper that’s got me the top 1000 blogs about web development from Technorati, because when looking at the API I didn’t seem to me that the ‘search‘ would do a blog search by tag. I’m going to look into this more because it might do it. You are limited to 500 queries a day, and you can’t use the results for commercial use.

Alexa API
You have to pay for this one, and it is through Amazon which owns Alexa, so I guess I be scraping that too. But you might be able to do a one off search if you already have a list (I’m sure all the plugins that generate alexa numbers in Firefox addons aren’t paying anything). Need to look into this more. The End

if you liked this post :)

Category: General | 688 views | Posted: September 14th 2007 09:48 am

Related Posts

Work Some Database Magic with DBDesigner 4

Database Design Contemplation DBDesigner4I wish I had found DBDesigner 4 a long time ago, but it works out that I just found it last week. Have you ever come across a program where you instantly realize that you know it is going to be useful, and it feels like it was written with your ideas in mind. This is what I experienced after installing DBDesigner 4.

DBDesigner Screen Shot CaptureI’m a visual person so it really tires out the left side of my brain when I have to create a database through code or phpMyAdmin. That’s just the way it is for me, and I’ve always had a really hard time going back to the databases I’ve created (especially if it’s been months or years) and remembering how they are structured. So I’ve always thought that a visual tool would be awesome where I can make the database with a visual model and then export the code from that. I’ve tried various UML programs in past years to do this, but I didn’t like any of the ones I tried (mostly because you had to purchase the full product to get key features).

DBDesigner 4 is built for 1 thing - creating database structures (ok, it does query building too, and probably more stuff I’m unaware of, but you get the point). I think that is a huge benefit over the UML programs I used before because there’s not a bunch of junk in the program that I’m never going to use. Even better, it is designed out of the box for MySQL databases.

One huge caveat is that it doesn’t work with MySQL 5, or even 4.1 for that matter. You see, DBDesigner 4 is on it’s way out, and it will be replaced by the much anticipated and much delayed MySQL Workbench project. So for now, I can’t even use all of the reverse engineering functions, database connection functions, etc because my local database is MySQL 5. Read more about these issues on on adaruby.com in a post he made in December 2006 (man, I’m really behind on this one). I tried all manner of things to get a current database into DBDesigner 4 from trying to find a way to import an sql script, to using an SQL Lite DB, but I couldn’t get anything to work. Hopefully MySQL Workbench is better at this, but I’ll have to download that to find out.

Note: This is not a sponsored post, I just think this is a cool piece of software. If you would like a sponsored review, please check out my advertise page.

if you liked this post :)

Category: Programming | 606 views | Posted: September 13th 2007 09:00 am

Related Posts

« Prev - Next »