This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Your charge will appear under the name “Marco Tabini & Associates, Inc.” The first issue of your subscription will be mailed to you in September, 2003. Please allow up to 6 weeks for your subscription to be established. *US Pricing is approximate and for illustration purposes only.
Choose a Subscription type: Canada/USA $ 81.59 $67.99 CAD ($59.99 $49.99 US*) International Surface $108.99 $94.99 CAD ($79.99 $69.99 US*) International Air $122.99 $108.99 CAD ($89.99 $79.99 US*)
*By signing this order form, you agree that we will charge your account in Canadian dollars for the “CAD” amounts indicated above. Because of fluctuations in the exchange rates, the actual amount charged in your currency on your credit card statement may vary slightly. †Limited time offer expires August 31st, 2003.
To subscribe via snail mail - please detach this form, fill it out and mail to the address above or fax to +1-416-630-5057
TABLE OF CONTENTS
php|architect Departments
Features
8
INDEX
4
EDITORIAL RANTS Welcome to another edition of php|architect ;-)
By Peter James
21 5
A Developer’s Introduction to Web Automation and Scraping using Scout
On MySQL and PHP By the php|a Editorial Staff
NEW STUFF 24
Making Your Code More Readable By Stuart Herbert
33
REVIEW dotProject By Peter James
37
Introducing GeekLog By Andrew Gray
70
TIPS & TRICKS By John W. Holmes
44
Client-Server TCP/IP Connections with PHP By Marco Tabini
74
BITS & PIECES By Peter James
50
Coding for PHP5 in PHP4 By Peter Moulding
75
exit(0); By Marco Tabini
61
FLASH Remoting with AMFPHP By Seth Wilson
July 2003 · PHP Architect · www.phparch.com
3
EDITORIAL RANTS
Editorial
Welcome to another edition of php|architect ;-) In spite of a lot of impatience on the part of the entire staff here at php|architect, we decided to wait a couple of extra days before launching the magazine, so we could also launch all of the other new stuff this month. A new website, an online store, an interview with Zak over at MySQL, who lends a hint of clarity to the PHP/MySQL ‘debacle’, and, of course, another information-packed issue of php|architect magazine! But the madness doesn’t stop there! Soon we’ll announce the winners of the php|architect project grant, if there isn’t already announcement on the site by the time you read this (hint: go to the site!). As if that wasn’t enough, I’m also proud to announce that coming soon will be the print edition of php|architect. It’s a topic that was hotly debated internally, and then planned heavily, so we really hope you find some value in that. While you won’t (yet) be able to pick it up at newsstands, you will be able to buy a print subscription and have it delivered to your doorstep (making the possibly broad assumption that you have a doorstep). You’ll see a subscription form in this issue, and also online. If your subscription hasn’t run out yet and you want to hold off – fine – give the subscription card to a friend. Better yet – get your friend a subscription so they’ll stop stealing your printouts! Though there was a lot of internally generated hubbub this month, giving us lots to talk about, I should also mention a couple of other bright spots for php|architect. The php|architect family continues to grow. Not only do we have a stable and growing family of collaborators, columnists, authors and editors, but our own Marco Tabini (you know,
Editorial Team Arbi Arzoumani Brian Jones Eddie Peloke Peter James Marco Tabini
Graphics & Layout Arbi Arzoumani, William Martin
Managing Editor Emanuela Corso
Authors Andrew Gray, Stuart Herbert, John W. Holmes, Peter Moulding, Peter James, Marco Tabini, Seth Wilson php|architect (ISSN 1705-1142) is published twelve times a year by Marco Tabini & Associates, Inc., P.O. Box. 3342, Markham, ON L3R 6G6, Canada. Although all possible care has been placed in assuring the accuracy of the contents of this magazine, including all associated source code, listings and figures, the publisher assumes no responsibilities with regards of use of the information contained herein or in all associated material.
Editorial the publisher?) has just recently announced that he is to shortly become a father to a new baby boy! Congratulations and good luck, Marco and Emanuela. I also want to give a very sincere and public (thereby even more sincere?) thanks to my senior editor at large, Peter James. As everyone probably knows by now, I’m a system administrator in addition to my role here at php|architect. At times, this takes up more of my resources than I care to admit. This month was an especially difficult one for me, and Peter stood like a rock in the face of much added responsibility, and took initiative to get things done that I couldn’t have possibly kept up with. Thanks Pete! And to you, the reader – I thank you too. Without your ongoing support, constant feedback and, of course, outright flames, php|architect wouldn’t be anywhere near the kind of publication that would justify a print edition. Keep that feedback coming! Soon we plan to add a
NEW STUFF
PHP 5
‘Reader Feedback and Flames’ section to the magazine, so don’t go clamming up on us now! In the coming weeks/months, there will (believe it or not) be even more announcements with even bigger news making their way to the front page of the php|architect website. I suggest you syndicate the news, or make php|architect a regular stop for PHP-related news. Our news editor, Eddie Peloke, has done an outstanding job at finding PHP news wherever it may hide, so where else would you go for PHP news? OK – I’m out of breath and my hands hurt. Hope you enjoy this months issue. ...Until we meet again...
What’s New!
The PHP Team has announced the release of the first beta version of PHP5. Although this release is not yet ready for prime time, it’s worth trying out to test the new features that will be bundled with PHP5, such as the new object management functionality, which dramatically improves PHP’s OOP capabilities. On a less fortunate note, because of a licensing policy change in the MySQL software, the new version of PHP does not include a bundled MySQL library. This does not mean that you will not be able to use MySQL with PHP any longer—just that you will need to compile MySQL support into it explicitly. For more information, check out the PHP Group homepage.
July 2003 · PHP Architect · www.phparch.com
5
NEW STUFF
NEW STUFF
PhpEd 3.2 NuSphere phpED is an IDE (Integrated Development Environment) that integrates a comprehensive set of editing, debugging and deployment tools for the PHP scripting language that can speed development time by up to 75 percent. New phpED v3.2 advantages:
a. New interface design makes your work much more pleasant and comfortable. b. Customizable shortcuts speed up your work significantly. c. Improved CVS support enables you to easily regain old versions of a source file to track bugs while working on the same project in a team of developers. d. New code highlighting features greatly facilitate your working with code. e. Enhanced project deployment. Once you tuned publishing according to your needs you can upload your projects with a single click! f. New search and replace scope. Find and replace works in multiple files now! g. Handy NuSOAP classes.
Back-End.org announced the newest release of their CMS system. Back-End 0.5.4 is a Multi-lingual PHP/MySQL CMS which features in-line editing and a common hierarchy for articles, links & galleries. BackEnd allows content developers to quickly add/modify content on a web site. Features a TTW WYSIWYG editor and options for straight html or wiki editing. It also includes sub-sites for your locals and committees. Auto generate your sitemap and navigation (including drop down navigation). Have complete control of the look and feel of your site through phplib’s templates.
Zend Per formance Suite 3.5 Zend.com is excited to announce the latest release of Zend Performance Suite 3.5, the leading solution for optimizing your application performance. ZPS 3.5 is a complete package for understanding and optimizing your site performance, delivering the information you need to find costly performance slowdowns and providing the end-to-end technology to fix it. ZPS 3.5 is designed to enhance site management capabilities while simplifying installation and setup time. New Features:
h. Latest version of PHP 4.3.2 is included.
• Support of virtual host configurations - optimize and manage multiple domains.
i. New help system supports chm help format and allows you to get the right reference at the right moment.
• Support of Apache 2.0 and Zeus web servers
And when you’re done, you can take advantage of multiple platform deployment options including Windows, Linux, and UNIX.” For more information, visit NuSphere.com or get a two week trial HERE (http://ww1.nusphere.com/web/registration_page.php).
• Improved testing capabilities - optimizing your performance improvement efforts with real-time reporting. • Intelligent, installation adapting according to the environment - Whenever you change your PHP version, ZPS will make sure the corresponding ZPS module will load, relieving you of the hassle.
Databases MySQL, Postgres and SQLite have all recently announced new releases MySQL released 3.23.57. Postgres announced 7.3.3 and SQLite announced 2.8.4
July 2003 · PHP Architect · www.phparch.com
6
FEATURES
FEATURES
A Developer’s Introduction to Web Automation and Scraping using Scout By Peter James Scout is a web automation tool inspired by Perl’s WWW::Mechanize. Scout allows the developer to automatically navigate very complex web sites, and pull out virtually anything of interest along the way. I originally wrote Scout as a means to pull headlines from web sites that don’t provide RSS news feeds, but its capabilities go much further.
S
cout is based on top of a more generic HTML parser, and relies on a number of pre-existing tools and technologies, including HTML Tidy, XPath, and cURL. This article is meant to introduce Scout by dissection, showing how and why different tools were used. This dissection is important for a couple of reasons. First, it serves as a great case study in using the right tools for the job. Without any one of these tools, writing Scout would have been at least an order of magnitude more difficult. Second, the dissection provides insights into Scout’s interface, and why it does things the way it does, which can be of great benefit when using Scout. Once we know how Scout works, we can play with some juicy examples. First, we’ll see how to create a simple spider, which will go out and crawl the web for us. Next, we’ll see how we can navigate through a more complex site, including a login form, and retrieve a status message. Lastly, I’ll show how to use Scout to easily scrape headlines from your favorite web site, and turn it into a nice, readable RSS feed. Sound like fun? Let’s roll. Requirements To get the most out of Scout, and this article, a basic knowledge of XML and related technologies would be helpful. Although there are non-XPath choices in Scout’s interface, its internals work on top of XPath, a language for querying XML documents. I don’t expect
July 2003 · PHP Architect · www.phparch.com
you to know XPath, but it might help if you have heard of it. In order to run Scout, there are a handful of system requirements. First, you need to make sure your PHP installation includes the cURL and XML extensions. Second, you need to ensure that you have the latest version of Tidy (http://tidy.sourceforge.net) on your machine, and accessible to Scout. Third, you need to download the PHP.XPath class from The http://www.carrubbers.org/scripts/php/xpath/. XPath class should be installed somewhere in your PHP include path. Once you have those three things, you’re set. So, without further delay, let’s get at it! HTML Parsing Screen scraping involves being able to pull out elements, groups of elements, and text from HTML documents. This implies that you need a way to “read” the HTML documents. By “read”, I actually mean “parse and interpret”. REQUIREMENTS PHP Version: 4.2+, XML extension, cURL extension, PHP.XPath class Additional Software: HTML Tidy (recent release)
8
FEATURES HTML parsing is a bit of an art. Because HTML is not a strict markup, it lends itself to much ambiguity and uncertainty. The existing HTML parsing tools that I found out there seemed to be limited in usability, at least for my purposes. They may make good guesses at parsing the HTML, but they don’t offer a nice interface for getting at the parse tree. Enter XML. There are a plethora of tools out there for parsing and searching XML, including XPath. But wait, you say, HTML isn’t exactly XML. I knew that, but I wasn’t going to let it stand in my way. I decided to find a way to force HTML documents to become XHTML documents, since XHTML documents are well-formed XML documents. After much searching, a little crying, and then some more searching, I found out that the newer versions of Tidy will forcibly output XHTML. Tidy Tidy, at its core, is an HTML code beautifier. That is, it takes really messy HTML code and turns it into really pretty HTML code. Tidy is available for a number of platforms, including Solaris, MacOS X, Linux, and Windows. Let’s look at an example of what Tidy can do. Listing 1 shows a rag-tag HTML file. The opening and closing and tags are completely missing, elements are opened and not closed, or closed in the wrong order, and element attributes are not quoted properly. By running the following command at the Unix command-line: petej@www $ tidy -i -q listing1.txt
Listing 2
Web Automation and Scraping using Scout we get the output shown in Listing 2. The ‘-i’ tells Tidy to format with indentation if it can (which it doesn’t in this case – see the Tidy documentation to learn why). The ‘-q’ tells Tidy to be quiet. Tidy is a very loud and obnoxious tool. It has a lot to say, and offers lots of hints on how you can better format your HTML in the first place. As you can see in Listing 2, even ‘-q’ didn’t shut Tidy up completely. The warning lines above our formatted document let us know what was wrong with our document, and what Tidy did about it. Warnings just whine, while errors here will actually prevent output. The HTML output in Listing 2 is much bigger than our input document. Tidy added and tags, fixed our out-of-order tags, quoted the attribute values for the
and tags, and closed our
tag for us. Our messy HTML input has become a well-formed HTML output. We can force this into well-formed XML by calling Tidy like so: petej@www $ tidy -q -i -asxml --force-output yes tidy_test.html
Here we have the ‘-q’ and ‘-i’ again, but we have two new options. ‘-asxml’ lets Tidy know that we want our output in XHTML. ‘—force-output yes’ is a configuration option that tells Tidy to output something, regardless of how bad the input is. Normally, if Tidy runs into an error condition in the input document, it won’t output anything except warnings and errors. NOTE: You might wonder about the validity of this method, considering that if we are forcing output, we can’t guarantee that what we input is being exactly reflected in what we output. It’s true. The documents might be slightly different, but this is a price we must pay if we are to use XML parsing and searching tools. I think it is an appropriate trade-off in most circumstances, and normally shouldn’t cause an issue. By querying the document properly, you can minimize dependence on the structure, effectively minimizing the effects of forced output.
petej@www $ tidy -q -i listing1.txt line 1 column 1 - Warning: inserting missing 'title' element line 1 column 8 - Warning:
lacks "summary" attribute line 4 column 20 - Warning: replacing unexpected b by line 4 column 18 - Warning: replacing unexpected i by line 1 column 8 - Warning: isn't allowed in
elements line 1 column 8 - Warning: missing
before <meta content= "HTML Tidy for FreeBSD (vers 1st March 2003), see www.w3.org" name="generator"> Listing 1
This is some text
This is some text
July 2003 · PHP Architect · www.phparch.com
9
FEATURES Now that we can be confident that Tidy will always output well-formed XML in the form of XHTML, we can move on to how we’ll parse and search the document. XPath There are two standard ways of parsing XML documents. The event-driven SAX parsers parse the document from top to bottom, firing events (or callbacks) on each object (element, comment, processing instruction, etc) along the way. This is a very linear way to parse XML, but doesn’t help me search the document at all. The DOM, or document object model, parses XML into a big tree structure of objects in memory. Obviously, this can be a resource hog, but it definitely offers some advantages for Scout, not the least of which is XPath (see sidebar).
Web Automation and Scraping using Scout XPath is really the answer for searching the document. XPath’s expressions are very powerful and will allow us to access virtually anything and everything in our HTML (XHTML) page. PHP comes with a built-in DOM extension (DOMXML). In my opinion, though, it’s a friggin’ mess. Looking at the function reference on php.net is kinda like looking at my desk, except I wasn’t the one that made it that way. It’s very confusing, and is also still very much a moving target. Note that I’m not necessarily knocking the DOMXML maintainers. I’m sure there are good reasons afoot. To gain some stability in my implementation, I decided to use the PHP.XPath class, available from http://sourceforge.net/projects/phpxpath. This class is fairly well maintained, and has a pleasant API. It does lack some standard XPath functionality, but it is also
What’s XPath? XPath is considered by some to be for XML documents what SQL is for relational databases. Used extensively in XSLT stylesheets, XPath is a query language used for addressing information in XML documents. XPath expressions are very powerful, and can conditionally find just about anything in an XML document. Consider the following XML document, which we’ll use for illustration: data more data even more data
XPath expressions return four types of values: boolean, string, number, node-set. While the first three should be self-explanatory, the node-set may not be. A node-set is a collection of nodes (elements, element attributes, comments, etc) in the document that match the given expression. In our example, a node-set returned from the ‘//c’ (see below for explanation) expression would contain the three nodes. Usually, node-sets are not useful by themselves, and must be used to either set the context for the next operation, or be coerced to one of the other three types. XPath syntax comes in two forms, verbose and abbreviated. In the interests of brevity, we’ll discuss the abbreviated form. XPath syntax purposely looks similar to Unix file paths. Just like a filesystem, an XML document is a hierarchical structure. The root of a document contains everything in the document, and is referred to by a forward slash (‘/’). As we move down through the nodes in the document, we use additional forward slashes. ‘/a/b/c’, for examJuly 2003 · PHP Architect · www.phparch.com
ple, would give us a node-set containing one element. Another holdover from Unix is to use ‘..’ and ‘.’ for the previous node and the current node, respectively. We can get a node-set containing all nodes on a level by using an asterisk. Therefore, ‘/a/d/*’ would give us all element nodes under , returning two elements. By using double forward slashes, we can get all of the children nodes descending from the given node, as well as the given node itself. This means that ‘/a//c’ would return three nodes. Element attributes can be found by using the ‘@’ symbol. Specifying ‘/a//@*’, for instance, would return any and all attributes found under the element, giving us the three ‘id’ attributes from the elements. There are a number of functions that you can use with XPath as well. For example, if we use the expression “/a//c[contains(@id, ‘3’)]”, we get the element with an ‘id’ attribute that contains ‘3’. The [..] brackets are called a predicate. They filter the results of the query. It’s important to note that XPath normally offers numerous alternatives for retrieving the same node-set. For instance, ‘//c’, ‘/a//c’, ‘/a/b/c | /a/d/c’ are all valid ways of querying for the elements. This was a very quick introduction to XPath, and will hopefully be enough to get you through this article. For more information, check out the excellent online tutorials at w3chools: http://www.w3schools.com/xpath
and zvon.org: http://www.zvon.org/xxl/XPathTutorial
as well as the definitive reference at w3.Org: http://www.w3.org/TR/xpath.
10
FEATURES surprisingly complete for an independent effort. Let’s take a look at a quick example of using PHP.XPath. Listing 3 shows our XML document, Listing 4 shows our PHP code, and Listing 5 shows the output. This is a very simple example, but shows the use of the three methods that we really care about here: match(), getData(), and getAttributes(). Looking at Listing 4, the first thing we do is include the PHP.XPath class. We then set our input file, and the XPath expression to evaluate. Next, we instantiate the XPath object with the input file, and execute the XPath query. Because our expression returns a nodeset (which, in this case, is really just an array of fully expanded XPath expressions) we loop over the results. Listing 3 1 2 3 4 data 5 6 7 more data 8 even more data 9 10
Listing 5 petej@www $ php -f listing4.txt Node: /a[1]/b[1]/c[1] Data: data Attr: id=c1 Node: /a[1]/d[1]/c[1] Data: more data Attr: id=c2 Node: /a[1]/d[1]/c[2] Data: even more data Attr: id=c3
July 2003 · PHP Architect · www.phparch.com
Web Automation and Scraping using Scout For each node, we display the node, its text (returned by getData()), and get any attributes. getAttributes() returns an array of attributes, which we loop over, and display. Note that if you change the XPath expression to return a scalar (such as boolean, string, or number), this example won’t work. In that case, you wouldn’t have to loop over $r1, you’d simply output $r1 to see the result. That about wraps up all we need to know about XPath and Tidy, and we can now take a look at the HTML_Parser class. Dissecting the Parser The HTML_Parser class (HTML_Parser.php, included in this article’s code distribution) consists of only a handful of properties and methods. Although it is a relatively simple class, it comes with very powerful capabilities. These capabilities are compliments of XPath. Let’s run down the method list. The HTML_Parser class constructor is currently empty, but is included for completeness. set_cleanup() is used to set a flag controlling the cleanup of temporary files created by the parser. This is especially useful for debugging. reset_parser() is a method called when parsing a new HTML file. It cleans up the object, in preparation. parse() is what users will call to parse a new file. It accepts the name of the HTML file to parse. First it resets the object, then it runs _process() to Tidy the file, and finally it instantiates a new XPath object on the XHTML file output by Tidy. After running this method, the HTML file is ready for querying. As mentioned above, _process() takes the HTML file, converts it to XHTML with Tidy, and saves it out to a new temporary file. If cleanup is allowed, we register a shutdown function to delete our temporary file when the object is cleaned up. This will prevent old temporary files from cluttering up your system. Notice the ‘— numeric-entities’ option in the call to Tidy. We have to set this option because otherwise the XML parser will halt when it runs into HTML entities, such as ‘ ’. These entities are undefined in XML, so the XML document becomes not well-formed. ‘—numeric-entities’ changes these entities to their numeric equivalent, which are universal. _delete_xhtml_file(), as mentioned above, is a callback function that is used like a class destructor. It deletes the temporary file created by the parser. evaluate() is an interface to the XPath object. It accepts an XPath expression and passes it back to the XPath object for execution. If the expression returns a node-set, get_node_set_data() is called to get any attributes and text associated with the nodes in the node-set. resolve_relative_url() is one of those func-
11
FEATURES tions that people are always looking for, and is hard to get right. There are definitely circumstances where it will fail, but it works well for the most part. Although it isn’t used anywhere in this class, it is present so that links extracted with evaluate() can be resolved. It accepts a relative URL to resolve, and a base URL to resolve against. Well, that wraps up the parser, let’s feed Listing 1 into our HTML_Parser, and see if we can select the element. Listing 6 shows our new PHP script, and the output is shown below. petej@www $ php -f listing6.txt Node: /html[1]/body[1]/table[1]/tr[1]/td[1]/p[1]/b[1]/i[ 1] Name: i Data: some
Let’s look at Listing 6. The first thing we do is include the HTML_Parser class. Next we set the input file, and the XPath expression to use (this will get all elements in the document). Now we can instantiate the object, parse the file, and evaluate our XPath expression. Finally, we take our resultant node-set, and dump it out. From our output we can see that we found the tag we were looking for. Technically, we could find anything in our document with the Parser, but it would be really nice to have a more efficient interface to work with. That’s where the HTML_Scraper class comes in. Let’s take a look. Dissecting the Scraper The scraper (HTML_Scraper.php, included in this article’s code distribution) is a very simple wrapper for the parser. Its only purpose is to make it easier to find things in the document by sheltering you from having Listing 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
12
FEATURES Because our document only contains one form, we could have achieved the same result by calling get_forms() with an empty argument list. Listing 8 shows our form’s structure. Notice how get_forms() pulled back the form’s one element as well. Now we know what HTML_Scraper does. Let’s move along, and take a closer look at Scout. cURL Scout’s automation engine is implemented using cURL. cURL is one of those extensions that you never knew you needed until you found out what it could do. It’s a monkey wrench of sorts, and I find it indispensable. cURL is a general purpose Internet agent. It supports a plethora of protocols, including the ones we care about: HTTP and HTTPS. Also supported are cookies, HTTP authentication, and multipart form transfers (needed for HTTP file uploads). cURL appears in command-line and library form, and enjoys a nice warm home inside the PHP wrapper. Scout uses cURL to simulate a web browser. All of the web requests and navigation performed by Scout are actually done through cURL. As an example of cURL’s ability, consider the following. You can do an SSL request over cURL by specifying a URL, or you can go all out. For example, you could specify what user agent
Web Automation and Scraping using Scout string you want cURL to use (to simulate a particular browser), give it a username and password for the website’s HTTP authentication mechanism, specify some POST variables to send with the request, tell it to use the cookies you gathered last time you were on this site, and let it know to dump the result back to a particular file. It’s really up to you. Now that we know what cURL is, let’s break down Scout.
“cURL is one of those extensions that you never knew you needed until you found out what it could do”
Scout-ing the Web For all that it’s capable of, Scout (Scout.php, included in this article’s code distribution) isn’t too complicated. Scout’s core is the request() method. This method performs a web request and returns the result to a file. Listing 8 This file’s name is then passed into the parse() method of petej@www $ php -f listing7.txt 2> /dev/null HTML_Parser, which – no surprisArray ( es here – parses the file, preparing [0] => Array it for querying. ( Scout’s $parameters property [node] => /html[1]/body[1]/form[1] [name] => form contains the variables that are to [text] => be passed along with the request. [attributes] => Array These are either form variables, or ( [action] => foo.php querystring (?foo=bar) variables. [method] => post The parameters will be sent with ) the request either as GET or POST [fields] => Array variables, depending on the ( method specified. [0] => Array There are three methods for ( [node] => /html[1]/body[1]/form[1]/input[1] dealing with parameters. [name] => input set_parameter() sets a single [text] => parameter. set_all_parame[attributes] => Array ( ters() replaces the current [type] => text parameter array with a new one. [name] => mytext unset_parameter() removes a [value] => ) single parameter. If we jump back to our ) request() method, we can see ) that it takes two arguments: a URL and a method. The URL is the tar) get of the request. The method is ) just the form method to use for
July 2003 · PHP Architect · www.phparch.com
13
FEATURES the request (GET or POST). This method really only applies when there are parameters defined. We don’t require these arguments, because they may very well have been set by something else automatically, but we’ll discuss this in a moment. The next step in request() is to initialize the cURL session. After obtaining a handle to the session, we tell cURL to return all response headers to us, as well as to return the results of the request. We set the user agent (which we can set using set_user_agent()), and tell cURL to follow any “Location: ...” headers. Following location headers is important because quite often URLs are specified without a script name or trailing slash, like so: http://www.shaman.ca. This will always generate a location header, usually to something like http://www.shaman.ca/. If we don’t follow location headers, our automation script will probably never reach its intended destination. request()’s next step is to set up the cookie jar. This is a file that is writable by the web server, and will store the cookies gathered along the way. It can be quite interesting to view this cookie file after some random spidering! Next, request() sets up the request variables, specifying them as either POST or GET. POST variables are stored in a cURL configuration option, whereas GET variables are passed in normally on the URL. Now we can specify the actual target of our request, and set up a file to capture the results in. Like HTML_Parser, we set up a destructor method
“About the only thing Scout can't do, is give you ideas”
(_delete_html_file()) to clean up this result file for us. Finally, we execute the cURL request and parse the result, returning the HTTP status code from our request. That method does a lot, but there’s a lot more it could do. cURL’s configuration options are numerous and varied. We’ve just scratched the surface of this beast! If all we ever did was call request() and use HTML_Scraper’s interface, we could do a lot. Even better, though, would be to use forms and links automatically. Lucky for us, there’s select_form() and select_link(). select_form() and select_link() are methJuly 2003 · PHP Architect · www.phparch.com
Web Automation and Scraping using Scout ods for automatically finding a form or a link, and populating all of the relevant details in Scout. These two methods are what give Scout its “gusto”. They make it very easy to navigate around a website. If you call select_form() and pass in a condition type and condition (same format as in HTML_Scraper), select_form() will try to find the specified form in the document. If it is found, select_form() will populate the URL (resolved, if necessary) from the form’s action, the method from the form’s method, load the $parameters array with names and values from the form’s fields, and store the submit buttons for later use. Successfully calling select_link() with a condition type and condition will populate the URL (resolved, if necessary) and parameters. The method will automatically be set to GET, and no submit buttons are stored. get_url() returns the URL that will be used in the next request, or the URL that was used in the last request, depending on when it’s called. The last method to cover here is the submit() method. submit() is just a wrapper around request(), used to mimic the clicking of a form submission button. It takes one argument which is the label of the submit button. This button’s name and label are attached to the parameters array, and request() is called normally. And we’re done! Now you know how Scout does its dirty work, and you have a few options: go play with it, extend it, or ignore it. My vote goes to one of the first two!
Examples Let’s look at those examples I promised. First up, a spider. Crawl the Web Web spiders start somewhere, and then navigate all links (within some rule-set) from that location. Search engine spiders would index content as they go. Bargain-hunting spiders would only travel to the ecommerce and catalog sites that they’re interested in. Mailing list spiders travel around everywhere, harvesting any email address they can find. For brevity’s sake, the design of our spider will be very simple. It’s only rule will be to find all of the links on a page, and follow them breadth-first. Take a look at Listing 9 (next page). First, we include the Scout class, and set our starting URL. We make our first request, and initialize the visited array with our starting URL. The visited array will prevent us from traveling to the same link over and over. Next, we get all of the links in the document, using HTML_Scraper’s get_links() method, and resolve
14
FEATURES the relative ones using the resolve_all_links() function. resolve_all_links() simple steps through the links array, resolving each one with HTML_Parser’s resolve_relative_url() method. Now we enter the loop. As long as there are links left to visit, you’ll keep looping. Once in the loop, we check if we’ve visited this link before. If not, we add it to the visited array, spit out a couple of messages, and repeat our earlier (pre-loop) tasks. “Lather, rinse, repeat if necessary.” Have a look at some sample output in Listing 10 (next page). You can see how quickly the number of links to parse grows.
Web Automation and Scraping using Scout Guided Navigation So Scout can crawl the web? “Big deal,” you say. “I could easily write a script that would do that in half as many lines, and be at least twice as fast.” You’re right. Strangely, it seems that Scout doesn’t like being left alone to work half as much as he likes being told what to do. This is because Scout is designed to be good at searching out data from a structured document using a standard syntax. Scout is not designed for speed. If you’ve ever sent or received a FedEx package, you probably know that you can track the package as it goes through the various checkpoints. This will let you know where your package is at any given time. (Note that it will not explain why your package, which was