Tidying up your HTML with PHP. 7. Juni 2004. Rechtlicher Hinweis. Dieser
Beitrag ist lizensiert unter der Creative Commons License. Zusammenfassung.
Tidying up your HTML with PHP 7. Juni 2004 Rechtlicher Hinweis Dieser Beitrag ist lizensiert unter der Creative Commons License. Zusammenfassung This talk will focus on introducing the new Tidy extension included as part of the upcoming PHP5 release and how it can be used to make working with and generating properly-formed HTML in a fast and effective manner. Specifically this session will focus on: * How to use Tidy to diagnose existing HTML for errors * Using Tidy to clean and repair HTML documents * An overview of the most useful Tidy options * Using the Tidy OO interface to navigate the HTML doc tree * Examples of how to navigate HTML effectively using Tidy With the introduction of the Tidy extension, users will no longer need to rely on mess regular expressions to mine data such as URLs, e-mail addresses, or entire tables from HTML documents. Furthermore, thanks to the diagnosing technologies provided by Tidy HTML documents can be diagnosed and even corrected on the fly to ensure complete HTML or XHTML compliance before being sent to the end user. This talk assumes users are familiar with basic PHP object-oriented and procedural constructs.
1
Introduction
Tidy is a new extension for PHP 5 which allows you to parse, validate, manipulate and repair markup documents from within your PHP 5 scripts. It is based on the tidy command line utility released by the W3C, and the extension comes bundled standard with PHP 5 beginning with PHP 5.0 Beta 3. This article will explore the Tidy extension, and its use within your PHP 5 applications. Although this article relies on Tidy 2.0 (which is available only in PHP 5), a very stable 1.0 version of Tidy is also available for PHP 4.3.x and above. It can be found in the PHP PECL Repository at http://pecl.php.net/.
1.1
Installation
Although the Tidy extension comes bundled by default in PHP 5.0, it must be enabled in order to be used and requires that the libTidy library be installed on your system. The latest version of libTidy can be found on the official Tidy utility web site at http://tidy.sourceforge.net/.
[user@localhost]$ [user@localhost]$ [user@localhost]$ [user@localhost]$ [user@localhost]$ [user@localhost]$
tar -zxvf tidy_src.tgz cd tidy /bin/sh build/gnuauto/setup.sh ./configure make make install
Once libTidy has been downloaded and installed, use the –with-tidy configuration option to configure Tidy support into PHP.
[user@localhost]$ cd php [user@localhost]$ ./configure --with-tidy=/path/to/libtidy As is the case with most extensions in PHP, Windows users will be provided with everything they need to use Tidy in their applications by default. To test that you have PHP installed, check the output of the phpinfo() function or execute the CLI version of PHP with the -m parameter to check for the tidy module.
1.2
An introduction to Syntax
Tidy, as is the case with many of the new PHP 5 specific extensions, supports a dual procedural / object oriented syntax. This syntax is designed for maximum API flexibility for the developer and works in the following fashion. For now, don’t concern yourself with the functions/methods being called themselves. They will all be discussed later. Consider the following procedural use of the tidy extension:
As one might expect, the $tidy value return from the call to the tidy parse file() function is a handle representing the parsed URL http://www.coggeshall.org/. However, in PHP 5, this handle is more than a simple resource. Rather, it is a complete object which may either be passed to other procedures in the tidy extension or used to call the procedures directly as methods. Thus, the following code is also acceptable:
Note that, unlike the first example the second uses the cleanRepair() method of the object returned from the tidy parse file() function. Because this example uses the cleanRepair() method, there is no need to specify the handle which should be used and thus the first parameter (the handle to manipulate) is omitted. Because ”resources”returned from the tidy extension are really PHP 5 objects, it allows the tidy extension to take advantage of many of the powerful object-oriented features available in PHP 5. One of these advantages is the way objects may be casted to other types transparently. For Tidy, this means that the $tidy object returned from the tidy parse file() function may also be treated as a string with an output equivalent to the contents of $value property as shown below:
Finally, if one would like to extend the object returned from tidy the tidy class is available to be instanciated:
Although not recommended, the procedural and object oriented syntaxes of Tidy in PHP 5 are completely interchangeable. In all examples in this article I will stick to a single type of syntax to avoid confusion. In general, procedural and object oriented syntaxes may be converted between each other in the following fashion: •
– Remove / Add tidy to the method / procedure – Remove Underscores / Add Underscores between words for methods/procedures (i.e. tidy clean repair() becomes $tidy->cleanRepair()) – When calling from an object syntax, the first parameter of every function (the handle to a valid tidy document) is omitted.
1.3
Using Basic Tidy
1.3.1
Parsing Documents
With syntax out of the way, let’s take a look at the basic usage of the Tidy extension. All real functionality within the tidy extension begins with the tidy parse file() or equivalent function or method. The sytnax for the tidy parse file() function is as follows:
tidy_parse_file($file [, $options [, $encoding [ $use_inc_path]]]);
$file is a valid PHP filename, either in the local file system or a remote URL / stream resource. The second parameter, $options is a very important parameter representing the configuration options which will be applied to this document. For now, we’ll ignore thisparameter as a large portion of this article is devoted to it. The third optional parameter, $encoding is a string representing the encoding of the file being parsed. The fourth and final parameter, $use inc path, is a boolean value indicating if PHP should attempt to find the requested file in the include path (if not found otherwise). If you would like to parse a document which already exists within a PHP variable, Tidy also provides the tidy parse string() function:
tidy_parse_string($data [, $options [, $encoding]]);
For this function, both $options and $encoding are identical to their equivalent parameters in the tidy parse file() function. The only difference is found in the first parameter $data , which accepts a string to parse rather than a filename or stream resource. Regardless of where the data is taken from (a string or read from a file), when a document is parsed by tidy any syntax errors (missing quotes, end tags, etc.) will automatically be corrected using Tidy’s intelligent parser. Once the document has been parsed both the tidy parse file() and tidy parse string() functions will return a tidy document handle representing the document to other Tidy functions. 1.3.2
Cleaning and Repairing
Although when Tidy parses a document it does correct any syntax errors found, other errors that are not syntax related (such as omitting a tag in an HTML document) are not corrected. To correct these errors the document must be cleaned and repaired using the tidy clean repair() function. The syntax for this function is as follows:
tidy_clean_repair($tidy);
Where $tidy is the document handle returned from a call to either tidy parse file() or tidy parse string() . The exact nature of how tidy will clean and repair the document depends very heavily on the configuration options assigned to this document. 1.3.3
Retrieving Output
As shown in earlier examples, the actual output of documents manipulated by Tidy can be done in a number of ways. Procedurally, the tidy get output() function is used to retrieve the current state of the document in memory:
tidy_get_output($tidy);
From an object oriented perspective, the document handle can be treated as a string directly (which is equal to the output of the tidy get output() function), or the value property may be accessed as well:
1.3.4
Dealing with errors
Not only is Tidy very good at intelligently parsing and repairing markup documents, but it also provides tools to identify specifically the problems it found in the original document. These errors begin logging as soon as the document is parsed, and can be retrieved by calling the tidy get error buffer() function:
tidy_get_error_buffer($tidy);
Where $tidy is the document handle to retrieve the error buffer for. When executed this function will return a string representing all of the errors encountered thus far complete with a line number / column listing of the offending line in the original document as shown:
line line line line line
1 1 1 1 1
column column column column column
1 - Warning: missing declaration 11 - Warning: replacing unexpected i by 27 - Warning: replacing unexpected u by 49 - Warning: discarding unexpected 1 - Warning: inserting missing ’title’ element
In an object syntax this information can also be retrieved through the errorBuffer property. 1.3.5
Shorthand Tidy
Since many common operations in Tidy are based on the above three functions ( tidy parse file() , tidy clean repair() , and tidy get output() ), the Tidy extension provides two shorthand functions which combine these functions into a single call. These functions are tidy repair file() and tidy repair string() for files and strings repsectively. The syntax for the tidy repair file() function is as follows:
tidy_repair_file($filename [, $options [, $encoding [, $use_inc_path]]]);
Where each parameter is identical to that found in the tidy parse file() function. Likewise, the syntax for the tidy repair string() function is as follows:
tidy_parse_string($data [, $options [, $encoding]]);
Which coincides with the prototype for the tidy parse string() function. When these functions are used, the given document will be parsed and repaired based on the encoding and configuration specified and a string is returned containing the final result.
1.4
Tidy Configuration Options
The majority of the power found in the Tidy extension can be found in the over 80 individual configuration options which may be set. These options control everything from the way Tidy will treat the input as it is parsed, how it will treat things such as PHP code, and even the format the final document will be rendered. In many cases working with Tidy is a matter of setting up the appropriate configuration options with very little change in code. By default, Tidy has a default configuration which will be used on every document. This default configuration can be altered (as you will see) through a few different means. To begin, they may be altered at run time by dealing with the $options parameter of the tidy parse file() or tidy parse string() functions introduced earlier. This parameter is either a string (representing a Tidy configuration file), or an associative array of configuration option / values. For example, below is an example of configuring Tidy to output a given HTML document in XHTML 1.0 format:
As an alternative to specifying configuration options at runtime, they may also be set in a configuration file which is then loaded by specifying the full path and filename for the $options parameter as shown below:
The format of the tidy configuration file is fairly straight-forward. An example of a valid tidy configuration file is found below:
indent-spaces: 4 indent: auto tidy-mark: no show-body-only: yes new-blocklevel-tags: mytag, anothertag Unlike parsing of markup documents, configuration files may only reside on the local file system. Configuration files have many uses, one of which is to define a number of Tidy ”profiles”for different types of operations. For instance, you could create a profile which completely strips HTML documents of whitespace (to save on bandwidth) and another which beautifies HTML for editing. 1.4.1
Default Configuration
Another use for Configuration files is to override the default configuration for any Tidy document handle created. To accomplish this, create a configuration file with your desired defaults and then set the tidy.default config configuration directive to point to this configuration file. When specified, the configuration specified by this directive will be used any time a new Tidy document handle is created.
1.5
Using Tidy’s Parser
Beyond just validation and repair of documents, the Tidy extension also provides a robust mapping of the internal document tree to objects within PHP. Using this object oriented interface, you are able to access specific blocks of a given markup document quickly and easily. To understand how these features work, first you must understand how Tidy represents documents internally. Consider the following simple HTML document:
Example Basic HTML Document Hello, World! This is italic text. Internally, this document would be represented as follows within Tidy:
From within PHP, Tidy allows access to this document tree through four methods available from the tidy document handle. These four methods ( root() , head() , html() and body() ) correspond to key points within any valid HTML document and return an instance of a yet un-introduced class: the tidyNode class. This class represents a single node within the document tree and provides the following properties and methods:
Through the use of the tidyNode class, it is possible to navigate to any part within the document tree. For instance, to retrieve the background color of a given HTML document (the BGCOLOR attribute of the tag) the following could be used:
Note that an instance of the tidyNode class can also be treated as a string, the contents of which is the same as the value property (the contents of this node and all of its child nodes). All nodes which represent HTML tags are assigned an id property. This property represents a numeric identifier for the HTML tag, and corresponds to a predefined constant within PHP for that tag. This property makes searching for a particular type of tag faster and easier than ever before. The format of the constants used to represent each tag is as follows:
TIDY_TAG_ Where is the HTML tag name. For an
tag, the constant would therefore be TIDY TAG A. To demonstrate the use of the Tidy Parser, consider the following function dump urls() . This function searches through the given node and returns all of the URLs found within anchor () tags:
This function, which uses recursion to traverse the entire document tree, is much easier than older regexstyle methods of markup data mining with a minimal impact on performance. Similar methods can be used to extract any data from an HTML document, from all of the tables, the URLs of images, and more.