PHP Web Scraping
Pattern recognition using regular expressions (Simple) Often at times while scraping, we may not know the specific position of the required data on the page, or the page may not be formatted in such a way that we can easily access and parse that data using a parser, though it may conform to a specific pattern, such as an e-mail address, for example
[email protected]. In this case, we may use regular expressions (regex or regexp) to match the content on a page to a specific pattern that we define.
How to do it... The following are the steps to be performed for displaying the results of the scrape: 1. Enter the following code into a new PHP project: 1
PHP Web Scraping 2. Save the project as 4-pattern-recognition-regex.php. 3. Execute the script. 4. The results of the scrape will be displayed on the screen as follows: Array ( [0] [1] [2] [3] [4] [5] [6] )
=> => => => => => =>
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
How it works... 1. After entering the curlGet() function, we then execute it passing a URL parameter of http://www.packtpub.com/contact to the function, which requests the web page and stores it in the $packtContactPage variable, and storing the returned results in the $packtContactPage variable as follows: $packtContactPage = curlGet('http://www.packtpub.com/contact'); // Calling function curlGet()
2. Next, we define our regular expression to match the pattern of an e-mail address and store it in the $emailRegex variable as follows: $emailRegex = '/([A-Za-z0-9\.\-\_\!\#\$\%\&\'\*\+\/\=\?\^\`\ {\|\}]+)\@([A-Za-z0-9.-_]+)(\.[A-Za-z]{2,5})/'; // Regex pattern to match email addresses
3. We then execute PHP's preg_match_all() function, passing in parameters which contain firstly our regex pattern to match, $emailRegex, then the page which we would like to match against, $packtContactPage, and lastly the $scrapedEmails array, which will be used to store any matches found as shown in the following code: preg_match_all($emailRegex, $packtContactPage, $scrapedEmails); // Matching regex patterns and assigning results to array
4. If we were to look at the array returned by preg_match_all() at this point, we would see in the first array key, $scrapedEmails[0], all of the e-mail addresses returned, containing many duplicates, so we need to eliminate these by running the array through PHP's array_unique() function, and then re-indexing the array keys by placing the results into a new array, $emailAddresses, using the array_ values() function as follows: 2
PHP Web Scraping $emailAddresses = array_values(array_unique($scrapedEmails[0])); // Extracting unique entries in $scrapedEmails into $emailAddresses array
5. Finally, we can print the array of scraped e-mail addresses, $scrapedEmails, to the screen using the following code: print_r($emailAddresses);
There's more... Using the code in this recipe, it is possible to replace the regular expression used to match e-mail addresses with a string, which can be used to match any pattern of text from a page. To effectively cover the structuring of regular expressions is not practical in such a short space. If you would like to learn more, a great online resource can be found at http://www. regular-expressions.info. To test any regular expressions that you may create, an online resource that often comes in handy can be found at http://www.regexpal.com.
Useful regex patterns As said previously, to cover regular expressions effectively is not possible in such a short space. So, in the following table are listed some of the common expressions that are likely to come in handy with web scraping: Item
Regex String
Telephone Number
/\(?\d{3}\)?[-\s.]?\d{3}[-\s.]\d{4}?/
(123) 456 7890 Postal Code
/([0-9]{5}([- ]?[0-9]{4})?)/
12345-6789 E-mail Address Usern4me@ hostname.com
/([A-Za-z0-9\.\-\_\!\#\$\%\&\'\*\+\/\=\?\^\`\ {\|\}]+)\@([A-Za-z0-9.-_]+)(\.[A-Za-z]{2,5})/
URL
/(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6}) ([\/\w \.-]*)*\/?/
https://sub.domain.tld IP Address 255.255.255.255
/(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\. (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\. (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\. (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/\b/
Price
/(\$[0-9]+(\.[0-9]{2})?)/
999.99 $
3
PHP Web Scraping
Potential issues with using regular expressions Using regular expressions to parse HTML documents can lead you into potential issues, most notably that even a perfectly-crafted regex will assume that you will be working with fairly wellformatted HTML. This is rarely the case, and although most of the modern web browsers and parsers are capable of dealing with this, regular expressions often are not. Used in isolation against a known target, as in this recipe, they are however useful. In conclusion, regular expressions should not be used as a general method for parsing unknown HTML documents or be widely employed in your scrapers.
Verifying scraped data (Simple) When we scrape data from a web page, the entire process is automated, meaning that we have no way to manually verify that the data being returned is in fact what we are expecting it to be, until it has already been scraped and saved. Incorrect data being returned, such as an incorrectly formatted e-mail address or invalid URL can present problems later on when we attempt to use it. It is therefore important to build functionality into our scrapers that will verify scraped data, to ensure we won't run into any problems later on. In this recipe, we will introduce the use of PHP filters to validate e-mail addresses and URLs.
How to do it... 1. Enter the following code into a new PHP project:
2. Save the project as 6-verifying-data.php. 3. Execute the script. 4. The results of the scrape will be displayed on the screen as follows: Array ( [0] => http://www.packtpub.com [1] => https://careers.packtpub.com [2] => http://packtlib.packtpub.com [3] => http://www.packtpub.com/books/all?utm_ source=websitebanner&utm_medium=bundle_offer&utm_campaign=all_ ebooks%0A [4] => http://www.packtpub.com/article/buy-two-ebooksfor-price-of-one-bundle-offer?utm_source=websitebanner&utm_ medium=bundle_offer&utm_campaign=all_ebooks%0A [5] => http://www.packtpub.com/lite-editions?utm_source=lite_ edition_campaign&utm_medium=campaign_page&utm_campaign=lite_ edition [6] => http://www.packtpub.com/article/packt-classics [7] => https://www.packtpub.com/content/frequently-askedquestions [8] => mailto:
[email protected] [9] => mailto:
[email protected] [10] => mailto:
[email protected] [11] => mailto:
[email protected] [12] => mailto:
[email protected] 6
PHP Web Scraping [13] => mailto:
[email protected] [14] => mailto:
[email protected] [15] => mailto:
[email protected] [16] => mailto:
[email protected] [17] => mailto:
[email protected] [18] => mailto:
[email protected] [19] => https://www.entrust.net/customer/profile. cfm?domain=www.packtpub.com [20] => http://www.packtpub.com/title-idea-submissionform?utm_source=title_idea_submission&utm_medium=homepage_ block&utm_campaign=new_title_ideas [21] => http://bit.ly/PY5og6 [22] => http://bit.ly/PY5og6 [23] => http://www.packtpub.com/git-version-control-foreveryone/book?utm_source=homepageblock&utm_medium=Git%3A%2BVersion %2BControl%2Bfor%2BEveryone&utm_campaign=WebMerchandising [24] => http://PacktLib.PacktPub.com [25] => http://www.packtpub.com/article/submit-book-coverimages ) Array ( [0] =>
[email protected] [1] =>
[email protected] [2] =>
[email protected] [3] =>
[email protected] [4] =>
[email protected] [5] =>
[email protected] [6] =>
[email protected] [7] =>
[email protected] [8] =>
[email protected] [9] =>
[email protected] )
How it works... 1. First, we have included our previously coded functions curlGet() and returnXPathObject(). 2. Next, we execute the curlGet() function, passing our target page and get the XPath object of that page, using the returnXPathObject() function as follows: $packtPage = curlGet('http://www.packtpub.com/contact'); Calling function curlGet and storing returned results in $packtPage variable $packtPageXpath = returnXPathObject($packtPage); object
//
// Getting XPath 7
PHP Web Scraping 3. We then query for all link anchors on the page. This will include the anchor text of all links on the page, only some of which are in fact e-mail addresses as given in the following code: $scrapedEmails = $packtPageXpath->query('//a'); // Querying for all link anchors For each of these results we then add them to the array $mixedEmails. // If results exist if ($scrapedEmails->length > 0) { // For each result for ($i = 0; $i < $scrapedEmails->length; $i++) { $mixedEmails[] = $scrapedEmails->item($i)->nodeValue; // Add result to $mixedEmails array } } For each of the items in this array we then use the filter_ var() function with the FILTER_VALIDATE_EMAIL filter to check if they are an email address and if so, add them to the array $validEmails. // For each result in $mixedEmails array foreach ($mixedEmails as $key => $email) { // If result is a valid email address if (filter_var($email, FILTER_VALIDATE_EMAIL)) { $validEmails[] = $email; // Add email to $validEmails array } }
4. We then follow the same steps for URLs as we have for e-mails, this time scraping the href attributes of link anchors, and then using the FILTER_VALIDATE_URL filter to extract the links and validate whether they are full links or not using the following code: $scrapedLinks = $packtPageXpath->query('//a/@href'); for href attribute of all link anchors
// Querying
// If results exist if ($scrapedLinks->length > 0) { // For each result for ($j = 0; $j < $scrapedLinks->length; $j++){ $mixedLinks[] = $scrapedLinks->item($j)->nodeValue; result to $mixedLinks array } } // For each result in $mixedLinks array foreach ($mixedLinks as $key => $link) { 8
// Add
PHP Web Scraping // If result is a valid link if (filter_var($link, FILTER_VALIDATE_URL)) { $validLinks[] = $link; // Add link to $validLinks array } }
5. Finally, we print out both arrays of valid e-mail addresses and valid links using the following code: print_r($validLinks); print_r($validEmails);
// Printing array of validated links // Printing array of validated emails
There's more... In addition to the two filters we have used previously, that is, FILTER_VALIDATE_EMAIL and FILTER_VALIDATE_URL, there are a number of other validate filters, which can be used. These can be found online at http://www.php.net/manual/en/filter.filters. validate.php.
Retrieving and extracting content from e-mails (Advanced) While we could use the techniques covered in the recipes so far to log in to and scrape content from an e-mail via its web interface, it is usually much quicker and more practical to access the content directly through their native protocols. Many e-mail providers will provide IMAP or POP3 access, so that we can make direct requests without having to go through a web interface. This recipe will cover how to log in to an e-mail account and download e-mails using IMAP, from which we can extract the data we require.
Getting ready Finding the IMAP or POP3 for your particular e-mail provider will consist of the following settings: ff
Server address, for example imap.gmail.com
ff
Port number, defaults are 993 for IMAP and 995 for POP3
ff
Encryption method, for example SSL or TLS
ff
Username, for example
[email protected]
ff
Password, for example password123
This recipe uses Gmail. Gmail users can use the settings in the code without any change. 9
PHP Web Scraping
How to do it... 1. Visit http://www.packtpub.com/newsletters/ in a web browser and subscribe to one or more of the available newsletters. 2. Enter the following code into a new PHP project, substituting in your e-mail details as follows:
3. Save the project as 10-retrieving-email.php. 4. Execute the script. 5. The newsletter subscription e-mail will be retrieved and displayed on the screen.
How it works... 1. First, we enter our e-mail details and store them in variables. The $host string consists of the server address (imap.gmail.com), followed by a colon (:), followed by the port number (993), followed by our optional flags, in this case the service (/ imap), encryption method (/ssl), and an instruction to not validate the encryption certificate as the server uses self-signed certificates (/novalidate-cert). $host = '{imap.gmail.com:993/imap/ssl/novalidate-cert}'; our host IMAP settings here $user = '
[email protected]'; $pass = 'password123';
// Enter
// Enter our email address here
// Enter our password
2. In the next line we attempt to open a connection with our e-mail server using the imap_open() function, passing our e-mail details. If the connection fails, we kill the script and show the error as follows: if (!$inbox = imap_open($host, $user, $pass)) { die ('Cannot connect to email: ' . imap_last_error()); }
3. We then use the imap_search() function to search our inbox and return an array of the e-mails in our inbox. $emails = imap_search($inbox, 'ALL'); into $emails array
// Retrieving email IDs
11
PHP Web Scraping 4. If e-mails are returned in the search, the $emails array will be populated. We test if the $emails array is populated with an if() statement and if it is, we proceed to sort the emails using rsort() as follows: if ($emails) { rsort($emails);
5. We then iterate over the $emails array using a foreach() loop as follows: foreach ($emails as $email) {
6. For each of the e-mails, we use the imap_fetch_overview() function to return an object with the details of the e-mail as follows: $emailOverview = imap_fetch_overview($inbox, $email, 0);
7. As we only want to return e-mails from our subscription to the PacktPub newsletter, we need to test whether the e-mail we are currently iterating over is from service@ packtpub.com, for which we use the strpos() function, accessing the form attribute of the $emailOverview object, and then checking for the presence of the e-mail address
[email protected] which will return TRUE if it is found as follows: if (strpos($emailOverview[0]->from, '
[email protected]')) {
8. If the e-mail is from
[email protected] we wish to access its other attributes and download the body of the e-mail. The attributes of the $emailOverview object are retrieved and echoed to the screen, followed by a line break
, as given in the following code: echo $emailOverview[0]->from . '
'; echo $emailOverview[0]->subject . '
'; echo $emailOverview[0]->date . '
';
9. We then fetch the body of the e-mail using the imap_fetchbody() function, and then echo the body to the screen as follows:. $emailBody = imap_fetchbody($inbox, $email, 1); echo $emailBody;
10. Finally, we close the imap e-mail connection using the following code: imap_close($inbox);
There's more… In addition to the attributes used in this recipe: from, subject and date – the imap_ fetch_overview() function returns a number of other attributes which we can use. An
explanation of which is as follows:
12
PHP Web Scraping
imap_fetch_overview attribute list The following id the list of attributes of imap_fetch_overview: ff
subject: The subject of the e-mail
ff
from: The sender of the e-mail
ff
to: The recipient of the e-mail
ff
date: The date the e-mail was sent
ff
message_id: The ID of the e-mail
ff
in_reply_to: The e-mail to which this is a reply
ff
size: The size of the e-mail in bytes
ff
uid: The UID of the e-mail in the mailbox
ff
msgno: The sequential e-mail number in the mailbox
ff
answered: Whether the e-mail is flagged as answered
ff
deleted: Whether the e-mail is flagged for deletion
ff
seen: Whether the e-mail is flagged as being read
ff
draft: Whether the e-mail is flagged as a draft
Multithreaded scraping using multi-cURL (Intermediate) While PHP has no built-in support for multithreading, cURL, the library we are using for retrieval of target pages, does provide the functionality of performing multiple requests in parallel. In this recipe we will introduce the use of cURL's multithreading functions to solve the scraping problem we encountered in the Traversing Multiple Pages by asynchronously scraping the paginated results, rather than performing the scrapes one-by-one.
How to do it... 1. Enter the code found in the code bundle for this recipe into a new PHP project. 2. Save the project as 11-multithreaded-scraping.php. 3. Execute the script. 4. An array of data scraped from all of the books' web pages will be displayed on the screen.
13
PHP Web Scraping
How it works... This recipe follows much of the same structure as in the Traversing multiple pages recipe, the only difference being that rather than iterating over the array of URLs, we use multi-cURL to execute the requests asynchronously. The function we have created, curlMulti(), is the only new addition and it works as follows: 1. We declare our new function curlMulti(), by passing an array of URLs to it as a parameter using the following code: function curlMulti($urls) {
2. Similarly to initializing a single cURL handle, as we have done previously, we initialize a multi session as follows: $mh = curl_multi_init();
3. We use a foreach() loop to iterate over our array of URLs as follows: foreach ($urls as $id => $d) {
4. We initialize a single cURL session and add it to an array of cURL sessions, $ch, passing the array key of the URL as the $id parameter of the $ch array element as follows: $ch[$id] = curl_init();
5. We then add the URL to the $url variable as follows: $url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
6. We proceed to set our required cURL options as follows: curl_setopt($ch[$id], CURLOPT_URL, $url); curl_setopt($ch[$id], CURLOPT_RETURNTRANSFER, TRUE);
7. Then add the cURL session to our multi session and close the foreach() loop as given in the following code: curl_multi_add_handle($mh, $ch[$id]);
8. The $running variable is used to determine whether cURL is still running, so we initially set it to NULL as given in the following code: $running = NULL;
// Set $running to NULL
9. We then execute our multi cURL session using curl_multi_exec(), into which we pass our multi cURL handle and the $running variable. The do-while loop executes this until the $running variable is 0, indicating that it has completed, as given in the following code: do { curl_multi_exec($mh, $running); } while ($running > 0); 14
PHP Web Scraping 10. For each of the cURL sessions we use curl_multi_getcontent() to retrieve the content, and then remove the multi cURL handle using curl_multi_remove_ handle as follows: foreach($ch as $id => $content) { $results[$id] = curl_multi_getcontent($content); // Add results to $results array curl_multi_remove_handle($mh, $content); // Remove cURL multi session }
11. Finally we close the session and return the $results array from the function for use in the rest of our script as follows: curl_multi_close($mh); return $results;
There's more... While we refer to cURL as being multithreaded in this context, it is not truly multithreaded. The requests are performed in parallel, using non-blocking API calls. In any batch of requests, the entire batch will take as long as the longest request before the data is returned.
15