With the rapid growth of internet content, developers face the challenge of quickly and accurately extracting valuable information from a vast number of webpages. PHP, as a widely used development language, combined with the powerful phpSpider crawler framework, helps us efficiently capture and process web data.
phpSpider is a PHP-based crawler framework that can be installed via Composer. Open your command line and run the following command:
<span class="fun">composer require php-spider/phpspider</span>
After installation, create a file named spider.php, first include the autoload file, then create the crawler object and set the initial crawling URL:
<?php
require 'vendor/autoload.php';
<p>use phpspider\core\phpspider;</p>
<p>// Create crawler object<br>
$spider = new phpspider();</p>
<p>// Set starting URL<br>
$spider->add_start_url('<a rel="noopener" target="_new" class="" href="http://www.example.com">http://www.example.com</a>');</p>
<p>// Define callback function for extraction rules<br>
$spider->on_extract_page = function ($page, $data) {<br>
// Write your data extraction logic here<br>
return $data;<br>
};</p>
<p>// Start the crawler<br>
$spider->start();<br>
Within the callback function, you can use regular expressions, XPath, or CSS selectors to locate elements. The following example demonstrates how to get the webpage title and content:
$spider->on_extract_page = function ($page, $data) {
$title = $page['raw']['headers']['title'][0];
$content = $page['raw']['content'];
$data['title'] = $title;
$data['content'] = strip_tags($content);
return $data;
};
The extracted data can be saved into files, databases, or other storage media. The code below appends the data to a text file:
$spider->on_extract_page = function ($page, $data) {
$title = $page['raw']['headers']['title'][0];
$content = $page['raw']['content'];
$data['content'] = strip_tags($content);
// Save results to a file
file_put_contents('extracted_data.txt', var_export($data, true), FILE_APPEND);
return $data;
};
After completing the code, run the following command to start the crawler:
<span class="fun">php spider.php</span>
The crawler will begin crawling from the initial URL, extract information according to your rules, and save the results.
By combining PHP with the phpSpider framework, you can quickly build powerful web crawlers to precisely gather data from vast numbers of webpages. The basic usage introduced here is perfect for beginners, and phpSpider also supports more advanced configurations and features to meet diverse data scraping needs.