In crawler projects, HTTP requests are very common operations. If you have to establish a new connection every time when scraping large amounts of data, it may affect efficiency, especially when the target website has certain request frequency limits. To optimize this process, it is essential to maintain the connection for a long time without disconnecting. In PHP, we can use curl_upkeep() to keep the connection alive, avoiding the need to frequently establish new connections.
curl_upkeep() is a custom function, usually used in PHP to send HTTP requests via the cURL library while keeping the connection open. Using curl_upkeep() prevents the need to establish a new connection each time a request is made, improving performance, especially when the crawler needs to send requests continuously.
When using cURL in PHP to perform HTTP requests, the default behavior often results in the connection being closed. To avoid re-establishing the connection with every request, you need to set the relevant parameters to keep the connection alive.
Here is an example that demonstrates how to use curl_upkeep() to keep the connection open.
<?php
<p>function curl_upkeep($url)<br>
{<br>
$ch = curl_init();</p>
curl_setopt($ch, CURLOPT_URL, $url); // Target URL
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // Return response content instead of directly outputting it
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // Allow automatic redirects
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
"Connection: keep-alive", // Keep the connection alive
"Keep-Alive: 300", // Set timeout in seconds
));
// Set a long timeout to prevent the connection from closing
curl_setopt($ch, CURLOPT_TIMEOUT, 300); // Set request timeout to 5 minutes
// Execute cURL request and get the result
$response = curl_exec($ch);
if(curl_errno($ch)) {
echo 'Curl error: ' . curl_error($ch);
}
curl_close($ch);
return $response;
}
// Use curl_upkeep() for the request
$url = "https://m66.net/api/data"; // Replace with target URL
$response = curl_upkeep($url);
echo $response;
?>
curl_init(): Initializes a cURL session.
curl_setopt(): Sets cURL options. CURLOPT_URL specifies the target URL, CURLOPT_RETURNTRANSFER ensures the response content is returned instead of directly outputting it, and CURLOPT_FOLLOWLOCATION allows automatic redirection.
Keep connection alive: By using the Connection: keep-alive and Keep-Alive headers, the connection is kept alive.
Timeout settings: The CURLOPT_TIMEOUT parameter is used to set a long timeout to avoid the connection being closed by the server.
In crawler projects, repeatedly opening and closing connections creates unnecessary overhead and reduces efficiency. By maintaining the connection, we can:
Improve performance: Reusing existing connections instead of establishing new ones reduces latency.
Prevent bans: Some target websites may limit the access frequency from the same IP. By keeping the connection alive, the number of connections can be reduced, preventing the anti-crawling mechanisms from being triggered.
Save resources: Establishing connections requires time and resources, while keeping the connection alive helps conserve both server and client computational resources.
Timeout settings: It is important to set an appropriate timeout. If the timeout is set too long, it might cause the request to hang for too long. It is recommended to set a reasonable timeout based on actual needs.
Connection limits: Some servers may have restrictions on persistent connections (such as a maximum number of connections). In such cases, you may need to adjust relevant configurations.
Server support: Not all servers support long-lived connections. When using keep-alive, make sure the target server supports this feature.