Laravel – Basic Web Crawler using GuzzleHttp

by Tradetu · March 19, 2017

A Web Crawler is a program that navigates the Web and finds new or updated pages for indexing. The Crawler starts with seed websites or a wide range of popular URLs (also known as the frontier) and searches in depth and width for hyperlinks to extract.

A Web Crawler must be kind and robust. Kindness for a Crawler means that it respects the rules set by the robots.txt and avoids visiting a website too often. Robustness refers to the ability to avoid spider traps and other malicious behavior. Other good attributes for a Web Crawler is distributivity amongst multiple distributed machines, expandability, continuity and ability to prioritize based on page quality.

1. Steps to create web crawler

The basic steps to write a Web Crawler are:

Pick a URL from the frontier
Fetch the HTML code
Parse the HTML to extract links to other URLs
Check if you have already crawled the URLs and/or if you have seen the same content before
1. If not add it to the index
For each extracted URL
1. Confirm that it agrees to be checked (robots.txt, crawling frequency)

Truth be told, developing and maintaining one Web Crawler across all pages on the internet is… Difficult if not impossible, considering that there are over 1 billion websites online right now. If you are reading this article, chances are you are not looking for a guide to create a Web Crawler but a Web Scraper. Why is the article called ‘Basic Web Crawler’ then? Well… Because it’s catchy… Really! Few people know the difference between crawlers and scrapers so we all tend to use the word “crawling” for everything, even for offline data scraping. Also, because to build a Web Scraper you need a crawl agent too.

2. Skeleton of a Crawler

GuzzleHttp is a PHP HTTP client that makes it easy to send HTTP requests and trivial to integrate with web services.

Simple interface for building query strings, POST requests, streaming large uploads, streaming large downloads, using HTTP cookies, uploading JSON data, etc…
Can send both synchronous and asynchronous requests using the same interface.
Uses PSR-7 interfaces for requests, responses, and streams. This allows you to utilize other PSR-7 compatible libraries with Guzzle.
Abstracts away the underlying HTTP transport, allowing you to write environment and transport agnostic code; i.e., no hard dependency on cURL, PHP streams, sockets, or non-blocking event loops.
Middleware system allows you to augment and compose client behavior.

$client = new GuzzleHttp\Client();
$res = $client->request('GET', 'https://api.github.com/user', [
    'auth' => ['user', 'pass']
]);
echo $res->getStatusCode();
// "200"
echo $res->getHeader('content-type');
// 'application/json; charset=utf8'
echo $res->getBody();
// {"type":"User"...'

$client = new GuzzleHttp\Client();

$res = $client->request('GET', 'https://api.github.com/user', [

'auth' => ['user', 'pass']

]);

echo $res->getStatusCode();

// "200"

echo $res->getHeader('content-type');

// 'application/json; charset=utf8'

echo $res->getBody();

// {"type":"User"...'

So let’s start with the basic code for a Web Crawler.

<?php

namespace App\Http\Controllers\Crawler;

use Illuminate\Http\Request;
use App\Http\Controllers\BaseController;
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

/**
 * WebCrawlerController
 * @author Crawler
 */
class WebCrawlerController extends BaseController {
    
    /**
     * Search snapdeal products by given url
     * @param Request $request
     */
    public function searchProductsByUrl(Request $request) {
        $pageUrl = 'http://www.tradetu.com';
        if(!$pageUrl) {
            return response()->json($this->getActionStatus("ERROR", "Product page url not found."));
        }
        $errors = array();

        $result = $this->makeWebRequest($pageUrl, $errors);
        if(empty($errors)) {
            $response['content'] = $this->parseQuickProductsJson($result, '');
            $response['Status'] = 'SUCCESS';
            $response['Message'] = 'Products downloaded successfully';
        } else {
            $response['Errors'] = $errors;
            $response['Status'] = 'ERROR';
            $response['Message'] = "Error in fetching products from the url. Errors: " . implode('|', $errors);
        }
        return response()->json($response);
    }
    
    /**
     * Parse json string into array
     * @param string $result
     * @param string $category
     */
    private function parseQuickProductsJson($result, $category) {
        $response = '';
        try {
            $crawler = new Crawler($result);
            $filter = $crawler->filter('div.product-tuple-listing');
            
            foreach ($filter as $i => $domElement) {
                $_crawler = new Crawler($domElement);
                
                $arr[$i] = array(
                    'productName' => $_crawler->filter('p.product-title')->text(),
                    'productUrl' => $_crawler->filter('a.noUdLine')->attr('href'),
                    'imageUrl' => $_crawler->filter('input.compareImg')->attr('value'),
                    'offerPrice' => $_crawler->filter('span.product-price')->text(),
                    'inStock' => $_crawler->filter('span.badge-soldout')->count() > 0 ? false : true,
                    'size' => $_crawler->filter('span.attr-value')
                );
            }
        } catch (Exception $ex) {
        }
        return $arr;
    }
    
    /**
     * Make web request to affiliate server url
     * @param String $url
     */
    private function makeWebRequest($url, &$errors) {
        $client = new Client();
        $response = $client->get($url);
        
        if($response->getStatusCode() == 200) {
            return (string)$response->getBody();
        } else {
            array_push($errors, $response->getReasonPhrase());
            return;
        }
    }
}

<?php

namespace App\Http\Controllers\Crawler;

use Illuminate\Http\Request;

use App\Http\Controllers\BaseController;

use GuzzleHttp\Client;

use Symfony\Component\DomCrawler\Crawler;

/**

* WebCrawlerController

* @author Crawler

class WebCrawlerController extends BaseController {

/**

* Search snapdeal products by given url

* @param Request $request

public function searchProductsByUrl(Request $request) {

$pageUrl = 'http://www.tradetu.com';

if(!$pageUrl) {

return response()->json($this->getActionStatus("ERROR", "Product page url not found."));

}

$errors = array();

$result = $this->makeWebRequest($pageUrl, $errors);

if(empty($errors)) {

$response['content'] = $this->parseQuickProductsJson($result, '');

$response['Status'] = 'SUCCESS';

$response['Message'] = 'Products downloaded successfully';

} else {

$response['Errors'] = $errors;

$response['Status'] = 'ERROR';

$response['Message'] = "Error in fetching products from the url. Errors: " . implode('|', $errors);

}

return response()->json($response);

}

/**

* Parse json string into array

* @param string $result

* @param string $category

private function parseQuickProductsJson($result, $category) {

$response = '';

try {

$crawler = new Crawler($result);

$filter = $crawler->filter('div.product-tuple-listing');

foreach ($filter as $i => $domElement) {

$_crawler = new Crawler($domElement);

$arr[$i] = array(

'productName' => $_crawler->filter('p.product-title')->text(),

'productUrl' => $_crawler->filter('a.noUdLine')->attr('href'),

'imageUrl' => $_crawler->filter('input.compareImg')->attr('value'),

'offerPrice' => $_crawler->filter('span.product-price')->text(),

'inStock' => $_crawler->filter('span.badge-soldout')->count() > 0 ? false : true,

'size' => $_crawler->filter('span.attr-value')

);

}

} catch (Exception $ex) {

}

return $arr;

}

/**

* Make web request to affiliate server url

* @param String $url

private function makeWebRequest($url, &$errors) {

$client = new Client();

$response = $client->get($url);

if($response->getStatusCode() == 200) {

return (string)$response->getBody();

} else {

array_push($errors, $response->getReasonPhrase());

return;

}

Like we mentioned before, a Web Crawler searches in width and depth for links. If we imagine the links on a web site in a tree-like structure, the root node or level zero would be the link we start with, the next level would be all the links that we found on level zero and so on.

In the above code snippet, we can fix the page url or we can dynamically pass through the HTML page.

Laravel – Basic Web Crawler using GuzzleHttp

1. Steps to create web crawler

2. Skeleton of a Crawler

References

You may also like...

Leave a Reply Cancel reply

Recent Posts

Archives

Categories

Laravel – Basic Web Crawler using GuzzleHttp

1. Steps to create web crawler

2. Skeleton of a Crawler

References

You may also like...

Advanced operations with Collection:where in Laravel 5.2

Top 3 new features in Laravel 5.4

Install Lumen by Laravel on Windows with Apache

Leave a Reply Cancel reply

Recent Posts

Archives

Categories