Laravel – Basic Web Crawler using GuzzleHttp

A Web Crawler is a program that navigates the Web and finds new or updated pages for indexing. The Crawler starts with seed websites or a wide range of popular URLs (also known as the frontier) and searches in depth and width for hyperlinks to extract.

A Web Crawler must be kind and robust. Kindness for a Crawler means that it respects the rules set by the robots.txt and avoids visiting a website too often. Robustness refers to the ability to avoid spider traps and other malicious behavior. Other good attributes for a Web Crawler is distributivity amongst multiple distributed machines, expandability, continuity and ability to prioritize based on page quality.

1. Steps to create web crawler

The basic steps to write a Web Crawler are:

  1. Pick a URL from the frontier
  2. Fetch the HTML code
  3. Parse the HTML to extract links to other URLs
  4. Check if you have already crawled the URLs and/or if you have seen the same content before
    1. If not add it to the index
  5. For each extracted URL
    1. Confirm that it agrees to be checked (robots.txt, crawling frequency)

Truth be told, developing and maintaining one Web Crawler across all pages on the internet is… Difficult if not impossible, considering that there are over 1 billion websites online right now. If you are reading this article, chances are you are not looking for a guide to create a Web Crawler but a Web Scraper. Why is the article called ‘Basic Web Crawler’ then? Well… Because it’s catchy… Really! Few people know the difference between crawlers and scrapers so we all tend to use the word “crawling” for everything, even for offline data scraping. Also, because to build a Web Scraper you need a crawl agent too.

2. Skeleton of a Crawler

GuzzleHttp is a PHP HTTP client that makes it easy to send HTTP requests and trivial to integrate with web services.

  • Simple interface for building query strings, POST requests, streaming large uploads, streaming large downloads, using HTTP cookies, uploading JSON data, etc…
  • Can send both synchronous and asynchronous requests using the same interface.
  • Uses PSR-7 interfaces for requests, responses, and streams. This allows you to utilize other PSR-7 compatible libraries with Guzzle.
  • Abstracts away the underlying HTTP transport, allowing you to write environment and transport agnostic code; i.e., no hard dependency on cURL, PHP streams, sockets, or non-blocking event loops.
  • Middleware system allows you to augment and compose client behavior.

So let’s start with the basic code for a Web Crawler.

Like we mentioned before, a Web Crawler searches in width and depth for links. If we imagine the links on a web site in a tree-like structure, the root node or level zero would be the link we start with, the next level would be all the links that we found on level zero and so on.

In the above code snippet, we can fix the page url or we can dynamically pass through the HTML page.

References

  1. GuzzleHttp
  2. Laravel
  3. Tradetu
  4. D’Tashion Fashion Shopping App

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.