Probably among many articles on Web Scraping With PHP, you may find something useful here along the lines.
We’ll be using spatie/crawler package which will provide us with a great features for writing crawlers without going absolutely crazy!
Please keep in mind that there is no general “the best way” — each approach has its use-case depending on what you need, how you like to do things, and what you want to achieve.
Note: Before you scrape a website, do read their Terms of Service to make sure they are OK with being scraped.
Build our own crawler project to fetch content of any websites.
Set Up & Installation
Install package via Composer:
composer require guzzlehttp/psr7 ^1.8.3 composer require spatie/crawler
Notice that guzzlehttp/psr7 is installed because spatie package uses Guzzle promises under the hood to crawl multiple urls concurrently.
Before we start, we need to create a class which extends the
\Spatie\Crawler\CrawlObservers\CrawlObserver abstract class. We can hook into crawling steps and manipulate http responses in our
Then, we can now prepare the crawler itself using syntax below and start it:
Note on line 26, we set the
CustomCrawlerObserver class created earlier. And on line 32, we set the website to crawl.
Also on line 27, we’re defining that we only want to follow internal links by using
setCrawlProfile-function. Check out other options here.
Do you see
setDelayBetweenRequests(100)? It makes the crawler pause 100 milliseconds between every request.
Quick and easy! spatie/crawler provides options/features to set maximum crawl depth, response size, adding a delay between requests, limit the crawl links, limit content-types to parse etc. that simplifies the process of web scraping. While this was an introductory article, you may choose to build on this knowledge and create complex web scrapers that can crawl thousands of pages. So make sure you access this package for more documentation.
Here’s the full source code with what we’ve done so far.