in Uncategorized

Scraping infojobs.net with Goutte

Goutte is a screen scraping and web crawling library for PHP on top of DomCrawler and Guzzle made by Fabian Potencer.

Building crawlers with this library is straightforward you just need to extract data with css selectors.

Let’s see an example that extracts PHP Symfony jobs in Barcelona from Spain main job portal: Infojobs

You just need to create an instance of Goutte Client make the request and it returns a crawler (http://symfony.com/doc/current/components/dom_crawler.html).

After we just need to found the results. With chrome inspector or firebug we can found them easy.

Info jobs add some ads, with .list-logos css class, mixed with the results, but it’s easy to avoid theme with the css selector not.

After that we just need to extract the data of each job. We can use each for this.

Finally we put all together with the extra fields we need for this example.

After that we just need to call it with the following parameters:

Here we have our data:

This is a really simple example to test this awesome tool.

Comments, doubts and questions are welcome.

Write a Comment

Comment