Introduction to web scraping with Puppeteer

Benjamin Ajewole
3 min readMar 5, 2020

--

Web scraping simply means extracting data from websites. It can be done manually and it can be automated using a bot or web crawler.

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default but can be configured to run full (non-headless) Chrome or Chromium.
Most of the things you can do manually with your browser can be done using Puppeteer. Examples include, generate screenshots and PDFs of pages, automate form submission, UI testing, scrape web pages etc.

Headless vs Non-Headless mode

Headless mode is when you run a UI-based browser test without showing the browser UI. Puppeteer runs headless by default. To run a non-headless mode i.e. to show the browser UI when running Puppeteer, set headless to be false.

{headless: false}

We are going to use Puppeteer to do the following;

  1. Go to google.com
  2. Search for a keyword
  3. Open the first search result
  4. Take a full-page screenshot of the page

Let’s get started…

First off, since Puppeteer is a Node library, you need Node to be installed on your PC.

Install puppeteer

npm install puppeteer

Import puppeteer

const puppeteer = require('puppeteer');

Creates a new instance of the Chromium and launch it.

// run in a non-headless mode
const browser = await puppeteer.launch({
headless: false,// slows down Puppeteer operationsslowMo: 100,// open dev tools devtools: true});

Create a new page

const page = await browser.newPage();

Set the viewport of the page

await page.setViewport({ width: 1199, height: 900 });

Go to www.google.com

const link = 'https://www.google.com';await page.goto(link);

Click inside the search input field, type the keyword we want to search and press enter on our keyboard.

You can install Puppeteer recorder Chrome extension to easily get the HTML selectors instead of getting it manually.

// wait for input field selector to render
await page.waitForSelector('div form div:nth-child(2) input');
await page.click('div form div:nth-child(2) input');// type JavaScript in the search boxawait page.keyboard.type('JavaScript');// press enter on your keyboardawait page.keyboard.press('Enter');

await page.waitFor(3000);

Get the URL of the first search result

await page.waitForSelector('#main > div #center_col #search > div > div > div');// get href from the selectorconst getHref = (page, selector) =>page.evaluate(selector => document.querySelector(selector).getAttribute('href'),selector);const url = await getHref(page,`#main > div #center_col #search > div > div > div a`);

Go to the URL of the first search result and wait till the initial HTML document has been completely loaded and parsed.

await page.goto(url, { waitUntil: 'domcontentloaded' });

Take full page screenshot and save it to the current directory

await page.screenshot({fullPage: true,path: 'new_image.png'});

Console log the URL and the screenshot location

const screenshotPath = process.cwd() + '/new_image.png';console.log('URL of the page:', url);console.log('Location of the screenshot:', screenshotPath);

Close the page and browser

await page.close();await browser.close();

Here’s the complete code

To learn more about Puppeteer, check Puppeteer official documentation https://github.com/puppeteer/puppeteer

Happy Scrapping!!!

--

--