Introduction to web scraping with Puppeteer
Web scraping simply means extracting data from websites. It can be done manually and it can be automated using a bot or web crawler.
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default but can be configured to run full (non-headless) Chrome or Chromium.
Most of the things you can do manually with your browser can be done using Puppeteer. Examples include, generate screenshots and PDFs of pages, automate form submission, UI testing, scrape web pages etc.
Headless vs Non-Headless mode
Headless mode is when you run a UI-based browser test without showing the browser UI. Puppeteer runs headless by default. To run a non-headless mode i.e. to show the browser UI when running Puppeteer, set headless to be false.
{headless: false}
We are going to use Puppeteer to do the following;
- Go to google.com
- Search for a keyword
- Open the first search result
- Take a full-page screenshot of the page
Let’s get started…
First off, since Puppeteer is a Node library, you need Node to be installed on your PC.
Install puppeteer
npm install puppeteer
Import puppeteer
const puppeteer = require('puppeteer');
Creates a new instance of the Chromium and launch it.
// run in a non-headless mode
const browser = await puppeteer.launch({ headless: false,// slows down Puppeteer operationsslowMo: 100,// open dev tools devtools: true});
Create a new page
const page = await browser.newPage();
Set the viewport of the page
await page.setViewport({ width: 1199, height: 900 });
Go to www.google.com
const link = 'https://www.google.com';await page.goto(link);
Click inside the search input field, type the keyword we want to search and press enter on our keyboard.
You can install Puppeteer recorder Chrome extension to easily get the HTML selectors instead of getting it manually.
// wait for input field selector to render
await page.waitForSelector('div form div:nth-child(2) input');await page.click('div form div:nth-child(2) input');// type JavaScript in the search boxawait page.keyboard.type('JavaScript');// press enter on your keyboardawait page.keyboard.press('Enter');
await page.waitFor(3000);
Get the URL of the first search result
await page.waitForSelector('#main > div #center_col #search > div > div > div');// get href from the selectorconst getHref = (page, selector) =>page.evaluate(selector => document.querySelector(selector).getAttribute('href'),selector);const url = await getHref(page,`#main > div #center_col #search > div > div > div a`);
Go to the URL of the first search result and wait till the initial HTML document has been completely loaded and parsed.
await page.goto(url, { waitUntil: 'domcontentloaded' });
Take full page screenshot and save it to the current directory
await page.screenshot({fullPage: true,path: 'new_image.png'});
Console log the URL and the screenshot location
const screenshotPath = process.cwd() + '/new_image.png';console.log('URL of the page:', url);console.log('Location of the screenshot:', screenshotPath);
Close the page and browser
await page.close();await browser.close();
Here’s the complete code
To learn more about Puppeteer, check Puppeteer official documentation https://github.com/puppeteer/puppeteer
Happy Scrapping!!!