Updated Jul 13th, 2023

A high-level API to control headless Chrome over the DevTools Protocol.

Node library for automating, testing and scraping web pages.

Playing with a headless browser to accomplish tasks like scrape or click buttons via automation really displays the power of programming and how it can be fun.

An Introduction from LearnWebCode

Brad Schiff’s Puppeteer 35-min YouTube video is here, the practice URL he uses in the video is here, and the finished code example is here.

Table of Contents

0:00 Intro
1:20 Installing Puppeteer
4:29 Taking a Screenshot
7:09 Scraping Text From HTML
15:34 Saving Images to Hard Drive
21:45 Clicking a Button
25:16 Filling Out a Form
30:51 Scheduling a Task to Repeat

Detailed Notes

Installing Puppeteer

npm init -y
npm install puppeteer

Note: this was version 20.8.2 as of the time of this post

Taking a Screenshot

You can customize the size of the window but if you want to take a screenshot of a really long page:

await page.screenshot({path: "amazing.png", fullPage: true})

It is important to use await browser.close() to end the process.

Side Note: This was a warning in the console when first running the script:

Puppeteer old Headless deprecation warning:
In the near feature headless: true will default to the new Headless mode
for Chrome instead of the old Headless implementation. For more
information, please see https://developer.chrome.com/articles/new-headless/.
Consider opting in early by passing headless: "new" to puppeteer.launch()
If you encounter any bugs, please report them to https://github.com/puppeteer/puppeteer/issues/new/choose.

Scraping Text From HTML

extract the name from the three pets into a text file on the hard drive

const fs = require "fs/promises"

await fs.writeFile("names.txt", names.join("\r\n"))

Note: “\r” is carriage return and “\n” is a newline

Right click and inspect for what you want to take a look at the html structure

Another trick to use is > copy > copy selector

  const names = await page.evaluate(() => {
    return Array.from(document.querySelectorAll(".info strong")).map(x => x.textContent)
  await fs.writeFile("names.txt", names.join("\r\n"))

Note: console.log inside the evaluate function logs to the headless browser and not your console.

Saving Images to Hard Drive

Instead of using Array.from we can use page.$$eval to help us get an actual array and not a node list.

 const photos = await page.$$eval("img", imgs => {
    return imgs.map(x => x.src)

“For…of” over “forEach()” so we can use the await syntax

 for (const photo of photos) {
    const imagepage = await page.goto(photo)
    await fs.writeFile(photo.split("/").pop(), await imagepage.buffer())

Clicking a Button

await page.click("#clickme")

Note: single $eval versus $$eval makes it so we don’t have to spell out document.querySelector for selecting one instance of an element

await page.click("#clickme")
  const clickedData = await page.$eval("#data", el => el.textContent)

Filling Out a Form

Get access to text on a new page after waiting for the next request and navigation to a new page.

await page.type("#ourfield", "blue")
await Promise.all([page.click("#ourform button"), page.waitForNavigation()])
const info = await page.$eval("#message", el => el.textContent)

console.log(info) // to test we got what we want

Note: the use of “Promise.all” which Brad mentions this was a workaround to an error he was getting before.

Also Note: in the code above he uses “page.click” to simulate submitting the form because the example URL is just static and connected to a real server. Typically we would use JS to submit the form.

Scheduling a Task to Repeat

A few options:

setInterval(start, 5000)

Only on the 3rd hour of every Monday

npm install node-cron

and then

const cron = require("node-cron")
cron.schedule("*/5 * * * * *", start)

See crontab.guru for more options here

Note: this requires the node script up and running forever which is not practical (node could run into an error ad not know to restart itself, your computer needs to be kept on, etc.). So you could do at the OS level. But Windows doesn’t have “cron tab”. Mac requires a ton of permissions to get it to work. Linux (which is what a server is typically running) is the best for this.