A high-level API to control headless Chrome over the DevTools Protocol.
Node library for automating, testing and scraping web pages.
Playing with a headless browser to accomplish tasks like scrape or click buttons via automation really displays the power of programming and how it can be fun.
Brad Schiff’s Puppeteer 35-min YouTube video is here, the practice URL he uses in the video is here, and the finished code example is here.
0:00 Intro
1:20 Installing Puppeteer
4:29 Taking a Screenshot
7:09 Scraping Text From HTML
15:34 Saving Images to Hard Drive
21:45 Clicking a Button
25:16 Filling Out a Form
30:51 Scheduling a Task to Repeat
Installing Puppeteer
npm init -y
npm install puppeteer
Note: this was version 20.8.2 as of the time of this post
Taking a Screenshot
You can customize the size of the window but if you want to take a screenshot of a really long page:
await page.screenshot({path: "amazing.png", fullPage: true})
It is important to use await browser.close()
to end the process.
Side Note: This was a warning in the console when first running the script:
Puppeteer old Headless deprecation warning:
In the near feature headless: true
will default to the new Headless mode
for Chrome instead of the old Headless implementation. For more
information, please see https://developer.chrome.com/articles/new-headless/.
Consider opting in early by passing headless: "new"
to puppeteer.launch()
If you encounter any bugs, please report them to https://github.com/puppeteer/puppeteer/issues/new/choose.
Scraping Text From HTML
extract the name from the three pets into a text file on the hard drive
const fs = require "fs/promises"
await fs.writeFile("names.txt", names.join("\r\n"))
Note: “\r” is carriage return and “\n” is a newline
Right click and inspect for what you want to take a look at the html structure
Another trick to use is > copy > copy selector
const names = await page.evaluate(() => {
return Array.from(document.querySelectorAll(".info strong")).map(x => x.textContent)
})
await fs.writeFile("names.txt", names.join("\r\n"))
Note: console.log inside the evaluate function logs to the headless browser and not your console.
Saving Images to Hard Drive
Instead of using Array.from we can use page.$$eval to help us get an actual array and not a node list.
const photos = await page.$$eval("img", imgs => {
return imgs.map(x => x.src)
})
“For…of” over “forEach()” so we can use the await syntax
for (const photo of photos) {
const imagepage = await page.goto(photo)
await fs.writeFile(photo.split("/").pop(), await imagepage.buffer())
}
Clicking a Button
await page.click("#clickme")
Note: single $eval versus $$eval makes it so we don’t have to spell out document.querySelector for selecting one instance of an element
await page.click("#clickme")
const clickedData = await page.$eval("#data", el => el.textContent)
console.log(clickedData)
Filling Out a Form
Get access to text on a new page after waiting for the next request and navigation to a new page.
await page.type("#ourfield", "blue")
await Promise.all([page.click("#ourform button"), page.waitForNavigation()])
const info = await page.$eval("#message", el => el.textContent)
console.log(info) // to test we got what we want
Note: the use of “Promise.all” which Brad mentions this was a workaround to an error he was getting before.
Also Note: in the code above he uses “page.click” to simulate submitting the form because the example URL is just static and connected to a real server. Typically we would use JS to submit the form.
Scheduling a Task to Repeat
A few options:
setInterval(start, 5000)
Only on the 3rd hour of every Monday
npm install node-cron
and then
const cron = require("node-cron")
cron.schedule("*/5 * * * * *", start)
See crontab.guru for more options here
Note: this requires the node script up and running forever which is not practical (node could run into an error ad not know to restart itself, your computer needs to be kept on, etc.). So you could do at the OS level. But Windows doesn’t have “cron tab”. Mac requires a ton of permissions to get it to work. Linux (which is what a server is typically running) is the best for this.