Clik here to view.

Web Scraping is the technique of extracting information from websites using scripts/code. This technique has a myriad of uses: collection of data (especially when no API has been provided), comparing pricing data across various e-commerce platforms, and so on.
(A quick note: Screen scraping can violate the terms of service of many sites. This post is for educational purposes only. If you are considering building a screen-scraping application, make sure to check the terms of service of the site before running it.)
In this tutorial, we will be using JavaScript (Node.js) and the headless browser module, Puppeteer, to automatically extract episode data and download links from a podcast’s page on Podbean.com.
Step 0: Setup
This tutorial assumes you have a fair knowledge of HTML and the DOM and Javascript (Node.js)
- Install Node.js and npm, if you haven’t already.
- Install Puppeteer with npm install puppeteer –save
Step 1: Accessing a podcast’s page
Copy and paste the following code into a JS file. We’ll call scrape.js.
const puppeteer = require("puppeteer"); // import the puppeteer module const BASE_URL = "https://www.podbean.com/"; const PODCAST_URL = "https://www.podbean.com/podcast-detail/nth28-2ef41/99%25-Invisible-Podcast"; async function scrapeEpisodeLinks() { let browser = await puppeteer.launch({ headless: false }); //headless:false so we can watch the browser as it works let page = await browser.newPage(); //open a new page await page.goto(PODCAST_URL); //access the podcasts page } scrapeEpisodeLinks()
Run this program in the command line: node scrape.js
On running, a Chrome browser will open and the podcast page will load.
Image may be NSFW.
Clik here to view.
Step 2: Developing our algorithm using Developer Tools for inspection
Open the podcast’s page in your browser, and then open “Developer Tools.”
You’ll notice that:
- All episodes are in a table element with class ‘items’
- Each episode has its title in an anchor element with class ‘listen-now’
- Each episode has its release date in a span element with class ‘datetime’
- Each episode has a link that points to a page where you can download that episode. The link is in an href with class ‘download’
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Step 3: Extracting episode information
Now, let’s update the scrapeEpisodeLinks function with the algorithm:
const puppeteer = require(“puppeteer”); // import the puppeteer module
const BASE_URL = "https://www.podbean.com/"; const PODCAST_URL = "https://www.podbean.com/podcast-detail/nth28-2ef41/99%25-Invisible-Podcast"; async function scrapeEpisodeLinks() { let browser = await puppeteer.launch({ headless: false }); //headless:false so we can watch the browser as it works let page = await browser.newPage(); //open a new page await page.goto(PODCAST_URL); //access the podcasts page let episodes_details = await page.evaluate(() => { //Extract each episode's basic details let table = document.querySelector(".items"); let episode_panels = Array.from(table.children); // Loop through each episode and get their details let episodes_info = episode_panels.map(episode_panel => { let title = episode_panel.querySelector(".listen-now").textContent; let datetime = episode_panel.querySelector(".datetime").textContent; let episode_download_page = episode_panel .querySelector(".download") .getAttribute("href"); return { title, datetime, episode_download_page }; }); return episodes_info; }); console.log(episodes_details) // Close the browser when everything is done await browser.close() } scrapeEpisodeLinks()
Run this code, and you should see output in your terminal like this:
Image may be NSFW.
Clik here to view.
Step 4: Extracting the download link
You’ll notice that the URL we extracted in Step 3 is not the actual link to the audio file of the podcast, but to the download page of that episode. As such, we would need to visit each episode’s download page and extract the download link.
Image may be NSFW.
Clik here to view.
The audio URL for the episode can be found in an anchor element with class ‘download-btn’
We will create a new function to handle that:
const BASE_URL = "https://www.podbean.com/"; const PODCAST_URL = "https://www.podbean.com/podcast-detail/nth28-2ef41/99%25-Invisible-Podcast"; // New function to go to episode download page and get the download link for the audio async function getDownloadLink(page, url) { await page.goto(url); let download_link = await page.evaluate(() => { let download_btn = document.querySelector(".download-btn"); return download_btn.getAttribute("href"); }); return download_link; } async function scrapeEpisodeLinks(){ ... }
Step 5: Tying it all together
Now that we have created a function to extract the actual download link, we can use it in our main code:
const BASE_URL = "https://www.podbean.com/"; const PODCAST_URL = "https://www.podbean.com/podcast-detail/nth28-2ef41/99%25-Invisible-Podcast"; async function getDownloadLink(page, url) { await page.goto(url); let download_link = await page.evaluate(() => { let download_btn = document.querySelector(".download-btn"); return download_btn.getAttribute("href"); }); return download_link; } async function scrapeEpisodeLinks() { let browser = await puppeteer.launch({ headless: false }); //headless:false so we can debug let page = await browser.newPage(); //open a new page await page.goto(PODCAST_URL); let episodes_details = await page.evaluate(() => { //Extract each episode's basic details let table = document.querySelector(".items"); let episode_panels = Array.from(table.children); // Loop through each episode and get their details let episodes_info = episode_panels.map(episode_panel => { let title = episode_panel.querySelector(".listen-now").textContent; let datetime = episode_panel.querySelector(".datetime").textContent; let episode_download_page = episode_panel .querySelector(".download") .getAttribute("href"); return { title, datetime, episode_download_page }; }); return episodes_info; }); // Loop through all episodes and get actual download link for each episode let episodes = []; for (let episode of episodes_details) { let download_link = await getDownloadLink( page, BASE_URL + episode_details["episode_download_page"] // Since download page is not a full url, we need to prepend it with the base url for podbean ); episode_details["download_link"] = download_link; episodes.push(episode_details); } console.log(episodes) await browser.close() }
Running this code will give us our final result:
Image may be NSFW.
Clik here to view.
Note: The information presented in this blog post/tutorial is for educational and informational purposes only.
The post Web Scraping with Node.js and Puppeteer appeared first on Sweetcode.io.