Web Scraping with Node.js and Puppeteer

Web Scraping is the technique of extracting information from websites using scripts/code. This technique has a myriad of uses: collection of data (especially when no API has been provided), comparing pricing data across various e-commerce platforms, and so on.

(A quick note: Screen scraping can violate the terms of service of many sites. This post is for educational purposes only. If you are considering building a screen-scraping application, make sure to check the terms of service of the site before running it.)

In this tutorial, we will be using JavaScript (Node.js) and the headless browser module, Puppeteer, to automatically extract episode data and download links from a podcast’s page on Podbean.com.

Step 0: Setup

This tutorial assumes you have a fair knowledge of HTML and the DOM and Javascript (Node.js)

Install Node.js and npm, if you haven’t already.
Install Puppeteer with npm install puppeteer –save

Step 1: Accessing a podcast’s page

Copy and paste the following code into a JS file. We’ll call scrape.js.

const puppeteer = require("puppeteer"); // import the puppeteer module 

const BASE_URL = "https://www.podbean.com/";
const PODCAST_URL = "https://www.podbean.com/podcast-detail/nth28-2ef41/99%25-Invisible-Podcast";

async function scrapeEpisodeLinks() {
 let browser = await puppeteer.launch({ headless: false }); //headless:false so we can watch the browser as it works 
 let page = await browser.newPage(); //open a new page
 await page.goto(PODCAST_URL); //access the podcasts page
}

scrapeEpisodeLinks()

Run this program in the command line: node scrape.js
On running, a Chrome browser will open and the podcast page will load.

Step 2: Developing our algorithm using Developer Tools for inspection

Open the podcast’s page in your browser, and then open “Developer Tools.”

You’ll notice that:

All episodes are in a table element with class ‘items’

Each episode has its title in an anchor element with class ‘listen-now’

Each episode has its release date in a span element with class ‘datetime’

Each episode has a link that points to a page where you can download that episode. The link is in an href with class ‘download’

Step 3: Extracting episode information

Now, let’s update the scrapeEpisodeLinks function with the algorithm:
const puppeteer = require(“puppeteer”); // import the puppeteer module

const BASE_URL = "https://www.podbean.com/";
const PODCAST_URL = "https://www.podbean.com/podcast-detail/nth28-2ef41/99%25-Invisible-Podcast";

async function scrapeEpisodeLinks() {
 let browser = await puppeteer.launch({ headless: false }); //headless:false so we can watch the browser as it works 
 let page = await browser.newPage(); //open a new page
 await page.goto(PODCAST_URL); //access the podcasts page
 
 let episodes_details = await page.evaluate(() => {
   //Extract each episode's basic details
   let table = document.querySelector(".items");
   let episode_panels = Array.from(table.children); 
   
   // Loop through each episode and get their details 
   let episodes_info = episode_panels.map(episode_panel => {
     let title = episode_panel.querySelector(".listen-now").textContent;
     let datetime = episode_panel.querySelector(".datetime").textContent;
     let episode_download_page = episode_panel
       .querySelector(".download")
       .getAttribute("href");
     return { title, datetime, episode_download_page };
   });
   return episodes_info;
 });
 
 console.log(episodes_details)
 // Close the browser when everything is done 
 await browser.close() 
}

scrapeEpisodeLinks()

Run this code, and you should see output in your terminal like this:

Step 4: Extracting the download link

You’ll notice that the URL we extracted in Step 3 is not the actual link to the audio file of the podcast, but to the download page of that episode. As such, we would need to visit each episode’s download page and extract the download link.

The audio URL for the episode can be found in an anchor element with class ‘download-btn’

We will create a new function to handle that:

const BASE_URL = "https://www.podbean.com/";
const PODCAST_URL = "https://www.podbean.com/podcast-detail/nth28-2ef41/99%25-Invisible-Podcast";

// New function to go to episode download page and get the download link for the audio 
async function getDownloadLink(page, url) {
 await page.goto(url);
 let download_link = await page.evaluate(() => {
   let download_btn = document.querySelector(".download-btn");
   return download_btn.getAttribute("href");
 });
 return download_link;
}

async function scrapeEpisodeLinks(){
  ...
}

Step 5: Tying it all together

Now that we have created a function to extract the actual download link, we can use it in our main code:

const BASE_URL = "https://www.podbean.com/";
const PODCAST_URL = "https://www.podbean.com/podcast-detail/nth28-2ef41/99%25-Invisible-Podcast";

async function getDownloadLink(page, url) {
 await page.goto(url);
 let download_link = await page.evaluate(() => {
   let download_btn = document.querySelector(".download-btn");
   return download_btn.getAttribute("href");
 });
 return download_link;
}

async function scrapeEpisodeLinks() {
 let browser = await puppeteer.launch({ headless: false }); //headless:false so we can debug
 let page = await browser.newPage(); //open a new page
 await page.goto(PODCAST_URL);
 let episodes_details = await page.evaluate(() => {
     //Extract each episode's basic details
     let table = document.querySelector(".items");
     let episode_panels = Array.from(table.children); 
   
     // Loop through each episode and get their details 
     let episodes_info = episode_panels.map(episode_panel => {
     let title = episode_panel.querySelector(".listen-now").textContent;
     let datetime = episode_panel.querySelector(".datetime").textContent;
     let episode_download_page = episode_panel
       .querySelector(".download")
       .getAttribute("href");
     return { title, datetime, episode_download_page };
   });
   return episodes_info;
 }); 

 // Loop through all episodes and get actual download link for each episode
 let episodes = [];
 for (let episode of episodes_details) {
   let download_link = await getDownloadLink(
     page,
     BASE_URL + episode_details["episode_download_page"] // Since download page is not a full url, we need to prepend it with the base url for podbean 
   );
   episode_details["download_link"] = download_link;
   episodes.push(episode_details);
 }
 console.log(episodes)
 await browser.close()
}

Running this code will give us our final result:

Note: The information presented in this blog post/tutorial is for educational and informational purposes only.

The post Web Scraping with Node.js and Puppeteer appeared first on Sweetcode.io.

Web Scraping with Node.js and Puppeteer

Step 0: Setup

Step 1: Accessing a podcast’s page

Step 2: Developing our algorithm using Developer Tools for inspection

Step 3: Extracting episode information

Step 4: Extracting the download link

Step 5: Tying it all together

Running this code will give us our final result:

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List