Home > Software engineering >  Scraping data with Puppeteer from React website
Scraping data with Puppeteer from React website

Time:01-06

I am trying to extract data from https://invictusdao.fi/#/dashboard However, I'm stuck in this.

There are no helpful class-names in the HTML. Sample here:

<div >
    <div id="dashboard-view">
        <div  style="
        transform: none;
        transition: transform 225ms cubic-bezier(0.4, 0, 0.2, 1) 0ms;
      ">
            <div >
                <div >
                    <a 
                        target="_blank" style="cursor: default">
                        <div >
                            <h5  tooltip="">
                                $IN Price
                            </h5>
                            <h4 >$289.50</h4>
                        </div>
                    </a>
                </div>

I tried with page.evaluate with an intention to get title and values of elements on the page.

This is my code:

const puppeteer = require("puppeteer");

(async () => {
  try {
    const browser = await puppeteer.launch({ headless: false });

    const page = await browser.newPage();
    await page.goto("https://invictusdao.fi/#/dashboard");

    await page.waitForSelector(".data-grid");

    // extracting information from code
    let cards = await page.evaluate(() => {
      let cardsElement = document.body.querySelectorAll(".stat-tile-content");
      cards = Object.values(cardsElement).map((x) => {
        return {
          title: x.querySelector(".MuiTypography-root.light-tooltip.MuiTypography-h5").textContent ?? null,
          value: x.querySelector(".MuiTypography-root.MuiTypography-h4").textContent ?? null,
        };
      });
      return cards;
    });

    // logging results
    const inPrice = cards[0].value;
    const apy = cards[1].value;
    const mCap = cards[2].value;

    const supply = cards[3].value;
    const tvl = cards[4].value;
    const treasury = cards[5].value;
    const inStaked = cards[6].value;
    const rfv = cards[7].value;
    const backedPrice = cards[8].value;
    const runway = cards[9].value;
    const currentIndex = cards[10].value;

    console.log("$IN price", "$"   inPrice);
    console.log("APY", apy);
    console.log("Market Cap", mCap);

    console.log("Supply", supply);
    console.log("TVL", tvl);
    console.log("Treasury", treasury);
    console.log("IN Staked", inStaked);
    console.log("Risk Free Value", rfv);
    console.log("Backed Price", backedPrice);
    console.log("Runway", runway);
    console.log("Current Index", currentIndex);
    await browser.close();

    process.exit(0);
  } catch (err) {
    console.error(err);
    process.exit(1);
  }
})();

That brings me titles but not values (I got empty strings).

What I am doing wrong here?

CodePudding user response:

At a glance, your selectors seem fine. The problem appears to be that the elements are rendered but without data, so you're scraping the empty text contents without waiting for them to be filled in asynchronously.

I tried using a waitForFunction that polls on whether the text contents you want are empty. When they're not empty, then go ahead and scrape:

const puppeteer = require("puppeteer");

let browser;
(async () => {
  browser = await puppeteer.launch({headless: true});
  const [page] = await browser.pages();
  await page.goto("https://invictusdao.fi/#/dashboard");
  await page.waitForFunction(`
    document.querySelector(".stat-tile-content h4")
     ?.textContent.trim()
  `);
  const data = await page.$$eval(
    ".stat-tile-content",
    els => els.map(el => ({
      title: el.querySelector("h5").textContent.trim(),
      value: el.querySelector("h4").textContent.trim(),
    }))
  );
  console.log(data);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

Output:

[
  { title: '$IN Price', value: '$276.73' },
  { title: 'APY', value: '30,718.206%' },
  { title: 'Market Cap', value: '$127,479,790' },
  { title: 'Supply', value: '460,658' },
  { title: 'TVD', value: '$98,727,599' },
  { title: 'Treasury', value: '$31,801,741' },
  { title: 'IN Staked', value: '77.45%' },
  { title: 'Risk Free Value', value: '$31,801,741' },
  { title: 'Backed Price', value: '$69.04' },
  { title: 'Runway', value: '269 Days' },
  { title: 'Current Index', value: '2.2127' }
]

If you want the data as an object keyed by the title, you could reduce instead of map:

    // ...
    els => els.reduce((a, el) => {
      a[el.querySelector("h5").textContent.trim()] = 
        el.querySelector("h4").textContent.trim();
      return a;
    }, {})

Output:

{
  '$IN Price': '$276.73',
  APY: '30,821.15%',
  'Market Cap': '$127,482,974',
  Supply: '460,670',
  TVD: '$98,670,031',
  Treasury: '$31,801,741',
  'IN Staked': '77.40%',
  'Risk Free Value': '$31,801,741',
  'Backed Price': '$69.03',
  Runway: '269 Days',
  'Current Index': '2.2128'
}

Note that there's some weird behavior where the site changes the value of "Risk Free Value" momentarily after the data loads. Initially, the data is the same as the "Treasury" card.

One approach to account for this is waiting a second, but it's generally better to use waitForFunction to avoid a race condition and keep things fast.

One predicate could be checking that all the elements' text contents are unique, although this might raise a false positive if the data is actually not supposed to be unique (you could catch the timeout after a short time and scrape whatever's there as normal if that fits your use case best):

  // ...
  await page.waitForFunction(() => {
    const sel = ".stat-tile-content h4";
    const text = [...document.querySelectorAll(sel)]
      .map(e => e.textContent.trim())
    ;
    return text.length && new Set(text).size === text.length;
  });
  // ...

This code would replace the original waitForFunction.

  •  Tags:  
  • Related