There is a website with a directory of 1,296 items on 54 index pages of 24 items each. Each item is represented by a tile that provides some information and is hyperlinked to a page with more details. I've studied the HTML of the index pages (not available by viewing source, but available in DevTools), and it contains all the data displayed in the tiles, but it does not contain the URLs that the tiles are linked to.
The structure of the HTML is strange (at least to me) in that most of the code on every index page appears to contain all or most of the data contained on the 24 detail pages linked on index page 1. I don't understand why those data would be repeated in all 54 index pages when they're only used to make up the detail pages linked from index page 1. And I wonder why the system makes those data for the first 24 detail pages readily available in the HTML (if one knows enough to use DevTools), but appears to keep the data of the other detail pages hidden somewhere. And I wonder even more if there's a way to find those data for the detail pages linked on pages 2 through 54 without needing to read the HTML of the 1,272 separate pages (1,272 = 24 detail pages * 53 index pages after page 1). Is there a way to do that?
And here's a simpler question:
The data for those first 24 detail pages includes the slugs from which the URLs for those page are constructed. So I can get the URLs for the detail pages linked on index page 1. Most of the slugs are made by simply concatenating the strings of the item name with spaces converted to hyphens, but that algorithm is not reliable if the name is at all complicated. So for the index pages after the first, is there a way to find out the URLs that the tiles link to?
CodePudding user response:
Yes. Pull it from the api:
import requests
import pandas as pd
url = 'https://api.verivest.com/sponsors/find'
payload = {
'page[number]': '1',
'page[size]': '9999',
'sort': '-capital_managed,name',
'returns': 'compact'}
jsonData = requests.get(url, params=payload).json()
data = jsonData['data']
df = pd.json_normalize(data)
df['links'] = 'https://verivest.com/s/' df['attributes.slug']
Output:
print(df['links'])
0 https://verivest.com/s/fairway-america
1 https://verivest.com/s/trion-properties
2 https://verivest.com/s/procida-funding-advisors
3 https://verivest.com/s/legacy-group-capital
4 https://verivest.com/s/tricap-residential-group
1291 https://verivest.com/s/zapolski-real-estate-llc
1292 https://verivest.com/s/zaragon-inc
1293 https://verivest.com/s/zeus-equity-group
1294 https://verivest.com/s/zne-capital
1295 https://verivest.com/s/zucker-investment-group
Name: links, Length: 1296, dtype: object
