web scraping data to csv file on python, and the code to scrape a link-CodePudding

1 - when I check the csv file I only find data from the last link (Tugende). but when I print the data I get all what I want. how can I get all the data in the csv file?

2 - for the 'source' variable how can I get only the article link from it and add it to csv file.

import requests
from bs4 import BeautifulSoup as bs
import csv

url = "https://digestafrica.com/companies/{}"
startups = ['OBM-Education','Crafty-Workshop','Planet42','Paylend','Tugende']
for startup in startups:
    u = url.format(startup)
    html_text = requests.get(u).text
    soup = bs(html_text, 'lxml')
    
    list1 = soup.find_all('div', class_='d-flex flex-wrap content mt-24 border p-2 border-dark')
    source1 =soup.find_all('div',class_='col-md-2 mt-3 mt-lg-0')
    file = open('funding.csv', 'w',newline='')
    writer = csv.writer(file)
    mama = (['Name', 'Type', 'date','amount','investors'])
    writer.writerow(mama)



    for L in list1:      
        name1 = L.find('span', class_="line-height-1").text
        amount1 = L.find('div', class_='p-0').text.replace('Amount','').strip()
        date1 = L.find('span', class_="pt-0").text
        funding_type1 = L.find('div', class_="col-md-2 mt-2 mt-lg-0").text.replace('Funding Round','')
        investor1 = L.find('div',class_='col-md-3 mt-3 mt-lg-0').text.replace('investors','')
        source =L.find('div',class_="col-md-2 mt-3 mt-lg-0")
        
        print(name1, funding_type1, date1,amount1, investor1)

        writer.writerow([name1, funding_type1, date1,amount1, investor1])
    file.close()

CodePudding user response：

The reason you only get data for the final startup is because of how you are opening your output file:

    file = open('funding.csv', 'w',newline='')

This opens the file for writing, as requested, but places the "start of file" pointer at the very start of the file. This is fine the first time you go through the loop, but not subsequently.

If you really want to open the file in the loop, you'll need to use a (for "open for writing, but in append mode if it already exists").

That's not efficient, however. I suggest opening the file for writing prior to starting your for loop, and creating the writer object then too:

writer = csv.writer(open('funding.csv', 'w', newline=''))
for startup in startups
....

[do loop operations]
....
writer.close()

And do the close() operation after the loop ends.

CodePudding user response：

There will be a difference in results when you print(element.find()) and save your element.
Actualy element.find() returns bs4.element.Tag and not a str.
In your case you don't see it, because python applies str(element.find()) when it prints something.
You need to do a cast or it can lead to unwanted results.
Example:

element = BeautifulSoup('<div></div>')
print(type(element.find()))
print(type(str(element.find())))

CodePudding user response：

1: You should use a context manager to handle the csv file when you write to it. I've fixed your code below, first I add the headers in "w" mode (so it writes the file when you first run the code) then I append "a" the data to it as I scrape each page.

2: You need to find the 'a' tag where the source link is, then get the href attribute like this: find('a')['href'] see below

import requests
from bs4 import BeautifulSoup as bs
import csv

#write header
with open('funding.csv','w',newline='') as file:
    writer = csv.writer(file)
    mama = (['Name', 'Type', 'date','amount','investors','source'])
    writer.writerow(mama)

url = "https://digestafrica.com/companies/{}"
startups = ['OBM-Education','Crafty-Workshop','Planet42','Paylend','Tugende']

for startup in startups:

    html_text = requests.get(url.format(startup))
    soup = bs(html_text.text,'lxml')

    for list1 in soup.find_all('div', class_='d-flex flex-wrap content mt-24 border p-2 border-dark'):
        name1 = list1.find('span', class_="line-height-1").text
        amount1 = list1.find('div', class_='p-0').text.replace('Amount','').strip()
        date1 = list1.find('span', class_="pt-0").text
        funding_type1 = list1.find('div', class_="col-md-2 mt-2 mt-lg-0").text.replace('Funding Round','')
        investor1 = list1.find('div',class_='col-md-3 mt-3 mt-lg-0').text.replace('investors','')
        source = list1.find('div',class_="col-md-2 mt-3 mt-lg-0").find('a')['href']

        print(name1, funding_type1, date1,amount1, investor1, source)

        with open('funding.csv','a',newline='') as file:
            writer = csv.writer(file)
            writer.writerow([name1, funding_type1, date1,amount1, investor1, source])