Home > OS >  open .txt file and save output in csv file
open .txt file and save output in csv file

Time:01-18

I want to open a txt file (which contains multiple links) and scrap title using beautifulsoup. My txt file contains link like this:

https://www.lipsum.com/7845284869/
https://www.lipsum.com/56677788/
https://www.lipsum.com/01127111236/

My code:

import requests as rq
from bs4 import BeautifulSoup as bs

with open('output1.csv', 'w', newline='') as f:
    url = open('urls.txt', 'r', encoding='utf8')
    request = rq.get(str(url))
    soup = bs(request.text, 'html.parser')
    title = soup.findAll('title')
    pdtitle = {}
    for pdtitle in title:
        pdtitle.append(pdtitle.text)
f.write(f'{pdtitle}')

I want to open all txt file links and scrap title from the links. The main problem is opening txt file in url variable is not working. How to open a file and save data to csv?

CodePudding user response:

you code isn't working because inside URL is all the URL. you need to run one by one:

import requests as rq
from bs4 import BeautifulSoup as bs
with open(r'urls.txt', 'r') as f:
    urls = f.readlines()
with open('output1.csv', 'w', newline='') as f:
    for url in urls:
        request = rq.get(str(url))
        soup = bs(request.text, 'html.parser')
        title = soup.findAll('title')
        pdtitle = {}
        for pdtitle in title:
            pdtitle.append(pdtitle.text)
    f.write(f'{pdtitle}')

CodePudding user response:

Your urls may not be working because your urls are being read with a return line character: \n. You need to strip the text before putting them in a list.

Also, you are using .find_all('title'), and this will return a list, which is probably not what you are looking for. You probably just want the first title and that's it. In that case, .find('title') would be better. I have provided some possible corrections below.

from bs4 import BeautifulSoup
import requests

filepath = '...'
with open(filepath) as f:
    urls = [i.strip() for i in f.readlines()]

titles = []
for url in urls:
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    title = soup.find('title') # Note: will find the FIRST title only
    titles.append(title.text) # Grabs the TEXT of the title only, removes HTML

new_csv = open('urls.csv', 'w') # Make sure to prepend with desired location, e.g. 'C:/user/name/urls.csv'
for title in titles:
    new_csv.write(title '\n') # The '\n' ensures a new row is written
new_csv.close()
f.close()
  •  Tags:  
  • Related