I have created a Google alert to generate an RSS feed which looks like this https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178
Now how do I use scrapy to extract the title, href, published date and content from each entry in the feed?
I have tried:
import scrapy
class GalertCovidSpider(scrapy.Spider):
name = 'galert-covid'
allowed_domains = ['https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178']
start_urls = ['https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178/']
def start_requests(self):
urls = [
'https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for post in response.xpath('//feed/entry'):
yield {
'title' : post.xpath('title//text()').extract_first(),
'link': post.xpath('link//text()').extract_first(),
}
But when I run it using scrapy crawl --nolog --output -:json galert-covid it produces no output with no error.
After scraping the information... How do I proceed to store the scraped information into a dataframe or CSV?
CodePudding user response:
I'm sure Scrapy can do this but you don't have to use it, this should get the job done:
import requests
from bs4 import BeautifulSoup
import pandas as pd
name = 'galert-covid'
url = 'https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178'
resp = requests.get(url)
soup = BeautifulSoup(resp.text,'html.parser')
output = []
for entry in soup.find_all('entry'):
item = {
'title' : entry.find('title',{'type':'html'}).text,
'pubdate' : entry.find('published').text,
'content' : entry.find('content').text,
'link' : entry.find('link')['href']
}
output.append(item)
df = pd.DataFrame(output)
df.to_csv('google_alert.csv',index=False)
print('Saved to google_alert.csv')
CodePudding user response:
import scrapy
class GalertCovidSpider(scrapy.Spider):
name = 'galert-covid'
allowed_domains = ['www.google.co.in']
start_urls = ['https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178/']
custom_settings = {
'FEEDS': {
'galert-covid': {'format': 'csv'}
}
}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url)
def parse(self, response):
response.selector.remove_namespaces()
for post in response.xpath('//feed/entry'):
yield {
'title': post.xpath('.//title//text()').get(),
'link': post.xpath('.//link/@href').get(),
'published date': post.xpath('.//published/text()').get(),
'content': post.xpath('.//content/text()').get(),
}
Read about removing namespaces, and feeds.
The allowed_domains should just be the domain, I removed the default callback inside start_requests (not necessary I just prefer it like that), and inside the yield I added a dot to get relative xpaths.
