Home > Mobile >  How to parse data from google alerts using scrapy in python?
How to parse data from google alerts using scrapy in python?

Time:01-20

I have created a Google alert to generate an RSS feed which looks like this https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178

Now how do I use scrapy to extract the title, href, published date and content from each entry in the feed?

I have tried:

import scrapy


class GalertCovidSpider(scrapy.Spider):
    name = 'galert-covid'
    allowed_domains = ['https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178']
    start_urls = ['https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178/']

    def start_requests(self):
        urls = [
            'https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for post in response.xpath('//feed/entry'):
            yield {
                'title' : post.xpath('title//text()').extract_first(),
                'link': post.xpath('link//text()').extract_first(),
            }

But when I run it using scrapy crawl --nolog --output -:json galert-covid it produces no output with no error.

After scraping the information... How do I proceed to store the scraped information into a dataframe or CSV?

CodePudding user response:

I'm sure Scrapy can do this but you don't have to use it, this should get the job done:

import requests
from bs4 import BeautifulSoup
import pandas as pd

name = 'galert-covid'
url = 'https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178'
resp = requests.get(url)
soup = BeautifulSoup(resp.text,'html.parser')

output = []
for entry in soup.find_all('entry'):

    item = {
        'title' : entry.find('title',{'type':'html'}).text,
        'pubdate' : entry.find('published').text,
        'content' : entry.find('content').text,
        'link' : entry.find('link')['href']
    }

    output.append(item)

df = pd.DataFrame(output)
df.to_csv('google_alert.csv',index=False)
print('Saved to google_alert.csv')

CodePudding user response:

import scrapy


class GalertCovidSpider(scrapy.Spider):
    name = 'galert-covid'
    allowed_domains = ['www.google.co.in']
    start_urls = ['https://www.google.co.in/alerts/feeds/17901041985790143983/2214023096042963178/']

    custom_settings = {
        'FEEDS': {
            'galert-covid': {'format': 'csv'}
        }
    }

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url)

    def parse(self, response):
        response.selector.remove_namespaces()
        for post in response.xpath('//feed/entry'):
            yield {
                'title': post.xpath('.//title//text()').get(),
                'link': post.xpath('.//link/@href').get(),
                'published date': post.xpath('.//published/text()').get(),
                'content': post.xpath('.//content/text()').get(),
            }

Read about removing namespaces, and feeds.

The allowed_domains should just be the domain, I removed the default callback inside start_requests (not necessary I just prefer it like that), and inside the yield I added a dot to get relative xpaths.

  •  Tags:  
  • Related