Home > Net >  Show NER Spacy Data in dataframe
Show NER Spacy Data in dataframe

Time:01-29

I am doing some web scraping to export text info from an html and using a NER (Spacy) to identify information such as Assets Under Management, Addresses, and founding dates of companies. Once the information is extracted, I would like to place it in a dataframe.

I am working with the following script:

from bs4 import BeautifulSoup
import numpy as np
from time import sleep
from random import randint
from selenium import webdriver
import pandas as pd
import spacy
from spacy import displacy
import en_core_web_sm
import requests
import re

NER = spacy.load("en_core_web_sm")

url = "https://www.baincapital.com/"


driver = webdriver.Chrome("C:/Program Files/chromedriver.exe")
driver.get(url)  
sleep(randint(5,15))
soup = BeautifulSoup(driver.page_source, 'html.parser')
body=soup.body.text
body
body= body.replace('\n', ' ')
body= body.replace('\t', ' ')
body= body.replace('\r', ' ')
body= body.replace('\xa0', ' ')
text3= NER(body)
displacy.render(text3,style="ent",jupyter=True)

The output is shown as:

Spacy Extraction

And I would like to place it in the following rudimentary table:

Entity Identified
Money $155 Billion
Date 1984
Org Bain Capital
Org Bain Capital Investor Portal Please
Cardinal four
Cardinal 24
GPE US

Essentially, take highlighted info and place it in a dataframe with identifying features.

CodePudding user response:

After you obtained the body with plain text, you can parse the text into a document and get a list of all entities with their labels and texts, and then instantiate a Pandas dataframe with those data:

#... your code here ...
body=soup.body.text

# now, this is the modification:
body = ' '.join(body.split())
doc = NER(body)
entities = [(e.label_,e.text) for e in doc.ents]
df = pd.DataFrame(entities, columns=['Entity','Identified'])

Note that the body = ' '.join(body.split()) line is used to normalize all whitespace in a simpler and shorter way than you used.

  •  Tags:  
  • Related