I scraped a table of SEC filings and extract a specific row as a pandas Series.
The tables are not very standardized in their formatting which makes scraping quite hard as unwanted information is extracted as well.
Take for example the following series I scraped from a table:
series = {'A': "3,360,003|", 'B': "(17) |", 'C': "16.8|"}
series = pd.Series(data=series, index=['A', 'B', 'C'])
The only information that is relevant for me is the one that contains commas. Is there a way to remove all other entries of the series that doesn't contain commas?
There may be cases where there is more than one entry with commas, e.g.
series = {'A': "3,360,003|", 'B': "(17,424,32) |", 'C': "16.8|"}
series = pd.Series(data=series, index=['A', 'B', 'C'])
in this case, the first entry that contains commas should be kept while all other should be removed.
Help is much appreciated
CodePudding user response:
Use .str.contains() as a boolean indexer;
s = series[series.str.contains(',', na=False)]
CodePudding user response:
If you really want to work with Series methods, the approach would be:
series[series.str.contains(',')].iloc[0]
However, this requires checking all elements, just to keep one.
A much more efficient approach (depending on the exact data, there might be edge case where this isn't true), would be to use a filter and next to get the first element. This is more that 100 times faster on the provided example.
next(filter(lambda x: ',' in x, series))
Output: '3,360,003|'
