Home > Blockchain >  Pandas series: Only keep the first entry that contains a given character (comma)
Pandas series: Only keep the first entry that contains a given character (comma)

Time:01-16

I scraped a table of SEC filings and extract a specific row as a pandas Series.

The tables are not very standardized in their formatting which makes scraping quite hard as unwanted information is extracted as well.

Take for example the following series I scraped from a table:

series = {'A': "3,360,003|", 'B': "(17) |", 'C': "16.8|"}
series = pd.Series(data=series, index=['A', 'B', 'C'])

The only information that is relevant for me is the one that contains commas. Is there a way to remove all other entries of the series that doesn't contain commas?

There may be cases where there is more than one entry with commas, e.g.

series = {'A': "3,360,003|", 'B': "(17,424,32) |", 'C': "16.8|"}
series = pd.Series(data=series, index=['A', 'B', 'C'])

in this case, the first entry that contains commas should be kept while all other should be removed.

Help is much appreciated

CodePudding user response:

Use .str.contains() as a boolean indexer;

s = series[series.str.contains(',', na=False)]

CodePudding user response:

If you really want to work with Series methods, the approach would be:

series[series.str.contains(',')].iloc[0]

However, this requires checking all elements, just to keep one.

A much more efficient approach (depending on the exact data, there might be edge case where this isn't true), would be to use a filter and next to get the first element. This is more that 100 times faster on the provided example.

next(filter(lambda x: ',' in x, series))

Output: '3,360,003|'

  •  Tags:  
  • Related