why does re.findall only match the first and last five rows for a larger dataset?-CodePudding

I have an excel file with over 3700 entries in a single column. The entries contain a full name and number with only whitespace to separate the two. To extract the number I use the following:

import re
import pandas as pd

dataframe = pd.read_excel('example.xlsx')
index_number = re.findall(r'\d ', str(dataframe['entry']))

The output gives only 10 numbers - the first and last 5, but if the same code is run on an excel sheet with only 50 entries, the output includes all 50 numbers.

Any ideas what is going wrong?

CodePudding user response：

The problem with your code is related to way of how pandas convert index to string. If it's of moderate length, then full data in printed in column. When it's longer, only first and last 5 items are printed with ellipsis in place of others.

To extract numbers, you can apply extracting function to each cell independently (it's much more efficient than coercing whole column to string). Also note that we can compile regex in advance to speed things up even more. With help of astype you'll have integers instead of their string representation.

import re
import pandas as pd

dataframe = pd.read_excel('example.xlsx')

regex = re.compile(r'(\d )')
dataframe['number'] = (dataframe['entry'].apply(lambda el: regex.search(el).group(1))
                                         .astype(int))