Is there a proper time to clean data when importing into a Pandas dataframe?-CodePudding

I am scraping the CIA Worldbook for country data as a learning exercise. I scrape the data and clean it up during import and then later convert to Pandas dataframe.
I have two choices - clean the data as it is being read in, as I am doing now, or just read everything into the dataframe and clean it up after the fact. Here are two examples of what I am doing now:

raw data 
info = "$2,000 note: data are in 2017 dollars (2020 est.)"
int(info.text[:info.text.find(' ')].replace(',', '').replace('$', ''))
result  2000

raw data
info = "36.08 births/1,000 population (2021 est.)"
float(info.text[:info.text.find(' ')].replace(',', ''))
result 36.08

I suspect that cleaning in the dataframe after downloading would be a better solution but the only way I can think to do that is using Regular Expressions - which at the moment I am not too well versed in. Would that be the "correct" way to do it, or does it even matter? If cleaning up the dataframe is the solution, what might these look like?

Thanks

CodePudding user response：

There are some things that are important depending on your case:

Do you want it to be highly reproducible or extendable?
Should it be highly performant?
Is readability more important than performance/extendability?

I've found that in the far majority of the cases, the performance doesn't matter that much. As long as you're not dealing with enormous amount of data to process or you're not working on low-performing infrastructure, it should run sufficiently fast. Again, this depends on your use case.

What I find way more annoying/time-consuming is over-complex functions that you won't know how they work afterwards, or having severely nested functionality. Those can take enormous amount of time to fix once your data-format changes or you need to alter some small parts in the code.

I would therefore agree that the ideal workflow would be to first download and store the raw data for reproducibility. Then you should write a processing function that makes them 'DataFrame' ready. Whenever your raw data then changes, you only have to rewrite this single function and assert the processed data comes out the same format it used to. Moreover, whenever you decide that you don't want to use pandas anymore (because you want to use regular numpy arrays for example), it is an easier fix to exclude pandas from your code than when it is completely knitted in your workflow since the very beginning.

This would be my motivation to do the processing before reading into a DataFrame.