I have some folders at the location : log_files_path. All these folders contain CSVs with different names. My aim is to read all these csvs from all the folders present at log_files_path and collate them into a single dataframe. I wrote the following code :
all_files = pd.DataFrame()
for region in listdir(log_files_path):
region_log_filepath = join(log_files_path, region)
#files stores file paths
files = [join(region_log_filepath, file) for file in listdir(region_log_filepath) if isfile(join(region_log_filepath, file))]
#appends data from all files to a single a DF all_files
for file in files :
all_files = all_files.append(pd.read_csv(file, encoding= 'utf-8')).reset_index(drop=True)
return all_files
This gives me an error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 61033: invalid start byte
On opening the CSVs, found out that some columns have values like : 
and ƒÂ‚‚ÃÂÂÂ.
I want to ignore such characters all together. How can I do it?
CodePudding user response:
You can pass encoding_errors='ignore', but I would advice to try different encoding first.
CodePudding user response:
As of version 1.3 of pandas.read_csv() you can pass argument encoding_errors.
strict: Raise UnicodeError (or a subclass); this is the default.ignore: Ignore the malformed data and continue without further notice.replace: Replace with a suitable replacement marker; Python will use the official U FFFD REPLACEMENT CHARACTER for the built-in codecs on decoding, and ‘?’ on encoding.xmlcharrefreplace: Replace with the appropriate XML character reference (only for encoding).backslashreplace: Replace with backslashed escape sequences.namereplace: Replace with \N{...} escape sequences (only for encoding).surrogateescape: On decoding, replace byte with individual surrogate code ranging from U DC80 to U DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.surrogatepass: Allow encoding and decoding of surrogate codes. These codecs normally treat the presence of surrogates as an error.
For your situation, you probably need one of these:
pd.read_csv(file, encoding='utf-8', encoding_errors='replace')
# or
pd.read_csv(file, encoding='utf-8', encoding_errors='ignore')
