I have a working code, but I think my logic isn't on the right path (although it works). I just need some help with optimizing it. Essentially, to see if what I did was an acceptable way of doing what I am doing or if there's a better way. I am rooting for the latter, because I know what I did isn't the "right" way.
I have a pd column of strings with "year" in it and I am trying to extract it from it. The problem is that a few entries do not have a year listed. So something like this:
| Index | string_values |
|---|---|
| 0 | String A (1995) |
| 1 | String B (1995) |
| 2 | String C (1995) |
| 3 | String D has no year |
| 4 | String E has (something in braces) AND also the year (2003) |
re.search('\d{4}', df['string_values '][0]).group(0) works, but in a for loop, it throws this error (I guess when it hit the non-4-digit string): AttributeError: 'NoneType' object has no attribute 'group'. I think this because len(_temp) gives 15036 and it has the years listed. Just that it's throwing this error.
Here's the for loop:
_temp = []
for i in df['string_values']:
year = re.search("\d{4}", i)
if year.group():
_temp.append(year.group())
else:
_temp.append(None)
Then I also tried the Try-Except way to do it, and that works - len(<var>) gives 62423, which is also the total row in the df. And here's the code:
_without_year = []
_with_year = []
for i in df['string_values']:
year = re.search("\d{4}", i)
try:
if year.group():
# _with_year.append(year.group())
pass
except:
_without_year.append(i)
I just need to know if what I did is acceptable. It works, like I said. _without_year does display all the entries without the year.
The thing with the Try-Except block is that I am passing on the if condition catching the ith error.
CodePudding user response:
You can use extract to extract the year values directly:
df['string_values'].str.extract(r'(?<=\()(\d{4})(?=\))', expand=False)
Output:
0 1995
1 1995
2 1995
3 NaN
4 2003
Name: string_values, dtype: object
Note I've used forward and backward lookarounds to assert that the year occurs inside parentheses; if you don't want that but just to match a 4-digit string replace them with \b (word break) e.g.
df['string_values'].str.extract(r'\b(\d{4})\b', expand=False)
To convert the output to a list, you can use tolist:
df['string_values'].str.extract('(?<=\()(\d{4})(?=\))', expand=False).tolist()
Output:
['1995', '1995', '1995', nan, '2003']
To find the string values that don't contain a year, you can use contains to find matches and invert that to use as an index:
df[~df['string_values'].str.contains(r'(?<=\()\d{4}(?=\))')]
Output:
Index string_values
3 3 String D has no year
