Parsing chemical formulas using regex expressions-CodePudding

I'm trying to do the following thing: given a single-column pandas.Dataframe (of chemical formulas) like

    formula
0   Hg0.7Cd0.3Te
1   CuBr
2   Lu
...

I would like to return a pandas.Series like

0         [(Hg, 0.7), (Cd, 0.3), (Te,1)]
1                [(Cu, 1), (Br, 1)]
2                [(Lu, 1), (P, 1)]
...

So this is the desired output.

I've already tried something with a regex expression:

counts = pd.Series(formulae.values.flatten()).str.findall(r"([a-z] )([0-9] )", re.I)

but unfortunately my output is the following:

0         [(Hg, 0), (Cd, 0)]
1                         []
2                         []
3       [(Cu, 3), (SbSe, 4)]

so it's not recognizing in some cases different elements in the chemical formula.

CodePudding user response：

There are a few things to be improved:

The number pattern does not allow floating point numbers yet. Here, you can use ([0-9] (?:[.][0-9] )?) instead.
The number might not be present at all, so that needs to be indicated by a trailing ?.
The elements all start with an uppercase letter, followed by zero or more (zero or one?) lower case letters. So the element name pattern would be [A-Z][a-z]*. That's important to distinguish different elements with no number in between, e.g. 'CuBr' (so ignore-case wouldn't work here).

Putting it all together:

from pprint import pprint
import re

formulae = ['Hg0.7Cd0.3Te', 'CuBr', 'Lu']

pattern = re.compile('([A-Z][a-z]*)([0-9] (?:[.][0-9] )?)?')

pprint([pattern.findall(f) for f in formulae])

The prints the following:

[[('Hg', '0.7'), ('Cd', '0.3'), ('Te', '')],
 [('Cu', ''), ('Br', '')],
 [('Lu', '')]]

As you can see, missing numbers are denoted by empty strings which you need to postprocess manually. For example:

result = [pattern.findall(f) for f in formulae]
result = [[(e, float(n or 1)) for e, n in f] for f in result]

CodePudding user response：

Would use multiple replace to introduce separators, split using introduced separators, explode and then filter. Code below

repl2 =  lambda g: f'{str(g.group(1)) }<'
repl3 =  lambda g: f'{str(g.group(1)) }>'
df1 = (df1.assign(formula1=df1['formula'].str.replace('((?<=[A-Z])\w)', repl3, regex=True)#Introduce separator where alpha numeric follows a cap letter
                 .str.replace('(\d(?=[A-Z]))', repl2, regex=True))#Introduce separator where digits is followed by cap letter
.replace(regex={r'\>(?=0)': ',', '\>': ',1 '})#Replace the < and > introduced separators
      )

df1=df1.assign(formula1=df1['formula1'].str.split('\<|\s')).explode('formula1')#Explode dataframe

new=df1[df1['formula1'].str.contains('\w')]#filter those rows that have details



    formula      formula1
0  Hg0.7Cd0.3Te   Hg,0.7
0  Hg0.7Cd0.3Te   Cd,0.3
0  Hg0.7Cd0.3Te     Te,1
1          CuBr     Cu,1
1          CuBr     Br,1
2            Lu     Lu,1