Home > Enterprise >  Apply function to each DataFrame row, without returning a Series
Apply function to each DataFrame row, without returning a Series

Time:01-21

I need to "apply" a function to a DataFrame row by row, by taking as input two particular cells of the current row for performing an operation. The function is the following:

def function(x, y):
    z = 2*x*y
    values.append(z)
    return z

The problem is that the function shouldn't be really applied, I need only the input values to perform some operations and fill the global list called values. If we suppose the pd.DataFrame to be the following:

| col1 | col2 | col3 |
| 2    | 3    | 5    |
| 10   | 12   | 14   |
| ...  | ...  | ...  |

I would usually apply the function like this:

df.apply(lambda x: function(x['col2'], x['col3']), axis=1)

The problem with apply is that the last line of code would create a pd.Series and I would actually have in my memory not only the global list values that I need for other purposes (I used this list as an example for some other data structure that could be created starting from the function) but also this Series that I don't need at all.

How can I apply the function without occupying additional memory?

CodePudding user response:

This operation can already be directly vectorized by-row, so you can avoid using .apply(), which will be tremendously faster
Canonical Answer for How to iterate over rows in a DataFrame in Pandas?

You won't be able to avoid using memory for the results because they need to go somewhere, but you could throw out columns you no longer need before or after performing the calculation

Just keeping the results in a dataframe column (Series) rather than a list of native ints will be a memory savings, but you may find that explicitly setting or reducing the datatypes of your dataframe is a big savings if they're not in their most efficient types already (for example from int64 to uint16 or even uint8 (which will still contain the example values)

>>> df = pd.DataFrame({"col1": [2,10], "col2": [3,12], "col3": [5,4]})
>>> df
   col1  col2  col3
0     2     3     5
1    10    12     4
>>> df["2xy"] = 2 * df["col2"] * df["col3"]
>>> df
   col1  col2  col3  2xy
0     2     3     5   30
1    10    12     4   96

CodePudding user response:

This seems too simple so I may be missing something, but couldn't you do this in... a loop?

def function(x, y):
    z = 2*x*y
    return z

for i, row in df.iterrows():
    values.append(function(row['col2'], row['col3']))

Would solve the literal problem you raised of creating a second object aside from values in memory to store the results.

  •  Tags:  
  • Related