QST: is the new behavior of df.apply(my_func, axis=1) in v1.1.0 intended?

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.


Question about pandas

import pandas as pd
def test_func(row):
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

df = pd.DataFrame({'a': [1,2,3], 'b': ['i','j', 'k']})
df.apply(test_func, axis=1)

The above code ran on pandas 1.1.0 returns:

   a  b   c  d
0  1  i  1i  2
1  1  i  1i  2
2  1  i  1i  2

While in pandas 1.0.5 it returns:

   a   b    c  d
0  1   i   1i  2
1  2   j   2j  3
2  3   k   3k  4

Using python 3.8.3 and IPython 7.16.1.

The Question:

What is the right way of getting the v1.0.5 behavior in v1.1.0?

I did see this release note but honestly can’t figure out if this is an intended/unintended side effect of it: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.1.0.html#apply-and-applymap-on-dataframe-evaluates-first-row-column-only-once

thanks

1 possible answer(s) on “QST: is the new behavior of df.apply(my_func, axis=1) in v1.1.0 intended?

  1. In great generality, one should not mutate containers when iterating over them.

    def test_func(row):
        row = row.copy()
        row['c'] = str(row['a']) + str(row['b'])
        row['d'] = row['a'] + 1
        return row
    

    gives

       a  b   c  d
    0  1  i  1i  2
    1  2  j  2j  3
    2  3  k  3k  4
    

    Of course, the vectorized version of this will be much faster:

    %%%%timeit
    
    df['c'] = df['a'].astype(str) + df['b']
    df['d'] = df['a'] + 1
    

    gives 564 µs ± 5.97 µs per loop whereas your version is 5.34 ms ± 16.9 µs per loop.