weird NaN in mean() of float16 series

I have a shuffled series with a bunch of sinvalues in float16, like this:

   tdata.time_sin
   110405276   -0.183105
   175560878   -0.301270
   ...
   130331292   -0.158813
   6782127     -0.282471
    Name: time_sin, Length: 18490389, dtype: float16

There’s no NaN values, everything’s a sinus of something:

tdata.time_sin[np.isnan(tdata.time_sin) == True].count()
0

But for some reason, mean() chokes somewhere in the middle like it’s overflowing:


tdata.time_sin.mean()
nan

tdata.time_sin[:328720].mean()
0.0

tdata.time_sin[:328721].mean()
nan

tdata.time_sin[328719:328722]
117467643   -0.639648
85318746     0.956055
10829780     0.112000
Name: time_sin, dtype: float16

And it works fine when converted to float32:

foo = tdata.time_sin.astype(np.float32)
foo.mean()

0.20143597

Is this weird or am I missing something about float16?

This behavior persists after pickling and loading and sorting by index, although it now chokes much earlier:

zzz = pickle.load(open('timesin.pkl', 'rb'))
bb = zzz.sort_index()

bb[:74351].mean()
-0.0

bb[:74352].mean()
nan

bb[74350:74355]
749371   -0.898438
749393   -0.898438
749432   -0.898438
749447   -0.898438
749479   -0.898438
Name: time_sin, dtype: float16

Problem description

Expected Output

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-119-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Author: Fantashit

1 thought on “weird NaN in mean() of float16 series

  1. You have an overflow. Take the mean over a ratio,(df[col] / n).mean() * n, where n is large enough.

    To know how large n needs to be you can compute the sum of the column once cast into float32, and compare to the largest float16.

Comments are closed.