ValueError in SimpleImputer for string input with strategy=’most_frequent’/’constant’

Describe the bug

  • List of strings raise ValueError in SimpleImputer with strategy=’most_frequent’ (also ‘constant’)

Steps/Code to Reproduce

import numpy as np
from sklearn.impute import SimpleImputer


X = [['a', 'b', 'c'], ['d', 'e', np.nan]]

imp_mf = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
transformed_mf = imp_mf.fit_transform(X)
print(transformed_mf)

Expected Results

[['a' 'b' 'c']
 ['d' 'e' 'c']]

Actual Results

--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-6cee172813bf> in <module>()
      6 
      7 imp_mf = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
----> 8 transformed_mf = imp_mf.fit_transform(X)
      9 print(transformed_mf)

2 frames
/content/scikit-learn/sklearn/impute/_base.py in _validate_input(self, X, in_fit)
    258                              "categorical data represented either as an array "
    259                              "with integer dtype or an array of string values "
--> 260                              "with an object dtype.".format(X.dtype))
    261 
    262         return X

ValueError: SimpleImputer does not support data with dtype <U3. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype..

Versions

System:
    python: 3.6.9 (default, Apr 18 2020, 01:56:04)  [GCC 8.4.0]
executable: /usr/bin/python3
   machine: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 19.3.1
   setuptools: 47.1.1
      sklearn: 0.23.1
        numpy: 1.18.4
        scipy: 1.4.1
       Cython: 0.29.19
       pandas: 1.0.4
   matplotlib: 3.2.1
       joblib: 0.15.1
threadpoolctl: 2.1.0

Built with OpenMP: True

Author: Fantashit

1 thought on “ValueError in SimpleImputer for string input with strategy=’most_frequent’/’constant’

  1. I can reproduce the error. Since the documentation says that fit and transform methods allow array-like (list or ndarray) as input, I would assume that this is a bug.

Comments are closed.