BUG: read_excel with openpyxl results in empty data frame

  • [ X ] I have checked that this issue has not already been reported.

  • [ X ] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Old behavior with xlrd 1.2.0 (last version with XLSX support)

>>> import pandas as pd
>>> df=pd.read_excel('infile.xlsx', engine='xlrd')
<stdin>:1: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
>>> df.shape
(843, 11)

File is read perfectly and dataframe is ok, with all 11 columns.

Problem description

When using openpyxl as engine for read_excel:

>>> import pandas as pd
>>> df=pd.read_excel('infile.xlsx', engine='openpyxl')
>>> print(df)
Empty DataFrame
Columns: [Column1, Column2, Column3]
Index: []
>>> df.shape
(0, 3)

Only the 3 first headers are read, and it stops.

However, if the file is opened directly with openpyxl, it works fine, I’m using it to open and save the file with a temporary name to make openpyxl work as engine on Pandas:

>>> wb = load_workbook('infile.xlsx')
>>> wb.save(filename = 'temp.xlsx')
>>> pd.read_excel('temp.xlsx', engine='openpyxl')
>>> df.shape
(843, 11)

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.128-microsoft-standard
Version : #1 SMP Tue Jun 23 12:58:10 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.3.1
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : None

2 thoughts on “BUG: read_excel with openpyxl results in empty data frame

  1. Turning read-only mode off has other implications though, in particular higher memory consumption.

    According to openpyxl’s docs, the recommended solution is to call sheet.reset_dimensions(). I. e. at the beginning of OpenpyxlReader.get_sheet_data():

        if sheet.calculate_dimension() == "A1:A1":
                sheet.reset_dimensions()

    I monkeypatched this for our application and it seems to work fine.

  2. All benchmarks below are on an excel file created via pd.DataFrame(np.random.rand(10000, 10)).to_excel("test.xlsx"), looped for 20 iterations. The code for each iteration is pd.read_excel('test.xlsx', engine='openpyxl'), where I am modifying the pandas internals to make sure there aren’t any gotchas.

    Using read_only=False takes 24.532 seconds as opposed to 19.897 with read_only=True. Calling reset_dimension is just the line:

    self._max_row = self._max_column = None

    However, by resetting the dimension alone openpyxl will no longer pad the rows when reading, and this results in the behavior that is the cause of #38956. To get correct results, one must also call calculate_dimension which unfortunately iterates in python space. It then seems to me that the approach taken in #39486, along with calling reset_dimension is the best one. This takes 20.183 seconds.