2 years ago
#73480
Rayne
Opening arrow files using vaex slower and using more memory than expected
I have multiple .arrow
files, each about 1GB (total filesize is larger than my RAM). I tried to open all of them using vaex.open_many()
to read them into a single dataframe, and saw that the memory usage was increasing by GBs, and it was taking longer than I expected.
These arrow files were generated by first making elasticsearch queries and storing the results as a pandas dataframe (df_pd
). Then I did a fillna()
and set the datatype of each column (I had gotten error messages converting to arrow when there were NaN
values and mixed datatypes for a column). I then converted the df_pd
dataframes to arrow files using vaex.
vaex_df = vaex.from_pandas(df=df_pd)
vaex_df.export("file1.arrow")
This was repeated for different ES query time periods. It was after I have created the arrow files that I tried to open them with vaex.
I tried just opening one file using the code below.
%%time
df = vaex.open("file1.arrow")
What I noticed it takes about 4-5 seconds to open the file, and the free memory (as indicated by the free column returned by the command free -h
) kept decreasing until it was ~1GB lesser.
I thought that when opening the arrow files, vaex would use memory-mapping and thus, won't actually use up so much memory, and it would also be faster. Is my understanding correct, or am I doing something wrong?
python
pandas
pyarrow
vaex
0 Answers
Your Answer