2 years ago

#73480

test-img

Rayne

Opening arrow files using vaex slower and using more memory than expected

I have multiple .arrow files, each about 1GB (total filesize is larger than my RAM). I tried to open all of them using vaex.open_many() to read them into a single dataframe, and saw that the memory usage was increasing by GBs, and it was taking longer than I expected.

These arrow files were generated by first making elasticsearch queries and storing the results as a pandas dataframe (df_pd). Then I did a fillna() and set the datatype of each column (I had gotten error messages converting to arrow when there were NaN values and mixed datatypes for a column). I then converted the df_pd dataframes to arrow files using vaex.

vaex_df = vaex.from_pandas(df=df_pd)
vaex_df.export("file1.arrow")

This was repeated for different ES query time periods. It was after I have created the arrow files that I tried to open them with vaex.

I tried just opening one file using the code below.

%%time
df = vaex.open("file1.arrow")

What I noticed it takes about 4-5 seconds to open the file, and the free memory (as indicated by the free column returned by the command free -h) kept decreasing until it was ~1GB lesser.

I thought that when opening the arrow files, vaex would use memory-mapping and thus, won't actually use up so much memory, and it would also be faster. Is my understanding correct, or am I doing something wrong?

python

pandas

pyarrow

vaex

0 Answers

Your Answer

Accepted video resources