2 years ago
#70777
Mark
TFRecords in a custom DataGenerator
I am doing an LSTM model in which samples have different number of time steps. I want to optimise my code and performance so I do not want to use masking, but I want to write a generator that will group automatically the batches with the same number of steps. My idea is the following:
- For each possible sequence length (ranges from 1-365) create a TFRecord dataset which will have only the samples with that length
- In each generator loop, randomly choose sequence length, and take the batch of data from corresponding TFRecord dataset. One option is to read in batches from this TFRecord dataset until it is depleted - this is preferrable if it is costly to open and close TFRecord dataset multiple times.
- Otherwise, if it is not costly to open and close TFRecord dataset, and read from the middle, we can randomly choose sequence length for each batch (sounds more robust)
I was able to implement this logic with .csvs (i.e. one csv for files having fixed sequence lengths) following this example, https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly, but I am wondering if I can gain more performance if I would do this with the TFRecords. However, I couldn't find any resources that would teach me how to use them with this degree of flexibility.
Can anybody point me in the right direction here?
Thanks!
python
tensorflow
tensorflow-datasets
tfrecord
tf.data.dataset
0 Answers
Your Answer