2 years ago
#77524
TeresaStat
Efficient methods for determining the largest set of complete data in a large dataset
I have a large-ish dataset (say 10mil rows by 1500 columns). Each row represents an individual and each column represents a question. I would like to find the largest set of non-missing data (i.e., n rows with k columns of complete data, subject to some criteria (n>N)). Currently, I am doing something that feels a bit arbitrary - I start by ranking the columns by completeness and using the column (C1) with the largest number of completions (non missing rows) as my starting point. I filter out rows with missing data for C1, re-rank the remaining columns based on completeness, choose the top column (C2) with highest number of completions and continue down this path until I reach a set size I am comfortable with (stop when n < N).
I would be very interested if there are methods in place for doing this and/or any thoughts on efficient ways to do this! Thank you
subset
missing-data
0 Answers
Your Answer