Efficient methods for determining the largest set of complete data in - Enhance your coding expertise with TeresaStat on @onlycoders.net

2 years ago

#77524

TeresaStat

Efficient methods for determining the largest set of complete data in a large dataset

I have a large-ish dataset (say 10mil rows by 1500 columns). Each row represents an individual and each column represents a question. I would like to find the largest set of non-missing data (i.e., n rows with k columns of complete data, subject to some criteria (n>N)). Currently, I am doing something that feels a bit arbitrary - I start by ranking the columns by completeness and using the column (C1) with the largest number of completions (non missing rows) as my starting point. I filter out rows with missing data for C1, re-rank the remaining columns based on completeness, choose the top column (C2) with highest number of completions and continue down this path until I reach a set size I am comfortable with (stop when n < N).

I would be very interested if there are methods in place for doing this and/or any thoughts on efficient ways to do this! Thank you

subset

missing-data

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs