1 year ago

#74150

test-img

Charlotte R

Split of data into train and test for spatial cross-validation

I want to create train and test datasets that avoid overfitting by keeping a spatial distance (in my example let's say 200 m) between both datasets. So study sites in train will never be closer than 200 m from study sites in test.

For this I made a variation of this script that uses a for loop :

# Selects points randomly in the dataset and puts them in the new dataset A or B
# To do this, checks whether the selected point is close to points already put in A or B
# and puts it in the dataset where it is closer to
# Some points are close to A and to B and will not be put in A nor in B
# The process is repeated "reps" times and then the biggest dataset (A+B) is chosen

# foo = dataset with columns "x" and "y" for coordinates
# buffer = distance threshold between points (use the same unit as the coordinates)
# reps = number of repetitions of the process to choose biggest dataset among all trials

buffer.f <- function(foo, buffer, reps){
  
  print(paste0 ("Sampling random points to create train and test datasets:"))
                
  # Make list of suitable tables
  suitable = list()
  for (k in 1:reps){
    
    print(paste0 (k, " out of ", reps, " iterations"))
    
    # Prepare output tables
    A <- data.frame(matrix(ncol = ncol(foo), nrow = 0))
    B <- data.frame(matrix(ncol = ncol(foo), nrow = 0))
    colnames(A) = colnames(foo)
    colnames(B) = colnames(foo)
    outdata = data.frame()
    
    # Set the rows to sample from
    for(i in 1:nrow(foo)){
      if(i>1){
        rowsleft <- (1:nrow(foo))[-c(as.numeric(rownames(A)), as.numeric(rownames(B)))]
      } else {
        rowsleft <- 1:nrow(foo)
      }
      
      # Randomly select point
      outpoint <- as.numeric(sample(as.character(rowsleft),1))
      outcoord <- foo[outpoint,c("x","y")]
      
      if(nrow(A) >0){
        if(nrow(B)>0){
          # If point is close to A points but far from B points, then goes to A
          if(TRUE %in%  (sqrt((A$x-outcoord$x)^2 + (A$y-outcoord$y)^2)<buffer) &
             !TRUE %in%  (sqrt((B$x-outcoord$x)^2 + (B$y-outcoord$y)^2)<buffer)){
            A = rbind(A,foo[outpoint,])
            # If point is close to B points but far from A points, then goes to B
          }else if(TRUE %in%  (sqrt((B$x-outcoord$x)^2 + (B$y-outcoord$y)^2)<buffer) &
                   !TRUE %in%  (sqrt((A$x-outcoord$x)^2 + (A$y-outcoord$y)^2)<buffer)){
            B = rbind(B,foo[outpoint,])
            # If point is far to A and B points ...
          }else if (!TRUE %in%  (sqrt((A$x-outcoord$x)^2 + (A$y-outcoord$y)^2)<buffer) &
                    !TRUE %in%  (sqrt((B$x-outcoord$x)^2 + (B$y-outcoord$y)^2)<buffer)){
            if(i %in% seq(1, nrow(foo), 3)){ # ... goes to A once every three cases to have approximately 33% A /66% B
              A = rbind(A,foo[outpoint,])
            }else{
              B = rbind(B,foo[outpoint,])
            }
          }
        }else{
          B = rbind(B,foo[outpoint,])
        }
      }else{
        A = rbind(A,foo[outpoint,])
      }
      
    }
    
    # Bind A and B
    A$type ="test"
    B$type ="train"
    outdata = rbind(A,B)
    
    # Populate the suitable points list
    suitable[[k]] <- outdata
  }
  
  # Go through the iterations and pick a list with the most data
  best <- suitable[[which.max(lapply(suitable, nrow))]]
  best
}

It works as expected. However, this function is way too long to run for the amount of data that I have and I need a quicker alternative. I found this solutions that looks perfect, but it is in Python. I am looking for a solution in R.

EDIT : here is a reproducible example:

xy93 <- data.frame(x = c(752683.4, 752669.2, 752888.8, 752858.3, 752537.5, 752626.2, 753146.4, 752708.8, 752594.5, 753214.5, 752445.9, 756242.8, 755447.7, 755182.7, 798997.7), y = c(6275019, 6275022, 6275884, 6274960, 6274792, 6275930, 6275850, 6275040, 6275606, 6276051, 6274641, 6272849, 6273417, 6274078, 6325464))

TEST = buffer.f(xy93, 200, 1)

r

function

split

spatial

cross-validation

0 Answers

Your Answer

Accepted video resources