How do you use sdf_checkpoint to break spark table lineage in sparklyr - Enhance your coding expertise with math

2 years ago

#59268

math_stats051

How do you use sdf_checkpoint to break spark table lineage in sparklyr?

I'm attempting to manipulate a Spark RDD via sparklyr with a dplyr mutate command to construct a large number of variables, and each time this seems to fail with an error message regarding Java memory exceeding 64 bits.

The mutate command is coded correctly and runs properly when executed separately to construct different dataframes for small subsets of the variables, but not when all of the variables are included in the same mutate command (or several mutate statements piped together in the same chain to construct a single dataset).

I suspect that this may be an issue with needing to 'break the computation graph' using sparklyr::sdf_checkpoint(eager = TRUE), but I am unclear precisely how to do this as the documentation for this function is lacking:

When I run this command, is the appropriate sequence of commands df <- df %>% compute('df') %>% sdf_checkpoint(eager = TRUE) ?
When I set the sparklyr::spark_set_checkpoint_dir() to an HDFS location, I can see that 'stuff' appears there after running the command - however when then referencing the df post the check-pointing, the mutate command still breaks with the same error message. When I am referencing the df, am I actually referencing the version that has had the lineage broken, or do I need to do something to 'load' the checkpoint back in? If so, might somebody be able to provide me with the steps to load this dataset back into memory without the lineage?

apache-spark

apache-spark-sql

sparklyr

0 Answers

Your Answer

Posts

Questions

Blogs