2 years ago
#65921
ak97
Using spark dataframe with UDF to access Graph DB using Gremlin? cannot pickle '_queue.SimpleQueue'
I'm currently having an issue trying to create a UDF to access a graph DB for each record in my DataFrame. Example testdf:
**id name**
1 tom
2 ben
.. etc.
I have written a function which takes an id and looks into the Neptune Graph to see if the specific id is connected to another vertex. looks something like this.
def getEngineer(id):
return g.V(f"{id}").repeat(__.out('knows').simplePath()).until(__.hasLabel('engineer')).dedup().elementMap('id').toList()
getEngineerUDF= udf(lambda z: getEngineer(z))
I have wrapped this function into a UDF and trying to use it withColumn.
finDf = testdf.withColumn('EngInNeptune', getEngineerUDF(F.col('id')))
When I run the above command I receive this error:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_queue.SimpleQueue' object
Will appreciate any help.(still pretty new sorry If I've missed something)
Are we able to implement something like this? I'm under the assumption that the Gremlin doesn't like being being put into a UDF due to how Spark handles them(concurrently?)?
apache-spark
pyspark
databricks
gremlin
amazon-neptune
0 Answers
Your Answer