2 years ago

#65921

test-img

ak97

Using spark dataframe with UDF to access Graph DB using Gremlin? cannot pickle '_queue.SimpleQueue'

I'm currently having an issue trying to create a UDF to access a graph DB for each record in my DataFrame. Example testdf:

**id name** 
  1  tom
  2  ben
  .. etc.

I have written a function which takes an id and looks into the Neptune Graph to see if the specific id is connected to another vertex. looks something like this.

def getEngineer(id):
  return g.V(f"{id}").repeat(__.out('knows').simplePath()).until(__.hasLabel('engineer')).dedup().elementMap('id').toList()

getEngineerUDF= udf(lambda z: getEngineer(z))  

I have wrapped this function into a UDF and trying to use it withColumn.

finDf = testdf.withColumn('EngInNeptune', getEngineerUDF(F.col('id')))

When I run the above command I receive this error:

    Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
  TypeError: cannot pickle '_queue.SimpleQueue' object

Will appreciate any help.(still pretty new sorry If I've missed something)

Are we able to implement something like this? I'm under the assumption that the Gremlin doesn't like being being put into a UDF due to how Spark handles them(concurrently?)?

apache-spark

pyspark

databricks

gremlin

amazon-neptune

0 Answers

Your Answer

Accepted video resources