2 years ago

#24977

test-img

Ashutosh Sharma

Unable to write spark dataframe to Greenplum table using spark-submit

I am reading a spark dataframe and trying to ingest the data in greenplum database table, but getting below error.

It runs fine when I try the same code with spark-shell or zeppelin standalone server.

I've granted greenplum user the access to create external tables as well:

ALTER USER test_user CREATEEXTTABLE(type = 'readable', protocol = 'gpfdist')

Error logs in spark (yarn):

22/01/07 09:01:07 WARN TaskSetManager: Lost task 32.1 in stage 16.0 (TID 117, 72.52.104.195, executor 5): org.postgresql.util.PSQLException: ERROR: http response code 500 from gpfdist (gpfdist://{IP_Address}:{port}/spark_3a746671c6a0a7c2_3d9d854163f8f07a_5_117): HTTP/1.1 500 Server Error  (seg6 slice1 {IP_Address}:6003 pid=1877)
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2310)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2023)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:217)
    at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:421)
    at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:318)
    at org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:294)
    at com.zaxxer.hikari.pool.ProxyStatement.executeUpdate(ProxyStatement.java:120)
    at com.zaxxer.hikari.pool.HikariProxyStatement.executeUpdate(HikariProxyStatement.java)
    at io.pivotal.greenplum.spark.SqlExecutor$$anonfun$executeUpdate$2$$anonfun$apply$8.apply(SqlExecutor.scala:45)
    at io.pivotal.greenplum.spark.SqlExecutor$$anonfun$executeUpdate$2$$anonfun$apply$8.apply(SqlExecutor.scala:43)
    at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:52)
    at resource.AbstractManagedResource$$anonfun$5.apply(AbstractManagedResource.scala:88)
    at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
    at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)

I also checked Greenplum error logs and I could find only this in pg_logs:

2022-01-07 08:33:26.446184 UTC,"test_user","test_db",p1740,th430266496,"{IP_ADDRESS}","24814",2022-01-07 08:33:24 UTC,0,con15013,cmd3,seg-1,,dx346951,,sx1,"ERROR","08006","http response code 500 from gpfdist (gpfdist://IP_ADDRESS:PORT/spark_3809cb98f16140df_3d9d854163f8f07a_2_83): HTTP/1.1 500 Server Error  (seg3 slice1 IP_ADDRESS:6003 pid=11466)",,,,,,"INSERT INTO ""public"".""gp_test_table""

EDIT: YARN DEBUG log

22/01/07 21:58:01 DEBUG HttpParser: HEADER_IN_NAME --> HEADER_VALUE
22/01/07 21:58:01 ERROR GpfdistHandler: Failed to handle GET request for /spark_42af0f6da7ddb6ea_3d9d854163f8f07a_1_60 : java.lang.NullPointerException
22/01/07 21:58:01 DEBUG HttpParser: HEADER --> HEADER_IN_NAME
22/01/07 21:58:01 ERROR GpfdistHandler: Failed to handle GET request for /spark_42af0f6da7ddb6ea_3d9d854163f8f07a_1_72 : java.lang.NullPointerException
22/01/07 21:58:01 DEBUG HttpParser: HEADER_IN_VALUE --> HEADER
22/01/07 21:58:01 DEBUG HttpParser: HEADER_IN_NAME --> HEADER_VALUE
22/01/07 21:58:01 DEBUG Server: handled=true async=false committed=false on HttpChannelOverHttp@6aa2035{r=1,c=false,a=DISPATCHED,uri=//IP_ADDRESS:PORT/spark_42af0f6da7ddb6ea_3d9d854163f8f07a_1_72}
22/01/07 21:58:01 DEBUG HttpParser: HEADER_IN_VALUE --> HEADER
22/01/07 21:58:01 DEBUG HttpChannelState: HttpChannelState@47972a0e{s=DISPATCHED a=NOT_ASYNC i=true r=NONE/false w=false} unhandle DISPATCHED
22/01/07 21:58:01 DEBUG HttpParser: HEADER_IN_NAME --> HEADER_VALUE
22/01/07 21:58:01 DEBUG HttpChannel: HttpChannelOverHttp@6aa2035{r=1,c=false,a=COMPLETING,uri=//IP_ADDRESS:PORT/spark_42af0f6da7ddb6ea_3d9d854163f8f07a_1_72} action COMPLETE
22/01/07 21:58:01 DEBUG SelectChannelEndPoint: onSelected 1->0 r=true w=false for SelectChannelEndPoint@7472b79f{/IP_ADDRESS:PORT<->36455,Open,in,out,FI,-,116/30000,HttpConnection@1935f810}{io=1/0,kio=1,kro=1}

22/01/07 21:58:01 DEBUG HttpConnection: org.eclipse.jetty.server.HttpConnection$SendCallback@14b8949[PROCESSING][i=HTTP/1.1{s=200,h=1},cb=org.eclipse.jetty.server.HttpChannel$CommitCallback@67270ed4] generate: FLUSH ([p=0,l=108,c=8192,r=108],[p=0,l=0,c=0,r=0],true)@COMPLETING
22/01/07 21:58:01 DEBUG HttpParser: HEADER_IN_NAME --> HEADER_VALUE
java.lang.NullPointerException
        at org.apache.spark.sql.execution.aggregate.HashAggregateExec.finishAggregate(HashAggregateExec.scala:368)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(generated.java:262)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(generated.java:269)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at io.pivotal.greenplum.spark.externaltable.GpfdistHandler$$anonfun$serveData$1.apply$mcV$sp(GpfdistHandler.scala:119)
        at io.pivotal.greenplum.spark.externaltable.GpfdistHandler$$anonfun$serveData$1.apply(GpfdistHandler.scala:110)
        at io.pivotal.greenplum.spark.externaltable.GpfdistHandler$$anonfun$serveData$1.apply(GpfdistHandler.scala:110)
        at scala.util.Try$.apply(Try.scala:192)
        at io.pivotal.greenplum.spark.externaltable.GpfdistHandler.serveData(GpfdistHandler.scala:110)
        at io.pivotal.greenplum.spark.externaltable.GpfdistHandler.processPartitionData(GpfdistHandler.scala:96)
        at io.pivotal.greenplum.spark.externaltable.GpfdistHandler.handleGET(GpfdistHandler.scala:84)
        at io.pivotal.greenplum.spark.externaltable.GpfdistHandler.handle(GpfdistHandler.scala:51)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at org.eclipse.jetty.server.Server.handle(Server.java:539)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
        at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Thread.java:748)
22/01/07 21:58:01 ERROR GpfdistHandler: Failed to handle GET request for /spark_42af0f6da7ddb6ea_3d9d854163f8f07a_1_61 : java.lang.NullPointerException

A similar issue has been reported in the community but hasn’t been answered. Greenplum-spark connector cannot save greenplum table when query with group

Please help me with this issue and let me know if any further info is needed, thanks!

apache-spark

greenplum

0 Answers

Your Answer

Accepted video resources