hello world

stay foolish, stay hungry

spark 3.1.2 context异常关闭问题排查

前段时间,基于spark和spring boot开发了一个web服务,将 spark session 注册为spring的bean,代码如下所示。

@Bean
@ConditionalOnMissingBean(SparkSession.class)
public SparkSession sparkSession(SparkConf conf) {
    return SparkSession.builder()
    .enableHiveSupport()
    .config(conf)
    .getOrCreate();
}

然而升级spark到3.1.2版本之后,服务启动之后,spark context就会关闭,而退回3.0.2版本后就没有这问题。基于3.1.2版本 spark 的启动日志如下所示:

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::                (v2.5.2)

...
21/07/01 15:05:10 INFO YarnClientSchedulerBackend: Application application_1619358582322_6529891 has started running.
21/07/01 15:05:10 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40461.
...
21/07/01 15:05:10 INFO SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
21/07/01 15:05:10 INFO Utils: Using initial executors = 1, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/07/01 15:05:10 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
21/07/01 15:05:11 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000000000(ns)
21/07/01 15:05:11 INFO WelcomePageHandlerMapping: Adding welcome page template: index
21/07/01 15:05:12 INFO Http11NioProtocol: Starting ProtocolHandler ["http-nio-9000"]
21/07/01 15:05:12 INFO TomcatWebServer: Tomcat started on port(s): 9000 (http) with context path ''
21/07/01 15:05:12 INFO SpringApplication: Started application in 2939.592 seconds (JVM running for 2942.937)
21/07/01 15:05:12 INFO AbstractConnector: Stopped Spark@23cd4246{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
21/07/01 15:05:12 INFO SparkUI: Stopped Spark web UI at http://xxxxxxxxx:4040
21/07/01 15:05:12 INFO YarnClientSchedulerBackend: Interrupting monitor thread
21/07/01 15:05:12 INFO YarnClientSchedulerBackend: Shutting down all executors
21/07/01 15:05:12 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
21/07/01 15:05:12 INFO YarnClientSchedulerBackend: YARN client scheduler backend Stopped
21/07/01 15:05:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/07/01 15:05:13 INFO MemoryStore: MemoryStore cleared
21/07/01 15:05:13 INFO BlockManager: BlockManager stopped
21/07/01 15:05:13 INFO BlockManagerMaster: BlockManagerMaster stopped
21/07/01 15:05:13 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/07/01 15:05:13 INFO SparkContext: Successfully stopped SparkContext

查看了3.1.2版本的提交历史,发现问题是由于该 pr c625eb4导致的,该pr是解决SPARK-34674问题。该pr在 core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala 代码的 955 行添加了SparkContext.getActive.foreach(_.stop())代码。为了验证猜想,修改源码,并重新打包。

try {
    logInfo("## app start ")
    app.start(childArgs.toArray, sparkConf)
    logInfo("## app start over ")
} catch {
    case t: Throwable =>
    throw findCause(t)
} finally {
    logInfo("## finally clean spark context ")
    if (!isShell(args.primaryResource) && !isSqlShell(args.mainClass) &&
    !isThriftServer(args.mainClass)) {
    try {
    SparkContext.getActive.foreach(_.stop())
    } catch {
    case e: Throwable => logError(s"Failed to close SparkContext: $e")
    }
    }
}

app启动日志如下。

## app start

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::                (v2.5.2)

...
21/07/01 15:05:10 INFO YarnClientSchedulerBackend: Application application_1619358582322_6529891 has started running.
21/07/01 15:05:10 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40461.
...
21/07/01 15:05:10 INFO SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
21/07/01 15:05:10 INFO Utils: Using initial executors = 1, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/07/01 15:05:10 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
21/07/01 15:05:11 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000000000(ns)
21/07/01 15:05:11 INFO WelcomePageHandlerMapping: Adding welcome page template: index
21/07/01 15:05:12 INFO Http11NioProtocol: Starting ProtocolHandler ["http-nio-9000"]
21/07/01 15:05:12 INFO TomcatWebServer: Tomcat started on port(s): 9000 (http) with context path ''
21/07/01 15:05:12 INFO SpringApplication: Started application in 2939.592 seconds (JVM running for 2942.937)
## app start over
## finally clean spark context
21/07/01 15:05:12 INFO AbstractConnector: Stopped Spark@23cd4246{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
21/07/01 15:05:12 INFO SparkUI: Stopped Spark web UI at http://xxxxxxxxx:4040
21/07/01 15:05:12 INFO YarnClientSchedulerBackend: Interrupting monitor thread
21/07/01 15:05:12 INFO YarnClientSchedulerBackend: Shutting down all executors
21/07/01 15:05:12 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
21/07/01 15:05:12 INFO YarnClientSchedulerBackend: YARN client scheduler backend Stopped
21/07/01 15:05:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/07/01 15:05:13 INFO MemoryStore: MemoryStore cleared
21/07/01 15:05:13 INFO BlockManager: BlockManager stopped
21/07/01 15:05:13 INFO BlockManagerMaster: BlockManagerMaster stopped
21/07/01 15:05:13 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/07/01 15:05:13 INFO SparkContext: Successfully stopped SparkContext

修改该段代码,再重新打包测试,问题解决。启动日志如下所示。

## app start

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::                (v2.5.2)

 ....
21/07/01 16:05:51 INFO YarnClientSchedulerBackend: Application application_1619358582322_6532933 has started running.
21/07/01 16:05:51 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42549.
21/07/01 16:05:51 INFO NettyBlockTransferService: Server created on xxxxxxxxx:42549
21/07/01 16:05:51 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/07/01 16:05:51 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, xxxxxxx, 42549, None)
21/07/01 16:05:51 INFO BlockManagerMasterEndpoint: Registering block manager xxxxxxxxx:42549 with 5.2 GiB RAM, BlockManagerId(driver, xxxxxxx, 42549, None)
21/07/01 16:05:51 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, xxxxxxxx, 42549, None)
21/07/01 16:05:51 INFO BlockManager: external shuffle service port = 7337
21/07/01 16:05:51 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, xxxxxxxx, 42549, None)
21/07/01 16:05:51 INFO ServerInfo: Adding filter to /metrics/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
21/07/01 16:05:51 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@60510791{/metrics/json,null,AVAILABLE,@Spark}
21/07/01 16:05:52 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
21/07/01 16:05:54 INFO Utils: Using initial executors = 1, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/07/01 16:05:54 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000000000(ns)
21/07/01 16:05:55 INFO WelcomePageHandlerMapping: Adding welcome page template: index
21/07/01 16:05:55 INFO Http11NioProtocol: Starting ProtocolHandler ["http-nio-9000"]
21/07/01 16:05:55 INFO TomcatWebServer: Tomcat started on port(s): 9000 (http) with context path ''
21/07/01 16:05:55 INFO SpringApplication: Started application in 848.362 seconds (JVM running for 851.343)
## app start over
## finally clean spark context

后续和公司内 spark 专家沟通过该问题,专家给出的建议是添加一个启动参数来控制 main 函数之后是否需要关闭 context。最终提交了该 pr 33154 到spark社区。不过我提交的 pr 没有被采纳,该问题社区的最终修复方案是 fd3e9ce