Two Misconfigurations That Caused Spark OOM Failures on Kubernetes
This article discusses the memory overflow issues that occurred when running Spark on Kubernetes due to two不当的基础设施设置。These settings are: setting `spark.kubernetes.local.dirs.tmpfs=true` to store all shuffle spill data in node memory, and using a hard `podAffinity` rule to force all executors to be placed on the same node. These settings cause shuffle spill to consume node memory instead of disk, leading to repeated OOM failures. By adjusting these settings, the issue can be resolved.
入选理由:设置`spark.kubernetes.local.dirs.tmpfs=true`将所有shuffle spill数据存储在节点内存中,可能导致内存溢出。
