T
traeai
登录
返回首页
AWS Architecture Blog

Unlock efficient model deployment: Simplified Inference Operator setup on Amazon SageMaker HyperPod

8.0Score
Unlock efficient model deployment: Simplified Inference Operator setup on Amazon SageMaker HyperPod
AI 深度提炼
  • 推理控制器现作为EKS原生插件,支持控制台一键安装与自动依赖配置,免去手动编写Helm图表与复杂IAM设置。
  • 提供Quick与Custom安装模式及CLI、Terraform多路径部署,兼顾开箱即用与现有资源复用。
  • 集成标准化版本管理与热升级机制,结合动态扩缩容与TTFT等核心指标监控,提升推理服务稳定性。
#AWS SageMaker#Kubernetes#EKS#大模型推理#MLOps
打开原文

Amazon SageMaker HyperPod offers an end-to-end experience supporting the full lifecycle of AI development—from interactive experimentation and training to inference and post-training workflows. The SageMaker HyperPod Inference Operator is a Kubernetes controller that manages the deployment and lifecycle of models on HyperPod clusters, offering flexible deployment interfaces (kubectl, Python SDK, SageMaker Studio UI, or HyperPod CLI), advanced autoscaling with dynamic resource allocation, and comprehensive observability that tracks critical metrics like time-to-first-token, latency, and GPU utilization.

Deploying inference workloads on Kubernetes-native infrastructure has traditionally required AI teams to navigate a maze of Helm charts, IAM role configurations, dependency management, and manual upgrades — often taking hours before a single model can serve predictions. Today, we’re announcing the Amazon SageMaker HyperPod Inference Operator as a native EKS add-on, enabling one-click installation and managed upgrades directly from the SageMaker console. This eliminates the need for manual Helm charts, complex IAM configuration tweaks, and downtime during upgrades.

In this post, we walk through the new installation experience, demonstrate three deployment methods (console, CLI, and Terraform), and show how features like multi-instance-type deployment and native node affinity give you fine-grained control over inference scheduling

**Simplified installation experience**

The new installation experience addresses three key customer scenarios with streamlined workflows:

**New HyperPod clusters: Automatic installation**

When creating new HyperPod clusters through the SageMaker console’s Quick Setup or Custom Setup workflows, the Inference Operator along with necessary dependencies is now installed through EKS add-on automatically as part of the cluster creation process. This eliminates the need for post-deployment configuration and ensures your cluster is ready for model deployments immediately upon creation along with one click upgrades.

**Existing clusters: One-click installation**

For existing HyperPod clusters, customers can install the Inference Operator with a single click through the SageMaker console. The installation automatically:

  • Creates required IAM roles with appropriate trust relationships and permissions
  • Sets up S3 buckets for TLS certificate storage
  • Configures VPC endpoints for secure S3 access
  • Installs dependency add-ons (cert-manager, S3 CSI driver, FSx CSI driver, metrics-server)
  • Deploys the Inference Operator as an EKS add-on

**Managed upgrades and lifecycle**

The EKS add-on integration provides standardized version management with one-click upgrades through the AWS console or CLI. This ensures customers can easily adopt new features and security updates without complex manual procedures.

The below prerequisite resources are needed to be setup before installing the Inference operator add-on. These prerequisites will be setup if SageMaker AI console is used to setup Inference operator. However, if EKS cli or console is used, these prerequisites will need to be created manually and passed to the add-on through configuration parameters. We discuss these approaches in Installation Methods.

**List of prerequisites**

1. EKS add-ons (S3 Mountpoint csi driver add-on, FsX add-on, Cert Manager add-on, Metrics server add-on) 2. IAM roles (Inference operator execution role, ALB role, KEDA role, Optional JumpStart Gated models role) 3. Infrastructure (S3 bucket to manage TLS certificates, OIDC association on the cluster,

For more information refer to this trouble shooting guide.

**Installation methods**

**Method 1: Install SageMaker HyperPod Inference Add-on through SageMaker UI (Recommended)**

The SageMaker console provides the most streamlined experience with two installation options:

**Quick install**: Automatically creates all required resources with optimized defaults, including IAM roles, S3 buckets, and dependency add-ons. This option is ideal for getting started quickly with minimal configuration decisions.

**Custom install**: Provides flexibility to specify existing resources or customize configurations while maintaining the one-click experience. Customers can choose to reuse existing IAM roles, S3 buckets, or dependency add-ons based on their organizational requirements.

Image 1: Amazon SageMaker HyperPod Inference Operator installation page showing Quick install and Custom install options with component details including AWS Load Balancer Controller, KEDA, and CSI drivers

**Prerequisites**

  • An existing Amazon SageMaker HyperPod cluster with EKS orchestration
  • IAM permissions for EKS cluster administration
  • kubectl configured for cluster access

**Installation steps**

1. **Navigate to the SageMaker Console**: Go to **HyperPod Clusters** → **Cluster Management** 2. **Select Your Cluster**: Choose the cluster where you want to install the Inference Operator

Image 2: HyperPod Dashboard main page

1. **Choose Installation Type**: Navigate to **Inference** tab. Select **Quick Install** for automated setup or **Custom Install** for configuration flexibility

Image 3: SageMaker HyperPod Inference tab interface showing disabled cluster role and installation options for managing inference workloads

1. **Configure Options**: If choosing **Custom Install**, specify existing resources or customize settings as needed 2. **Install**: Choose **Install** to begin the automated installation process 3. **Verify**: Check the installation status through the console, or by running `kubectl get pods -n hyperpod-inference-system`, or by checking the add-on status with `aws eks describe-addon --cluster-name CLUSTER-NAME --addon-name amazon-sagemaker-hyperpod-inference --region REGION`

After the add-on is successfully installed, you can deploy models using the Model deployments document or navigate to Deploying Your First Model section below.

**Method 2: Install SageMaker HyperPod Inference add-on through EKS APIs**

For customers preferring command-line workflows, the Inference Operator can be installed directly using the EKS CLI. Note that all prerequisite resources (IAM roles, S3 buckets, VPC endpoints) and dependency add-ons must be created manually before installing the Inference Operator add-on. For detailed setup instructions, see the installation guide.

aws eks create-addon \
  --cluster-name my-hyperpod-cluster \
  --addon-name amazon-sagemaker-hyperpod-inference \
  --addon-version v1.0.0-eksbuild.1 \
  --configuration-values '{
    "executionRoleArn": "arn:aws:iam::ACCOUNT-ID:role/SageMakerHyperPodInference-inference-role",
    "tlsCertificateS3Bucket": "hyperpod-tls-certificate-bucket",
    "hyperpodClusterArn": "arn:aws:sagemaker:REGION:ACCOUNT-ID:cluster/CLUSTER-ID",
    "alb": {
      "serviceAccount": {
        "create": true,
        "roleArn": "arn:aws:iam::ACCOUNT-ID:role/alb-controller-role"
      }
    },
    "keda": {
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "arn:aws:iam::ACCOUNT-ID:role/keda-operator-role"
          }
        }
      }
    }
  }' \
  --region us-west-2

Bash

**Method 3: Install SageMaker HyperPod Inference add-on through Terraform deployment**

Organizations utilizing Terraform for Infrastructure as Code (IaC) can deploy HyperPod clusters using the provided modules in the awesome-distributed-training GitHub repository.

To enable the HyperPod inference operator, set the `create_hyperpod_inference_operator_module` variable to `true` within your`custom.tfvars` file, as shown below:

kubernetes_version    = "1.33"
eks_cluster_name      = "tf-eks-cluster"
hyperpod_cluster_name = "tf-hp-cluster"
resource_name_prefix  = "tf-eks-test"
aws_region            = "us-east-1"

instance_groups = [
    {
        name                      = "accelerated-instance-group-1"
        instance_type             = "ml.g5.8xlarge",
        instance_count            = 2,
        availability_zone_id      = "use1-az2",
        ebs_volume_size_in_gb     = 100,
        threads_per_core          = 1,
        enable_stress_check       = false,
        enable_connectivity_check = false,
        lifecycle_script          = "on_create.sh"
    }
]

create_hyperpod_inference_operator_module = true

Bash

In addition to the HyperPod inference operator add-on, the Terraform modules also support the task governance, training operator, and observability add-ons as well. Check out the documentation for enabling optional add-ons for more details.

**Dependency management**

The HyperPod inference operator includes several additional dependencies, which are enabled by default but can be toggled off if they already exist on your EKS cluster:

**Dependency****Module/Variable****Toggle to Disable** **cert-manager**Installed via the HyperPod module`enable_cert_manager` = false **Amazon FSx for Lustre CSI**Installed via FSx module`create_fsx_module` = false **Mountpoint for Amazon S3 CSI**Bundled with Inference Operator Module`enable_s3_csi_driver` = false **AWS Load Balancer Controller**Bundled with Inference Operator EKS add-on`enable_alb_controller` = false **KEDA Operator**Bundled with Inference Operator EKS add-on`enable_keda` = false

**Key benefits**

**Faster time to value**

Teams can now deploy their first inference endpoint within minutes of cluster creation, compared to the previous multi-hour setup process. This acceleration enables faster experimentation and reduces the barrier to adoption for new teams.

**Reduced complexity**

The new installation experience eliminates the need to manually create and configure multiple AWS resources. Previously, customers needed to create IAM roles, policies, S3 buckets, VPC endpoints, and install multiple Kubernetes operators. Now, a single action handles all these requirements automatically.

**Consistent configuration**

Automated resource creation ensures consistent, secure configurations across environments. The installation process follows AWS best practices for IAM permissions, network security, and resource naming conventions.

**Simplified upgrades**

EKS Add-on integration provides standardized upgrade paths with rollback capabilities. Customers can confidently adopt new features and security updates through the familiar AWS console or CLI interfaces.

**Advanced features integration**

The simplified installation experience seamlessly integrates with advanced HyperPod inference capabilities:

**Managed tiered KV cache**

During installation, customers can optionally enable managed tiered KV cache with intelligent memory allocation based on instance types. This feature can reduce inference latency by up to 40% for long-context workloads while optimizing memory utilization across the cluster.

**Intelligent routing**

The installation automatically configures intelligent routing capabilities with multiple strategies (prefix-aware, KV-aware, round-robin) to maximize cache efficiency and minimize inference latency based on workload characteristics.

**Observability integration**

Built-in integration with HyperPod Observability provides immediate visibility into inference metrics, cache performance, and routing efficiency through Amazon Managed Grafana dashboards.

**Deploying your first model**

Once the add-on is installed, you can deploy models using the `InferenceEndpointConfig` or `JumpStart` models custom resources. Here’s an example configuration for deploying a Llama model:

apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name: deepseek-test-endpoint
spec:
  model:
    modelId: "deepseek-llm-r1-distill-qwen-1-5b"
  sageMakerEndpoint:
    name: deepseek-test-endpoint
  server:
    instanceType: "ml.g5.8xlarge"

YAML

**New features**

**Multi-Instance Type Deployment** HyperPod Inference supports multi-instance type deployment, enhancing deployment reliability and resource utilization. You can specify a prioritized list of instance types in your deployment configuration, and the system automatically selects from available alternatives when your preferred instance type lacks capacity. The Kubernetes scheduler evaluates instance types in priority order using node affinity rules based scheduling, seamlessly placing workloads on the highest-priority available instance type. In the example below, when deploying a model from S3, `ml.p4d.24xlarge` has the highest priority and will be selected first if memory capacity is available. If `ml.p4d.24xlarge` is unavailable, the scheduler automatically falls back to `ml.g5.24xlarge`, and finally to `ml.g5.8xlarge` as the last resort.

apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: lmcache-test-1
  namespace: default
spec:
  replicas: 13
  modelName: Llama-3.1-8B-Instruct
  instanceTypes: ["ml.p4d.24xlarge","ml.g5.24xlarge","ml.g5.8xlarge"]

YAML

This is implemented using Kubernetes node affinity rules with `requiredDuringSchedulingIgnoredDuringExecution` to restrict scheduling to the specified instance types, and `preferredDuringSchedulingIgnoredDuringExecution` with descending weights to enforce priority ordering.

**Node affinity**

For scenarios requiring more granular scheduling control — such as excluding spot instances, preferring specific availability zones, or targeting nodes with custom labels — HyperPod Inference exposes Kubernetes’ native `nodeAffinity` directly in the `InferenceEndpointConfig` spec. This gives you the full expressiveness of Kubernetes scheduling primitives.

apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: lmcache-test-1
  namespace: default
spec:
  replicas: 15
  modelName: Llama-3.1-8B-Instruct
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: node.kubernetes.io/instanceType
          operator: In
          values: ["ml.g5.4xlarge"]
  worker:
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        cpu: "6"
        memory: 30Gi
        nvidia.com/gpu: "1"

YAML

Clean up

To clean up your environment after completing this walkthrough, follow these steps to remove the deployed models and uninstall the Inference Operator add-on from your HyperPod cluster.

Removing Inference Operator add-on

Through the SageMaker console:

1. Navigate to SageMaker Console → **HyperPod Clusters** → **Cluster Management** 2. Select your cluster and go to the **Inference tab** 3. Choose **Remove** to uninstall the Inference Operator add-on and associated resources

Alternatively, using the AWS CLI:

aws eks delete-addon \
--cluster-name <my-hyperpod-cluster> \
--addon-name amazon-sagemaker-hyperpod-inference \
--region <region>

Bash

Delete the deployed models

# Delete JumpStartModel deployment
kubectl delete jumpstartmodel <model-name> -n <namespace>

# Or for InferenceEndpointConfig deployment
kubectl delete inferenceendpointconfig <endpoint-name> -n <namespace>

Bash

**Migration path for existing users**

Automated migration script is hosted in public GitHub that transitions the HyperPod Inference Operator from Helm to EKS add-on with built-in rollback capabilities if add-on installation fails. Backup files are stored in `/tmp/hyperpod-migration-backup-<timestamp>/` for manual rollback if needed.

**Key features**

  • **Auto-Discovery**: Derives configuration from existing Helm deployment (roles, buckets, dependencies)
  • **Safe Migration**: Scales down Helm deployments before add-on installation, validates prerequisites
  • **Dependency Handling**: Migrates S3/FSx CSI drivers, cert-manager, and metrics-server to add-ons
  • **Rollback Support**: Preserves original resources and restores on failure

**IAM Roles created**

1. Execution Role (Inference Operator + S3 TLS access) 2. JumpStart Gated Model Role 3. ALB Controller Role 4. KEDA Operator Role

**Examples for running the script**

# to follow step by step guide
./helm_to_addon.sh --cluster-name <my-cluster> --region us-east-1 

# no prompts needed except for initiating rollback in case of failure
./helm_to_addon.sh --cluster-name <my-cluster> --region us-east-1 --auto-approve 

# To skip the dependencies FSX, S3, Metricsserver, cert manager migration from Inference operator helm to respective add-ons
./helm_to_addon.sh --cluster-name my-cluster —region us-east-1 --skip-dependencies-migration

Bash

**Migration flow**

1. Validate existing Helm installation 2. Auto-derive configuration and create new IAM roles 3. Tag resources (ALBs, ACM certs, S3 objects) with `CreatedBy: HyperPodInference` 4. Install dependency add-ons (S3, FSx, cert-manager) if dependent CRDs don’t exist 5. Scale down Helm deployments for ALB, KEDA and Inference operator 6. Install Inference Operator add-on with `OVERWRITE` flag 7. Clean up old Helm resources 8. Migrate Helm-installed dependencies that are installed through Inference operator main chart to add-ons. To skip this step provide `--skip-dependencies` flag.

**Benefits**

  • Simplified management through EKS console/APIs
  • Automated updates via EKS add-on mechanisms
  • Native EKS integration
  • Zero downtime migration with rollback safety

**Conclusion**

The streamlined Inference Operator installation experience for Amazon SageMaker HyperPod eliminates infrastructure complexity and accelerates time to value for machine learning teams. With one-click installation, automated resource management, and seamless upgrade capabilities, teams can focus on deploying and optimizing their inference workloads rather than managing underlying infrastructure.

The EKS Add-on integration provides enterprise-grade lifecycle management while maintaining the flexibility to customize configurations for specific organizational requirements. Combined with advanced features like managed tiered KV cache and intelligent routing, this simplified installation experience makes high-performance inference deployment accessible to teams of all sizes.

Get started today by creating a new HyperPod cluster with the Inference Operator pre-installed, or add it to your existing clusters with a single click through the SageMaker console. For detailed add-on installation instructions and configuration options see this guide and for troubleshooting see this guide.

**Appendix**

  • * *

About the authors

问问这篇内容

回答仅基于本篇材料
    0 / 500

    Skill 包

    领域模板,一键产出结构化笔记
    • 论文精读包

      把一篇论文 / 技术博客精读成结构化笔记:问题、方法、实验、批判、延伸阅读。

      • · TL;DR(1 段)
      • · 研究问题与动机
      • · 方法概览
    • 投融资雷达包

      把一条融资 / 创投新闻整理成投资人视角的雷达卡:交易要点、判断、竞争格局、风险、尽调清单。

      • · 交易要点(公司 / 轮次 / 金额 / 投资人 / 估值,材料未明示则写 “未披露”)
      • · 投资 thesis(这家公司为什么值得关注)
      • · 竞争格局与替代方案

    导出到第二大脑

    支持 Notion / Obsidian / Readwise
    下载 Markdown(Obsidian 直接拖入)