T
traeai
Sign in
返回首页
Elastic Blog

Elastic Agent Builder: When Your Agent's Tools Fight Back

7.0Score
Elastic Agent Builder: When Your Agent's Tools Fight Back

TL;DR · AI Summary

Elastic demonstrated security vulnerabilities in its Agent Builder tool during a hackathon, highlighting the necessity of developing secure and reliable intelligent agent systems.

Key Takeaways

  • Security vulnerabilities in Agent Builder tool
  • Hack attack caused the intelligent agent system to fail
  • Need to strengthen the security of intelligent agent systems

Outline

Jump quickly between sections.

  1. Introduces Elastic’s Agent Builder tool and its application during a hackathon.

  2. Describes the functions and purposes of the Agent Builder tool.

  3. Details the specific process and results of the hack attack.

  4. Analyzes the security vulnerabilities in the Agent Builder tool.

  5. Discusses the impact of these vulnerabilities on intelligent agent systems.

  6. Proposes measures to improve the security of intelligent agent systems.

Mindmap

See how the topics connect at a glance.

查看大纲文本(无障碍 / 无 JS 友好)
  • Elastic Agent Builder 安全性问题
    • 黑客攻击案例
      • 攻击过程
      • 系统失效
    • 安全漏洞分析
      • 漏洞类型
      • 影响范围
    • 解决方案与建议
      • 改进措施
      • 安全策略

Highlights

Key sentences worth saving and sharing.

  • The hack attack caused the intelligent agent system to fail, highlighting the security issues of such systems.

    Paragraph 3

    ⬇︎ 下载 PNG𝕏 分享到 X
  • These security vulnerabilities in the Agent Builder tool could lead to the manipulation or control of intelligent agent systems.

    Paragraph 4

    ⬇︎ 下载 PNG𝕏 分享到 X
  • Development teams need to take measures to ensure the security of intelligent agent systems.

    Paragraph 6

    ⬇︎ 下载 PNG𝕏 分享到 X
#智能代理#Agent Builder#Elastic
Open original article

Gauntlet: What happens when your agent's tools fight back | Elastic Blog

Skip to main content

New

Forrester Wave Leader, Q2 2025

Access report

About usPartnersSupport|ENLogin

[](https://www.elastic.co/)

  • Elasticsearch

##### Elasticsearch for...

##### Elasticsearch components

##### Deployment options

  • Solutions

##### Search

Overview

##### Observability

Overview

##### Security

Overview

  • Enterprise

##### Why Elastic?

Knowledge Hub

##### Industry

Financial servicesManufacturingPublic sectorRetailTelecommunicationsView all industries

##### Better together

##### Accolades

##### Customers

View all customers stories

Image 2: logo for Docusign

[Search Docusign powers millions of e-signature searches daily with Elasticsearch](https://www.elastic.co/customers/docusign)

Image 3: logo for UOL

[Security UOL slashes incident resolution time by 80% with Elastic Security](https://www.elastic.co/customers/uol)

Image 4: logo for PepsiCo

[Observability Pepsi boosts efficiency and reduces MTTR by 30% with Elastic Observability](https://www.elastic.co/customers/pepsico)

  • Resources

##### Launch

##### Learn

##### Connect

##### Get help

PricingDocs

Search

Start free trialContact sales

Blog

Company

* Solutions

* Stack + Cloud

* News

* Customers

* Generative AI

* Culture

Elasticsearch Labs

* Blogs

* Tutorials

* Examples

* Integrations

Security Labs

* Blogs

* Reports

* Tools

Observability Labs

* Blogs

Image 5: Blog feed

Table of Contents

Table of contentsImage 6: icon-toc-16-blue.svg

  • Close

Gauntlet: What happens when your agent's tools fight back

Elasticsearch Agent Builder Hackathon

By

Kavish Sathia

May 13, 2026

Image 7: gauntlet-blog_(1).png.png)

With two days left before the hackathon deadline, I made the decision to step back and rethink my approach from scratch.

The original idea was called Rehearse: an agent that rehearses actions in a sandbox mocked by another agent before executing them in the real world. The concept was sound, but the flaw was obvious in hindsight. The environment can change between rehearsal and execution. Your agent rehearses sending an email, but by the time it actually runs, the inbox looks different. Simulation diverges from reality, and the whole thing falls apart.

But one class of problems doesn't have this issue: adversarial fuzz-testing. If your agent fails in simulation, it can fail in real life too. That's how _Gauntlet_ was born — 48 hours before the deadline and reusing the same core insight (an agent that uses search to build memory and stay creative) pointed at a problem where stochasticity doesn't matter.

Gauntlet

#### Watch the Gauntlet demo

Test what happens when your agent's tools fight back. Not just once manually, but continuously, with a system that remembers what it's tried and gets more creative over time.

Watch the demo here

The problem with testing agents on the happy path

Most of us have heard of OpenClaw, the personal AI assistant that went viral. If you've followed the discourse around agentic AI assistants with broad tool access, you've seen the security concerns. Agents forget what they're not supposed to do or never knew in the first place. The reason is straightforward: We test the happy path. We check that the agent does what it should. We rarely check what happens when someone tries to make it do what it shouldn't.

Adversarial testing sandboxes exist, but they're painful to build. You design attack vectors manually. You seed adversarial data by hand. You configure test infrastructure for each scenario. It's slow, it doesn't scale, and it only finds the bugs you already thought of.

I wanted something different: a system where the _environment itself_ is automatically adversarial and gets more creative over time.

The idea: Mock the sandbox with another agent

Instead of building a sandbox, Gauntlet uses a mocking agent that intercepts your primary agent's tool calls and finds creative ways to break it. When your agent calls search_emails, the mocking agent sees the result and decides whether to mutate it, injecting a prompt injection into an email body, returning subtly wrong data, or feeding false information to see if the primary agent catches it. The primary agent never knows it's in a simulation.

The interface is two decorators:

code
@function_tool
@gauntlet.query
def search_emails(folder: str = "inbox") -> str:
    """Search emails in the given folder."""
    return json.dumps(fetch_emails(folder))

Image 18Copy to clipboard Copy to clipboard

There is @gauntlet.query for read operations and @gauntlet.mutation for writes. That's the entire integration surface. When the run finishes, evaluate() reviews what happened and stores confirmed bugs.

It’s simple to use, but there two hard problems that hide underneath.

The two problems that make this a search problem

First, the mocking agent needs to maintain a coherent model of the world throughout the conversation. If it told the primary agent that an email was from Alice, it can't later contradict that. A mutated email that's obviously fake teaches you nothing. Plausibility is the whole game.

Second, the mocking agent needs to find _novel_ bugs. Rediscovering the same prompt injection pattern 50 times isn't useful. It needs to remember what it has already found and explore in new directions while staying grounded in what the tools actually do.

Both of these are search problems. And that's where Elasticsearch becomes the backbone of the system.

Two memory circuits

The mocking agent runs on two memory circuits, both living in Elasticsearch.

Short-term memory tracks everything within the current session: every tool call intercepted, the original result, what it was mutated to, and what the primary agent did in response. This is the coherence layer. The mocking agent can query its own recent decisions and stay internally consistent while still being adversarial. Balancing creativity with coherence was the hardest design problem in the entire project.

Long-term memory is where the creativity compounds. It stores confirmed bugs with embeddings for similarity search, full tool implementations so the agent can reason about failure modes, and historical results from past runs. When the mocking agent needs a new attack idea, it searches long-term memory for what's been tried before, finds gaps, and hypothesizes something new.

These feed into a closed cycle: hypothesize what bugs might exist, create circumstances to prove them, and store confirmed bugs back into the index. The inventory grows. The attacks get more creative. The gap between Gauntlet and manual sandbox setup widens over time.

Everything runs inside Elastic Agent Builder

The entire mocking agent is built inside Elastic Agent Builder — instructions, tool bindings, and multi-turn conversation state via the Amazon Bedrock Converse API; no external orchestration needed.

The tool I'm most proud of is generate-hypothesis. It's a single ES|QL statement that samples existing bugs, aggregates them with MV_CONCAT, and calls COMPLETION inline to propose a novel attack hypothesis. It handles sampling, aggregation, LLM reasoning, and result generation all in one query, never leaving the ES|QL pipeline. I went in expecting I'd need to shuttle data between Elasticsearch and an external script. I didn't.

ES|QL's COMPLETION function was the biggest surprise. Between COMPLETION, STATS, MV_CONCAT, and SAMPLE, I could build entire reasoning pipelines as single queries. Bug storage uses Kibana Workflows, and a programmatically created Kibana Dashboard gives real-time visibility into bug counts, severity breakdowns, and attack pattern heatmaps.

The Converse API solved another problem I'd been dreading. The mocking agent needs to remember what it's already told the primary agent within a single run. I assumed I'd have to fetch conversation histories from indices and reload them into the agent on every call. But it turns out that the Converse API handles multi-turn state natively. I didn't write any conversation management logic. Just keep calling converse, and it stays coherent.

What this actually buys you

Manual adversarial sandbox setup takes roughly an hour per scenario. With Gauntlet, the same process takes 2–10 minutes, and its long-term memory means each run is informed by every previous run. The more you use it, the more it learns about your agent’s weak points and the harder it tries to find new ones.

What's next?

Right now Gauntlet is a 1v1: one mocking agent versus one primary agent. But the problem is embarrassingly parallel. 20 attack sessions could run simultaneously on separate sessions without any architectural changes. Scaling is the obvious next step.

The more interesting open question is exploration versus exploitation in the long-term memory. The mocking agent needs to balance trying variations of known successful attacks (exploitation) against completely novel hypotheses (exploration). This is a well-studied problem in other domains, but applying it to adversarial agent testing feels unexplored. There might be something worth pursuing beyond this project entirely.

I also keep thinking about Rehearse. Gauntlet is a special case: fuzz-testing works because failure in simulation implies possible failure in reality. But there are other domains where the environment is stable enough between rehearsal and execution that the original Rehearse concept could work. I haven't found them yet, but I'm looking.

The takeaway

If you're building agents with access to real-world tools, test what happens when those tools fight back. Not just once manually, but continuously, with a system that remembers what it's tried and gets more creative over time. That's what Gauntlet does.

[Kavish Sathia](https://www.elastic.co/blog/author/kavish-sathia)

Student, National University of Singapore

_Kavish Sathia is a computer science student at NUS working on agentic systems._

GitHub_·_Demo_·_Website_·_LinkedIn

_The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all._

_In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use._

_Elastic, Elasticsearch, and associated marks are trademarks, logos or registered trademarks of elasticsearch B.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners._

Share

Sign up for Elastic Cloud free trial

Spin up a fully loaded deployment on the cloud provider you choose. As the company behind Elasticsearch, we bring our features and support to your Elastic clusters in the cloud.

Start free trial

Image 29: Elastic The Search AI Company

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

© 2026. elasticsearch B.V. All Rights Reserved

This website and all associated content, software, discussion forums, products, and services are intended for professional use only. No consumer use of this website or its content is intended or directed.

Elastic, Elasticsearch, and other related marks are trademarks, logos, or registered trademarks of elasticsearch B.V. in the United States and other countries.

Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries. All other brand names, product names, or trademarks belong to their respective owners.

Notice at Collection | Your Privacy Choices![Image 35: California Consumer Privacy Act (CCPA) Opt-Out Icon](blob:http://localhost/ef7e8ea9ad85f0635b74ccfdf73c32f1)

Image 37Image 38

Image 39
Image 40

AI may generate inaccurate information. Please verify important content.