Elastic Agent Builder: When Your Agent's Tools Fight Back
TL;DR · AI Summary
Elastic demonstrated security vulnerabilities in its Agent Builder tool during a hackathon, highlighting the necessity of developing secure and reliable intelligent agent systems.
Key Takeaways
- Security vulnerabilities in Agent Builder tool
- Hack attack caused the intelligent agent system to fail
- Need to strengthen the security of intelligent agent systems
Outline
Jump quickly between sections.
Introduces Elastic’s Agent Builder tool and its application during a hackathon.
Describes the functions and purposes of the Agent Builder tool.
Details the specific process and results of the hack attack.
Analyzes the security vulnerabilities in the Agent Builder tool.
Discusses the impact of these vulnerabilities on intelligent agent systems.
Proposes measures to improve the security of intelligent agent systems.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- Elastic Agent Builder 安全性问题
- 黑客攻击案例
- 攻击过程
- 系统失效
- 安全漏洞分析
- 漏洞类型
- 影响范围
- 解决方案与建议
- 改进措施
- 安全策略
Highlights
Key sentences worth saving and sharing.
The hack attack caused the intelligent agent system to fail, highlighting the security issues of such systems.
These security vulnerabilities in the Agent Builder tool could lead to the manipulation or control of intelligent agent systems.
Development teams need to take measures to ensure the security of intelligent agent systems.
Gauntlet: What happens when your agent's tools fight back | Elastic Blog
New
Forrester Wave Leader, Q2 2025
About usPartnersSupport|ENLogin
[](https://www.elastic.co/)
- Elasticsearch
##### Elasticsearch for...
- ###### Context engineering Get the most relevant context to agents so that they deliver accurate and trusted outcomes
- ###### Vector database Efficiently create, store, and search vector embeddings
- ###### Search powered applications The speed, scale, and flexibility to power modern application experience
- ###### Logs Collect, search, explore, and act on large volumes
- ###### Threat protection Detect, investigate, and remediate cyber threats at scale on real-time data
- ###### Workflows Combine scripted automation with AI reasoning natively in Elasticsearch
##### Elasticsearch components
- ###### Elasticsearch A distributed, RESTful search and analytics engine
- ###### Kibana (Discover, Dashboards) Explore, visualize, and build dashboards using data stored in Elasticsearch
- ###### Elastic Agent Builder Build context-aware agents faster that incorporate all your data and deliver best-in-class relevance.
- ###### AutoOps Easy cluster management with performance recommendations, resource utilization, and cost insights
- ###### Piped query language Simplify workflows and accelerate query response for efficient data processing
- ###### Jina AI search models Jina AI is part of Elastic, bringing best-in-class models for embeddings, rerankers, and URL and doc extraction
##### Deployment options
- ###### Elastic Cloud Serverless Zero operational load so that you can build fasterStart free trial
- ###### Elastic Cloud Hosted Deploy and scale on any cloud in minutes with ultimate controlStart free trial
- ###### Self-managed Elasticsearch Run locally, via Kubernetes, or your own orchestrationDownload
- Solutions
##### Search
- ###### Ecommerce search Improve customers' search experience and drive conversion
- ###### Customer support search Help customers find support information quickly and easily
- ###### Search-driven apps Create engaging apps quickly and easily with Elasticsearch
##### Observability
- ###### Log analytics Centralize and analyze logs using Search AI to detect, investigate, and remediate incidents
- ###### Infrastructure monitoring Monitor, visualize, and analyze the health of your on-premises and cloud infrastructure
- ###### Digital experience monitoring Improve users' experience with real user monitoring (RUM), synthetic testing, and uptime monitoring
- ###### App performance monitoring Monitor, visualize, and analyze the performance and availability of your applications
- ###### AIOps Automatically detect, diagnose, and resolve issues faster with GenAl and ML
- ###### LLM observability Monitor and optimize LLM performance, cost, safety, and reliability
##### Security
- ###### Next-gen SIEM Detect, investigate, and respond to evolving threats with Al-driven security analytics
- ###### Workflows for security Automate alert triage, enrichment, and response natively. No separate SOAR required.
- ###### XDR and endpoint security Secure your endpoints, clouds, and containers with AI-driven insights
- ###### AI for security Automate your triage, investigation, and response workflows with Search AI
- Enterprise
##### Why Elastic?
##### Industry
Financial servicesManufacturingPublic sectorRetailTelecommunicationsView all industries
##### Better together
- ###### Cloud providers Deploy with your favorite cloud marketplace: AWS, Azure, or Google Cloud
- ###### Elastic AI Ecosystem Use Elastic with built-in integrations with leading Al technology providers
- ###### Search AI Partner Program Partner with Elastic so we can find the answers, together
##### Accolades
- ###### AV-Comparatives Elastic earns Endpoint Prevention and Response Certification from AV-Comparatives
- ###### Forrester Wave™ Leader A Leader in The Forrester Wave™: Security Analytics Platforms, Q2 2025
- ###### Gartner Magic Quadrant Leader A Leader in 2025 Gartner® Magic Quadrant™ for Observability Platforms
- ###### IDC MarketScape Leader Leader in IDC MarketScape: Worldwide SIEM for Enterprise 2024
##### Customers
[Search Docusign powers millions of e-signature searches daily with Elasticsearch](https://www.elastic.co/customers/docusign)
[Security UOL slashes incident resolution time by 80% with Elastic Security](https://www.elastic.co/customers/uol)
[Observability Pepsi boosts efficiency and reduces MTTR by 30% with Elastic Observability](https://www.elastic.co/customers/pepsico)
- Resources
##### Launch
- ###### Get started Follow along with beginner guides for each solution
- ###### Demo gallery Play in our hands-on sandbox and watch how-to videos
- ###### Downloads Download Elasticsearch now to get started for free
- ###### Integrations Easily connect Elasticsearch to all the systems that matter
##### Learn
- ###### Docs Learn how to use all of Elastic's products and features
- ###### Elasticsearch Labs Learn how to build with the latest features and abilities
- ###### Elastic Security Labs Understand the threat horizon and see the latest research
- ###### Elastic Observability Labs Explore what's next in monitoring and metric trends
- ###### Blog Read all of the latest company news from Elastic's blog
##### Connect
- ###### Community Join our community of developers on Slack, GitHub, and more
- ###### Events Attend your local meetups, workshops, and Elastic{ON}
- ###### Webinars Check out Elastic webinars and learn directly from our experts
- ###### Discuss Share tips, ask questions, and learn from other developers
##### Get help
- ###### Training Learn Elastic for free and expand your skills with our courses
- ###### Support Get expert advice on your Elasticsearch deployments for fast resolution
- ###### Consulting Drive success with custom support and consulting services
Search
Table of Contents
Table of contents
- Close
Gauntlet: What happens when your agent's tools fight back
Elasticsearch Agent Builder Hackathon
By
May 13, 2026
.png)
- )Share on Twitter
- )Share on LinkedIn
- )Share on Facebook
- )Share by Email
- )Print
With two days left before the hackathon deadline, I made the decision to step back and rethink my approach from scratch.
The original idea was called Rehearse: an agent that rehearses actions in a sandbox mocked by another agent before executing them in the real world. The concept was sound, but the flaw was obvious in hindsight. The environment can change between rehearsal and execution. Your agent rehearses sending an email, but by the time it actually runs, the inbox looks different. Simulation diverges from reality, and the whole thing falls apart.
But one class of problems doesn't have this issue: adversarial fuzz-testing. If your agent fails in simulation, it can fail in real life too. That's how _Gauntlet_ was born — 48 hours before the deadline and reusing the same core insight (an agent that uses search to build memory and stay creative) pointed at a problem where stochasticity doesn't matter.
Gauntlet
#### Watch the Gauntlet demo
Test what happens when your agent's tools fight back. Not just once manually, but continuously, with a system that remembers what it's tried and gets more creative over time.
The problem with testing agents on the happy path
Most of us have heard of OpenClaw, the personal AI assistant that went viral. If you've followed the discourse around agentic AI assistants with broad tool access, you've seen the security concerns. Agents forget what they're not supposed to do or never knew in the first place. The reason is straightforward: We test the happy path. We check that the agent does what it should. We rarely check what happens when someone tries to make it do what it shouldn't.
Adversarial testing sandboxes exist, but they're painful to build. You design attack vectors manually. You seed adversarial data by hand. You configure test infrastructure for each scenario. It's slow, it doesn't scale, and it only finds the bugs you already thought of.
I wanted something different: a system where the _environment itself_ is automatically adversarial and gets more creative over time.
The idea: Mock the sandbox with another agent
Instead of building a sandbox, Gauntlet uses a mocking agent that intercepts your primary agent's tool calls and finds creative ways to break it. When your agent calls search_emails, the mocking agent sees the result and decides whether to mutate it, injecting a prompt injection into an email body, returning subtly wrong data, or feeding false information to see if the primary agent catches it. The primary agent never knows it's in a simulation.
The interface is two decorators:
@function_tool
@gauntlet.query
def search_emails(folder: str = "inbox") -> str:
"""Search emails in the given folder."""
return json.dumps(fetch_emails(folder))Copy to clipboard Copy to clipboard
There is @gauntlet.query for read operations and @gauntlet.mutation for writes. That's the entire integration surface. When the run finishes, evaluate() reviews what happened and stores confirmed bugs.
It’s simple to use, but there two hard problems that hide underneath.
The two problems that make this a search problem
First, the mocking agent needs to maintain a coherent model of the world throughout the conversation. If it told the primary agent that an email was from Alice, it can't later contradict that. A mutated email that's obviously fake teaches you nothing. Plausibility is the whole game.
Second, the mocking agent needs to find _novel_ bugs. Rediscovering the same prompt injection pattern 50 times isn't useful. It needs to remember what it has already found and explore in new directions while staying grounded in what the tools actually do.
Both of these are search problems. And that's where Elasticsearch becomes the backbone of the system.
Two memory circuits
The mocking agent runs on two memory circuits, both living in Elasticsearch.
Short-term memory tracks everything within the current session: every tool call intercepted, the original result, what it was mutated to, and what the primary agent did in response. This is the coherence layer. The mocking agent can query its own recent decisions and stay internally consistent while still being adversarial. Balancing creativity with coherence was the hardest design problem in the entire project.
Long-term memory is where the creativity compounds. It stores confirmed bugs with embeddings for similarity search, full tool implementations so the agent can reason about failure modes, and historical results from past runs. When the mocking agent needs a new attack idea, it searches long-term memory for what's been tried before, finds gaps, and hypothesizes something new.
These feed into a closed cycle: hypothesize what bugs might exist, create circumstances to prove them, and store confirmed bugs back into the index. The inventory grows. The attacks get more creative. The gap between Gauntlet and manual sandbox setup widens over time.
Everything runs inside Elastic Agent Builder
The entire mocking agent is built inside Elastic Agent Builder — instructions, tool bindings, and multi-turn conversation state via the Amazon Bedrock Converse API; no external orchestration needed.
The tool I'm most proud of is generate-hypothesis. It's a single ES|QL statement that samples existing bugs, aggregates them with MV_CONCAT, and calls COMPLETION inline to propose a novel attack hypothesis. It handles sampling, aggregation, LLM reasoning, and result generation all in one query, never leaving the ES|QL pipeline. I went in expecting I'd need to shuttle data between Elasticsearch and an external script. I didn't.
ES|QL's COMPLETION function was the biggest surprise. Between COMPLETION, STATS, MV_CONCAT, and SAMPLE, I could build entire reasoning pipelines as single queries. Bug storage uses Kibana Workflows, and a programmatically created Kibana Dashboard gives real-time visibility into bug counts, severity breakdowns, and attack pattern heatmaps.
The Converse API solved another problem I'd been dreading. The mocking agent needs to remember what it's already told the primary agent within a single run. I assumed I'd have to fetch conversation histories from indices and reload them into the agent on every call. But it turns out that the Converse API handles multi-turn state natively. I didn't write any conversation management logic. Just keep calling converse, and it stays coherent.
What this actually buys you
Manual adversarial sandbox setup takes roughly an hour per scenario. With Gauntlet, the same process takes 2–10 minutes, and its long-term memory means each run is informed by every previous run. The more you use it, the more it learns about your agent’s weak points and the harder it tries to find new ones.
What's next?
Right now Gauntlet is a 1v1: one mocking agent versus one primary agent. But the problem is embarrassingly parallel. 20 attack sessions could run simultaneously on separate sessions without any architectural changes. Scaling is the obvious next step.
The more interesting open question is exploration versus exploitation in the long-term memory. The mocking agent needs to balance trying variations of known successful attacks (exploitation) against completely novel hypotheses (exploration). This is a well-studied problem in other domains, but applying it to adversarial agent testing feels unexplored. There might be something worth pursuing beyond this project entirely.
I also keep thinking about Rehearse. Gauntlet is a special case: fuzz-testing works because failure in simulation implies possible failure in reality. But there are other domains where the environment is stable enough between rehearsal and execution that the original Rehearse concept could work. I haven't found them yet, but I'm looking.
The takeaway
If you're building agents with access to real-world tools, test what happens when those tools fight back. Not just once manually, but continuously, with a system that remembers what it's tried and gets more creative over time. That's what Gauntlet does.
[Kavish Sathia](https://www.elastic.co/blog/author/kavish-sathia)
Student, National University of Singapore
_Kavish Sathia is a computer science student at NUS working on agentic systems._
GitHub_·_Demo_·_Website_·_LinkedIn
_The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all._
_In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use._
_Elastic, Elasticsearch, and associated marks are trademarks, logos or registered trademarks of elasticsearch B.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners._
Share
- )Share on Twitter
- )Share on LinkedIn
- )Share on Facebook
- )Share by Email
- )Print
Sign up for Elastic Cloud free trial
Spin up a fully loaded deployment on the cloud provider you choose. As the company behind Elasticsearch, we bring our features and support to your Elastic clusters in the cloud.
Follow us
- 
- 
- 
- 
- 
- About us About ElasticLeadershipBlogNewsroom
- Join us CareersCareer portalHow we hire
- Partners Find a partnerPartner loginRequest accessBecome a partner
- Trust & Security LegalTrust centerPrivacyTrade ComplianceEthics & Compliance
- Investor relations Investor resourcesGovernanceFinancialsStock
- Excellence Awards Previous winnersElastic{ON} TourBecome a sponsorAll events
About us
Join us
Partners
Trust & Security
Investor relations
Excellence Awards
© 2026. elasticsearch B.V. All Rights Reserved
This website and all associated content, software, discussion forums, products, and services are intended for professional use only. No consumer use of this website or its content is intended or directed.
Elastic, Elasticsearch, and other related marks are trademarks, logos, or registered trademarks of elasticsearch B.V. in the United States and other countries.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries. All other brand names, product names, or trademarks belong to their respective owners.
Notice at Collection | Your Privacy Choices
