How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway

- 结合Scikit-Learn、AWS Lambda、S3及API Gateway,打造端到端的无服务器垃圾邮件分类解决方案。
- 系统设计注重模块化和成本效率,便于模型独立更新而不停止API服务。
- 通过TF-IDF向量化与逻辑回归算法,演示文本数据预处理及机器学习模型训练步骤。
结构提纲
AI 替你读一遍后整理出的核心层级。
思维导图
用一张图看清主题之间的关系。
查看大纲文本(无障碍 / 无 JS 友好)
- 无服务器垃圾邮件分类器部署
- 引言
- 垃圾邮件威胁与应对
- 构建模型
- TF-IDF向量化
- 逻辑回归训练
- 部署流程
- AWS Lambda
- S3存储
- API Gateway
- 系统架构
- 模块化设计
- 成本效率优化
- 结论
- 无服务器AI应用价值
金句 / Highlights
值得收藏与分享的关键句。
通过Scikit-Learn构建的模型与AWS服务集成,实现了可扩展且成本高效的垃圾邮件分类API。
利用TF-IDF技术将文本转换为数值特征,提升模型对有意义词汇的关注度。
系统设计支持模型独立更新,确保API稳定性同时保持模型最新状态。

In today's digital world, spam is no longer just an annoyance - it's a growing security threat. To combat this, developers often turn to machine learning to build intelligent filters that can distinguish legitimate emails from malicious ones.
While building a machine learning model in a notebook is relatively straightforward, the real challenge lies in the last mile: deploying that model into a scalable, production-ready system that users can actually interact with.
In this project, I built an end-to-end serverless spam classifier, combining Scikit-learn for model development with AWS Lambda, Amazon S3, and Amazon API Gateway for deployment. The result is a lightweight, scalable API that can classify messages in real time.
The system is designed to be modular and cost-efficient, allowing the model to be retrained and updated independently without affecting the live API. From detecting "free iPhone" scams to identifying phishing attempts, this project demonstrates how to bridge the gap between machine learning experimentation and real-world deployment.
Table of Contents
1. Prerequisites
1. **Fundamental skills:** Basic proficiency in Python and understanding of Machine Learning concepts like classification.
2. **AWS account:** Access to an AWS account with permissions for Lambda, S3, and API Gateway.
3. **Environment:** Python 3.11 installed, along with libraries like scikit-learn, pandas, and joblib.
4. **AWS CLI:** Configured on your local machine for file uploads.
5. **HuggingFace account:** You can directly download the model from my account.
2. Building the Brain: The Model

_Photo by__Steve A Johnson__on__Unsplash_
At the heart of this project lies a supervised learning approach. Instead of simply specifying which words are considered spam, we'll provide the computer with a dataset and an algorithm, enabling it to learn and identify spam patterns on its own.
1. Vectorization: Turning Text into Math
Machine Learning models can't **read** text. They require numerical input. To solve this, we used the TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer.
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_trainHere's the mathematical formula:
TF-IDF term definitions:
- **wᵢ,ⱼ (Weight):** The final importance score of a specific word in a document.
- **tfᵢ,ⱼ (Term Frequency):** How often a word appears in a single email.
- **N (Total Documents):** The total count of all emails in your dataset.
- **dfᵢ (Document Frequency):** The number of different emails that contain this specific word.
- **log(N/dfᵢ) (IDF):** A penalty that lowers the score of common words like **the** or **is** that appear everywhere.
It cleans the data by removing common words, converts all text to lowercase for consistency, and assigns more importance to rare and meaningful words while giving less importance to frequently used words.
2. Training: The Logistic Regression Engine
We'll use **Logistic Regression** here, a classification algorithm that predicts the probability of an outcome.
In this stage, we feed our vectorized training data into the Logistic Regression algorithm. The goal is to establish a mathematical relationship between specific word weights and the **Spam** or **Ham** label.
During training, the model iteratively adjusts its internal parameters to minimize error, eventually learning that words like winner or free correlate highly with spam, while conversational language correlates with legitimate messages.
model = LogisticRegression()
model.fit(X_train_features, Y_train)In our case, it calculates the probability that an email belongs to spam or HAM.
The algorithm uses the Sigmoid function to map any real-valued number into a value between 0 and 1.
where z = β₀ + β₁x₁ +… + βₙxₙ.
3. Evaluation: Testing the Intelligence
After training, we need to verify if the brain actually works on data it hasn't seen before.
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)By comparing the model’s predictions against the actual labels in our test set, we calculate an Accuracy Score. This gives us the confidence that the model is ready for the real world (achieving ~94% accuracy in our tests).
4. Exporting the Logic (Serialization)
To move this brain from our local Python environment to the AWS Cloud, we'll use Joblib to save our work into binary files (.pkl).
joblib.dump(model, 'spam_model.pkl')
joblib.dump(feature_extraction, 'vectorizer.pkl')We use the Pickle format because it allows us to freeze complex Python objects (mathematical weights and word mappings) into a portable binary format that can be instantly re-animated in the cloud.
We need the Vectorizer to translate new user text into the exact numerical coordinates the Model was trained to understand. Using one without the other is like having a key but no lock.
The trained Logistic Regression model and TF-IDF vectorizer are openly available for the community on Hugging Face here: Get the model on HuggingFace.
3. Deploying the Model to AWS
Training a model is science, while deploying it is engineering. To make this classifier accessible to the world, we'll use a serverless stack that scales automatically and incurs nearly no maintenance costs.
1. Model Storage: Amazon S3
First, we'll uploade our.pkl files to an S3 bucket. By decoupling the model from the code, we can update the AI's intelligence (simply by overwriting the file in S3) without redeploying the backend code. It makes the system highly maintainable.
2. The Production Backend: AWS Lambda
To make the AI accessible, we'll move from a local script to a Serverless Cloud Architecture. This ensures the model is always available without the cost of a 24/7 server.
The deployment environment is AWS Lambda (Python 3.11). Since Lambda is a lightweight environment, it doesn't include Scikit-Learn or Joblib. To provide these, we'll download and store them in our S3 bucket and import them through the layers.
**Commands in AWS CLI:**
# 1. Create a workspace
mkdir ml_layer && cd ml_layer
# 2. Install scikit-learn and its dependencies into a folder
pip install \
--platform manylinux2014_x86_64 \
--target=python/lib/python3.11/site-packages \
--implementation cp \
--python-version 3.11 \
--only-binary=:all: \
scikit-learn joblib
# 3. Zip the folder
zip -r sklearn_lib.zip python
# 4. Upload to S3 (Using AWS CLI)
aws s3 cp sklearn_lib.zip s3://YOUR-BUCKET-NAME/We store the Scikit-Learn library as a ZIP in S3 to bypass the AWS Lambda deployment package size limit. This allows the function to dynamically load heavy dependencies only when needed without bloating the core code.
**The Lambda Function:**
import json
import boto3
import os
import sys
from io import BytesIO
# Ensures the custom Lambda layer(containing sklearn/joblib)
sys.path.append('/opt/python')
try:
import joblib
except ImportError:
# Fallback for specific Scikit-Learn distributions
from sklearn.utils import _joblib as joblib
# Initialize S3 client
s3 = boto3.client('s3')
# Use placeholders for the article so readers can insert their own values
BUCKET_NAME = 'YOUR_S3_BUCKET_NAME'
MODEL_KEY = 'spam_model.pkl'
VECTORIZER_KEY = 'vectorizer.pkl'
# Global variables for 'Warm Start' caching (improves performance by keeping model in RAM)
model = None
vectorizer = None
def load_model():
"""Downloads model files from S3 only if they aren't already in RAM"""
global model, vectorizer
if model is None or vectorizer is None:
try:
# 1. Load the Logistic Regression Model from S3
m_obj = s3.get_object(Bucket=BUCKET_NAME, Key=MODEL_KEY)
model = joblib.load(BytesIO(m_obj['Body'].read()))
# 2. Load the TF-IDF Vectorizer directly from S3
v_obj = s3.get_object(Bucket=BUCKET_NAME, Key=VECTORIZER_KEY)
vectorizer = joblib.load(BytesIO(v_obj['Body'].read()))
except Exception as e:
raise Exception(f"Failed to load .pkl files from S3: {str(e)}")
def lambda_handler(event, context):
try:
# Ensure model and vectorizer are ready before processing
load_model()
# Handles both direct Lambda tests and API Gateway POST requests
body = event.get('body', event)
if isinstance(body, str):
body = json.loads(body)
text = body.get('text', '')
if not text:
return {
'statusCode': 400,
'body': json.dumps({'error': 'No text provided.'})
}
# 1. Transform input text to numeric features using the trained Vectorizer
data_vec = vectorizer.transform([text])
# 2. Predict using the Logistic Regression Model
prediction = int(model.predict(data_vec)[0])
# 3. Map numeric result to human-readable label
result_label = "HAM" if prediction == 1 else "SPAM"
# RESPONSE WITH CORS
return {
'statusCode': 200,
'headers': {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*' # needed for cross-domain web integration
},
'body': json.dumps({
'status': 'success',
'classification': result_label,
'input_text': text
})
}
except Exception as e:
return {
'statusCode': 500,
'body': json.dumps({'error_message': f"Inference Error: {str(e)}"})
}Key features of the Lambda function:
1. **Warm start caching:** By defining the model and vectorizer variables outside the lambda_handler, we store them in the container's memory. This significantly reduces cold start latency for subsequent requests.
2. **Dynamic dependency loading:** The **sys.path.append('/opt/python')** line allows us to import heavy libraries from S3/Layers without exceeding the upload limit.
3. **Bimodal input handling:** The function is designed to handle both direct JSON testing from the AWS console and stringified payloads sent via API Gateway.
3. The API Gateway - The Bridge to the Web

#### Creating the REST API
Next we'll create a REST API with a single POST method. Why POST, you might be wondering? Well, we need to securely send a JSON payload containing the user’s text message to our model.
1. First navigate to the Amazon API Gateway console and select Create API -> REST API.
2. Give your API a name, such as EmailSpamPredictor-API, and set the Endpoint Type to Regional.
3. Then in the left sidebar, click Resources and enter a resource name (e.g: **/ predict** as entered by me)
4. Next click the create method and select POST and then select Lambda Function for integration type
5. Ensure Lambda Proxy integration is enabled (this allows the full request to pass through to your code).
**The CORS Configuration (The Troubleshooting Hub)**
This is where many developers encounter the dreaded **Connection Error**. Since our API is hosted on AWS, and if your front-end is on a separate website, the browser’s Same-Origin Policy will block the request by default.
To fix this, we'll enable **CORS:**
1. **Access-Control-Allow-Origin:** Set to * (or specifically to your domain) to tell the browser that the API is allowed to talk to your front-end.
2. **The OPTIONS method:** API Gateway creates an OPTIONS method automatically. This handles the Preflight request where the browser asks, “Are you allowed to receive data from me?” before sending the actual text.
3. **Access-Control-Allow-Headers:** In the screenshot, you'll notice headers like Content-Type and Authorization are allowed. This ensures that when our JavaScript fetch() call sets the content type to application/json, the API Gateway doesn't reject it.

Image illustrates the CORS configuration for our project. (Image by author)
#### Deployment Stages
Once the API is deployed to a production stage, AWS generates a permanent Invoke URL. This acts as the public gateway to our model and typically follows this structure: [https://[api-id].execute-api.[region].amazonaws.com/prod/classify](https://www.freecodecamp.org/%5Bapi-id%5D.execute-api.%5Bregion%5D.amazonaws.com/prod/classify).
#### Connecting the Frontend (The JavaScript Layer)
With the API live, we can now write a simple JavaScript function to talk to our model. This script runs whenever a user clicks the **Analyze** button on your site.
async function checkSpam() {
const message = document.getElementById("userInput").value;
const apiUrl = "YOUR_API_GATEWAY_INVOKE_URL";
try {
const response = await fetch(apiUrl, {
method: "POST",
headers: {
"Content-Type": "application/json"
},
body: JSON.stringify({ "text": message })
});
const data = await response.json();
// Display result on the webpage
const resultElement = document.getElementById("result");
resultElement.innerText = `Prediction: ${data.classification}`;
resultElement.style.color = data.classification === "SPAM" ? "red" : "green";
} catch (error) {
console.error("Error:", error);
alert("Could not connect to the Spam Detector API.");
}
}4. How to Run The Project Locally
You can store the front-end as an HTML file. Once it's ready, you shouldn’t just double-click the.html file. Opening it as a **file** in your browser can cause security restrictions. Instead, you should host it using a simple local server.
**Step 1:** Open the terminal or Command Prompt.
**Step 2:** Navigate to your project folder
cd [PATH_TO_YOUR_FOLDER]**Step 3:** Start a local Python web server.
python -m http.server 8000**Step 4:** Access the application.
Open your browser and navigate to:
http://localhost:8000/your-file-name.html
**Watch the Demo:**
5. Our Project Architecture

The image illustrates the architecture of our project (Building a Serverless Spam Classifier). It shows the process that takes place from the client input to the final model output. (Image by Author)
1. **Client Front-End Interaction:** The process starts on the far left. A user interacts with the web interface (for example, a website or a desktop app). They input text like **WIN free iPhone now** and trigger a request.
2. **The Entry Point: API Gateway:** The request hits the Amazon API Gateway, which acts as the **security guard** and translator.
**(a)** CORS OPTIONS handles the pre-flight handshake to ensure the browser has permission to talk to the AWS cloud.
**(b)** Classification Request (POST) routes the actual message data to your backend logic.
3. **The Engine: AWS Lambda (Python 3.11):**The central “**lightbulb**” represents your Lambda function. This is where the code you wrote lives. It doesn’t run 24/7 – it only wakes up when a request arrives.
4. **Storage & Retrieval: S3 Bucket:** Since Lambda is lightweight, it doesn’t store your heavy Machine Learning files internally.
**Dependency and Model Download:** The function reaches out to the S3 Bucket to pull in the sklearn_lib.zip (the engine) and the.pkl files (the intelligence).
**Required Dependency and Model:** These assets are loaded into the Lambda’s temporary memory to prepare for the prediction.
5. **The Inference Pipeline:**Inside the Lambda, a three-step mathematical cycle occurs:
**(a) Text Vectorizer:** Translates the words into numbers.
**(b) Logistic Regression:** Calculates the probability of spam based on those numbers.
**(c) Label:** Assigns a final result (Spam or Ham).
6. **The Result Delivery:** The result is sent back through the API Gateway, including the necessary CORS Headers to ensure the browser accepts it. The front-end then updates to show the “**Result: SPAM**” with a visual indicator.
6. Conclusion: The Power of Serverless AI
By merging the mathematical simplicity of Logistic Regression with the industrial strength of AWS Serverless Architecture, we have transformed a static Python script into a globally accessible, scalable API.
This project demonstrates that you don’t need a massive budget or a 24/7 dedicated server to deploy high-quality Machine Learning.
Using the S3-to-Lambda workaround allowed us to bypass common storage hurdles, ensuring that our Brain (the model) and its Muscle (Scikit-Learn) could function seamlessly within the cloud’s ephemeral environment. It bridges the gap between experimentation and real-world applications, making AI systems practical, efficient, and accessible.
7. Acknowledgment / References
- Pre-trained spam classification model: View on Hugging Face (**rakshath1/mail-spam-detector · Hugging Face****)**
- Scikit-learn Documentation
- AWS Lambda Documentation
- Amazon S3 Documentation
- Amazon API Gateway Documentation
Connect With Me
**You may also like**
2. **DevOps is Dead. Long Live Platform Engineering**
- * *
- * *
Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started
问问这篇内容
回答仅基于本篇材料Skill 包
领域模板,一键产出结构化笔记论文精读包
把一篇论文 / 技术博客精读成结构化笔记:问题、方法、实验、批判、延伸阅读。
- · TL;DR(1 段)
- · 研究问题与动机
- · 方法概览
投融资雷达包
把一条融资 / 创投新闻整理成投资人视角的雷达卡:交易要点、判断、竞争格局、风险、尽调清单。
- · 交易要点(公司 / 轮次 / 金额 / 投资人 / 估值,材料未明示则写 “未披露”)
- · 投资 thesis(这家公司为什么值得关注)
- · 竞争格局与替代方案