---
title: "Using Bag-of-Words With PyCharm"
source_name: "The JetBrains Blog"
original_url: "https://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/"
canonical_url: "https://www.traeai.com/articles/ef53646c-603b-467a-a7e9-3be5fe054bcb"
content_type: "article"
language: "英文"
score: 8
tags: ["NLP","PyCharm","词袋模型"]
published_at: "2026-04-29T17:42:41+00:00"
created_at: "2026-04-30T02:38:59.938714+00:00"
---

# Using Bag-of-Words With PyCharm

Canonical URL: https://www.traeai.com/articles/ef53646c-603b-467a-a7e9-3be5fe054bcb
Original source: https://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/

## Summary

本文介绍了如何使用PyCharm实现基于词袋模型的文本分类项目，详细解释了词袋模型的工作原理及其在NLP中的应用。

## Key Takeaways

- 词袋模型通过将文本转换为数值向量来表示文本内容。
- 尽管词袋模型不保留语法或词序，但在许多任务中仍然非常有效。
- PyCharm提供了特定功能，使词袋模型的实现更加高效和便捷。

## Content

Title: Using Bag-of-Words With PyCharm | The PyCharm Blog

URL Source: http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/

Markdown Content:
[![Image 1: Pycharm logo](https://blog.jetbrains.com/wp-content/uploads/2019/01/PyCharm-1.svg)](https://blog.jetbrains.com/pycharm/)
The only Python IDE you need.

## Using Bag-of-Words With PyCharm

![Image 2: Jodie Burchell](https://blog.jetbrains.com/wp-content/uploads/2022/11/BK7A9876_korr_sRGB_8_1000x1500px_square_resized-200x200.jpg)

April 29, 2026

Have you ever wondered how [machine learning](https://www.jetbrains.com/pycharm/data-science/) models actually work with text? After all, these models require numerical input, but text is, well, text.

Natural language processing (NLP) offers many ways to bridge this gap, from the large language models (LLMs) that are dominating headlines today all the way back to the foundational techniques of the 1950s. Those early methods fall under what we now call the **bag-of-words (BoW) model**, and despite their age, they remain remarkably effective for a wide range of language problems.

In this post, we’ll unpack how the bag-of-words model works, explore the techniques it uses to convert text into numerical representations, and look at where it fits relative to more modern NLP approaches. We’ll also build a text classification project using BoW techniques, and see how PyCharm’s specific features make the whole process faster and easier.

## What is the bag-of-words model?

The bag-of-words model is a text representation technique that converts unstructured text into numerical vectors by tracking which words appear across a corpus (a collection of texts). Rather than preserving grammar or word order, it simply represents each document as a “bag” of its words, recording how often each one appears. The result is a vector of counts that captures what a text is about, even if it discards how that content is expressed.

This apparent limitation turns out to matter less than you might expect. For many tasks, such as text classification and sentiment analysis, the presence of certain words is often a stronger signal than their arrangement, and BoW captures that signal efficiently.

## How does bag-of-words work?

To use the bag-of-words model, we need to convert each text in a corpus into a numerical vector. Let’s walk through how that works, starting with what that vector actually looks like.

Take the following sentence:

> When diving into natural language processing, it is natural for beginners to feel overwhelmed by the complexity of sentiment analysis, which involves distinguishing negative from positive text. However, as you practice with libraries like NLTK or spaCy, the concepts naturally start to click.

A vector representation of this text using the BoW model might look something like this.

…natural naturally nausea near neared nearing necessary negative…
2 1 0 0 0 0 0 1

If we think of this vector as a table, you may have noticed that each column represents a word in the corpus, and the row contains a number from 0 to 2. This number is a count of how many times the word occurs in the text, as we can see below:

> When diving into **natural** language processing, it is **natural** for beginners to feel overwhelmed by the complexity of sentiment analysis, which involves distinguishing **negative** from positive text. However, as you practice with libraries like NLTK or spaCy, the concepts **naturally** start to click.

Each column represents a word in the vocabulary; each value records how many times that word appears. Here, “natural” appears twice, while “naturally” and “negative” each appear once.

### Tokenization

Before we can build this vector, we need to split our text into tokens. In BoW modeling, this is typically straightforward: We split on whitespace, so “When diving into natural language processing,” becomes seven tokens: `["When", "diving", "into", "natural", "language", "processing", ","]`. This is considerably simpler than the tokenization used in LLMs.

### Vocabulary creation

Applying tokenization across every text in the corpus produces a long list of words. Deduplicating this list gives us our vocabulary, which we can see in the set of columns in the vector above. This process does introduce some noise: “Natural” and “natural”, for instance, would be treated as two separate tokens. We’ll look at preprocessing steps to address this shortly.

### Encoding

With a vocabulary in hand, we create a vector for each text with one element per vocabulary word. Encoding is then the process of filling in those elements by checking each vocabulary word against the text.

The simplest approach is **binary vectorization**: 0 if a word is absent, 1 if present. More common, however, is **count vectorization**, which records the actual number of occurrences, as we saw in the example above. Count vectorization carries more information, since it helps distinguish texts that merely mention a topic from those that focus on it heavily.

One practical consequence of this approach is sparsity. If a corpus contains thousands of unique words, each vector will have thousands of elements, but any individual text will only use a small fraction of them, leaving most values at zero. This signal-to-noise issue is something we’ll return to.

## Advantages of the bag-of-words model

The bag-of-words model has remained a staple in NLP for good reason. Its greatest strength is its simplicity: Because text is represented as a collection of word counts, the approach is easy to understand and straightforward to implement, making it a natural baseline before reaching for more complex architectures.

Beyond simplicity, BoW is computationally efficient. As you saw above, the underlying math is lightweight, which means it scales well to large text collections without demanding significant computing resources. For tasks where the presence of specific words is sufficient to capture meaning, with sentiment analysis and topic categorization being the clearest examples, it remains a highly effective tool.

## Applications of bag-of-words

Like many NLP approaches, the bag-of-words model can be applied to many natural language problems. These potential applications include:

*   **Document classification**, where encoded documents are sorted into predefined categories. A classic example of this is automatically sorting incoming news articles into distinct categories such as sports, politics, or technology, as we’ll see in the project we do in this post.
*   **Sentiment analysis**, where the presence of certain words strongly indicates the overall tone of a text, allows models to easily determine whether a piece of writing expresses a positive, negative, or neutral sentiment. If you’re interested in learning more about BoW and other approaches to sentiment analysis, you can see a [prior blog post](https://blog.jetbrains.com/pycharm/2024/12/introduction-to-sentiment-analysis-in-python/) I wrote on this topic.
*   **Spam detection**, which relies heavily on BoW to identify and filter out unwanted emails or messages by learning to recognize the distinct, high-frequency word patterns characteristic of spam.
*   **Retrieval systems**, where it helps to efficiently find the most relevant documents from an immense corpus based on a user’s search query.
*   **Topic modeling**, which aims to group similar text vectors in order to discover and extract the hidden, latent topics present within a large collection of documents.

As you can see, the number of potential applications is broad, making bag-of-words modeling a popular first approach to natural language problems.

## Why use PyCharm for NLP?

[PyCharm](https://www.jetbrains.com/pycharm/) is particularly well-suited to bag-of-words modeling because it supports the iterative, detail-oriented workflow that text processing requires. As you’ll soon see, building a reliable BoW pipeline involves multiple steps, such as cleaning text, tokenizing, vectorizing, and validating outputs, and PyCharm’s code intelligence makes each of these smoother. Autocompletion, parameter hints, and quick navigation through specialized NLP libraries reduce friction when experimenting with different vectorizer settings, and help you understand how each component behaves.

[Debugging](https://www.jetbrains.com/pycharm/features/debugger.html) and data inspection are equally important here, since small preprocessing mistakes can have an outsized effect on results. PyCharm lets you step through your code and examine intermediate states of things such as token lists and vocabulary at runtime, making it straightforward to verify that your feature extraction is working as intended. This visibility is especially useful when diagnosing issues like unexpected vocabulary sizes or missing terms.

PyCharm also supports exploratory work through its excellent [Jupyter Notebook integration](https://www.jetbrains.com/help/pycharm/jupyter-notebook-support.html) and scientific tooling. BoW modeling often involves trying different preprocessing strategies and observing their effects immediately, so the ability to run code interactively and inspect outputs inline is a genuine advantage. Combined with built-in virtual environment and package management support, this keeps experiments reproducible and well-organized.

As projects grow, PyCharm’s refactoring tools, project navigation, and version control integration help manage the added complexity. BoW models rarely exist in isolation, and they’re often embedded in larger ML pipelines. In such contexts, PyCharm’s features for working with larger applications mean you spend less time managing code and more time improving your models.

### Setting up the project

To see these components in action, let’s build an actual bag-of-words project. We’ll use a classic text classification dataset and the AG News dataset, and then use the model to classify news articles into one of four categories: World, Sports, Business, or Science/Technology.

To get started in PyCharm, open the _Projects and Files_ tool window and select _New… > New Project…_. Since this is a data science project, we can use PyCharm’s built-in Jupyter project type, which sets up a sensible default structure for us.

During project configuration, you’ll be asked to choose a Python interpreter. By default, PyCharm uses uv and lets you select from a range of Python versions, though all major dependency management systems are supported: pip, Anaconda, Pipenv, Poetry, and Hatch. Every project is automatically created with an attached virtual environment, so your setup will be ready to go each time you reopen the project.

![Image 3](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-1-selecting-uv-project.png)
With the project configured, we can install our dependencies via the _Python Packages_ tool window. Simply search for a package by name, select it from the list, and install your desired version directly into the virtual environment. You can also see the same information about the package you’d find on PyPI directly within the IDE. For this project, we’ll need pandas and Numpy, along with datasets from Hugging Face, scikit-learn, Pytorch, and spaCy.

![Image 4](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-2-installing-package.png)
## Implementing bag-of-words with PyCharm

There are many versions of this dataset online. We’ll be using [one of the versions](https://huggingface.co/datasets/sh0416/ag_news) hosted on Hugging Face Hub.

### Loading and preparing the data

We’ll use Hugging Face’s `datasets` package to download this dataset.

from datasets import load_dataset
ag_news_all = load_dataset("sh0416/ag_news")
This gives us a Hugging Face `DatasetDict` object. If we look at it, we can see it contains a training dataset with 120,000 news articles, and a test dataset containing 7,600 articles.

ag_news_all DatasetDict({
    train: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['label', 'title', 'description'],
        num_rows: 7600
    })
})
As we’ll be training a model, we also need a validation set. We’ll convert the training and test sets to pandas DataFrames, and use the `train_test_split` method from scikit-learn to create the validation set from the training data.

import pandas as pd
from sklearn.model_selection import train_test_split

ag_news_train = ag_news_all["train"].to_pandas()
ag_news_test = ag_news_all["test"].to_pandas()

ag_news_train, ag_news_val = train_test_split(
   ag_news_train,
   test_size=0.1,     
   random_state=456,   
   stratify=ag_news_train['label'] 
)

print(f"Training set: {len(ag_news_train)} samples")
print(f"Validation set: {len(ag_news_val)} samples")
We now have a validation set with 12,000 articles, and a training set with 108,000 articles.

Training set: 108000 samples
Validation set: 12000 samples
For those of you new to machine learning, you might be wondering why we need all of these different datasets. The reason for this is to make sure we have a good idea that our model will generalize well and perform as expected on unseen data. The training set is the only data the model ever learns from directly. The validation set is used to monitor how the model is performing on unseen data as we make modeling decisions, such as choosing how many epochs to train for, how large to make the hidden layer, or which preprocessing steps to apply (we’ll see all of this later). This means that we look at validation performance repeatedly while building the model, and this increases the risk that our choices gradually become tuned to the quirks of that particular split. This is why we need a third set (the test set), which we keep completely locked away until we’ve finished all modeling decisions and want a single, unbiased estimate of how well our model will perform on new data. Using the test set for anything other than this final evaluation would give us an overly optimistic picture of our model’s real-world performance.

Let’s now inspect our datasets. PyCharm Pro has a lot of built-in features that make working with DataFrames easier, a few of which we’ll see soon. In this DataFrame, we have three columns: The article title and description, the article text, and the label indicating which of the four news categories the article belongs to. You can open any of the DataFrame cells in the _Value Editor_ to see its full text, or widen the column to prevent truncation, both of which are useful for a quick visual inspection.

![Image 5](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-3-viewing-full-text.png)
At the top of each column, PyCharm displays column statistics, giving you an at-a-glance summary of the data. Switching from _Compact_ to _Detailed_ mode via _Show Column Statistics_ gives you rich summary statistics about each column, and saves you from writing a lot of pandas boilerplate to get it! From these statistics, we can see that our training set is evenly split across the news categories (which is very handy when training a model). We can also see that some headlines and descriptions are not unique, which may introduce noise when classifying these duplicates.

![Image 6](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-4-column-statistics.png)
The first step in preparing the data is basic string cleaning, which normalizes the text and reduces meaningless token variation. For instance, without cleaning, “Natural” and “natural” would be treated as two separate vocabulary entries, as we noted earlier.

We’ll apply four cleaning steps: lowercasing, punctuation removal, number removal, and whitespace normalization. There are different string cleaning steps you can apply depending on the language and use case, but for English-language texts, these tend to be very standard. Let’s go ahead and write a function to do this.

def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
   patterns_to_remove = [
       r"[^a-zA-Z\s]",
   ]

   cleaned = dataset.str.lower()

   for pattern in patterns_to_remove:
       cleaned = cleaned.str.replace(pattern, " ", regex=True)

   cleaned = cleaned.str.replace(r"\s+", " ", regex=True).str.strip()

   return cleaned

ag_news_train["title_clean"] = apply_string_cleaning(ag_news_train["title"])
ag_news_train["description_clean"] = apply_string_cleaning(ag_news_train["description"])![Image 7](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-5-raw-and-cleaned-text.png)
This mostly works, but there’s one issue: The regex strips apostrophes entirely, turning contractions like “you’re” into “you re” and possessives like “Canada’s” into “Canada s”. The cleanest fix is a regex that preserves apostrophes in contractions while removing possessive endings, but this is not the most enjoyable thing to write by hand.

This is where PyCharm’s built-in [AI Assistant](https://www.jetbrains.com/pycharm/features/ai/) comes in. Open the chat window via the _AI Chat_ icon on the right-hand side of the IDE and enter the following prompt:

> Can you please alter the `@apply_string_cleaning` function so that it retains apostrophes inside words when they’re used for contractions (e.g., “you’re”), but removes them when they’re used for possessives (e.g., “Canada’s” into “Canada”).

The `@` notation lets you reference specific files or objects in your IDE without copying and pasting code into the prompt, including Jupyter variables like datasets and functions.

![Image 8](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-6-ai-chat.png)
I ran this against Claude Sonnet 4.5, though JetBrains AI supports a wide range of models from OpenAI, Anthropic, Google, and xAI, as well as open models via Ollama, LM Studio, and OpenAI-compatible APIs. Below is the updated function it returned:

def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
    cleaned = dataset.str.lower()
    
    # Remove possessive apostrophes (word's -> word)
    # This pattern matches: letter(s) + 's + word boundary
    cleaned = cleaned.str.replace(r"(\w+)'s\b", r"\1", regex=True)
    
    # Remove all non-letter characters except apostrophes within words
    cleaned = cleaned.str.replace(r"[^a-zA-Z'\s]", " ", regex=True)
    
    # Clean up any apostrophes at the start or end of words
    cleaned = cleaned.str.replace(r"\s'|'\s", " ", regex=True)
    
    # Remove multiple spaces and trim
    cleaned = cleaned.str.replace(r"\s+", " ", regex=True).str.strip()
    
    return cleaned

ag_news_train["title_clean"] = apply_string_cleaning(ag_news_train["title"])
ag_news_train["description_clean"] = apply_string_cleaning(ag_news_train["description"])

We can insert this into our Jupyter notebook directly by clicking on _Insert Snippet as Jupyter Cell_ in the AI chat.

![Image 9](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-7-insert-code-as-cell.png)
Once we run this updated function on our raw text, we get the correct result:

**text****text_clean**
Don’t stand for racism – football chief don’t stand for racism football chief
Canada’s Barrick Gold acquires nine per cent stake in Celtic Resources (Canadian Press)canada barrick gold acquires nine per cent stake in celtic resources canadian press

We can see the contraction “don’t” is correctly preserved in the first example, but the possessive “Canada’s” has been removed. We apply this to both the training and validation datasets using the same function, so that the cleaning is consistent across both splits:

ag_news_val["title_clean"] = apply_string_cleaning(ag_news_val["title"])
ag_news_val["description_clean"] = apply_string_cleaning(ag_news_val["description"])
### Creating the bag-of-words model

Now that we have clean text, we need to build our vocabulary and encode it. We’ll use scikit-learn’s `CountVectorizer` for this:

from sklearn.feature_extraction.text import CountVectorizer

countVectorizerNews = CountVectorizer()
countVectorizerNews.fit(ag_news_train["text_clean"])
ag_news_train_cv = countVectorizerNews.transform(ag_news_train["text_clean"]).toarray()
The process has two distinct steps. First, `.fit()` scans the training data and builds a vocabulary by identifying every unique word and assigning it a fixed index position (for example, “government” = column 8,901). The result is a mapping of 59,544 unique words, which you can think of as the column headers for our eventual matrix.

Second, `.transform()` uses that vocabulary to convert each headline into a numerical vector, counting how many times each vocabulary word appears and placing that count at the corresponding index position.

The reason these are two separate steps is important: When we later process our validation and test data, we’ll call `.transform()` using the vocabulary learned from the training set. This ensures that all three splits share a consistent feature space. If we re-ran .fit() on the test data, we’d get a different vocabulary, and the model’s predictions would be meaningless.

With the vectorizer fitted and our training data transformed, we can start exploring what we’ve actually built. Let’s first take a look at the vocabulary. `CountVectorizer` stores it as a dictionary mapping each word to its index position, accessible via `vocabulary_`:

countVectorizerNews.vocabulary_{'fed': 18461,
 'up': 55833,
 'with': 58324,
 'pension': 38929,
 'defaults': 13156,
 'citing': 9475,
 'failure': 18077,
 'of': 36704,
 'two': 54804,
 'big': 5269,
 'airlines': 1139,
 'to': 53531,
 'make': 31397,
 'payments': 38686,
 'their': 52947,
...}len(countVectorizerNews.vocabulary_)59544
This confirms that our vocabulary contains 59,544 unique words. Browsing through it, you can start to guess what kinds of terms appear frequently in the different types of news. Country names feature heavily in the “world” news category, terms like “football” and “cricket” in the “sports” news category, terms like “profit” and “losses” in the “business” news category, and company names like “Google” and “Microsoft” in the “science/technology” category.

Next, let’s inspect the feature matrix itself. ag_news_train_cv is a NumPy array with one row per headline and one column per vocabulary word, giving us a matrix of shape (108,000 × 59,544). We can wrap it in a DataFrame to make it easier to inspect in PyCharm’s DataFrame viewer:

pd.DataFrame(ag_news_train_cv, columns=countVectorizerNews.get_feature_names_out())![Image 10](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-8-sparse-matrix.png)
As expected, the matrix is very sparse. Most values are zero, since any individual headline only contains a small fraction of the full vocabulary. In fact, you might have noticed that the number of columns is two-thirds of the number of rows, which is never good for a feature matrix. We’ll explore how to reduce the dimensionality of the feature space in a later section.

Note that we also need to apply this vectorization to the validation dataset before moving on to modeling. Importantly, we are only applying the `.transform` method to the validation set, as we already trained it on the training dataset.

ag_news_val_cv = countVectorizerNews.transform(ag_news_val["text_clean"]).toarray()
## Visualizing the results

Before we move onto reducing down the dimensionality of our feature space, let’s explore the distribution of the words in our corpus. This can help us to understand the most common and rare words, and how we might use this to further process our data to amplify the signal-to-noise ratio.

### Word frequency plots

We’ll start by creating a DataFrame that aggregates word counts across all headlines and ranks them by frequency:

import numpy as np

vocab = countVectorizerNews.get_feature_names_out()
counts = np.asarray(ag_news_train_cv.sum(axis=0)).flatten()

pd.DataFrame({
  'vocab': vocab,
  'count': counts,
}).sort_values('count', ascending=False).reset_index(drop=True)
First, we retrieve the vocabulary in index order using `get_feature_names_out()`, so each word lines up with its corresponding column in the feature matrix. We then sum the matrix column-wise (that is, across all documents) to get the total number of times each word appears in the training set. Finally, we wrap these two arrays into a DataFrame and sort by count, giving us a ranked list of the most frequent terms.

Once this DataFrame is displayed in PyCharm, we can easily turn it into a visualization without writing a single line of code. By clicking on the _Chart View_ button in the top left-hand corner of the DataFrame, we can explore a range of ways of visualizing our data. Go to _Show Series Settings_ in the top right-hand corner, and adjust the parameters to get the count frequencies of the words: we set the _X axis_ value to “vocab” (and change _group and sort_ to _none_), the _Y axis_ value to “count”, and the chart type to “Bar”.

![Image 11](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-9-chart-view.png)
Hovering over this chart, we can see that it has a very long-tailed distribution, which is very typical of vocabulary frequencies (this is actually so typical that it is described in something called Zipf’s law). This means that the majority of our words very rarely occur in the text, and in fact, if we hover over the right-hand side of the chart, it looks like around a third of our vocabulary terms are only used once!

On the other hand, when we hover over the left-hand side of the chart, we can see that this is dominated by very common words, prepositions, and articles, such as “to”, “in”, “the”, and “you”. These words don’t really carry any meaning and pretty much occur in every text, so they’re unlikely to be useful for our classification task.

Let’s have a look at some things we can do to clean up our feature space and help our semantically meaningful words stand out a bit more.

## Advanced bag-of-words techniques

The basic BoW pipeline we’ve built so far is a solid foundation, but there are several techniques that can meaningfully improve its quality. This section walks through the most important ones. We’ll only be using a selection of them in our project, but you can investigate which of these seem appropriate when building your own project.

### Stop word removal

Stop words are extremely common words that appear frequently across all kinds of text but carry little meaningful information. This includes words like “the”, “is”, “and”, “of”, as we saw in the histogram in the previous section. They inflate vocabulary size without adding signal, so removing them is one of the most straightforward ways to improve your BoW representation. NLTK provides a built-in stop word list for English and many other languages.

### Stemming and lemmatization

Another issue you might have noticed in our vocabulary is that words that are semantically equivalent have different syntactic forms, meaning that while they should be treated as the same token, they occupy additional token slots. We can resolve this through two techniques: stemming and lemmatization. Stemming reduces words to their root form using simple rule-based truncation (e.g. “running” → “run”), while lemmatization takes a linguistic approach, mapping words to their dictionary base form. Lemmatization is slower but generally produces cleaner results, particularly for irregular word forms.

### TF-IDF

Term frequency-inverse document frequency (TF-IDF) is an extension of basic count vectorization that weights each word by how informative it actually is. A word that appears frequently in one document but rarely across the corpus receives a high weight; a word that appears everywhere receives a low one. This neatly addresses one of the core weaknesses of raw count vectors: common but uninformative words can dominate the feature space even after stop-word removal.

### N-grams

Standard BoW treats each word independently, which means it misses phrases whose meaning depends on word combinations. A classic example of this is “machine learning”, which has a distinct meaning to “machine” + “learning”. N-grams address this by treating sequences of adjacent words as single tokens, so a bigram model would capture “machine learning” as a feature in its own right. The trade-off is a much larger vocabulary, so in practice, bigrams are most commonly used, with trigrams reserved for cases where capturing longer phrases is important.

### Handling out-of-vocabulary words

When you apply your fitted vectorizer to new data, any words not present in the training vocabulary are silently ignored by default. For many tasks, this is acceptable, but if your production data is likely to continue introducing new terms that carry meaningful signal, it’s worth considering alternatives. One common approach is to reserve a special <UNK> token to represent unseen words, which at least preserves the information that something unfamiliar appeared, even if its identity is unknown and multiple (perhaps unrelated) words are collapsed onto the same token.

However, LLMs, with their more flexible approach to tokenization, tend to be a better choice if out-of-vocabulary words will be a major issue for your model once it is in production.

### Dimensionality reduction

Even after stop word removal and other cleaning steps, BoW feature matrices are typically very high-dimensional and sparse. Two widely used techniques can help. Reducing to the top-N most frequent terms is the simplest approach, discarding low-frequency words that are unlikely to generalize well. For a more principled reduction, techniques like principal component analysis (PCA) or latent semantic analysis (LSA) project the feature matrix into a lower-dimensional space, compressing the representation while preserving as much of the meaningful variance as possible.

### Feature selection techniques

Rather than reducing dimensionality arbitrarily, feature selection methods identify and retain only the features most relevant to your specific task. Chi-squared testing measures the statistical dependence between each term and the target label, making it well-suited to classification tasks. Mutual information takes a similar approach, scoring each feature by how much it reduces uncertainty about the class. Both methods can substantially reduce vocabulary size while preserving model performance.

## Applying bag-of-words to a real-world problem

Let’s now continue the example we started earlier. We’re going to take the work we’ve done on our AG News text classification task and take it to its completion by building a model.

A common way to build a model using encoded text is neural networks, where each of the words in the vocabulary is treated as a feature, and the categories we want to predict (in our case, the news category) are the output. We’ll start by building a baseline model that applies only string cleaning and encoding to the text.

I had originally written this model in Keras, as part of a previous BoW project from a couple of years ago. However, that code was now out of date. In order to update it and adapt it to Pytorch, I asked JetBrains AI to do the following:

> Please update this neural network from Keras to Pytorch, making improvements to make the code as reusable as possible.

This gave us the following successful port of the code:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

class MulticlassClassificationModel(nn.Module):
   def __init__(self, input_size: int, hidden_layer_size: int, num_classes: int = 4):
       super(MulticlassClassificationModel, self).__init__()
       self.fc1 = nn.Linear(input_size, hidden_layer_size)
       self.relu = nn.ReLU()
       self.fc2 = nn.Linear(hidden_layer_size, num_classes)

   def forward(self, x):
       x = self.fc1(x)
       x = self.relu(x)
       x = self.fc2(x)
       return x

def train_text_classification_model(
       train_features: np.ndarray,
       train_labels: np.ndarray,
       validation_features: np.ndarray,
       validation_labels: np.ndarray,
       input_size: int,
       num_epochs: int,
       hidden_layer_size: int,
       num_classes: int = 4,
       batch_size: int = 1920,
       learning_rate: float = 0.001) -> MulticlassClassificationModel:

   # Convert labels to 0-indexed (AG News has labels 1,2,3,4 -> need 0,1,2,3)
   train_labels_indexed = train_labels - 1
   validation_labels_indexed = validation_labels - 1

   # Convert numpy arrays to PyTorch tensors
   X_train = torch.FloatTensor(train_features.copy())
   y_train = torch.LongTensor(train_labels_indexed.copy())
   X_val = torch.FloatTensor(validation_features.copy())
   y_val = torch.LongTensor(validation_labels_indexed.copy())

   # Create datasets and dataloaders
   train_dataset = TensorDataset(X_train, y_train)
   train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

   # Initialize model, loss function, and optimizer
   model = MulticlassClassificationModel(input_size, hidden_layer_size, num_classes)
   criterion = nn.CrossEntropyLoss()
   optimizer = optim.RMSprop(model.parameters(), lr=learning_rate)

   # Training loop
   for epoch in range(num_epochs):
       model.train()
       train_loss = 0.0
       correct_train = 0
       total_train = 0

       for batch_features, batch_labels in train_loader:
           # Forward pass
           outputs = model(batch_features)
           loss = criterion(outputs, batch_labels)

           # Backward pass and optimization
           optimizer.zero_grad()
           loss.backward()
           optimizer.step()

           # Calculate training metrics
           train_loss += loss.item()
           _, predicted = torch.max(outputs, 1)
           correct_train += (predicted == batch_labels).sum().item()
           total_train += batch_labels.size(0)

       # Validation
       model.eval()
       with torch.no_grad():
           val_outputs = model(X_val)
           val_loss = criterion(val_outputs, y_val)
           _, val_predicted = torch.max(val_outputs, 1)
           correct_val = (val_predicted == y_val).sum().item()
           total_val = y_val.size(0)

       # Print epoch metrics
       train_acc = correct_train / total_train
       val_acc = correct_val / total_val
       print(f'Epoch [{epoch+1}/{num_epochs}], '
             f'Train Loss: {train_loss/len(train_loader):.4f}, '
             f'Train Acc: {train_acc:.4f}, '
             f'Val Loss: {val_loss:.4f}, '
             f'Val Acc: {val_acc:.4f}')

   return model

def generate_predictions(model: MulticlassClassificationModel,
                       validation_features: np.ndarray,
                       validation_labels: np.ndarray) -> list:
   model.eval()

   # Convert to tensors
   X_val = torch.FloatTensor(validation_features.copy())

   with torch.no_grad():
       outputs = model(X_val)
       _, predicted = torch.max(outputs, 1)

   # Convert back to 1-indexed labels to match original dataset
   predicted_labels = (predicted.numpy() + 1)

   print("Confusion Matrix:")
   print(pd.crosstab(validation_labels, predicted_labels,
                     rownames=['Actual'], colnames=['Predicted']))
   return predicted_labels.tolist()
Let’s walk through this code step-by-step to understand how we’re going to train our text classifier.

### The model architecture

`MulticlassClassificationModel` is a simple two-layer feedforward neural network. It takes a BoW vector as input, with each feature being a vocabulary word, and passes it through two sequential transformations to produce a prediction. The first layer (`fc1`) compresses this high-dimensional input down to a smaller intermediate representation, whose size we control via `hidden_layer_size`. A ReLU activation is then applied, which introduces a small amount of mathematical complexity that allows the model to learn patterns that a simple weighted sum couldn’t capture. The second layer (`fc2`) takes this intermediate representation and maps it down to four output values, one per news category, where the category with the highest value becomes the model’s prediction.

### Training and validation

`train_text_classification_model` handles the full training loop. It starts with a small amount of housekeeping: The AG News labels run from 1 to 4, but PyTorch expects 0-indexed classes, so these are shifted down by 1. The features and labels are then converted to PyTorch tensors, and a `DataLoader` is created to feed the training data to the model in batches.

Each epoch, the model processes the training data batch by batch. For each batch, it runs a forward pass to generate predictions, computes the cross-entropy loss against the true labels, and then runs a backward pass to update the model weights via the RMSprop optimizer. At the end of every epoch, the model switches into evaluation mode and runs inference over the full validation set, printing the training and validation loss and accuracy so we can monitor how training is progressing.

### Generating predictions

Once training is complete, `generate_predictions` runs the trained model on a held-out dataset and returns the predicted class for each article. It also prints a confusion matrix, which gives us a breakdown of which categories the model is getting right and where it’s getting confused, which is a much more informative picture than accuracy alone.

### Running the baseline

We can now train the baseline model. We pass in the raw count-vectorized training and validation features, specify an input size equal to the vocabulary size (59,544 columns), train for two epochs, and use a hidden layer of 5,000 nodes.

baseline_model = train_text_classification_model(
    ag_news_train_cv,
    ag_news_train["label"].to_numpy(),
    ag_news_val_cv,
    ag_news_val["label"].to_numpy(),
    ag_news_train_cv.shape[1],
    5,
    5000
)

predictions = generate_predictions(
    baseline_model,
    ag_news_val_cv,
    ag_news_val["label"].to_numpy()
)Epoch [1/2], Train Loss: 0.3553, Train Acc: 0.8813, Val Loss: 0.2307, Val Acc: 0.9243
Epoch [2/2], Train Loss: 0.1217, Train Acc: 0.9587, Val Loss: 0.2352, Val Acc: 0.9240

Confusion Matrix:
Predicted     1     2     3     4
Actual                           
1          2774    65    89    72
2            37  2944     9    10
3           112    20  2694   174
4            97    20   207  2676
Even with the very basic data preparation we did, we can see we’ve performed very well on this prediction task, with around 92% accuracy. The confusion matrix shows that the model seems to have the easiest time distinguishing between category two (sports) and the other topics, and the hardest time distinguishing between category three (business) and category four (science/technology). This makes sense, as the words used to describe sports are very distinct and unlikely to be used in other contexts (things like football), whereas there is likely to be overlapping vocabulary between business and technology (especially company names).

As we saw above, there is a lot we can do to improve the signal-to-noise ratio in BoW modeling. Let’s apply four commonly used techniques to our data and see whether this improves our predictions: lemmatization, stop word removal, limiting our vocabulary to the top N terms, and TF-IDF weighting. As you’ll see, all of these can be done relatively simply using inbuilt functions in packages such as spaCy and scikit-learn.

### Lemmatization

As we discussed earlier, lemmatization collapses inflected word forms into a single vocabulary entry by mapping each word to its dictionary base form, which both shrinks the vocabulary and concentrates the signal for each concept into a single feature. We’ll use spaCy for this, which first requires downloading its small English language model:

!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")
Our `lemmatise_text` function passes each text through spaCy’s NLP pipeline using `nlp.pipe()`, which processes them in batches of 1,000 for efficiency. For each document, it extracts the `.lemma_` attribute of every token and joins them back into a single string. One small detail worth noting: we preserve the original DataFrame index when constructing the output Series, so that rows stay correctly aligned when we assign the results back.

We apply lemmatization before string cleaning, since spaCy needs the original casing and punctuation to correctly identify grammatical structure. For example, “running” and “Running” lemmatize to the same thing, but removing punctuation first can confuse the parser. Once lemmatized, we pass the output through `apply_string_cleaning` as before:

ag_news_train["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_train["title"]))
ag_news_train["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_train["description"]))

ag_news_val["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_val["title"]))
ag_news_val["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_val["description"]))

ag_news_train["text_clean"] = ag_news_train["title_clean"] + " " + ag_news_train["description_clean"]

ag_news_val["text_clean"] = ag_news_val["title_clean"] + " " + ag_news_val["description_clean"]
We apply this separately to the title and description columns before concatenating them into a single `text_clean` field. As you can see, we do this for both the training and validation sets using the same function, so that lemmatization is applied consistently across both splits.

![Image 12](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-10-lemmatisation.png)
### Removing stop words

As with lemmatization, we covered the motivation for stop word removal earlier: Words like “the”, “is”, and “of” appear so frequently across all texts that they add noise rather than signal to our feature matrix. Here we’ll actually apply it to our data.

def remove_stopwords(texts: pd.Series) -> pd.Series:
   texts = texts.fillna("").astype(str)

   filtered_texts = []
   for doc in nlp.pipe(texts, batch_size=1000):
       filtered_texts.append(
           " ".join(token.text for token in doc if not token.is_stop)
       )

   return pd.Series(filtered_texts, index=texts.index)
Our `remove_stopwords` function again uses `nlp.pipe()` to process texts in batches. For each document, it filters out any token where spaCy’s `is_stop` attribute is True, and joins the remaining tokens back into a string. Conveniently, spaCy handles stop word detection using the same pipeline we already loaded for lemmatization, so no additional setup is needed.

We apply this to the already-cleaned and lemmatized `text_clean` column for both the training and validation sets, so the stop word removal builds directly on our previous preprocessing steps and is applied consistently across both splits.

ag_news_train["text_no_stopwords"] = remove_stopwords(ag_news_train["text_clean"])
ag_news_val["text_no_stopwords"] = remove_stopwords(ag_news_val["text_clean"])
### Top N terms and TF-IDF vectorization

The final two improvements we’ll apply are limiting the vocabulary size and switching from raw count vectorization to TF-IDF weighting. Conveniently, scikit-learn’s `TfidfVectorizer` handles both in a single step.

Recall from earlier that TF-IDF downweights words that appear frequently across many documents while upweighting words that are distinctive to particular documents. This cleans up uninformative words that don’t quite qualify as stopwords, but add little useful information to our dataset. The `max_features=20000` argument caps the vocabulary at the 20,000 most frequent terms after TF-IDF scoring, which discards the long tail of rare words that are unlikely to generalize well and brings our feature matrix down to a much more manageable size. (The choice of 20,000 words is arbitrary. We could have easily used a smaller or larger number, depending on our dataset and use case.)

As with `CountVectorizer`, we fit only on the training data and then use that fixed vocabulary to transform both the training and validation sets:

TfidfVectorizerNews = TfidfVectorizer(max_features=20000)
TfidfVectorizerNews.fit(ag_news_train["text_no_stopwords"])

ag_news_train_tfidf = TfidfVectorizerNews.transform(ag_news_train["text_no_stopwords"]).toarray()
ag_news_val_tfidf = TfidfVectorizerNews.transform(ag_news_val["text_no_stopwords"]).toarray()
We can inspect the resulting vocabulary and feature matrix exactly as we did before:

TfidfVectorizerNews.vocabulary_{'fed': np.int64(6243),
 'pension': np.int64(13134),
 'default': np.int64(4469),
 'cite': np.int64(3200),
 'failure': np.int64(6109),
 'big': np.int64(1787),
 'airline': np.int64(401),
 'payment': np.int64(13051),
 'plan': np.int64(13424),
 'government': np.int64(7306),
 'official': np.int64(12453),
 'tuesday': np.int64(18437),
 'congress': np.int64(3691),
 'hard': np.int64(7689),
 'corporation': np.int64(3901),
...}pd.DataFrame(ag_news_train_tfidf, columns=TfidfVectorizerNews.get_feature_names_out())![Image 13](https://blog.jetbrains.com/wp-content/uploads/2026/04/screenshot-12-tf-idf-matrix.png)
Compared to our baseline feature matrix of 59,544 columns filled almost entirely with zeros, this is considerably leaner. We now have 20,000 columns of weighted scores that better reflect each word’s actual importance to the document it appears in. It is still relatively sparse, but we can see from both the feature matrix and the vocabulary list that it is much more focused on semantically rich words.

### Fitting the revised model

With our improved features in hand, we can now retrain the model. The call is identical to before, except we pass in the TF-IDF feature matrices instead of the raw count vectors, and the input size is now 20,000 rather than 59,544:

baseline_model = train_text_classification_model(
    ag_news_train_tfidf,
    ag_news_train["label"].to_numpy(),
    ag_news_val_tfidf,
    ag_news_val["label"].to_numpy(),
    ag_news_train_tfidf.shape[1],
    2,
    5000
)

predictions = generate_predictions(
    baseline_model,
    ag_news_val_tfidf,
    ag_news_val["label"].to_numpy()
)Epoch [1/2], Train Loss: 0.3183, Train Acc: 0.8932, Val Loss: 0.2301, Val Acc: 0.9225
Epoch [2/2], Train Loss: 0.1512, Train Acc: 0.9475, Val Loss: 0.2332, Val Acc: 0.9243
Confusion Matrix - Raw Counts:
Predicted     1     2     3     4
Actual                           
1          2703    71   121   105
2            20  2955    13    12
3            68    19  2691   222
4            77    17   163  2743
The results are actually very encouraging! Our overall validation accuracy is essentially unchanged at around 92%, but we’ve achieved this with a feature matrix that is less than a third of the size. This suggests that the extra vocabulary in the baseline (including the stop words) was contributing to noise rather than signal. Reducing the size of the feature matrix makes our model more stable, less prone to overfitting, and much more manageable to deploy.

Looking at the confusion matrix, the pattern of errors is similar to before: Sports (category two) is the easiest category to classify, with 98.5% accuracy, while Business (category three) and Science/Technology (category four) remain the hardest to separate, with around 7% of articles in each category being misclassified as the other. This is consistent with what we saw in the baseline, so it seems that the preprocessing improvements have tightened things up at the margins, but the fundamental difficulty of the Business/Technology boundary is a property of the data rather than the feature representation.

### Applying our model to the test set

Finally, we need to validate that our model performs as well on the test set as it does on the validation set. Up to this point, we’ve deliberately kept the test set locked away. As mentioned earlier, if we had been making modeling decisions based on test set performance, we’d risk inadvertently overfitting our choices to it, and our final accuracy estimate would be optimistic.

The preprocessing steps must be applied in exactly the same order as for the training and validation data: lemmatization, string cleaning, concatenation of title and description, and stop-word removal. Crucially, we also call `.transform()` rather than `.fit_transform()` on the test text, using the vocabulary learned from the training data:

ag_news_test["title_clean"] = apply_string_cleaning(lemmatise_text(ag_news_test["title"]))
ag_news_test["description_clean"] = apply_string_cleaning(lemmatise_text(ag_news_test["description"]))
ag_news_test["text_clean"] = ag_news_test["title_clean"] + " " + ag_news_test["description_clean"]
ag_news_test["text_no_stopwords"] = remove_stopwords(ag_news_test["text_clean"])

ag_news_test_tfidf = TfidfVectorizerNews.transform(ag_news_test["text_no_stopwords"]).toarray()
We can then generate predictions and evaluate accuracy on the test set:

test_predictions = generate_predictions(
    baseline_model,
    ag_news_test_tfidf,
    ag_news_test["label"].to_numpy()
)

test_accuracy = accuracy_score(ag_news_test["label"].to_numpy(), test_predictions)
print(f"Test Accuracy: {test_accuracy:.4f}")Test Accuracy: 0.9183

Confusion Matrix - Raw Counts:
Predicted     1     2     3     4
Actual                           
1          1710    54    78    58
2            13  1870    10     7
3            51    12  1676   161
4            53     9   115  1723
The test accuracy of 91.8% is very close to the 92.4% we saw on the validation set, which is a reassuring sign that our model has generalized well rather than overfitting to the validation data. The confusion matrix tells the same story as before: Sports (category two) remains the easiest category to classify, with only 30 misclassified articles out of 1,900, while the Business/Technology boundary continues to be the main source of errors, with around 8% of articles in each category being misclassified as the other. The consistency between validation and test results gives us confidence that these error patterns reflect genuine properties of the data rather than artifacts of any particular split.

## Limitations and alternatives

### Loses word order information

The most fundamental limitation of the bag-of-words model is right there in the name: it treats text as an unordered collection of words, discarding all sequence information. This means “the dog bit the man” and “the man bit the dog” produce identical vectors, even though they describe very different events. For many classification tasks, this doesn’t matter much, but for tasks that require understanding the relationship between words, such as question answering or natural language inference, the loss of word order is a serious handicap.

### Ignores semantics and context

BoW has no notion of word meaning or context. Each word is simply a column in a matrix, entirely independent of every other word. This creates two related problems. First, synonyms are treated as completely distinct features: “cheap” and “inexpensive” contribute nothing to each other’s signal, even though they mean the same thing. Second, words with multiple meanings are treated as a single feature regardless of context: “bank” means the same thing whether it appears in a sentence about rivers or finance. Both of these issues limit how well BoW representations can capture the actual semantics of a text.

### Can result in large, sparse vectors

As we saw in our own example, even a moderately sized corpus of news headlines can produce a vocabulary of nearly 60,000 unique terms. The resulting feature matrix has one column per vocabulary word, but any individual document only uses a tiny fraction of them, leaving the vast majority of values at zero. This sparsity creates two practical problems: The matrices can consume a large amount of memory if stored densely, and the high dimensionality can make it harder for models to find meaningful patterns, a phenomenon sometimes called the curse of dimensionality.

### Alternatives

If BoW’s limitations are a bottleneck for your task, there are several well-established alternatives worth considering.

*   **Word embeddings (Word2Vec and GloVe)** address the semantics problem by representing each word as a dense vector in a continuous space, where similar words are geometrically close to each other. They capture distributional meaning far more richly than BoW, and are a natural next step when synonym handling or word similarity matters. Doc2Vec extends this idea to produce embeddings for entire documents rather than individual words.
*   **Transformer-based models (BERT and GPT)** go further still, generating contextual representations where the same word receives a different vector depending on the surrounding text. This handles polysemy directly and captures complex long-range dependencies between words. The trade-off is substantially higher computational cost and complexity compared to BoW.
*   **Topic models like latent Dirichlet allocation (LDA)** take a different angle entirely. Rather than encoding documents for downstream classification, they are generative models that discover latent thematic structure in a corpus. This is useful when your goal is exploration and interpretation rather than prediction.

For tasks where BoW already performs well, as we saw here with AG News, the added complexity of these approaches may not be worth the cost. BoW remains a strong baseline, and it’s always worth establishing how far it can take you before reaching for heavier machinery.

## Get started with PyCharm today

In this post, we’ve covered a lot of ground: from the fundamentals of the bag-of-words model and how it converts text into numerical vectors, through to building and iteratively improving a real text classification pipeline on the AG News dataset. Along the way, we’ve seen how preprocessing steps like lemmatization, stop word removal, vocabulary capping, and TF-IDF weighting can meaningfully improve the efficiency of your feature representation, and how PyCharm’s DataFrame viewer, column statistics, chart view, and AI Assistant make each of these steps faster and easier to inspect and debug.

If you’d like to try this yourself, [PyCharm Pro](https://www.jetbrains.com/pycharm/download/?section=windows) comes with a 30-day trial. As we saw in this tutorial, its built-in support for Jupyter notebooks, virtual environments, and scientific libraries means you can go from a blank project to a working NLP pipeline with minimal setup friction, leaving you free to focus on the fun parts.

You can find the [full code](https://github.com/t-redactyl/ag-news-bag-of-words-classification) for this project on GitHub. If you’re interested in exploring more NLP topics, check out our recent blogs [here](https://blog.jetbrains.com/pycharm/).

[](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#)

1.   [What is the bag-of-words model?](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#what-is-the-bag-of-words-model)
2.   [How does bag-of-words work?](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#how-does-bag-of-words-work)
    1.   [Tokenization](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#tokenization)
    2.   [Vocabulary creation](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#vocabulary-creation)
    3.   [Encoding](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#encoding)

3.   [Advantages of the bag-of-words model](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#advantages-of-the-bag-of-words-model)
4.   [Applications of bag-of-words](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#applications-of-bag-of-words)
5.   [Why use PyCharm for NLP?](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#why-use-pycharm-for-nlp)
    1.   [Setting up the project](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#setting-up-the-project)

6.   [Implementing bag-of-words with PyCharm](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#implementing-bag-of-words-with-pycharm)
    1.   [Loading and preparing the data](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#loading-and-preparing-the-data)
    2.   [Creating the bag-of-words model](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#creating-the-bag-of-words-model)

7.   [Visualizing the results](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#visualizing-the-results)
    1.   [Word frequency plots](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#word-frequency-plots)

8.   [Advanced bag-of-words techniques](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#advanced-bag-of-words-techniques)
    1.   [Stop word removal](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#stop-word-removal)
    2.   [Stemming and lemmatization](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#stemming-and-lemmatization)
    3.   [TF-IDF](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#tf-idf)
    4.   [N-grams](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#n-grams)
    5.   [Handling out-of-vocabulary words](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#handling-out-of-vocabulary-words)
    6.   [Dimensionality reduction](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#dimensionality-reduction)
    7.   [Feature selection techniques](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#feature-selection-techniques)

9.   [Applying bag-of-words to a real-world problem](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#applying-bag-of-words-to-a-real-world-problem)
    1.   [The model architecture](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#the-model-architecture)
    2.   [Training and validation](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#training-and-validation)
    3.   [Generating predictions](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#generating-predictions)
    4.   [Running the baseline](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#running-the-baseline)
    5.   [Lemmatization](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#lemmatization)
    6.   [Removing stop words](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#removing-stop-words)
    7.   [Top N terms and TF-IDF vectorization](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#top-n-terms-and-tf-idf-vectorization)
    8.   [Fitting the revised model](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#fitting-the-revised-model)
    9.   [Applying our model to the test set](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#applying-our-model-to-the-test-set)

10.   [Limitations and alternatives](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#limitations-and-alternatives)
    1.   [Loses word order information](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#loses-word-order-information)
    2.   [Ignores semantics and context](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#ignores-semantics-and-context)
    3.   [Can result in large, sparse vectors](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#can-result-in-large,-sparse-vectors)
    4.   [Alternatives](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#alternatives)

11.   [Get started with PyCharm today](http://blog.jetbrains.com/pycharm/2026/04/using-bag-of-words-with-pycharm/#get-started-with-pycharm-today)

## Discover more