by Jonathan Widarsa

What K-means Says about Stocks

·

One natural question we should ask is whether stocks can be grouped by how they behave, rather than just by the sector labels someone assigned them decades ago. For example, here’s one of many things sector labels don’t capture: a tech company and a utility might sit in completely different industries, yet move in near-perfect lockstep during a market downturn.

That’s why in this article, we’ll use KK-mean clustering to classify a universe of stocks based on their return and volatility characteristics. Of course, there are likely more robust ways of choosing factors for the task, e.g., including other features such as price-earnings and risk-return ratios. However, we’ll keep things simple here, given the choice of features is beyond the scope of this article.

Anyway, along the way, we’ll cover how the algorithm actually works, how to pick KK, and how to run it.

***

We begin with the idea behind KK-means: choose KK centroids and a hard assignment of each point to one centroid to minimize total squared distance to assigned centers; solve this (approximately) by alternating between nearest-centroid assignment and mean recomputation; use smart initialization and multiple restarts because the problem is non-convex; and report the final centroids, labels, and inertia as the fitted clustering.

Mathematically, suppose we have data x1,,xnpx_1, \dots, x_n \in \mathbb{R}^p, and we want KK clusters. In this sense, the classical KK-means objective is

min{C(i)},{μk}i=1nxiμC(i)2,\min_{\{C(i)\},\{\mu_k\}} \sum_{i=1}^n \|x_i – \mu_{C(i)}\|^2,

where C(i){1,,k}C(i) \in \{1, \dots, k\} is the cluster assignment of xix_i and μkp\mu_k \in \mathbb{R}^p is the centroid of cluster kk. If you’ve got a keen eye, you could probably tell that xiμC(i)\|x_i – \mu_{C(i)}\| is simply the Euclidean distance between xix_i and the centroid of the cluster it’s assigned to. Explicitly, in kk-means, we use the squared version:

xiμC(i)2=j=1p(xijμC(i),j)2,\|x_i – \mu_{C(i)}\|^2 = \sum_{j=1}^p (x_{ij} – \mu_{C(i),j})^2,

which measures how far the point is from its cluster center across pp features.

The main difficulty here is that this isn’t a convex optimization problem in the joint variables {C(i)}\{C(i)\} and {μk}\{\mu_k\} — the assignments are discrete, the centroids are continuous, and the space of possible clusterings is enormous. Therefore, we employ Lloyd’s algorithm, which is an iterative coordinate-descent-style procedure that alternates between:

  • improving assignments while holding centroids fixed, and
  • improving centroids while holding assignments fixed.

Lloyd’s Algorithm

The logic of the algorithm is pretty elegant. In the assignment step, we suppose that the centroids μ1,,μK\mu_1, \dots, \mu_K are known. Then, to minimize

i=1nxiμC(i)2,\sum_{i=1}^n \|x_i – \mu_{C(i)}\|^2,

each point xix_i should clearly be assigned to whichever centroid is closest, because the terms in the sum are separable in ii. Formally, with centroids fixed, the optimal assignment rule is

C(i)=argmink{1,,K}xiμk2.C(i) = \arg \min_{k \in \{1, \dots, K\}} \|x_i – \mu_k\|^2.

Understandably, it’s sometimes called the “E-like” step since it resembles the assignment part of expectation-maximization, but in kk-means it’s a hard deterministic reassignment (i.e., not a probabilistic expectation).

In the update step, we instead fix the cluster assignments. In this case, the objective decomposes by cluster:

k=1KiSkxiμk2,\sum_{k=1}^K \sum_{i \in S_k} \|x_i – \mu_k\|^2,

where SkS_k represents the set of indices in cluster kk. For each kk, the centroid μk\mu_k can be optimized independently by solving

minμkiSkxiμk2.\min_{\mu_k} \sum_{i \in S_k} \|x_i – \mu_k\|^2.

Taking its derivative w.r.t. μk\mu_k and setting it to zero yields

μk=1|Sk|iSkxi,\mu_k = \frac{1}{|S_k|} \sum_{i \in S_k} x_i,

which literally means that the best centroid, conditional on current assignments, is the sample mean of the assigned points.

Now, simply putting these together, we can understand that Lloyd’s algorithm starts from some initial centroids μ1(0),,μK(0)\mu_1^{(0)}, \dots, \mu_K^{(0)} and repeats

C(t)(i)=argmink{1,,K}xiμk(t)2C^{(t)}(i) = \arg \min_{k \in \{1, \dots, K\}} \|x_i – \mu_k^{(t)}\|^2

and then

μk(t+1)=1|Sk(t)|iSk(t)xi,\mu_k^{(t+1)} = \frac{1}{|S_k^{(t)}|} \sum_{i \in S_k^{(t)}} x_i,

where Sk(t)={i:C(t)(i)=k}S_k^{(t)} = \{i: C^{(t)}(i) = k\}. Each of these two steps cannot increase the objective, meaning the within-cluster sum of squares (WCSS)

J(t)=i=1nxiμC(t)(i)(t)2J^{(t)} = \sum_{i=1}^n \|x_i – \mu_{C^{(t)}(i)}^{(t)}\|^2

is non-increasing over iterations. Since there are only so many possible hard partitions of nn points into KK groups, the algorithm eventually converges to a fixed point — a local minimum.

Data Preparation

Now that we’ve gotten the theory out of the way, let’s begin by loading the dataset that we’ll be using. Our two factors will be the average annualized returns and average annualized volatility.

import pandas as pd
import yfinance as yf
from math import sqrt

url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(url, storage_options={'User-Agent': 'Mozilla/5.0'})
tickers = tables[0]['Symbol'].values.tolist()
tickers = [
    s.replace('\n', '').replace('.', '-').replace(' ', '') for s in tickers
]

prices_df = yf.download(
    tickers,
    start='2025-01-01',
    end='2026-01-01',
    auto_adjust=True,
    progress=False
)["Close"]

prices_df.sort_index(inplace=True)
rets_df = pd.DataFrame({
    'Returns': prices_df.pct_change().mean() * 252,
    'Volatility': prices_df.pct_change().std() * sqrt(252)
})

X = rets_df[['Returns', 'Volatility']].dropna().values

Recall the formula for the loss function (WCSS). Since it aims to minimize a sum of squares, features with larger scale dominate the squared Euclidean distance. To avoid bias, we should therefore scale the data accordingly.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Optimal Choice of K

Choosing an optimal KK is necessary to balance between underfitting (having too few clusters) and overfitting (having too many clusters). To do this, we look at two complementary quantities: the WCSS, which measures how tightly points are grouped around their assigned centroids, and the mean silhouette score, which captures both how cohesive and how well-separated the clusters are.

While WCSS always decreases as KK increases, making it difficult to identify a clear stopping point, the mean silhouette score provides a more balanced measure. We define it as

s=1ni=1nb(i)a(i)max(a(i),b(i)),\bar{s} = \frac{1}{n} \sum_{i=1}^n \frac{b(i) – a(i)}{\max\left(a(i), b(i)\right)},

where a(i)a(i) is the average distance of xix_i to points in its own cluster, and b(i)b(i) is the minimum average distance to points in the nearest neighboring cluster. Intuitively, we can understand this quantity as rewarding clusters that are both internally tight and externally well-separated.

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

inertias = []
silhouette_scores = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(
        n_clusters=k, init='k-means++', n_init=10, random_state=42
    )
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

    if k > 1:
        score = silhouette_score(X_scaled, labels)
        silhouette_scores.append(score)
    else:
        silhouette_scores.append(None)
WCSS and Mean Silhouette Score vs K Plot for Optimal K Determination

In our case, s\bar{s} is highest at K=2K=2 and shows a smaller local peak at K=4K=4. Combined with the ambiguous elbow in WCSS, this somewhat suggests that the data may naturally organize into two dominant regimes, with further refinement into four clusters capturing more nuanced differences.

I shall conclude that K=2K=2 offers the clearest high-level segmentation, while K=4K=4 provides a more detailed and informative view of the underlying structure. As such, we’ll proceed with K=4K=4.

K-Means Fitting and Analysis

Now, we can simply rerun KMeans with n_clusters=4 keeping all else equal and plot the corresponding color-labeled scatterplot.

kmeans = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)
labels = kmeans.fit_predict(X_scaled)

plt.scatter(
    X_scaled[:, 0],
    X_scaled[:, 1],
    c=labels
)
plt.scatter(
    kmeans.cluster_centers_[:, 0],
    kmeans.cluster_centers_[:, 1],
    marker='X'
)
plt.show()
Scatterplot of K-Means Clustering

As expected, most of the points are clustered around each of the four centroids. After fitting KK-means, we define outliers in a very natural way: points that lie unusually far from their assigned cluster center. Concretely, for each point we compute the distance xiμC(i)\|x_i – \mu_{C(i)}\| (not squared distance), and then, within each cluster, flag points that exceed a threshold of the cluster mean plus two standard deviations.

Defining outliers per cluster is important because different clusters can have very different spreads, so an extreme distance in a tight cluster may be completely normal in a more dispersed one. This gives a fair, local notion of what it means to be “far”.

Scatterplot of K-Means Clustering with Labeled Outliers

However, in the context of financial data, outliers need to be interpreted carefully. KK-means labels something as an outlier simply because it doesn’t fit well into its geometry, which isn’t the same as labeling a data point as bad, noisy, or erroneous.

For instance, in the case of EchoStar (SATS), the extreme volatility and returns were driven by event risk (fears of bankruptcy and regulatory pressure around its spectrum assets), followed by major positive developments ($23 billion spectrum sale to AT&T and $17 billion spectrum deal with SpaceX). Obviously, this isn’t about bad data.

In this sense, identifying outliers is less about preprocessing (i.e., treating noise) and more about diagnosis (i.e., interpreting structurally different assets). Regardless, our simple clustering analysis is useful for investors looking to diversify their investment portfolios, as it allows them to identify groups of stocks with different levels of risk and return (w.r.t. our determined factors and subject to model limitations).

This essentially serves as a useful starting point for risk-diversification in portfolio optimization settings.

K-Means Limitations

Our previous example showcases one limitation of KK-means: it groups assets purely based on geometric proximity and summarizes each cluster by a single “average” point. This works well when the data is homogeneous and clusters are roughly spherical, but stocks are often driven by unique narratives which can’t be meaningfully represented by a single centroid (e.g., elongated, skewed, heteroskedastic).

Perhaps most importantly in financial contexts, it struggles with heterogeneous data where small but meaningful groups or rare events carry significant information.

If these limitations matter, there are more suitable alternatives depending on the goal:

  • Gaussian Mixture Models (GMMs) extend K-means by allowing soft cluster membership and flexible covariance structures, making them better at handling overlapping or non-spherical clusters.
  • Density-based methods like DBSCAN can explicitly identify outliers without forcing them into clusters.
  • Hidden Markov Models (HMMs) may be more appropriate if the data is driven by regimes or time dynamics.

Even within the K-means framework, we can mitigate some issues by careful feature engineering, scaling, or clustering in a richer feature space. Anyway, the key idea is to recognize that KK-means is a geometric simplification that’s be useful for capturing broad structure, but not designed to explain the most extreme or structurally unique observations.

***

KK-means is generally a great starting tool for capturing broad structure even in financial data science. It helps us see the overall landscape, and just as importantly, highlights where that simple picture breaks down.

It’s precisely because of its weaknesses that we can see where it breaks down, allowing us to move on to more appropriate models like GMMs, DBSCAN, or HMMs depending on the underlying structure of the data we’re dealing with.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *


More Posts