May 19, 2019

Kmeans Clustering

Kmeans Clustering*
How often do we see sports teams broken out into various groups? Some group selections are natural and driven by conferences, geography and standings. Others are data driven to a certain extent, but are subject to the bias of the researcher. Often times they are situation specific and cater to the agenda of the writer. A truly objective way of grouping teams is missing in sports literature.

Kmeans clustering allows a researcher to put their own bias aside in grouping sports teams and let the data do all (well almost all) of the talking. Before the results and implications of this recent study are presented, it is important to understand the method being used.

What is kmeans clustering?
Kmeans clustering is a machine learning algorithm which groups obersvations (in this case college football teams) together which have a similar statistical profile based on chosen features. It is very popular in marketing analytics for segmenting a customer base. It has also proven to very insightful for grouping college football teams. The researcher only gets to choose the number of features and the number of groups (these are the only subjective parts) and the algorithm instructs on which teams to group together.

What is good about it?
Researcher/author bias is eliminated. Subjective descriptions of playing style can be replaced with a statistical profile which cannot be argued. Also, cutoff points for inclusion in various groups can no longer be “cherry picked” to guarantee a team’s inclusion in a particular group. It is the most objective way to group teams.

What is bad about it?
Kmeans is only as smart as the data which it is fed. It does not know about things like coaching changes, NCAA sanctions, etc. If the researcher does not correct for these anomalies or state them explicitly (we chose to do the latter), then the usefulness of such an algorithm is greatly diminished. THIS METHOD DOES NOT REPLACE SUBJECT MATTER EXPERTISE! Experts are needed to design the study and analyze the results. However, it is a great method for overcoming the bias of the researcher.

What do I do with the results?
In the marketing world, after customers are segmented, each segment is then treated differently. In the sports world we can do something a little different. Since each team will belong to a cluster (or in our study three distinct clusters), we can evaluate the performance of various clusters against each other. In other words, we can see how a given team and other teams with a similar statistical profile perform against another group of teams which are alike.

*For more detailed information on methodology, data, assumptions, etc., please contact Paul Chimenti at

%d bloggers like this: