# Introduction to Customer Segmentation

Demand of a certain product or service is the driver to the creation and expansion of every business. To create the demand we need to find out new market niches that are not satisfied by the current establishment of companies. If we want this to happen we need to understand the source of the demand in our case which is the customer base.

The customers are not typically one homogeneous group. They can be divided into groups or segments, the process is called customer segmentation. This can be done based on several different criteria:

- Geographic – where we live determines the taste of the products and services. This can include the latitude, longitude, climate, landscape where you live affects your lifestyle and your preferences.
- Demographic – this includes your age, gender, race, ethnicity, marital status, family life, religion, income
- Psychological – lifestyle, interests, activities the person participates
- Behavior – the patterns observed by the person when making purchases. The purchase behavior can be studied based on the number of purchases, the intensity, the volume, the price. Another aspect is the loyalty, this is if the user prefers to stick with the same brand or changes his brand.

Collecting the data of the customers makes it possible to conduct statistical analysis. People that belong to the same or similar groups tend to have similar tastes for the products and services. This will make it possible to make data-driven scientifically accurate predictions on the future demand of customers as well as to target new customers based on their segment. In such a way we can optimize our production, extract greater revenue, consolidate and improve the retention of our customer base.

For the rest of the article we will address different statistical methods that are used for customer segmentation. Some of them are complementary, others could be used only when we have a specific context. The first and the most famous and widely used method is clusterization.

- The clustering or cluster analysis is an unsupervised learning algorithm when the different groups are merged into clusters. To do this our data needs to be at least in 2 dimensions or multivariate. These methods were first developed back in the 1930s by psychologists. First the distance is calculated based on different measurements. Next we try to maximize the distance between the clusters and minimize it inside. There are several variations of the clustering algorithm. Usually there is no best clustering analysis, it is chosen experimentally unless we know something important regarding the data. Here is a list of the main types of clusters:
- Centroid-based clustering – the most famous form is k-means where we group the data points to the nearest cluster by minimizing the distance. Here we choose k number of clusters. There are several modifications of this algorithm.
- Connectivity-based clustering also known as hierarchical clustering – here a dendrogram is used. We assume that similar clusters are connected, so we merge smaller clusters to a bigger one.
- Distribution-based clustering – this is connected to the probability distributions. It is assumed that the data points that belong to the same cluster should have the same probability distribution.
- Density-based clustering – this can take different shapes of the cluster. The cluster is based on the idea that to belong to the cluster the data points should be in the area with a certain required density else we consider them noise. DBSCAN (Density based spatial clustering of application with noise) is an example of this type of clustering.
- Grid-based clustering – here we divide the data into cells if a cell has a greater density than a threshold it is marked as a cluster. An example for such an algorithm is STING (statistical information grid) and CLIQUE.
- Other clustering algorithms – there are several developments, but at this time they are still more experimental and not widely used, so we will not delve deeper.

- Logistic regression – this is one of the best known methods, it is part of the family of Generalized Linear Methods. It is used as a classification algorithm. There are both bivariate and multivariate versions. The odds for the different outcome are calculated, this can be calibrated further to a percentage.
- Ensemble Methods – here several weaker algorithms are combined to create a stronger one. This is usually done through voting. There could be more emphasis to minimise the errors. Here are some of the best known example:
- Random Forest – this is an amalgamation of decision trees, that can be used both as a classification and regression variation. There is a random component when choosing the cases for the trees. So this became part of the name. The algorithm is prone to overfitting so the trees need to be pruned.
- Boosting algorithms – here weaker learners are combined to produce a stronger one. There could be several variations. AdaBoost is one of the most famous, here more emphasis is given to the misclassification cases. Another algorithm is the Xgboost, there is again an interactive process to reduce the cost function.
- Stacking – several algorithms are used, they are trained separately on the dataset. Finally the arbitrary combiner most often a logistic regression is used.
- Bucket of models – a model selection technique is used get best performing algorithm.
- Bayesian methods – this is another iterative procedure were multiple trials have been run and are being averaged out. There are several variations: bayesian optimal classifier, bayesian model averaging and combination.
- Consensus clustering – unlike the rest of the algorithm this is an unsupervised ensemble algorithm. Here several clustering algorithms are used and the results are aggregated.

- Conjoint analysis – this is a survey based method, originally developed for the field of mathematical psychology. Here the consumer attitude for a different feature of the product or service is being tested for the influence on the decision making. There are several examples for this algorithm Brand-Price Trade Off, Simalto (Simultaneous Multi Attribute Trade Off), AHP (Analytic Hieratic Process), evolutionary algorithms or rule-development experimentation.
- Factor Analysis – this is a method that is used when we have multidimensional data. Some of the predictors could be correlated and they might not produce useful information. The predictors are named factors. There could be hidden or latent variables. A linear combination of the factors is used to model the observed variables. The goal is to identify such latent variables.
- PCA (Principal Component Analysis) – this is another alternative for dimensionality reduction. Here the variables are called principal components. They are grouped by the explained variation. This is one of the oldest methods that has been developed.
- Mixture models – this is a very huge field of diverse models. It is entirely possible that the total population is not a homogeneous unit and there are subpopulations. Hence the name comes from mixing of statistical distributions. We try to infer information of the general population based on the subpopulations.
- Structural equations modeling – here we used equations to model unobserved or latent variables from the observed ones. Regressions analysis can be used to estimate the parameters. Some examples of the models are confirmatory composite analysis, path analysis, latent growth modeling and others.

This short list of algorithms is far from exhaustive, some of the algorithms could be used interchangeably. There is no best algorithm. We strongly recommend the curious reader to research more on the topic.

When our models perform well and are used in production it needs to be maintained and back tested with the latest data. If there are huge discrepancies and the model performance degrades with time, we would need to make major changes or to redevelop the model entirely. Here the knowledge of various algorithms comes in handy, so that we could create an even better model then the initial one.

We must know that AI and statistics are not magic, but science and to some extent art. So having a firm grasp on the material can help us navigate the path better and stay ahead of the competition.