Consumer behavior in online stores gives rise to product graphs based on co-purchasing
or co-viewing behavior. These product graphs can be analyzed using
the known methods of graph analysis.
Social networks exhibit strong clustering properties as explained by the principle of
triadic closure. Two individuals with a mutual friend are highly likely to form a friendship
on a social networking website. A friend of a friend is a friend, and this also
creates structural balance.
The clustering coefficient (CC) measures the density of edges in a graph. A CC of
1 indicates that all nodes are connected to each other i.e. the graph is a clique. A CC
of 0 indicates a graph of formation of stars in which there are no triangles, and hence
no indication of clustering of the nodes.
Assortive mixing in a graph causes similar nodes to connect together. It is our conjecture
that products in a particular category tend to mix more as compared to a graph
produced using products of all categories. Thus, we expect the clustering coefficient of
a graph constructed using products of a particular category to be higher. Also, as compared
to the Erd˝os-Renyi random graph, we expect our product graphs to have higher
clustering coefficients as well as a different degree distribution i.e. a power law
The Clustering Coefficient
The clustering coefficient of a graph is described by the following equation:
CC = (3 *Tr)=(Tt)
where Tr is the number of triangles in the graph and Tt is the number of triplets (open
The clustering coefficient is then 1:0 if the graph is a giant clique and 0:0 if the graph
only has star formations i.e. there is no clustering of the nodes.
The clustering coefficient depends directly on the number of edges of the network.
The Erd¨os-Reny´ı Random Graph Model
We compare the results of each category with the  ER random graph model G(n; p)
where n is the number of vertices and p is the probability of an edge between any of
the 2*(n/2) pairs of edges (in an undirected graph created from directed edges).
So, the probability is then:
p = e/(n *(n – 1))
where e is the number of edges in the original graph and n is the vertices in the original
graph. The graph is then constructed by adding randomly chosen edges from the 2*(n/2)
The degree of any node is then binomial which is approximated by the Poisson distribution
when n-> ∞. This is why the ER random graph model is called the Poisson
random graph model.
Results on the Product Graph – Example
Figure 1 shows the clustering coefficients of graphs of some categories. In all but the
baby category, the clustering coefficient is higher than the one in the corresponding
random graph (with the same number of edges).
Figure 2 shows the probability of an edge in the graph of several categories. The category
baby has the highest probability of almost 10% and grocery has the next highest
probability. The probability represents the number of co-viewed or co-bought products
divided by the total number of edges possible in the undirected graph (n (n 1),
where n is the number of vertices).
Figure 1: Clustering Coefficients Across Categories
Predictive Power of the Local Clustering Coefficient
The local clustering coefficient is calculated per node in the graph. It is defined as the
number of triangles formed in the neighborhood of a node divided by the number of
CClocal = (3 *Trlocal)=(Ttlocal)
Figure 2: Probability of an Edge in the Graph Across Categories
where Trlocal is the number of triangles in the immediate neighborhood of a node
and Ttlocal is the number of triplets in the neighborhood (open or closed).
We evaluated the predictive value of the local clustering coefficient by using it for
product recommendations in the example. We keep 20% of the consumers as a held
out test set and evaluate the recommendations using the hit rate @ position 10. We
compare the hit rates with the ones calculated in  which notes that the maximum hit
rate is 31.1%. We found that using only the local clustering coefficient (one dimension
as compared to 250 dimensions), we can achieve a hit rate of 0.7% @ position 10.
Degrees of Actual and Random Graphs
The average degree on the actual graph is about 40 in all categories within a one week
period of views and purchases
the top six products
with the highest degrees on Target.com (for this figure, we ignore all edges that did not
occur at least 10 times i.e. at least 10 guests co-viewed the items). This is significantly
different from the random graph
The degree distribution of the actual
graph is clearly a power law. The degree distribution of the random graph is quite
Average Path Lengths of Actual Graphs Across Categories
We randomly sample 1000 nodes from each category and compute all pairs shortest
paths. The sampling is done to keep all categories comparable and to avoid using infeasible
computational resources. For each pair of nodes in the sample,
we compute the shortest path between them and then compute the mean of the shortest
The probability of an edge (figure 2) and the clustering coefficient (CC) (figure 1) are
the important properties of a graph. These two measures have a correlation of 0.85.
Both of these measures represent the amount of cohesion in the graph between products.
The probability of an edge measures the number of times two products are viewed
or bought together, and the CC also measures the amount of cohesion in the products
of the graph.
Figure 2 shows the probability of an edge in the graph. Baby, toys and kitchen have a
good probability of an edge. Apparel, furniture and home have a higher probability of
an edge as compared to all categories, but there is an opportunity to co-promote other
items of the same category using recommender systems.
We calculate the clustering coefficient (CC) for both the actual graph and the ER random
graph. We expect that the CC for the random graph to be significantly less than for
the actual graph, which is the case for all categories except baby. For baby, the CC for
the actual graph is the highest, which shows that consumers are very active within the
category. But the random graph has a higher CC. This could be because the probability
of an edge for baby is significantly higher for the baby category. But this also means
that there is an opportunity for increasing the cohesion between products for the baby
The difference between the CC for the actual graph and the random graph measures
the amount of information in the actual graph. For electronics, this difference is the
least, which shows that consumers who buy electronics do so cohesively with a much
higher amount of online research activity within the category. Same is the case for
apparel, furniture and home.
Grocery, kitchen and toys seem to have a good amount of clustering, but if we compare
the CC of these categories with the CC of all categories, there is an opportunity to
increase the cohesion (especially because the CC of the random graph for these categories
is also quite high).
The degree distribution of the actual graph and the random graph is shown in figure
The distribution of degrees on the actual graph has a power law distribution, which
might indicate that the graph is a scale-free network with significant amount of preferential
attachment (the tendency of forming larger and larger clusters of products).
The average degree is about the same for the random graph and the actual graph. The
distribution of the random graph is centered on the average degree, dropping symmetrically
on both sides, which is expected as the edges are chosen randomly.
we analyzed the product graph, formed using coviewing
and co-purchasing behavior in the consumer-product bipartite graph. We find
that the clustering coefficient is quite different across product categories and we discuss
some of the findings. We find that the degree distribution and the entropy of the
degree distribution are vastly different between the actual and random graphs. The path
lengths show that in a random sample of 1000 nodes from the actual graph, the average
is less than 5 for all categories implying the small world property that social networks
Future work could be to do some analysis on the mixing of categories. Specifically,
it might be useful to see how the products cluster when the graph is constructed using
pairs of categories. For instance, it might be the case that furniture and home mix really
strongly and the CC of the combined graph might be higher than the graphs constructed
from the individual categories.
Dare to add real AI capabilities to your online data and uplift your results in Lead Generation, Upsell or Retention contact us at any time.