Methods
The proposed methodology involves applying three clustering techniques: K-means, Gaussian Mixture Models, and Geostatistical Clustering, to define estimation domains.
K-means
K-means clustering method groups similar data points into clusters by iteratively updating the center of each cluster based on the mean distance of data points from their respective centers. The algorithm stops when the center of each cluster no longer changes or when it reaches a pre-defined number of iterations.
Gaussian Mixture Models (GMM)
Gaussian Mixture Models (GMM) is a clustering method that probabilistically models each cluster as a mixture of Gaussian distributions. The algorithm estimates the parameters of the Gaussian distribution and the likelihood of each data point belonging to each cluster to identify the clusters. Then, the algorithm keeps refining the model until it reaches convergence.
Geostatistical Clustering (GC)
The Geostatistical Clustering method calculates the direct variogram and cross-variograms of all features. Then, a kernel is fitted to them. These variograms provide information about the similarity between pairs of samples separated by a certain distance. A dissimilarity matrix is calculated from the kernel model, serving as input for the hierarchical clustering algorithm. The hierarchical clustering method groups the samples based on their similarity, creating a dendrogram that can be cut at any desired level to obtain the final clusters. This method was proposed by Fouedjio, 2016 and was implemented for this study.
The optimum number of clusters (K)
I employed the elbow method in this study to identify the optimal number of clusters. To achieve this, we plotted the log-likelihood score as a function of the number of clusters and visually inspected the plot for the point where the rate of improvement started to level off. This point represented the optimal number of clusters, as adding more clusters did not significantly improve the model performance.
Quality of the clusters
In order to evaluate the quality of the resulting clusters, we will utilize several metrics. Firstly, we will use the within-cluster variance to measure the similarity of elements within each cluster. This metric will help us determine whether the clusters are well-defined and contain similar elements.
Secondly, we will use the between-cluster variance to measure the dissimilarity between the clusters. This metric will help us determine whether the clusters are distinct and well-separated from each other.
Additionally, we will use the total entropy of each cluster to evaluate its spatial continuity or connectivity. The entropy will be calculated at each sample location by considering the nearby samples within a defined neighborhood. The probability for each category will be calculated, and the entropy will be derived from this probability. This process will be repeated for all sample locations, and the entropies will be summed up to obtain the total entropy.
.
K-means
K-means clustering method groups similar data points into clusters by iteratively updating the center of each cluster based on the mean distance of data points from their respective centers. The algorithm stops when the center of each cluster no longer changes or when it reaches a pre-defined number of iterations.
Gaussian Mixture Models (GMM)
Gaussian Mixture Models (GMM) is a clustering method that probabilistically models each cluster as a mixture of Gaussian distributions. The algorithm estimates the parameters of the Gaussian distribution and the likelihood of each data point belonging to each cluster to identify the clusters. Then, the algorithm keeps refining the model until it reaches convergence.
Geostatistical Clustering (GC)
The Geostatistical Clustering method calculates the direct variogram and cross-variograms of all features. Then, a kernel is fitted to them. These variograms provide information about the similarity between pairs of samples separated by a certain distance. A dissimilarity matrix is calculated from the kernel model, serving as input for the hierarchical clustering algorithm. The hierarchical clustering method groups the samples based on their similarity, creating a dendrogram that can be cut at any desired level to obtain the final clusters. This method was proposed by Fouedjio, 2016 and was implemented for this study.
The optimum number of clusters (K)
I employed the elbow method in this study to identify the optimal number of clusters. To achieve this, we plotted the log-likelihood score as a function of the number of clusters and visually inspected the plot for the point where the rate of improvement started to level off. This point represented the optimal number of clusters, as adding more clusters did not significantly improve the model performance.
Quality of the clusters
In order to evaluate the quality of the resulting clusters, we will utilize several metrics. Firstly, we will use the within-cluster variance to measure the similarity of elements within each cluster. This metric will help us determine whether the clusters are well-defined and contain similar elements.
Secondly, we will use the between-cluster variance to measure the dissimilarity between the clusters. This metric will help us determine whether the clusters are distinct and well-separated from each other.
Additionally, we will use the total entropy of each cluster to evaluate its spatial continuity or connectivity. The entropy will be calculated at each sample location by considering the nearby samples within a defined neighborhood. The probability for each category will be calculated, and the entropy will be derived from this probability. This process will be repeated for all sample locations, and the entropies will be summed up to obtain the total entropy.
.