Title: | Hierarchical Modal Clustering |
---|---|
Description: | Performs Modal Clustering (MAC) including Hierarchical Modal Clustering (HMAC) along with their parallel implementation (PHMAC) over several processors. These model-based non-parametric clustering techniques can extract clusters in very high dimensions with arbitrary density shapes. By default clustering is performed over several resolutions and the results are summarised as a hierarchical tree. Associated plot functions are also provided. There is a package vignette that provides many examples. This version adheres to CRAN policy of not spanning more than two child processes by default. |
Authors: | Surajit Ray and Yansong Cheng |
Maintainer: | Surajit Ray <[email protected]> |
License: | GPL-2 |
Version: | 0.7 |
Built: | 2025-01-25 03:03:04 UTC |
Source: | https://github.com/cran/Modalclust |
Choosing the cluster which is closest to a point specified by user. Works only for two dimensional data.
choose.cluster(hmacobj,x=NULL,level=NULL,n.cluster=NULL)
choose.cluster(hmacobj,x=NULL,level=NULL,n.cluster=NULL)
hmacobj |
The output of HMAC analysis. An object of class 'hmac'. |
x |
The user-specified location. Deafult value is NULL in which case user chooses a point using the locator function. |
level |
The specified level |
n.cluster |
The specified number of clusters. Either |
Surajit Ray and Yansong Cheng
Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," Journal of Machine Learning Research , 8(8):1687-1723, 2007.
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
phmac
for front end of using modal clustering and also for parallel implementation of modal clustering.
data(disc2d.hmac) #disc2d.hmac is the output of phmac(disc2d,npart=1) choose.cluster(disc2d.hmac,x=c(0,0),level=3) choose.cluster(disc2d.hmac,x=c(0,0),n.cluster=2) # Users can choose anypoint they want by clicking the point # in the plot after the following command. # choose.cluster(disc2d.hmac,level=3)
data(disc2d.hmac) #disc2d.hmac is the output of phmac(disc2d,npart=1) choose.cluster(disc2d.hmac,x=c(0,0),level=3) choose.cluster(disc2d.hmac,x=c(0,0),n.cluster=2) # Users can choose anypoint they want by clicking the point # in the plot after the following command. # choose.cluster(disc2d.hmac,level=3)
Plot clusters for two dimensional data with contours of the original data
## S3 method for class 'hmac' contour(x, n.cluster=NULL,level=NULL,prob=NULL,smoothplot=FALSE,...)
## S3 method for class 'hmac' contour(x, n.cluster=NULL,level=NULL,prob=NULL,smoothplot=FALSE,...)
x |
The output of HMAC analysis. An object of class 'hmac'. |
level |
The specified level |
n.cluster |
The specified number of clusters. Either |
prob |
The specified level of the contour plot. Default value is NULL, plot all levels of the contour plot. Must be between 0 and 1 |
smoothplot |
Get the smooth scatter plot of the original data set. Default value is FALSE, which does not provide the smooth scatter plot. |
... |
Further arguments passed to or from other methods. |
Surajit Ray and Yansong Cheng
Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," Journal of Machine Learning Research , 8(8):1687-1723, 2007.
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
phmac
for front end of using modal clustering and also for parallel implementation of modal clustering.
soft.hmac
for soft clustering at specified levels.
hard.hmac
for hard clustering at specified levels.
See plot
for plotting the whole dendrogram.
data(disc2d.hmac) # disc2d.hmac is the output of phmac(disc2d,npart=1) contour.hmac(disc2d.hmac,level=3,col=gray(0.7)) # Provide contour line at probability density 0.05. contour(disc2d.hmac,n.cluster=2,prob=0.05) # Plot using smooth scatter plot. contour.hmac(disc2d.hmac,n.cluster=2,smoothplot=TRUE)
data(disc2d.hmac) # disc2d.hmac is the output of phmac(disc2d,npart=1) contour.hmac(disc2d.hmac,level=3,col=gray(0.7)) # Provide contour line at probability density 0.05. contour(disc2d.hmac,n.cluster=2,prob=0.05) # Plot using smooth scatter plot. contour.hmac(disc2d.hmac,n.cluster=2,smoothplot=TRUE)
Two dimensional data in original and log scale and their hierarchical modal clustering. This dataset demonstrates the fact that modal clustering techniques can be used to cluster untransformed data as it does not depend on parametric assumptions. The clustering results, before and after the log transformation both produce nice separation of the three clusters.
data(cta20) data(cta20.hmac) data(logcta20) data(logcta20.hmac)
data(cta20) data(cta20.hmac) data(logcta20) data(logcta20.hmac)
cta20 and logcta20 are two dimensional
matrices. cta20.hmac and logcta20.hmac are objects of class hmac
obtained from applying phmac
on cta20 and logcta20 respectively
The dataset is generated by illumina technology for high throughput genotyping named GOLDEN GATE. The data values are actual measurements made by the machine (intensity), after these are normalized (background subtracted etc). The data set is used for making genotype calls by Illumina. The data around X- and Y-axes represents the two homozygous genotypes (e.g. AA and TT), while the cluster along the 45-degree line represents the heterozygous (e.g. AT) genotype. Due to noisy reads, the data points often lie in-between the axes, and cluster detection is used for making automatic genotype calls.
Surajit Ray and Yansong Cheng
data(logcta20) data(logcta20.hmac) plot(logcta20) plot(logcta20.hmac) plot(logcta20.hmac,level=4)
data(logcta20) data(logcta20.hmac) plot(logcta20) plot(logcta20.hmac) plot(logcta20.hmac,level=4)
Two and three dimensional data and their hierarchical modal clustering with 400 observations where the first two dimensions represent the shape of two discs.
data(disc2d) data(disc2d.hmac) data(disc3d) data(disc3d.hmac)
data(disc2d) data(disc2d.hmac) data(disc3d) data(disc3d.hmac)
disc2d and disc3d are two and three dimensional
matrices. disc2d.hmac and disc3d.hmac are objects of class hmac
obtained from applying phmac
on disc2d and disc3d respectively
Two dimensional data with 400 observations representing the shape of two half discs.
Surajit Ray and Yansong Cheng
Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," Journal of Machine Learning Research , 8(8):1687-1723, 2007.
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
data(disc2d) plot(disc2d) data(disc2d.hmac) summary.hmac(disc2d.hmac) hard.hmac(disc2d.hmac,n.clust=2)
data(disc2d) plot(disc2d) data(disc2d.hmac) summary.hmac(disc2d.hmac) hard.hmac(disc2d.hmac,n.clust=2)
Find the mid point of memberships of each cluster. Sub function of plot.hmac
.
findmid(x,memb)
findmid(x,memb)
x |
Input data |
memb |
Membership of each observation |
Surajit Ray and Yansong Cheng
plot
for plotting the dendrogram
Plot clusters with colors obtained from hard density. Plot one dimensional data with density plot. Plot two dimensional data with scatter plot. Pairwise scatter plot will be provided for data with more than two dimensions.
hard.hmac(hmacobj,level=NULL, n.cluster=NULL,plot=TRUE,colors=1:6,...)
hard.hmac(hmacobj,level=NULL, n.cluster=NULL,plot=TRUE,colors=1:6,...)
hmacobj |
The output of HMAC analysis. An object of class 'hmac'. |
level |
The specified level of HMAC output |
n.cluster |
The specified number of clusters. If neither |
plot |
Get the plot of the clusters with different colors. Default value is TRUE, draws a plot on the current graphics device; plot=FALSE indicates do not get the plot and returns the membership of data. |
colors |
Colors used to represent different clusters. |
... |
Further graphical parameters |
Returns the membership of each observation of the specified level if plot=FALSE
Surajit Ray and Yansong Cheng
Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," Journal of Machine Learning Research , 8(8):1687-1723, 2007.
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
phmac
for front end of using modal clustering and also for parallel implementation of modal clustering
soft.hmac
for soft clustering at specified levels.
See plot.hmac
.
data(disc2d.hmac) #disc2d.hmac is the output of phmac(disc2d,npart=1) hard.hmac(disc2d.hmac,level=3) #returns the membership of each observation disc2d.2clus=hard.hmac(hmacobj=disc2d.hmac,n.cluster=2,plot=FALSE) table(disc2d.2clus) #hard.hmac(disc2d.hmac) iris.hmac=phmac(iris[,-5]) # For more than two dimensions it produces the pairs plot hard.hmac(iris.hmac,n.cluster=2)
data(disc2d.hmac) #disc2d.hmac is the output of phmac(disc2d,npart=1) hard.hmac(disc2d.hmac,level=3) #returns the membership of each observation disc2d.2clus=hard.hmac(hmacobj=disc2d.hmac,n.cluster=2,plot=FALSE) table(disc2d.2clus) #hard.hmac(disc2d.hmac) iris.hmac=phmac(iris[,-5]) # For more than two dimensions it produces the pairs plot hard.hmac(iris.hmac,n.cluster=2)
Performs Modal Cluster with specified smoothing paramters. Used as a sub function of phmac
.
hmac(dat,Sigmas,G=NULL,member=NULL)
hmac(dat,Sigmas,G=NULL,member=NULL)
dat |
Matrix of data points |
Sigmas |
Specified smoothing levels |
G |
Specified values of modes. A matrix with number of rows equal to the number of modes and number of columns equal to the dimension of the data. Defualt value is NULL |
member |
Membership of the observations to the modes given in G. Default value is NULL |
data |
Same as the input dat. |
n.cluster |
Number of clusters at each level. |
level |
Levels corresponding to each smoothing parameter. |
Sigmas |
Same as input sigmas. |
mode |
List of modes at each distinct levels. |
membership |
List of memmbership to modes at each distinct levels. |
Surajit Ray and Yansong Cheng
Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," Journal of Machine Learning Research , 8(8):1687-1723, 2007.
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
phmac
for front end of using modal clustering and also for parallel implementation of modal clustering.
these set of functions are based on the concept of pseudo degrees of freedom (Lindsay et al 2008) and are used to calculate the Sigmas that are used for the 'hmac' function
khat.inv(p,len=10) sdofnorm(h,p) khat(dof,p)
khat.inv(p,len=10) sdofnorm(h,p) khat(dof,p)
len |
Number of smoothing parameters. |
h |
Smoothing parameter |
p |
Number of column of data |
dof |
Degrees of freedom |
Surajit Ray
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
phmac
for front end of using HMAC and also for parallel
implementation of modal clustering.
Faster calculation of density of multivariate normal with diagonal covariance matrix
mydmvnorm(x, mean, sigmasq)
mydmvnorm(x, mean, sigmasq)
x |
The input data |
mean |
The vector of mean values |
sigmasq |
The variance of each dimension. Assume the variance are the same for all dimensions. |
Surajit Ray and Yansong Cheng
A one dimensional data and its hierarchical modal clustering with 2 main clusters
data(oned) data(oned.hmac)
data(oned) data(oned.hmac)
oned
is a one dimensional data with 2 main clusters and several subclusters.
oned.hmac
is an object of class 'hmac'
obtained from applying phmac
on disc2d and disc3d respectively
Surajit Ray and Yansong Cheng
Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," Journal of Machine Learning Research , 8(8):1687-1723, 2007.
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
data(oned) hist(oned,col="lavender",n=15) data(oned.hmac) plot(oned.hmac) plot(oned.hmac,n.clust=2)
data(oned) hist(oned,col="lavender",n=15) data(oned.hmac) plot(oned.hmac) plot(oned.hmac,n.clust=2)
Performing Modal Clustering
phmac(dat, length = 10, npart = 1, parallel = TRUE, sigmaselect = NULL, G= NULL) modalclust(dat, length = 10, npart = 1, parallel = TRUE, sigmaselect = NULL, G= NULL)
phmac(dat, length = 10, npart = 1, parallel = TRUE, sigmaselect = NULL, G= NULL) modalclust(dat, length = 10, npart = 1, parallel = TRUE, sigmaselect = NULL, G= NULL)
dat |
Matrix of data points |
length |
number of smoothing levels. Default is 10 |
sigmaselect |
Specified Smoothing levels. Default NULL will calculate the Sigma levels using concept of spectral degrees of freedom given in Lindsay et al (2008) |
npart |
Number of random partitions when using parallel computing. If using several processors of a machine one option is to choose the number of partitions equal to the number of processors |
parallel |
If TRUE uses parallel comptation using |
G |
Specified values of modes. A matrix with number or rows equal to the number of modes and number of columns equal to the dimension of the data. Defualt value is NULL |
data |
Same as the input Data |
n.cluster |
Number of clusters at each level. |
level |
Levels corresponding to each smoothing parameter. |
sigmas |
Same as input sigmaselect if provided or dynamically calculated smoothing levels based on Spectral Degrees of Freedom criterion. Uses the function khat.inv |
mode |
List of modes at each distinct levels. |
membership |
List of memmbership to modes at each distinct levels. |
Surajit Ray and Yansong Cheng
Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," Journal of Machine Learning Research , 8(8):1687-1723, 2007.
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
soft.hmac
for soft clustering at specified levels.
hard.hmac
for hard clustering at specified levels.
See plot.hmac
.
data(disc2d) ## Not run: disc2d.hmac=phmac(disc2d,npart=1) plot.hmac(disc2d.hmac,level=2) ## For parallel implementation ## Not run: disc2d.hmac.parallel=phmac(disc2d,npart=2,parallel=TRUE) soft.hmac(disc2d.hmac,level=2) soft.hmac(disc2d.hmac,n.cluster=3) hard.hmac(disc2d.hmac,n.cluster=3)
data(disc2d) ## Not run: disc2d.hmac=phmac(disc2d,npart=1) plot.hmac(disc2d.hmac,level=2) ## For parallel implementation ## Not run: disc2d.hmac.parallel=phmac(disc2d,npart=2,parallel=TRUE) soft.hmac(disc2d.hmac,level=2) soft.hmac(disc2d.hmac,n.cluster=3) hard.hmac(disc2d.hmac,n.cluster=3)
Plots the dendrogram of the entire heierarchical tree for a 'hmac' object starting from any specified smoothing level.
## S3 method for class 'hmac' plot(x,mycol=1:6,level=1,n.cluster=NULL,userclus=NULL,sep=.1,...)
## S3 method for class 'hmac' plot(x,mycol=1:6,level=1,n.cluster=NULL,userclus=NULL,sep=.1,...)
x |
The output of HMAC analysis. An object of class 'hmac'. |
mycol |
Colors used to represent different clusters. |
level |
The specified level that dendrogram starts. Default value is 1. |
n.cluster |
The specified number of clusters. If neither |
userclus |
If user provides membership, the tree colors the node according to this membership and the tree can be used for validation. |
sep |
It provides the distance between the lowest layer of nodes of the clusters. |
... |
further arguments passed to or from other methods. |
Surajit Ray and Yansong Cheng
Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," Journal of Machine Learning Research , 8(8):1687-1723, 2007.
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
phmac
for front end of using modal clustering and also for parallel implementation of modal clustering.
hard.hmac
for hard clustering at specified levels.
soft.hmac
for soft clustering at specified levels.
data(disc2d.hmac) # disc2d.hmac is the output of phmac(disc2d,npart=1) plot(disc2d.hmac) set.seed(20) mix4=data.frame(rbind(rmvnorm(20,rep(0,4)), rmvnorm(20,rep(2,4)), rmvnorm(20,rep(10,4)),rmvnorm(20,rep(13,4)))) mix4.hmac=phmac(mix4,npart=1) plot(mix4.hmac,col=1:6) # Verifying with user provided groups plot(mix4.hmac,userclus=rep(c(1,2,3,4),each=20),col=1:6)
data(disc2d.hmac) # disc2d.hmac is the output of phmac(disc2d,npart=1) plot(disc2d.hmac) set.seed(20) mix4=data.frame(rbind(rmvnorm(20,rep(0,4)), rmvnorm(20,rep(2,4)), rmvnorm(20,rep(10,4)),rmvnorm(20,rep(13,4)))) mix4.hmac=phmac(mix4,npart=1) plot(mix4.hmac,col=1:6) # Verifying with user provided groups plot(mix4.hmac,userclus=rep(c(1,2,3,4),each=20),col=1:6)
Plot clusters for two dimensional data with colors representing the posterior probability of belonging to clusters. Additionally boundary points between the clusters, with specified thresholds are also
soft.hmac(hmacobj,n.cluster=NULL,level=NULL,boundlevel=0.4,plot=TRUE)
soft.hmac(hmacobj,n.cluster=NULL,level=NULL,boundlevel=0.4,plot=TRUE)
hmacobj |
The output of HMAC analysis. An object of class 'hmac'. |
level |
The specified level of HMAC output |
n.cluster |
The specified number of clusters. If neither |
boundlevel |
Posterior probability threshold. Points having posterior probability below |
plot |
Get the two dimensional plot of the clusters with different colors. Default value is TRUE, which returns the two dimensional plot on the current graphics device; plot=FALSE returns the posterior probability of each observation. |
Returns the list that contains the posterior probability of each observation and boundary points at specified level if plot=FALSE
Surajit Ray and Yansong Cheng
Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," Journal of Machine Learning Research , 8(8):1687-1723, 2007.
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
phmac
for front end of using modal clustering and also for parallel implementation of modal clustering
hard.hmac
for hard clustering at specified levels.
data(logcta20.hmac) #logcta20.hmac is the output of phmac(logcta20,npart=1) soft.hmac(logcta20.hmac,n.cluster=3) #return the posterior probability of each observation and boundary points. postprob=soft.hmac(hmacobj=logcta20.hmac,n.cluster=3,plot=FALSE)
data(logcta20.hmac) #logcta20.hmac is the output of phmac(logcta20,npart=1) soft.hmac(logcta20.hmac,n.cluster=3) #return the posterior probability of each observation and boundary points. postprob=soft.hmac(hmacobj=logcta20.hmac,n.cluster=3,plot=FALSE)
Gives the summary of output of a 'hmac' object.
## S3 method for class 'hmac' summary(object,...)
## S3 method for class 'hmac' summary(object,...)
object |
The output of HMAC analysis. An object of class 'hmac'. |
... |
further arguments passed to or from other methods. |
Surajit Ray and Yansong Cheng
Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," Journal of Machine Learning Research , 8(8):1687-1723, 2007.
Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," The Annals of Statistics Vol. 36, No. 2, page 983–1006, 2008.
phmac
for front end of using modal clustering and also for parallel
implementation of modal clustering.
data(disc2d.hmac) summary(disc2d.hmac)
data(disc2d.hmac) summary(disc2d.hmac)