Title: | Clustering of Micro Panel Data |
---|---|
Description: | Two-step feature-based clustering method designed for micro panel (longitudinal) data with the artificial panel data generator. See Sobisek, Stachova, Fojtik (2018) <arXiv:1807.05926>. |
Authors: | Jan Fojtik [aut, cre], Anna Grishko [aut], Lukas Sobisek [aut, cph, rev] |
Maintainer: | Jan Fojtik <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.8.1 |
Built: | 2025-02-14 04:12:02 UTC |
Source: | https://github.com/cran/CluMP |
This function clusters Micro-Panel (longitudinal) Data (or trajectories) to a pre-defined number of clusters by employing Feature-Based Clustering of Micro-Panel (longitudinal) Data algorithm called CluMP (see Reference). Currently, only univariate clustering analysis is available.
CluMP(formula, group, data, cl_numb = NA, base_val = FALSE, method = "ward.D")
CluMP(formula, group, data, cl_numb = NA, base_val = FALSE, method = "ward.D")
formula |
A two-sided |
group |
A grouping factor variable (vector), i.e. single identifier for each individual (trajectory). |
data |
A data frame containing the variables named in the |
cl_numb |
An integer, positive number (scalar) specifying the number of clusters. The |
base_val |
Indicates whether include a value at zero time point as an additional clustering variable. Default is FALSE and the standard number (7) of clustering parameters is used. |
method |
A method which use in hierarhical clustering, same as in |
Cluster Micro-Panel data. The output is the list
of 5 components which contain results from clustering.
Sobisek, L., Stachova, M., Fojtik, J. (2018) Novel Feature-Based Clustering of Micro-Panel Data (CluMP). Working paper version online: www.arxiv.org
data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10) CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3, base_val = FALSE, method = "ward.D") CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3, base_val = TRUE, method = "ward.D")
data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10) CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3, base_val = FALSE, method = "ward.D") CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3, base_val = TRUE, method = "ward.D")
The function CluMP_profiles provides a description (profile) for each cluster. The description is in the form of a summary list containing descriptive statistics of a cluster variable, time variable, cluster parameters and other variables (covariates), both continuous and categorical.
CluMP_profiles(CluMPoutput, cont_vars = NULL, cat_vars = NULL, show_NA = FALSE)
CluMP_profiles(CluMPoutput, cont_vars = NULL, cat_vars = NULL, show_NA = FALSE)
CluMPoutput |
An object (output) from the |
cont_vars |
An optional single character or a character vector of continuous variables' names (from the original dataset). |
cat_vars |
An optional single character or a character vector of categorical variables' names (from the original dataset). |
show_NA |
Logical scalar. Should be calculated and shown descriptive statistics for NA cluster if exists? Default is FALSE. NA cluster gathers improper individuals (trajectories with < 3 not missing observations) for longitudinal clustering. |
Returns a list
with cluster variable (Y) summary, both baseline and changes; time and a summary of the number of observations (visits); clustering parameters summary and optional continuous variables summary (baseline and changes) and categorical variables summary (baseline and end).
set.seed(123) dataMale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataMale$Gender <- "M" dataFemale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataFemale$ID <- dataFemale$ID + 50 dataFemale$Gender <- "F" data <- rbind(dataMale, dataFemale) CluMPoutput <- CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3) CluMP_profiles(CluMPoutput, cat_vars = "Gender")
set.seed(123) dataMale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataMale$Gender <- "M" dataFemale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataFemale$ID <- dataFemale$ID + 50 dataFemale$Gender <- "F" data <- rbind(dataMale, dataFemale) CluMPoutput <- CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3) CluMP_profiles(CluMPoutput, cat_vars = "Gender")
This graphical function enables to visualise cluster profiles (mean representatives of each cluster). Available are three types of plots: non-parametric (LOESS method for small/medium or GAM method for complex data of large size. Both methods are applied from ggplot2 representatives (mean within-cluster trajectories) with/without all individual (original) trajectories, and nonparametric mean trajectories with error bars.
CluMP_view( CluMPoutput, type = "all", nb_intervals = NULL, return_table = FALSE, title = NULL, x_title = NULL, y_title = NULL, plot_NA = FALSE )
CluMP_view( CluMPoutput, type = "all", nb_intervals = NULL, return_table = FALSE, title = NULL, x_title = NULL, y_title = NULL, plot_NA = FALSE )
CluMPoutput |
An object (output) from the |
type |
String. Indicates which type of graph is required. Possible values for this argument are: "all" (plots all data with non-parametric mean trajectories), "cont" (only non-parametric mean trajectories) or "breaks" (mean trajectories with error bars). |
nb_intervals |
An integer, positive number (scalar) specifying the number of regular timepoints into which should be follow-up period split. This argument works only with graph type = "breaks". In case of other graph types the argument is ignored. The number of error bars is equal to the number of timepoints specified by this argument. |
return_table |
Logical scalar indicating if the summary table of plotted values in the graph of type = "breaks" should be returned. Default is FALSE. |
title |
String. Optional title for a plot. If undefined, no title will used. |
x_title |
String. An optional title for x axis. If undefined, the variable name after ~ in |
y_title |
String. An optional title for y axis. If undefined, the variable name before ~ in |
plot_NA |
Plot NA cluster if exists. Default is FALSE. NA cluster gathers improper individuals (< 3 observations) for longitudinal clustering. |
Returns graph for type "all" and "cont" or (list with) graph and table of mean trajectories (if specified) for type = "breaks".
set.seed(123) dataMale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataMale$Gender <- "M" dataFemale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataFemale$ID <- dataFemale$ID + 50 dataFemale$Gender <- "F" data <- rbind(dataMale, dataFemale) CluMPoutput <- CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3) title <- "Plotting clusters' representatives with error bars" CluMP_view(CluMPoutput, type = "all" , return_table = TRUE) CluMP_view(CluMPoutput, type = "cont") CluMP_view(CluMPoutput, type = "breaks", nb_intervals = 5, return_table=TRUE, title = title)
set.seed(123) dataMale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataMale$Gender <- "M" dataFemale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataFemale$ID <- dataFemale$ID + 50 dataFemale$Gender <- "F" data <- rbind(dataMale, dataFemale) CluMPoutput <- CluMP(formula = Y ~ Time, group = "ID", data = data, cl_numb = 3) title <- "Plotting clusters' representatives with error bars" CluMP_view(CluMPoutput, type = "all" , return_table = TRUE) CluMP_view(CluMPoutput, type = "cont") CluMP_view(CluMPoutput, type = "breaks", nb_intervals = 5, return_table=TRUE, title = title)
This function creates artificial linear or non-linear micro-panel (longitudinal) data coming from generating process with a certain function (linear, quadratic, cubic, exponencial) set of parameters (fixed and random (intercept, slope) effects of time).
GeneratePanel( n, Param, NbVisit, VisitFreq = NULL, TimeVar = NULL, RegModel = NULL, ClusterProb = NULL, Rho = NULL, units = NULL )
GeneratePanel( n, Param, NbVisit, VisitFreq = NULL, TimeVar = NULL, RegModel = NULL, ClusterProb = NULL, Rho = NULL, units = NULL )
n |
An integer specifying the number of individuals (trajectories) being observed. |
Param |
Object of |
NbVisit |
A positive integer numeric input defining expected number of visits. Option is Fixed or Random.
Number of visits given by the argument |
VisitFreq |
String that defines the frequency of visits for each individual. Option is Random or Fixed. If set to Fixed or not defined, each individual has the same number of visits given by |
TimeVar |
A positive integer representing daily, time variability of the occurrence of repeated measurement (timepoint) from the regular, fixed occurrence (visit) given by the argument units. For example, if this argument is set to 5 then the random integer from interval of -5 to 5 is drawn and added to the time variable. TimeVar must be lower than the regular frequency of repeat measurement given by the argument units. |
RegModel |
String specifying the mathematical function for generating trajectory for each of n individuals. Options are linear, quadratic, cubic or exponential. If set to linear or not defined, then each trajectory has a linear trend. If set to quadratic, then each trajectory has a quadratic development in time. If set to cubic then each trajectory has cubic development. If set to exponential, then each trajectory has exponential development. |
ClusterProb |
Numeric scalar (for 2 clusters) or a vector of numbers (for >2 clusters) defining the probability of each cluster. If not defined, then each cluster has the same occurrence probability. |
Rho |
A numeric scalar specifying autocorrelation parameter with the values from range 0 to 1. If set as 0 or not define then there is no autocorrelation between the within-individual repeated observations. |
units |
String defining the units of time series. Options are day, week, month or year. |
Generates artificial panel data.
set.seed(123) #Simple Linear model where each individual has 10 observations. data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10) #Exponential model where each individual has 10 observations. data <- GeneratePanel(100, ParamExpon, NbVisit = 10, VisitFreq = "Fixed", RegModel = "exponential") PanelPlot(data) #Cubic model where each individual has random number of observations on daily basis. #Average number of observation is given by parameter NbVisit. data <- GeneratePanel(n = 100, Param = ParamCubic, NbVisit = 100, RegModel = "cubic", units = "day") PanelPlot(data) #Quadratic model where each individual has random number of observations. #Each object is observede weekly with variability 2 days. data <- GeneratePanel(5,ParamQuadrat,NbVisit=50,RegModel="quadratic",units="week",TimeVar=2) PanelPlot(data) #Generate panel data with linear trend with 75% objects in first cluster and 25% in the second. data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10, ClusterProb = c(0.75, 0.25)) PanelPlot(data, colour = "Cluster")
set.seed(123) #Simple Linear model where each individual has 10 observations. data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10) #Exponential model where each individual has 10 observations. data <- GeneratePanel(100, ParamExpon, NbVisit = 10, VisitFreq = "Fixed", RegModel = "exponential") PanelPlot(data) #Cubic model where each individual has random number of observations on daily basis. #Average number of observation is given by parameter NbVisit. data <- GeneratePanel(n = 100, Param = ParamCubic, NbVisit = 100, RegModel = "cubic", units = "day") PanelPlot(data) #Quadratic model where each individual has random number of observations. #Each object is observede weekly with variability 2 days. data <- GeneratePanel(5,ParamQuadrat,NbVisit=50,RegModel="quadratic",units="week",TimeVar=2) PanelPlot(data) #Generate panel data with linear trend with 75% objects in first cluster and 25% in the second. data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10, ClusterProb = c(0.75, 0.25)) PanelPlot(data, colour = "Cluster")
This function finds optimal number of clusters based on evaluation criteria (indices) available from the NbClust package.
OptiNum( formula, group, data, index = c("silhouette", "ch", "db"), max_clust = 10, base_val = FALSE )
OptiNum( formula, group, data, index = c("silhouette", "ch", "db"), max_clust = 10, base_val = FALSE )
formula |
A two-sided |
group |
A grouping factor variable (vector), i.e. single identifier for each individual (trajectory). |
data |
A data frame containing the variables named in |
index |
String vector of indices to be computed. Default is c("silhouette", "ch", "db"). See NbClust package for available indices and their description. |
max_clust |
An integer, positive number (scalar) defining the maximum number of clusters to check. Default value of this argument is 10 or maximum number of individuals. |
base_val |
Indicates whether include a value at zero time point as an additional clustering variable. Default is FALSE and the standard number (7) of clustering parameters is used. |
Determine the optimal number of clusters, returns graphical output (red dot in plot indicates the recommended number of clusters according to that index) and table with indices.
Malika Charrad, Nadia Ghazzali, Veronique Boiteau, Azam Niknafs (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, 61(6), 1-36. URL http://www.jstatsoft.org/v61/i06/.
set.seed(123) data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10) OptiNum(data = data, formula = Y ~ Time, group = "ID")
set.seed(123) data <- GeneratePanel(n = 100, Param = ParamLinear, NbVisit = 10) OptiNum(data = data, formula = Y ~ Time, group = "ID")
This function plots micro-panel (longitudinal) data from stored data.frame
or randomly generated panel data from GeneratePanel
function.
PanelPlot( data, formula = Y ~ Time, group = "ID", colour = NA, mean_traj_all = FALSE, mean_traj_group = FALSE, show_legend = TRUE, title = NULL, x_title = NULL, y_title = NULL )
PanelPlot( data, formula = Y ~ Time, group = "ID", colour = NA, mean_traj_all = FALSE, mean_traj_group = FALSE, show_legend = TRUE, title = NULL, x_title = NULL, y_title = NULL )
data |
A data frame containing the variables named in |
formula |
A two-sided |
group |
A grouping factor variable (vector), i.e. single identifier for each (trajectory). |
colour |
Character, which is a variable's name in data. The trajectories are distinguished by colour according to this variable. |
mean_traj_all |
Logical scalar. It indicates whether to show mean overall trajectory. Default is FALSE. |
mean_traj_group |
Logical scalar. It indicates whether to show mean trajectory by group. Default is FALSE. |
show_legend |
Logical scalar. It indicates whether to show cluster legend. Default is TRUE. |
title |
String. Is an optional title for a plot. Otherwise no title will used. |
x_title |
String. Is an optional title for x axis. Otherwise variable name after ~ in |
y_title |
String. Is an optional title for y axis. Otherwise variable name before ~ in |
Returns plot using package ggplot2.
set.seed(123) dataMale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataMale$Gender <- "M" dataFemale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataFemale$ID <- dataFemale$ID + 50 dataFemale$Gender <- "F" data <- rbind(dataMale, dataFemale) PanelPlot(data = data, formula = Y ~ Time, group = "ID", colour = "Gender") PanelPlot(data = data, formula = Y ~ Time, group = "ID", colour = "Gender", mean_traj_all = TRUE) PanelPlot(data = data, formula = Y ~ Time, group = "ID", colour = "Gender", mean_traj_group = TRUE)
set.seed(123) dataMale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataMale$Gender <- "M" dataFemale <- GeneratePanel(n = 50, Param = ParamLinear, NbVisit = 10) dataFemale$ID <- dataFemale$ID + 50 dataFemale$Gender <- "F" data <- rbind(dataMale, dataFemale) PanelPlot(data = data, formula = Y ~ Time, group = "ID", colour = "Gender") PanelPlot(data = data, formula = Y ~ Time, group = "ID", colour = "Gender", mean_traj_all = TRUE) PanelPlot(data = data, formula = Y ~ Time, group = "ID", colour = "Gender", mean_traj_group = TRUE)
Default parameters to generate micro-panel (longitudinal) data with quadratic trend. The parameters may differ per each cluster. The parameters of each cluster are in rows. Number of rows denotes the number of clusters. Fixed effects are taken from Allen et al. (2005), and the source for random effects is Uher et al. (2017).
ParamCubic
ParamCubic
Its adviced to keep parameters in data.frame
. The Parameters structure is as follows:
fixed parameter of intercept
fixed parameter of slope
fixed parameter of defining the quadraticity
fixed parameter of defining the cubicity
variance of random factor U0 given to fixed parameter b0
variance of random factor U1 given to fixed parameter b1
correlation between random factors U0 and U1
the variability of the residuals
Allen, JS, Bruss, J, Brown, CK, Damasio, H. Normal neuroanatomical variation due to age: the major lobes and a parcellation of the temporal region. Neurobiol Aging. 2005 Oct;26(9):1245-60; discussion 1279-82.
Uher T, Vaneckova M, Krasensky J, Sobisek L, Tyblova M, Volna J, Seidl Z, Bergsland N, Dwyer MG, Zivadinov R, De Stefano N, Sormani MP, Havrdova EK, Horakova D. Pathological cut-offs of global and regional brain volume loss in multiple sclerosis. Mult Scler. 2017 Nov 1:1352458517742739. doi: 10.1177/1352458517742739.
Default parameters to generate micro-panel (longitudinal) data with exponencial trend. The parameters may differ per each cluster. The parameters of each cluster are in rows. Number of rows denotes the number of clusters. Fixed effects are taken from Jones et al. (2013).
ParamExpon
ParamExpon
It is adviced to keep parameters in data.frame
. The Parameters structure is as follows:
fixed parameter of intercept
fixed parameter of slope
fixed parameter of defining the decay
variance of random factor U0 given to fixed parameter b0
variance of random factor U1 given to fixed parameter b1
correlation between random factors U0 and U1
the variability of the residuals
Jones BC, Nair G, Shea CD, Crainiceanu CM, Cortese IC, Reich DS. Quantification of multiple-sclerosis-related brain atrophy in two heterogeneous MRI datasets using mixed-effects modeling. Neuroimage Clin. 2013 Aug 13;3:171-9. doi: 10.1016/j.nicl.2013.08.001.
Default parameters to generate micro-panel (longitudinal) data with linear trend. The parameters may differ per each cluster. The parameters of each cluster are in rows. Number of rows denotes the number of clusters. Fixed and random effects are taken from Uher et al. (2017).
ParamLinear
ParamLinear
It is adviced to keep parameters in data.frame
. The Parameters structure is as follows:
fixed parameter of intercept
fixed parameter of slope
variance of random factor U0 given to fixed parameter b0
variance of random factor U1 given to fixed parameter b1
correlation between random factors U0 and U1
the variability of the residuals
Uher T, Vaneckova M, Krasensky J, Sobisek L, Tyblova M, Volna J, Seidl Z, Bergsland N, Dwyer MG, Zivadinov R, De Stefano N, Sormani MP, Havrdova EK, Horakova D. Pathological cut-offs of global and regional brain volume loss in multiple sclerosis. Mult Scler. 2017 Nov 1:1352458517742739. doi: 10.1177/1352458517742739.
Parameters to generate panel data with quadratic trend. The parameters may differ per each cluster. The parameters of each cluster are in rows. Number of rows denotes the number of clusters. Fixed effects are taken from Allen et al. (2005), and the source for random effects is Uher et al. (2017).
ParamQuadrat
ParamQuadrat
It is adviced to keep parameters in data.frame
. The Parameters structure is as follows:
fixed parameter of intercept
fixed parameter of slope
fixed parameter of defining the quadraticity
variance of random factor U0 given to fixed parameter b0
variance of random factor U1 given to fixed parameter b1
correlation between random factors U0 and U1
the variability of the residuals
Allen, JS, Bruss, J, Brown, CK, Damasio, H. Normal neuroanatomical variation due to age: the major lobes and a parcellation of the temporal region. Neurobiol Aging. 2005 Oct;26(9):1245-60; discussion 1279-82.
Uher T, Vaneckova M, Krasensky J, Sobisek L, Tyblova M, Volna J, Seidl Z, Bergsland N, Dwyer MG, Zivadinov R, De Stefano N, Sormani MP, Havrdova EK, Horakova D. Pathological cut-offs of global and regional brain volume loss in multiple sclerosis. Mult Scler. 2017 Nov 1:1352458517742739. doi: 10.1177/1352458517742739.