Figure 1: Profile and standard error for the four hypothesised groups.

In our proposal, we will be conducting a cluster analysis at baseline using the rich, multivariate data available for a very large sample (N=9000). To compute a simplified power analysis, we will assume that we use a single summary variable for each explanatory level – Although we will likely use datareduction techniques, the real data will be richer than this, and thus offer more opportunities for separation.

We computed a power analysis based on model based clustering (Fraley and Raftery, 1998), for 9000 cases and 5 variables, using information criteria (BIC) to select the number of clusters. We hypothesize, based on the literature, four distinct subgroups: Hyper, Hypo, Normative and Mixed. We assume these groups are of equal size in our sample. We visualize the hypothesized profile for each subgroup in Figure 1 (top) where we show the subgroup mean value (line) and standard deviation (ribbon). The power outcomes are subgroup estimation accuracy (how often do we find 4 subgroups, given that our ground truth contains 4 subgroups) and individual accuracy (how often are individuals assigned to the correct cluster). We simulated 100 datasets of N=9000.

Figure 2: The recovery of the four hypothesised clusters using gaussian mixture modelling https://mclust-org.github.io/mclust/.

Figure 2 shows a 2 dimensional projection of one such analysis, with cluster assignment shown as colours, and assignments with greater uncertainty shown as small dots. Across 100 simulations, we find a correct recovery of the 4 cluster solution in 100% of cases. Moreover, we find we are able to correctly assign 99.13% of individuals to the corresponding cluster, despite notable overlap within individual variable domains. Although real data may be distributed differently, our approach is feasible, flexible, theory driven and sufficiently powered to achieve the goals of our proposal.,

Fraley, C., & Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The computer journal, 41(8), 578-588.