# Learning Notes-Structural Discovery

Unlike predictive analysis, structural discovery is to find the patterns of data.

## K-means

• prcedures

• decide the number of clusters
• find initial starting points - centroids (usually random)
• label data points that close to a certain centroid as one cluster，e.g.,Cluster0 (use voronoi diagram)
• refit the centroids in each cluster (e.g., Centroid A) in a certain cluster (in every coordinate, calculate the average value of all data points in that cluster） - the process called convergence
• repeat until the centroids remain the same
• key concepts
• cluster size
• centroids
• distortion
• BiC/AIC
• notes

1. how to determine the final clusters: with different initial centroids, the final clusters will differ.
• ramdonly try several initial starting points and get several restarts
• use distortion(mean square deviation，Sum of Squares) to determine which restart works best.
2. how to determine the number of clusters: the distortion usually gets smaller along the increase of cluster size
• cross-validation doesn’t work
• BIC and AIC works ( Bayesian Information Criterion, Akaike Information Criterion) - compare how much fit would be spuriously expected from a randomized K centroids(do not move centroids) and how much fit we actually had
• choose the cluster size with best value of AIC or BiC
3. most importantly, scientific question matters!!!!

1. Gaussian Mixture Models - Expectation Maximization Algorithm
• a centroid and a radius ( threshold)
• Used during model calculation
• Assess with distortion, AIC/Bic, and likelihood
• difference from K-means
• clusters can overlap
• explicity treating points as outliers
• key concepts : cluster size, centroid, radius, distortion, BiC/AIC, likelihood
2. Spetral Clustering
3. Hierachical Clustering -> Hierachical Agglommerative Clustering
• each data point starts as a cluster
• two clusters are combined if the fit is better
• continue until no more cluters can be combined

## factor analysis

• types
• experimental: get groups in bottom-up fashion, more educational data ming
• confirmatory: test the goodness of existing structure, more psychometric
• procedures
• algorithms: PAF(principal axis factoring), PCA(principal components analysis,more common)
• first factor tries to find a combinition of variable-weightings that gets the best fit to the data
• the second tries to fit the remaining unexplained variance….
• computer a factor score ( each factor can generate a linear euqation)
• generate one-factor-per-variable(scale) models by iteratively
• assigning each item to factors
• refitting factors
• key concepts
• goodness:
• rsquare( what propotion of the variance in the variables is explained by the factoring)
• cross-validated rsquare
• internal reliability of scales ( cronbach’s α)
• notes
1. Unlike clustering that groups data points together, factor analysis finds how group data features/variables together.(two problems can be trasformative, but not the same)
2. the procedures deal with quantative variables, there is also a variant for categorical and binary data, Latent Class Factor Analysis (LCFA –Magidson & Vermunt, 2001; Vermunt & Magidson,2004), as well as a variant for mixed data types, Exponential Family Principal Component Analysis (EPCA – Collins et al., 2001)
3. context matters!!! make reasonable variable selection.

## questions

• what is cross-validation?
• what is model calcuation? ( Expectation Maximization Algorithm)
• what is non-linear dimension-reduced space? what is dimensionality reduction? (spetral clustering)
• what is a support vector machine? ( spetral clustering)
• how to understand the mathematical mechanisms of factor analysis?

Baker, R.S. (2015) Big Data and Education. 2nd Edition. New York, NY: Teachers College, Columbia University.

Written on January 20, 2017