Learning Notes-Structural Discovery

Unlike predictive analysis, structural discovery is to find the patterns of data.

K-means

prcedures
- decide the number of clusters
- find initial starting points - centroids (usually random)
- label data points that close to a certain centroid as one cluster，e.g.,Cluster0 (use voronoi diagram)
- refit the centroids in each cluster (e.g., Centroid A) in a certain cluster (in every coordinate, calculate the average value of all data points in that cluster） - the process called convergence
- repeat until the centroids remain the same
key concepts
- cluster size
- centroids
- distortion
- BiC/AIC
notes
1. how to determine the final clusters: with different initial centroids, the final clusters will differ.
  - ramdonly try several initial starting points and get several restarts
  - use distortion(mean square deviation，Sum of Squares) to determine which restart works best.
2. how to determine the number of clusters: the distortion usually gets smaller along the increase of cluster size
  - cross-validation doesn’t work
  - BIC and AIC works ( Bayesian Information Criterion, Akaike Information Criterion) - compare how much fit would be spuriously expected from a randomized K centroids(do not move centroids) and how much fit we actually had
  - choose the cluster size with best value of AIC or BiC
3. most importantly, scientific question matters!!!!

advanced clustering algorithms

Gaussian Mixture Models - Expectation Maximization Algorithm
- a centroid and a radius ( threshold)
- Used during model calculation
- Assess with distortion, AIC/Bic, and likelihood
- difference from K-means
  - clusters can overlap
  - explicity treating points as outliers
- key concepts : cluster size, centroid, radius, distortion, BiC/AIC, likelihood
Spetral Clustering
Hierachical Clustering -> Hierachical Agglommerative Clustering
- each data point starts as a cluster
- two clusters are combined if the fit is better
- continue until no more cluters can be combined

factor analysis

types
- experimental: get groups in bottom-up fashion, more educational data ming
- confirmatory: test the goodness of existing structure, more psychometric
procedures
- algorithms: PAF(principal axis factoring), PCA(principal components analysis,more common)
  - first factor tries to find a combinition of variable-weightings that gets the best fit to the data
  - the second tries to fit the remaining unexplained variance….
  - factors are made ortogonal.
- computer a factor score ( each factor can generate a linear euqation)
- find variables strongly load on each factors(e.g, F1) and get the loading - many criteria
- generate one-factor-per-variable(scale) models by iteratively
  - assigning each item to factors
  - dropping the one item that loads most poorly in one factor, if it has no strong loading (if every variable is strong loading, best!)
  - refitting factors
key concepts
- goodness:
  - rsquare( what propotion of the variance in the variables is explained by the factoring)
  - cross-validated rsquare
- internal reliability of scales ( cronbach’s α)
notes
1. Unlike clustering that groups data points together, factor analysis finds how group data features/variables together.(two problems can be trasformative, but not the same)
2. the procedures deal with quantative variables, there is also a variant for categorical and binary data, Latent Class Factor Analysis (LCFA –Magidson & Vermunt, 2001; Vermunt & Magidson,2004), as well as a variant for mixed data types, Exponential Family Principal Component Analysis (EPCA – Collins et al., 2001)
3. context matters!!! make reasonable variable selection.

questions

what is cross-validation?
what is model calcuation? ( Expectation Maximization Algorithm)
what is non-linear dimension-reduced space? what is dimensionality reduction? (spetral clustering)
what is a support vector machine? ( spetral clustering)
how to understand the mathematical mechanisms of factor analysis?

Baker, R.S. (2015) Big Data and Education. 2nd Edition. New York, NY: Teachers College, Columbia University.

Written on January 20, 2017