Learning Notes-Structural Discovery
Unlike predictive analysis, structural discovery is to find the patterns of data.
K-means
-
prcedures
- decide the number of clusters
- find initial starting points - centroids (usually random)
- label data points that close to a certain centroid as one cluster,e.g.,Cluster0 (use voronoi diagram)
- refit the centroids in each cluster (e.g., Centroid A) in a certain cluster (in every coordinate, calculate the average value of all data points in that cluster) - the process called convergence
- repeat until the centroids remain the same
- key concepts
- cluster size
- centroids
- distortion
- BiC/AIC
-
notes
- how to determine the final clusters: with different initial centroids, the final clusters will differ.
- ramdonly try several initial starting points and get several restarts
- use distortion(mean square deviation,Sum of Squares) to determine which restart works best.
- how to determine the number of clusters: the distortion usually gets smaller along the increase of cluster size
- cross-validation doesn’t work
- BIC and AIC works ( Bayesian Information Criterion, Akaike Information Criterion) - compare how much fit would be spuriously expected from a randomized K centroids(do not move centroids) and how much fit we actually had
- choose the cluster size with best value of AIC or BiC
- most importantly, scientific question matters!!!!
- how to determine the final clusters: with different initial centroids, the final clusters will differ.
advanced clustering algorithms
- Gaussian Mixture Models - Expectation Maximization Algorithm
- a centroid and a radius ( threshold)
- Used during model calculation
- Assess with distortion, AIC/Bic, and likelihood
- difference from K-means
- clusters can overlap
- explicity treating points as outliers
- key concepts : cluster size, centroid, radius, distortion, BiC/AIC, likelihood
- Spetral Clustering
- Hierachical Clustering -> Hierachical Agglommerative Clustering
- each data point starts as a cluster
- two clusters are combined if the fit is better
- continue until no more cluters can be combined
factor analysis
- types
- experimental: get groups in bottom-up fashion, more educational data ming
- confirmatory: test the goodness of existing structure, more psychometric
- procedures
- algorithms: PAF(principal axis factoring), PCA(principal components analysis,more common)
- first factor tries to find a combinition of variable-weightings that gets the best fit to the data
- the second tries to fit the remaining unexplained variance….
- factors are made ortogonal.
- computer a factor score ( each factor can generate a linear euqation)
- find variables strongly load on each factors(e.g, F1) and get the loading - many criteria
- generate one-factor-per-variable(scale) models by iteratively
- assigning each item to factors
- dropping the one item that loads most poorly in one factor, if it has no strong loading (if every variable is strong loading, best!)
- refitting factors
- algorithms: PAF(principal axis factoring), PCA(principal components analysis,more common)
- key concepts
- goodness:
- rsquare( what propotion of the variance in the variables is explained by the factoring)
- cross-validated rsquare
- internal reliability of scales ( cronbach’s α)
- goodness:
- notes
- Unlike clustering that groups data points together, factor analysis finds how group data features/variables together.(two problems can be trasformative, but not the same)
- the procedures deal with quantative variables, there is also a variant for categorical and binary data, Latent Class Factor Analysis (LCFA –Magidson & Vermunt, 2001; Vermunt & Magidson,2004), as well as a variant for mixed data types, Exponential Family Principal Component Analysis (EPCA – Collins et al., 2001)
- context matters!!! make reasonable variable selection.
questions
- what is cross-validation?
- what is model calcuation? ( Expectation Maximization Algorithm)
- what is non-linear dimension-reduced space? what is dimensionality reduction? (spetral clustering)
- what is a support vector machine? ( spetral clustering)
- how to understand the mathematical mechanisms of factor analysis?
Baker, R.S. (2015) Big Data and Education. 2nd Edition. New York, NY: Teachers College, Columbia University.
Written on January 20, 2017