Learning NotesStructural Discovery
Unlike predictive analysis, structural discovery is to find the patterns of data.
Kmeans

prcedures
 decide the number of clusters
 find initial starting points  centroids (usually random)
 label data points that close to a certain centroid as one cluster，e.g.,Cluster0 (use voronoi diagram)
 refit the centroids in each cluster (e.g., Centroid A) in a certain cluster (in every coordinate, calculate the average value of all data points in that cluster）  the process called convergence
 repeat until the centroids remain the same
 key concepts
 cluster size
 centroids
 distortion
 BiC/AIC

notes
 how to determine the final clusters: with different initial centroids, the final clusters will differ.
 ramdonly try several initial starting points and get several restarts
 use distortion(mean square deviation，Sum of Squares) to determine which restart works best.
 how to determine the number of clusters: the distortion usually gets smaller along the increase of cluster size
 crossvalidation doesn’t work
 BIC and AIC works ( Bayesian Information Criterion, Akaike Information Criterion)  compare how much fit would be spuriously expected from a randomized K centroids(do not move centroids) and how much fit we actually had
 choose the cluster size with best value of AIC or BiC
 most importantly, scientific question matters!!!!
 how to determine the final clusters: with different initial centroids, the final clusters will differ.
advanced clustering algorithms
 Gaussian Mixture Models  Expectation Maximization Algorithm
 a centroid and a radius ( threshold)
 Used during model calculation
 Assess with distortion, AIC/Bic, and likelihood
 difference from Kmeans
 clusters can overlap
 explicity treating points as outliers
 key concepts : cluster size, centroid, radius, distortion, BiC/AIC, likelihood
 Spetral Clustering
 Hierachical Clustering > Hierachical Agglommerative Clustering
 each data point starts as a cluster
 two clusters are combined if the fit is better
 continue until no more cluters can be combined
factor analysis
 types
 experimental: get groups in bottomup fashion, more educational data ming
 confirmatory: test the goodness of existing structure, more psychometric
 procedures
 algorithms: PAF(principal axis factoring), PCA(principal components analysis,more common)
 first factor tries to find a combinition of variableweightings that gets the best fit to the data
 the second tries to fit the remaining unexplained variance….
 factors are made ortogonal.
 computer a factor score ( each factor can generate a linear euqation)
 find variables strongly load on each factors(e.g, F1) and get the loading  many criteria
 generate onefactorpervariable(scale) models by iteratively
 assigning each item to factors
 dropping the one item that loads most poorly in one factor, if it has no strong loading (if every variable is strong loading, best!)
 refitting factors
 algorithms: PAF(principal axis factoring), PCA(principal components analysis,more common)
 key concepts
 goodness:
 rsquare( what propotion of the variance in the variables is explained by the factoring)
 crossvalidated rsquare
 internal reliability of scales ( cronbach’s α)
 goodness:
 notes
 Unlike clustering that groups data points together, factor analysis finds how group data features/variables together.(two problems can be trasformative, but not the same)
 the procedures deal with quantative variables, there is also a variant for categorical and binary data, Latent Class Factor Analysis (LCFA –Magidson & Vermunt, 2001; Vermunt & Magidson,2004), as well as a variant for mixed data types, Exponential Family Principal Component Analysis (EPCA – Collins et al., 2001)
 context matters!!! make reasonable variable selection.
questions
 what is crossvalidation?
 what is model calcuation? ( Expectation Maximization Algorithm)
 what is nonlinear dimensionreduced space? what is dimensionality reduction? (spetral clustering)
 what is a support vector machine? ( spetral clustering)
 how to understand the mathematical mechanisms of factor analysis?
Baker, R.S. (2015) Big Data and Education. 2nd Edition. New York, NY: Teachers College, Columbia University.
Written on January 20, 2017