Lecture Applied data science: Clustering

Số trang: 21 Loại file: pdf Dung lượng: 531.11 KB Lượt xem: 25 Lượt tải: 0

tailieu_vip

Phí tải xuống: 1,000 VND

Xem trước 3 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Lecture "Applied data science: Clustering" includes content: Exemplary technique - K-means clustering; Exemplary technique - Hierarchical clustering; Practical issues in clustering; Case study;... We invite you to consult!
Nội dung trích xuất từ tài liệu:
Lecture Applied data science: Clustering Clustering Overview 1. Introduction 8. Validation 2. Application 9. Regularisation 3. EDA 10. Clustering 4. Learning Process 11. Evaluation 5. Bias-Variance Tradeoff 12. Deployment 6. Regression (review) 13. Ethics 7. Classification Lecture outline - Exemplary technique - K-means clustering - Exemplary technique - Hierarchical clustering - Practical issues in clustering - Case study Unsupervised learning and clustering - Tend to be more subjective - Often a part of the exploratory data analysis - No universally accepted mechanism to validate the results - Clustering - partition a data set into distinct, non-overlapping groups Exemplary technique - K-means clustering - Assign each observation to exactly one of K clusters (K must be predefined) - A good clustering is one for which the within-cluster variation is smallest - There are K^n ways to partition n observations in K clusters, thus the approximating algorithm… Exemplary technique - K-means clustering Exemplary technique - K-means clustering - The above algorithm is repeated until the elements in the K clusters are stable - The algorithm only gives a local optimum - Run the algorithm multiple times and selected the best solution, i.e. one that has the smallest within-cluster variation of all clusters. Exemplary technique - Agglomerative hierarchical clustering Exemplary technique - Agglomerative hierarchical clustering The dendrogram ‘Hierarchical’ means that clusters obtained by cutting the dendrogram at a given height are nested within clusters at any greater height => not a suitable approach to all data sets. Choice of dissimilarities Euclidean distance Manhattan distance Jaccard distance Cosine distance Correlation based distance Choice of dissimilarity - The Euclidean distance - similar items have shorter distance between them - The correlation based distance - similar items are stronger correlated Practical issues in clustering - Standardising features before clustering - Hierarchical clustering - dissimilarity measures, types of linkage, number of clusters - K- means clustering - the number of k - Are clusters representing true (natural) sub groups in data? - Clustering methods not robust to perturbations to data - Clustering results are only a starting point for forming hypotheses about data - Understanding clustering results - Use the name (or characteristic attributes) of elements in each cluster - Use an exemplar member in each cluster - Clusters may be used as label for subsequent predictive analytics Case study - Clustering financial centres 15 cities - Ho Chi Minh City, Manila, Jakarta, Kuala Lumpur, Bangkok, Mumbai, Hong Kong, Singapore, Beijing, Shanghai, Shenzhen, Seoul, Busan, Taipei, Tokyo 55 instrument factors - in Business Environment (20), Financial Sector Development (9), Human Capital (7), Infrastructure (8), Reputation (11)