Lecture Applied data science: Clustering
Số trang: 21
Loại file: pdf
Dung lượng: 531.11 KB
Lượt xem: 25
Lượt tải: 0
Xem trước 3 trang đầu tiên của tài liệu này:
Thông tin tài liệu:
Lecture "Applied data science: Clustering" includes content: Exemplary technique - K-means clustering; Exemplary technique - Hierarchical clustering; Practical issues in clustering; Case study;... We invite you to consult!
Nội dung trích xuất từ tài liệu:
Lecture Applied data science: Clustering Clustering Overview 1. Introduction 8. Validation 2. Application 9. Regularisation 3. EDA 10. Clustering 4. Learning Process 11. Evaluation 5. Bias-Variance Tradeoff 12. Deployment 6. Regression (review) 13. Ethics 7. Classification Lecture outline - Exemplary technique - K-means clustering - Exemplary technique - Hierarchical clustering - Practical issues in clustering - Case study Unsupervised learning and clustering - Tend to be more subjective - Often a part of the exploratory data analysis - No universally accepted mechanism to validate the results - Clustering - partition a data set into distinct, non-overlapping groups Exemplary technique - K-means clustering - Assign each observation to exactly one of K clusters (K must be predefined) - A good clustering is one for which the within-cluster variation is smallest - There are K^n ways to partition n observations in K clusters, thus the approximating algorithm… Exemplary technique - K-means clustering Exemplary technique - K-means clustering - The above algorithm is repeated until the elements in the K clusters are stable - The algorithm only gives a local optimum - Run the algorithm multiple times and selected the best solution, i.e. one that has the smallest within-cluster variation of all clusters. Exemplary technique - Agglomerative hierarchical clustering Exemplary technique - Agglomerative hierarchical clustering The dendrogram ‘Hierarchical’ means that clusters obtained by cutting the dendrogram at a given height are nested within clusters at any greater height => not a suitable approach to all data sets. Choice of dissimilarities Euclidean distance Manhattan distance Jaccard distance Cosine distance Correlation based distance Choice of dissimilarity - The Euclidean distance - similar items have shorter distance between them - The correlation based distance - similar items are stronger correlated Practical issues in clustering - Standardising features before clustering - Hierarchical clustering - dissimilarity measures, types of linkage, number of clusters - K- means clustering - the number of k - Are clusters representing true (natural) sub groups in data? - Clustering methods not robust to perturbations to data - Clustering results are only a starting point for forming hypotheses about data - Understanding clustering results - Use the name (or characteristic attributes) of elements in each cluster - Use an exemplar member in each cluster - Clusters may be used as label for subsequent predictive analytics Case study - Clustering financial centres 15 cities - Ho Chi Minh City, Manila, Jakarta, Kuala Lumpur, Bangkok, Mumbai, Hong Kong, Singapore, Beijing, Shanghai, Shenzhen, Seoul, Busan, Taipei, Tokyo 55 instrument factors - in Business Environment (20), Financial Sector Development (9), Human Capital (7), Infrastructure (8), Reputation (11)
Nội dung trích xuất từ tài liệu:
Lecture Applied data science: Clustering Clustering Overview 1. Introduction 8. Validation 2. Application 9. Regularisation 3. EDA 10. Clustering 4. Learning Process 11. Evaluation 5. Bias-Variance Tradeoff 12. Deployment 6. Regression (review) 13. Ethics 7. Classification Lecture outline - Exemplary technique - K-means clustering - Exemplary technique - Hierarchical clustering - Practical issues in clustering - Case study Unsupervised learning and clustering - Tend to be more subjective - Often a part of the exploratory data analysis - No universally accepted mechanism to validate the results - Clustering - partition a data set into distinct, non-overlapping groups Exemplary technique - K-means clustering - Assign each observation to exactly one of K clusters (K must be predefined) - A good clustering is one for which the within-cluster variation is smallest - There are K^n ways to partition n observations in K clusters, thus the approximating algorithm… Exemplary technique - K-means clustering Exemplary technique - K-means clustering - The above algorithm is repeated until the elements in the K clusters are stable - The algorithm only gives a local optimum - Run the algorithm multiple times and selected the best solution, i.e. one that has the smallest within-cluster variation of all clusters. Exemplary technique - Agglomerative hierarchical clustering Exemplary technique - Agglomerative hierarchical clustering The dendrogram ‘Hierarchical’ means that clusters obtained by cutting the dendrogram at a given height are nested within clusters at any greater height => not a suitable approach to all data sets. Choice of dissimilarities Euclidean distance Manhattan distance Jaccard distance Cosine distance Correlation based distance Choice of dissimilarity - The Euclidean distance - similar items have shorter distance between them - The correlation based distance - similar items are stronger correlated Practical issues in clustering - Standardising features before clustering - Hierarchical clustering - dissimilarity measures, types of linkage, number of clusters - K- means clustering - the number of k - Are clusters representing true (natural) sub groups in data? - Clustering methods not robust to perturbations to data - Clustering results are only a starting point for forming hypotheses about data - Understanding clustering results - Use the name (or characteristic attributes) of elements in each cluster - Use an exemplar member in each cluster - Clusters may be used as label for subsequent predictive analytics Case study - Clustering financial centres 15 cities - Ho Chi Minh City, Manila, Jakarta, Kuala Lumpur, Bangkok, Mumbai, Hong Kong, Singapore, Beijing, Shanghai, Shenzhen, Seoul, Busan, Taipei, Tokyo 55 instrument factors - in Business Environment (20), Financial Sector Development (9), Human Capital (7), Infrastructure (8), Reputation (11)
Tìm kiếm theo từ khóa liên quan:
Lecture Applied data science Applied data science Exemplary technique K-means clustering Hierarchical clustering Practical issues in clusteringGợi ý tài liệu liên quan:
-
Ebook Machine learning algorithms: Part 2
184 trang 41 0 0 -
Bài giảng Khai phá dữ liệu (Data mining): Clustering - Trịnh Tấn Đạt
70 trang 41 0 0 -
Lecture Applied data science: Exploratory data analysis
35 trang 41 0 0 -
Lecture Applied data science: Classification
18 trang 36 0 0 -
Lecture Applied data science: Application
12 trang 35 0 0 -
Lecture Applied data science: Linear regression (review)
20 trang 31 0 0 -
Lecture Applied data science: Validation
23 trang 29 0 0 -
Lecture Applied data science: Regularisation
34 trang 26 0 0 -
Ứng dụng thuật toán K-Means trong phân cụm khách hàng mục tiêu
6 trang 26 0 0 -
17 trang 24 0 0
-
Lecture Applied data science: Introduction
20 trang 22 0 0 -
Ebook Pattern Recognition and Machine Learning: Part 2 - Christopher M. Bishop
380 trang 20 0 0 -
Bài giảng Khai mở dữ liệu: Giải thuật gom cụm (Clustering algorithms)
43 trang 20 0 0 -
Lecture Applied data science: Evaluation, deployment, ethics
19 trang 16 0 0