Work Hours
Everyday: 北京时间8:00 - 23:59
Kaggle Competition Lab Reflections
Midterm Review
Data Mining & Analytics
4 of 29
- Wednesday, October 19th in-class
- The Midterm will be on bCourses (like a quiz)
- The link to the Midterm will be posted before class
– 1. All exams must be turned in no later than 5:00pm
– 2. You will have no more than 2 hours to complete the exam
– 3. To receive the full 2 hours, you must start between 2pm and 3pm
– 4. If starting at 2pm, for example, your exam will be due at 4pm
– 5. If starting at 3:30pm, for example, your exam will be due at 5pm
– 6. There are 19 questions, worth 53 points total (8 points extra credit + 15%)
– 7. The exam is open book/note
– 8. No communication is allowed during the test
Midterm
5 of 29 - Data transformation (pre-processing)
- Clustering (k-means)
- Classification (decision trees, neural nets)
- Model evaluation (cross-validation, error metrics)
- Combining classifiers (ensembles)
Topics
Subtopics
● Feature engineering (pandas)
● Representing data to fit the prediction task
● Normalization
e.g., Z-score:
Data transformation (pre-processing)
7 of 29
Example Exam Question
Subtopics
● Types of clustering methods
● Measures of cluster goodness (SSE, silhouette score)
● The k-means algorithm
● Ways of choosing K (elbow method)
Clustering (k-means)
Clustering (k-means)
Sum of Squared Errors
Clustering: Elbow Method
Clustering: SSE, Silhouette Score
Within-cluster Variance / Sum of Squared Errors
Silhouette Score (take the avg. of all s(o)) – every data point
For each data point o in Ci:
12 of 29
Example Exam Question
Pt Feature 1 Feature 2 Cluster
A 2 2 1
B 0 2 1
C -4 -1 2
D -3 -2 2
(a) Calculate the silhouette coefficient for point B.
(b) If this assignment is obtained right after an iteration of K-means
clustering (which may or may not have terminated), do you think the
assignment will change in later iterations? Why or why not?
Given a clustering assignment on four 2-dimensional points:
Subtopics
● Characterizing purity (Gini/Info)
● Splitting based on features to improve purity (trees)
● Improving the generalizability of training trees (pruning)
Classification (decision trees)
14 of 29
Example Exam Question
Pt Feature Label
A 7 0
B 10 1
C 4 0
D 10 0
E 16 1
F 9 1
(a) Calculate the gini index of the Dataset
(b) If we split on 8, what is the overall gini
index after splitting?
(c) If we split on 13, what is the overall gini
index after splitting?
Subtopics
● Feed forward neural networks
Input layer, hidden layer, output layer, weights, bias
● Backpropagation (conceptual)
● Activation functions
Logistic, relu, softmax
● Additional details
Epoch, batch size, stopping criteria
Classification (neural networks)
16 of 29
Example Exam Question
Subtopics
● Metrics (confusion matrix based & continuous)
● Training, validation, and testing sets
● Cross-validation
● Model selection
Evaluation of models
Error Metrics
predicted actual
0 0
0 1
1 0
1 1
1 1
Examples from lecture
predicted actual
0.25 0
0.45 1
0.66 0
0.71 1
19 of 29
Example Exam Question
(a) Predict GPA given students’ department, credits taken, study hours, etc.
(b) Predict if a Twitter user is liberal or conservative.
(c) Predict lung cancer from chest X-rays.
For each of the tasks below, explain which algorithm(s) and error metric(s)
you would use and why.
Subtopics
● Simple combiners
● Bagging (e.g., random forests)
● Boosting (Adaboost)
● Blending/Stacking
Combining classifiers (ensembles)
21 of 29
Example Exam Question
Determine True or False for each of the following statements:
(a) Ensemble methods are never the cause of overfit.
(b) Hyperparameters of random forests include (but not limited to) number of trees,
percentage of rows sampled, and max depth.
(c) Blending uses cross-validation while stacking uses a holdout validation.
22 of 29
Break-out groups - Clustering & Preprocessing
- Prediction/Classification Models
- Cross-validation & Metrics
- Ensembling
Prof. Pardos will stay in the main room for
general Q & A
(self-select after break)
https://classes.berkeley.edu/content/2021-fall-data-144-001-lec-001