ML
Machine Learning · R Programming · Harvard

Data Science
Portfolio
by Sarah Silva

A rigorous end-to-end machine learning study completed at Harvard, applying 10+ algorithms to real-world datasets — from cancer diagnosis to movie recommendations.

12+ Datasets
10+ Algorithms
97.4% Best Accuracy
🎓
Harvard University
Data Science
Certificate
01 — Overview

Key Metrics

Performance highlights across all models and datasets

97.4% Best Model (KNN)
96.5% Best Ensemble
88.3% Titanic RF
98.8% SVD Variance
λ=135 Optimal Reg.
🎗️ BRCA — Cancer Classification
Breast cancer biopsy · 569 samples · 30 features
4 Models
Logistic Regression
93.9%
LOESS / GAM
93.0%
Random Forest
94.8%
KNN (k=9) ★
97.4%
Ensemble
96.5%
Key insight: KNN at k=9 outperformed all models including the ensemble, achieving 97.4% accuracy after proper feature scaling with sweep(). The most important predictor was area_worst.
🚢 Titanic Survival Prediction
891 passengers · Sex, class, age, fare
6 Models
Sex only
78.8%
Class only
68.2%
GLM (4 predictors)
82.1%
Decision Tree
84.9%
KNN (k=15)
73.2%
Random Forest ★
88.3%
Key insight: Random Forest (mtry=2, ntree=100) was the best model with 88.3%. Females in 1st and 2nd class had the highest survival rates (>90%). The decision tree revealed Sex → Age → Fare → Class as the key decision splits.
🔢 MNIST 27 — Ensemble Comparison
Digit recognition (2s vs 7s) · 7 models
Ensemble
82.5% Ensemble
GLM77.5%
LDA77.5%
Naive Bayes81.5%
KNN ★83.5%
LOESS83.5%
QDA81.5%
Random Forest82.0%
Key insight: KNN and LOESS tied at 83.5%, beating the 7-model ensemble (82.5%). When only models with CV accuracy ≥ 80% were included (gamLoess, QDA), the filtered ensemble maintained 82.5% — confirming that weaker models don't always hurt.
📐 KNN Accuracy vs. k — Heights Dataset
Sex prediction from height · F1 score optimization
k = 1→101
0.50 0.55 0.60 0.62 0.65 1 25 51 75 101 k=40 ★
Key insight: The optimal k=40 maximized F1 score at 0.619 for sex prediction from height. Lower k values overfit (high variance), while higher k values underfit. The curve peaks around k=40 before gradually declining.
🏫 Schools Regularization — RMSE vs λ
1000 simulated schools · Finding optimal penalty
λ = 10→250
Low Mid High 10 75 135 200 250 λ=135 ★
Key insight: λ=135 minimizes RMSE when estimating school quality. Without regularization, small schools dominate top rankings due to high variance — not because they are actually better. Regularization "shrinks" noisy estimates toward the overall mean.
🧬 BRCA PCA — Variance Explained
30 principal components · Cancer features
PC1 = 44.3%
PC1
44.3%
PC2
19.0%
PC3
9.4%
PC4
6.6%
PC5
5.5%
PC6
4.0%
PC7 (→90%)
3.1%
Key insight: PC1 alone explains 44.3% of variance and clearly separates benign from malignant tumors — the only PC with non-overlapping IQRs between groups. Just 7 components are needed to explain 90% of total variance.
02 — Visualizations

Plots from R

Real outputs generated during the analysis — click to expand and read the explanation

Decision Tree Titanic
Titanic Decision Tree
Classification tree for survival prediction — Sex, Age, SibSp, Fare and Pclass splits
rpart · cp=0.02
MovieLens Genres
MovieLens — Genre Effect
Average rating ± SE by genre combination. Comedy has the lowest average (3.27 stars)
movielens · error bars
MovieLens Time Effect
MovieLens — Time Effect
Weekly average rating over time (1995–2017) with LOESS smoother — mild downward trend
movielens · LOESS
MovieLens Ratings per Year
Rating Frequency vs. Average
More frequently rated films have higher average ratings — popularity bias confirmed
movielens · scatter
MovieLens Boxplot Year
Ratings by Release Year
Boxplot of ratings count per film by release year — 1995 has the highest median
movielens · boxplot
SVD Heatmap
SVD — V Matrix Heatmap
The right singular vectors reveal structure in the 24-subject grade matrix (Math, Science, Arts)
SVD · my_image()
Correlation Heatmap
Correlation Matrix — Student Grades
High within-subject correlations, moderate between Math/Science, weaker with Arts
SVD · cor(y)
Grade Image
Student Grades Heatmap
100 students × 24 subjects. Top students (red) cluster at top, with 3 subject groupings visible
SVD · my_image(y)
BRCA PCA Boxplot
BRCA — PCA Boxplots (B vs M)
PC1 is the only component with non-overlapping IQRs between Benign and Malignant tumors
BRCA · PCA · boxplot
BRCA PCA Scatter
BRCA — PC1 vs PC2 Scatter
Malignant (teal) clearly separated from Benign (pink) along PC1 axis — strong signal
BRCA · PCA · scatter
03 — Key Findings

What the Data Revealed

The most important insights from applying ML theory to real datasets

97.4%
KNN Dominates BRCA
After proper Z-score scaling with sweep(), KNN at k=9 outperformed all individual models and even the 4-model ensemble on cancer diagnosis.
λ=135
Regularization Fixes Rankings
The "small school effect" — small schools dominating top rankings due to noise — is corrected entirely by regularization. Optimal λ=135 minimizes RMSE.
98.8%
3 SVD Components Suffice
Just 3 singular vectors capture 98.8% of variance in 100×24 grade data. Component 1 = ability, 2 = Arts vs STEM, 3 = Math vs Science.
7.4×
Bayes Theorem in Practice
A positive test (85% sensitivity, 90% specificity) for a disease with 2% prevalence increases the probability of disease by 7.4×, from 2% to 14.8%.
75%
Data Leakage Demonstrated
Selecting 108 "significant" predictors from random data before cross-validation inflated accuracy to 75% — a textbook case of data leakage invalidating results.
PC1
One Component Separates Cancer
PC1 alone (44.3% of variance) separates malignant from benign tumors with no IQR overlap. All 9 remaining components show overlapping distributions.
04 — Algorithms

Models Implemented

All methods applied, tuned, and evaluated with proper cross-validation

📈
Logistic Regression
BRCA: 93.9%
Linear boundary classification. Used across Titanic, BRCA and MNIST datasets.
🌲
Random Forest
Titanic: 88.3%
Bagged decision trees. Best for Titanic. Key variable: area_worst in BRCA.
🔵
K-Nearest Neighbors
BRCA: 97.4%
Distance-based. Best overall model. Requires scaling — very sensitive to feature normalization.
〰️
LOESS / GAM
BRCA: 93.0%
Local polynomial regression. Non-parametric smoother for continuous and classification tasks.
🌳
Decision Trees
Titanic: 84.9%
Interpretable splits via rpart. Tuned with complexity parameter cp.
📊
LDA / QDA
MNIST: 77.5%
Linear/quadratic discriminant analysis. Fast parametric classifiers assuming Gaussian distributions.
🧮
PCA / SVD
98.8% variance
Dimensionality reduction. 3 components suffice for student grades, PC1 separates cancer types.
🗳️
Ensemble (Vote)
BRCA: 96.5%
Majority vote across 4–7 models. Reduces variance but can be outperformed by a single strong model.