Lecture 1: Linear Regression — முழு விளக்கம்

Q: 1. Machine Learning என்றால் என்ன?

சாதாரண programming-ல் நீங்கள் rules எழுதுவீர்கள்: Machine Learning-ல் நீங்கள் rules எழுதுவதில்லை — data கொடுத்து, computer-ஐ rules கண்டுபிடிக்கச் சொல்வீர்கள். உதாரணம்: 1000 employees-ன் experience மற்றும் salary data கொடுத்தால், computer தானாகவே "ஒரு வருட experience-க்கு சுமார் ₹5000 salary உயர்கிறது" என்ற pattern-ஐ கற்றுக்கொள்ளும்.

Q: 2. StandardScaler — Features-ஐ ஏன் scale செய்ய வேண்டும்?

Problem: Area range: 500-3000. Bedrooms range: 1-5. Area-ன் numbers பெரியது என்பதால் model area-க்கு அதிக importance கொடுத்துவிடும் — ஆனால் bedrooms-ம் equally important ஆக இருக்கலாம்! Solution: StandardScaler ஒவ்வொரு feature-ஐயும் same scale-க்கு கொண்டு வருகிறது: After scaling: இப்போது இரண்டும் same range-ல் (-2 to +2) இருக்கிறது. Model-க்கு fair comparison. Real-life analogy: Cricket-ல் batsman comparison. Virat 50 matches-ல் 3000 runs, Dhoni 100 matches-ல் 5000 runs. Raw runs பார்த்தால் Dhoni "better." ஆனால் average (runs/matches) பார்த்தால் Virat: 60, Dhoni: 50. Scaling = fair comparison-க்கு. எப்போது scaling MUST? எப்போது scaling optional? Gradient Descent use செய்யும்போது (SGDRegressor, Neural Networks) — scaling இல்லாமல் converge ஆகாது Ridge/Lasso Regularization — penalty feature size-ஐ depend செய்யும் KNN, SVM — distance-based algorithms Decision Trees, Random Forest — இவை splits based, distance based இல்லை

Q: 3. Residual Analysis — Model Assumptions எப்படி check செய்வது?

Residual என்ன? உதாரணம்: ஏன் residuals பார்க்கிறோம்? Linear Regression சில assumptions வைத்து வேலை செய்கிறது. Residuals-ஐ plot செய்தால் assumptions நிறைவேறுகிறதா என்று பார்க்கலாம். Assumption 1: Residuals random-ஆக இருக்க வேண்டும் (no pattern) Pattern இருந்தால் = model ஏதோ relationship-ஐ miss பண்ணுகிறது (maybe polynomial தேவை). Assumption 2: Residuals normal distribution-ல் இருக்க வேண்டும் Histogram போட்டால் bell shape வர வேண்டும். Skewed-ஆ இருந்தால் = model certain ranges-ல் consistently wrong. Assumption 3: Homoscedasticity (equal spread) Fan shape = predicted value அதிகரிக்கும்போது error-ம் அதிகரிக்கிறது. இது problem — model low prices-க்கு accurate, high prices-க்கு inaccurate. நீங்கள் என்ன செய்ய வேண்டும்: plt.scatter(predicted, residuals) — pattern இல்லாமல் random scatter இருக்க வேண்டும் plt.hist(residuals) — bell shape இருக்க வேண்டும் Pattern/fan shape இருந்தால் → feature transformation (log, sqrt) அல்லது polynomial model try

Q: 6. SGDRegressor — Gradient Descent Code-ல் எப்படி வேலை செய்கிறது?

Regular LinearRegression() — Normal Equation use செய்கிறது (math formula-ல் directly best answer calculate). Small datasets-க்கு perfect. SGDRegressor — Stochastic Gradient Descent use செய்கிறது. Step by step learn செய்கிறது. Difference: SGD எப்படி வேலை செய்கிறது (step by step): "Stochastic" என்றால்? ஒவ்வொரு step-லும் ஒரு random sample எடுத்து update செய்வது (full data-வை use செய்யாமல்). இதனால் faster, ஆனால் zig-zag path-ல் converge ஆகும். Notebook-ல்: ஏன் SGD important? Data மிகப்பெரியதாக இருக்கும்போது (1 million+ rows) Normal Equation slow Neural Networks எல்லாம் SGD variants-ல்தான் train ஆகிறது Online learning — new data வரும்போது continuously update செய்யலாம்

Q: 8. PolynomialFeatures — Curve Fit செய்வது Code-ல் எப்படி?

Problem: Plot செய்தால் straight line fit ஆகாது — salary ஆரம்பத்தில் fast-ஆக உயரும், பிறகு slow ஆகும் (diminishing returns). Curve தேவை! PolynomialFeatures என்ன செய்கிறது: Original feature x-லிருந்து புதிய features உருவாக்குகிறது: இப்போது Linear Regression-ஐ இந்த new features-ல் apply செய்தால்: இது technically "linear" regression — ஆனால் x-ன் polynomial powers use செய்வதால் curve fit ஆகிறது! Notebook-ல்: Degree selection — overfitting risk: Real-life analogy: Drawing through dots. எப்படி right degree select செய்வது? GridSearchCV! Degree 1 = ruler வைத்து straight line Degree 2 = ஒரு smooth curve Degree 10 = ஒவ்வொரு dot-ஐயும் exactly touch செய்ய நெளிந்து போகிறது — ஆனால் புதிய dots-க்கு completely wrong ஆக இருக்கும்

1. Machine Learning என்றால் என்ன?

சாதாரண programming-ல் நீங்கள் rules எழுதுவீர்கள்:

if experience > 5: salary = 70000

if experience > 10: salary = 90000

Machine Learning-ல் நீங்கள் rules எழுதுவதில்லை — data கொடுத்து, computer-ஐ rules கண்டுபிடிக்கச் சொல்வீர்கள்.

உதாரணம்: 1000 employees-ன் experience மற்றும் salary data கொடுத்தால், computer தானாகவே "ஒரு வருட experience-க்கு சுமார் ₹5000 salary உயர்கிறது" என்ற pattern-ஐ கற்றுக்கொள்ளும்.

2. Supervised Learning

"Supervised" = நாம் சரியான பதிலை (answer) முன்பே கொடுக்கிறோம்.

ஒரு teacher மாணவனுக்கு question + answer கொடுத்து practice செய்ய வைப்பது போல. 100 houses-ன் data (area, rooms, location) + அவற்றின் price கொடுத்து train செய்தால், புதிய house-ன் price-ஐ predict செய்யும்.

3. Regression vs Classification

Regression = output ஒரு number (continuous). "இந்த house-ன் price என்ன?" → ₹45,00,000
Classification = output ஒரு category. "இந்த email spam-ஆ இல்லையா?" → Spam / Not Spam

Linear Regression = regression problem. Output எப்போதும் ஒரு number.

4. The Big Idea: Drawing the Best Line

மிக எளிய உதாரணம்:

Experience (years): 1 2 3 4 5

Salary (lakhs): 3 5 6 8 10

இதை graph-ல் plot செய்தால் புள்ளிகள் கிட்டத்தட்ட ஒரு நேர்கோட்டில் (straight line) வரும். Linear Regression-ன் வேலை = இந்த புள்ளிகளுக்கு மிக நெருக்கமான ஒரு best line-ஐ கண்டுபிடிப்பது.

அந்த line கிடைத்தவுடன், "6 வருட experience-க்கு salary என்ன?" என்று புதிய prediction செய்யலாம் — line-ல் x=6 போட்டு y value பார்க்கவும்.

5. The Math: y = mx + b

அந்த line-ன் formula:

y = mx + b

y = predict செய்ய வேண்டிய value (salary)
x = input feature (experience)
m = slope (line எவ்வளவு steep-ஆக உயர்கிறது) — "ஒரு வருடத்திற்கு salary எவ்வளவு உயரும்"
b = intercept (x=0-ல் line எங்கே start ஆகிறது) — "experience இல்லாத ஆரம்ப salary"

உதாரணம்: salary = 2 × experience + 1

Experience 0 → Salary = 1 lakh (starting salary)
Experience 3 → Salary = 7 lakhs
Experience 5 → Salary = 11 lakhs

6. Multiple Features

ஒரே x மட்டும் இல்லை — பல features இருக்கலாம்:

salary = (m1 × experience) + (m2 × education_years) + (m3 × city_tier) + b

House price example:

price = (500 × area_sqft) + (200000 × bedrooms) + (100000 × floor) + 1000000

Computer-ன் வேலை: சரியான m1, m2, m3, b values-ஐ கண்டுபிடிப்பது.

7. How Do We Find the Best Line? — Loss Function (MSE)

"Best line" என்றால் என்ன? எல்லா புள்ளிகளுக்கும் line-க்கும் இடையிலான தூரம் (error) மிகக்குறைவாக இருக்க வேண்டும்.

Error = Actual salary - Predicted salary

ஆனால் சில errors positive (+5), சில negative (-3). அவை cancel ஆகிவிடும். அதனால் square செய்கிறோம்:

MSE = average of (actual - predicted)²

உதாரணம்:

Actual: [3, 5, 6, 8, 10]

Predicted: [2.5, 4.5, 6.5, 8.5, 10.5]

Errors: [0.5, 0.5, -0.5, -0.5, -0.5]

Squared: [0.25, 0.25, 0.25, 0.25, 0.25]

MSE = 0.25 (average)

MSE குறைவாக இருந்தால் = line நன்றாக fit ஆகிறது!

8. Gradient Descent: Finding the Minimum

MSE-ஐ குறைக்க m மற்றும் b values-ஐ எப்படி adjust செய்வது?

உதாரணம்: நீங்கள் ஒரு மலையின் மேலே நிற்கிறீர்கள், கண்ணை மூடிக்கொண்டு கீழே இறங்க வேண்டும். என்ன செய்வீர்கள்? காலை கீழ்நோக்கி (slope direction) வைப்பீர்கள். அதுதான் Gradient Descent!

Step 1: Random m, b values-ல் start

Step 2: MSE கணக்கிடு

Step 3: slope (gradient) direction பார்

Step 4: m, b-ஐ சிறிது adjust செய்

Step 5: MSE குறையும் வரை repeat

Learning rate = ஒவ்வொரு step-ன் அளவு. மிகப்பெரியதாக இருந்தால் minimum-ஐ தாண்டிவிடும். மிகச்சிறியதாக இருந்தால் மிக மெதுவாக போகும்.

9. Train/Test Split

எல்லா data-வையும் model train செய்ய பயன்படுத்தினால், "இது புதிய data-வுக்கும் நன்றாக வேலை செய்யுமா?" என்று தெரியாது.

அதனால்:

80% data → Training (model கற்றுக்கொள்ள)
20% data → Testing (model-ஐ test செய்ய — இதை model பார்த்ததே இல்லை)

Test data-வில் நன்றாக வேலை செய்தால் = model generalize செய்யும்.

10. Evaluation Metrics

Model எவ்வளவு நன்றாக வேலை செய்கிறது?

MSE = average squared error (மேலே பார்த்தது)
RMSE = √MSE (original units-ல் — "சராசரியாக ₹5000 தவறு")
MAE = average |error| (absolute, squared இல்லாமல்)
R² = 0 to 1 score. "Model data-வின் எவ்வளவு % variance-ஐ explain செய்கிறது?"
R² = 0.85 → model 85% patterns கண்டுபிடித்தது
R² = 0.95 → மிக நல்ல model
R² = 0.30 → மோசமான model

11. Overfitting vs Underfitting

Underfitting (too simple): ஒரு straight line போட்டால் curved data-வை capture செய்ய முடியாது. "Model போதுமான அளவு கற்றுக்கொள்ளவில்லை."

Overfitting (too complex): ஒவ்வொரு training point-ஐயும் exactly match செய்ய model மிகவும் complex ஆகிவிட்டது. Training-ல் 99% accuracy, ஆனால் test-ல் 60%. "Model data-வை மனப்பாடம் செய்துவிட்டது, புதிய data-வுக்கு apply செய்ய தெரியாது."

Just right (good fit): Training-லும் test-லும் இரண்டிலும் நன்றாக வேலை செய்கிறது.

உதாரணம்: Exam preparation.

Underfitting = ஒரு chapter மட்டும் படித்தது
Overfitting = Question bank-ஐ மனப்பாடம் செய்தது, ஆனால் புதிய format question-க்கு answer தெரியாது
Good fit = concepts புரிந்து, எந்த format-லும் answer செய்யும்

12. Regularization: Ridge & Lasso

Overfitting-ஐ தடுக்க model-ன் weights (m values) மிகப்பெரியதாக ஆகாமல் கட்டுப்படுத்துவது.

Ridge (L2): weights-ன் square-ஐ penalty-ஆக சேர்க்கும். எல்லா features-ஐயும் வைத்துக்கொள்ளும் ஆனால் சிறியதாக்கும்.
Lasso (L1): weights-ன் absolute value-ஐ penalty-ஆக சேர்க்கும். சில features-ன் weight-ஐ 0 ஆக்கிவிடும் (feature selection!).

உதாரணம்: 20 features இருக்கிறது, ஆனால் உண்மையில் 5 features மட்டும் important. Lasso தானாகவே மீதி 15-ன் weight-ஐ 0 ஆக்கி "இவை தேவையில்லை" என்று சொல்லும்.

13. Cross-Validation (K-Fold)

Train/test ஒரே ஒரு முறை split செய்தால், நாம் select செய்த 20% test data-வுக்கு மட்டும் நல்ல score இருக்கலாம் — வேறு split-க்கு மோசமாக இருக்கலாம்.

K-Fold (K=5):

Round 1: [Test] [Train] [Train] [Train] [Train] → Score 1

Round 2: [Train] [Test] [Train] [Train] [Train] → Score 2

Round 3: [Train] [Train] [Test] [Train] [Train] → Score 3

Round 4: [Train] [Train] [Train] [Test] [Train] → Score 4

Round 5: [Train] [Train] [Train] [Train] [Test] → Score 5

Final score = Average of all 5 scores

ஒவ்வொரு data point-ம் ஒரு முறை test-ல் வரும். இது model-ன் உண்மையான performance-ஐ reliable-ஆக சொல்கிறது.

14. Polynomial Regression

Straight line போதாது என்றால்? Data curved-ஆக இருந்தால்?

Linear: y = mx + b (straight line)

Polynomial: y = ax² + bx + c (curve)

Experience vs Salary curve ஆக இருக்கலாம் — ஆரம்பத்தில் வேகமாக உயரும், பிறகு flatten ஆகும். Polynomial regression அந்த curve-ஐ fit செய்யும்.

ஆனால் degree அதிகமானால் overfitting risk!

15. Notebook-ல் என்ன செய்கிறோம் (Hands-on)

Notebook-ல் California Housing dataset பயன்படுத்தி:

Data load → fetch_california_housing()
EDA → distributions, correlations, scatter plots பார்க்கிறோம்
Preprocessing → StandardScaler-ல் features-ஐ scale செய்கிறோம்
Model build → LinearRegression().fit(X_train, y_train)
Evaluate → MSE, RMSE, R² score பார்க்கிறோம்
Assumptions check → residuals normal-ஆ, homoscedastic-ஆ
Gradient Descent → SGDRegressor-ல் manually learning rate set செய்து train
Regularization → Ridge, Lasso compare
Feature Selection → VIF, RFE techniques
Cross Validation → K-Fold-ல் stable score பார்க்கிறோம்
GridSearchCV → best hyperparameters தானாக தேடுகிறோம்

Summary: Linear Regression = data-வுக்கு best fit ஆகும் line (அல்லது curve) கண்டுபிடிப்பது. MSE loss-ஐ Gradient Descent மூலம் minimize செய்வது. Train/Test split-ல் evaluate செய்வது. Overfitting-ஐ Regularization-ல் கட்டுப்படுத்துவது.

---------

Notebook Concepts:

1. EDA (Exploratory Data Analysis) — ஏன் heatmaps, distributions பார்க்கிறோம்?

Model build செய்வதற்கு முன்னால் data-ஐ புரிந்துகொள்ள வேண்டும். EDA இல்லாமல் model போடுவது = கண்ணை மூடிக்கொண்டு சமைப்பது போல.

Distribution (Histogram) ஏன் பார்க்கிறோம்?

ஒவ்வொரு feature-ம் எப்படி spread ஆகி இருக்கிறது என்று பார்க்க.

உதாரணம்: House price dataset-ல் "area" feature பார்க்கிறோம்:

Houses: 500sqft, 600sqft, 700sqft, 800sqft, ..., 5000sqft, 12000sqft

Histogram-ல் பார்த்தால் — பெரும்பாலான houses 500-2000 sqft, ஆனால் 12000sqft ஒரு outlier! இது model-ஐ mess பண்ணும். Distribution பார்க்காமல் இருந்திருந்தால் இது தெரிந்திருக்காது.

பார்க்க வேண்டியவை:

Outliers இருக்கிறதா? (மிக உயர்ந்த / குறைந்த values)
Skewed-ஆ? (ஒரு பக்கம் நீண்ட tail)
Normal distribution-ஆ? (bell shape — Linear Regression இதை prefer செய்யும்)

Heatmap (Correlation Matrix) ஏன் பார்க்கிறோம்?

எந்த features target-உடன் strongly connected என்று கண்டுபிடிக்க.

உதாரணம் — House price predict செய்ய:

Correlation with price:

area_sqft: +0.85 ← strong positive (area அதிகம் = price அதிகம்)

bedrooms: +0.65 ← moderate

distance_city: -0.70 ← strong negative (தூரம் அதிகம் = price குறைவு)

wall_color: +0.02 ← almost zero (irrelevant feature!)

Heatmap பார்த்தவுடன் தெரியும்:

area_sqft மிக முக்கியம் — model-ல் definitely இருக்க வேண்டும்
wall_color useless — remove பண்ணலாம்
area_sqft மற்றும் bedrooms இரண்டும் ஒன்றுடன் ஒன்று +0.90 correlated → இரண்டும் வைத்தால் multicollinearity problem (VIF-ல் பார்ப்போம்)

2. StandardScaler — Features-ஐ ஏன் scale செய்ய வேண்டும்?

Problem:

Feature 1: area_sqft → values: 500, 1000, 1500, 2000, 3000

Feature 2: bedrooms → values: 1, 2, 3, 4, 5

Area range: 500-3000. Bedrooms range: 1-5. Area-ன் numbers பெரியது என்பதால் model area-க்கு அதிக importance கொடுத்துவிடும் — ஆனால் bedrooms-ம் equally important ஆக இருக்கலாம்!

Solution: StandardScaler

ஒவ்வொரு feature-ஐயும் same scale-க்கு கொண்டு வருகிறது:

scaled_value = (value - mean) / standard_deviation

After scaling:

area_sqft: [-1.2, -0.6, 0.0, 0.6, 1.8] ← mean=0, std=1

bedrooms: [-1.2, -0.6, 0.0, 0.6, 1.2] ← mean=0, std=1

இப்போது இரண்டும் same range-ல் (-2 to +2) இருக்கிறது. Model-க்கு fair comparison.

Real-life analogy: Cricket-ல் batsman comparison. Virat 50 matches-ல் 3000 runs, Dhoni 100 matches-ல் 5000 runs. Raw runs பார்த்தால் Dhoni "better." ஆனால் average (runs/matches) பார்த்தால் Virat: 60, Dhoni: 50. Scaling = fair comparison-க்கு.

எப்போது scaling MUST?

Gradient Descent use செய்யும்போது (SGDRegressor, Neural Networks) — scaling இல்லாமல் converge ஆகாது
Ridge/Lasso Regularization — penalty feature size-ஐ depend செய்யும்
KNN, SVM — distance-based algorithms

எப்போது scaling optional?

Decision Trees, Random Forest — இவை splits based, distance based இல்லை

3. Residual Analysis — Model Assumptions எப்படி check செய்வது?

Residual என்ன?

Residual = Actual value - Predicted value

உதாரணம்:

House 1: Actual price = 50 lakhs, Predicted = 48 lakhs → Residual = +2

House 2: Actual price = 30 lakhs, Predicted = 35 lakhs → Residual = -5

ஏன் residuals பார்க்கிறோம்?

Linear Regression சில assumptions வைத்து வேலை செய்கிறது. Residuals-ஐ plot செய்தால் assumptions நிறைவேறுகிறதா என்று பார்க்கலாம்.

Assumption 1: Residuals random-ஆக இருக்க வேண்டும் (no pattern)

GOOD (random scatter): BAD (pattern — curve):

. . .

. . . . .

. . . .

Pattern இருந்தால் = model ஏதோ relationship-ஐ miss பண்ணுகிறது (maybe polynomial தேவை).

Assumption 2: Residuals normal distribution-ல் இருக்க வேண்டும்

Histogram போட்டால் bell shape வர வேண்டும். Skewed-ஆ இருந்தால் = model certain ranges-ல் consistently wrong.

Assumption 3: Homoscedasticity (equal spread)

GOOD (equal spread): BAD (fan shape — heteroscedasticity):

. . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . .

Fan shape = predicted value அதிகரிக்கும்போது error-ம் அதிகரிக்கிறது. இது problem — model low prices-க்கு accurate, high prices-க்கு inaccurate.

நீங்கள் என்ன செய்ய வேண்டும்:

plt.scatter(predicted, residuals) — pattern இல்லாமல் random scatter இருக்க வேண்டும்
plt.hist(residuals) — bell shape இருக்க வேண்டும்
Pattern/fan shape இருந்தால் → feature transformation (log, sqrt) அல்லது polynomial model try

4. VIF (Variance Inflation Factor) — Multicollinearity என்ன?

Multicollinearity = இரண்டு features ஒன்றுக்கொன்று மிக closely related ஆக இருப்பது.

உதாரணம்:

Feature 1: area_sqft (வீட்டின் area)

Feature 2: num_rooms (rooms எண்ணிக்கை)

இவை இரண்டும் highly correlated — பெரிய வீடு = அதிக rooms. இரண்டையும் model-ல் போட்டால் model confused ஆகும்: "price-ஐ area influence செய்கிறதா rooms influence செய்கிறதா?" → weights unstable ஆகும்.

VIF Score:

VIF = 1 → No multicollinearity (perfect)

VIF = 1-5 → Moderate (acceptable)

VIF = 5-10 → High (warning)

VIF > 10 → Very high (remove this feature!)

எப்படி calculate ஆகிறது?

ஒவ்வொரு feature-ஐயும் மற்ற எல்லா features-ஐ வைத்து predict செய்கிறோம். நன்றாக predict ஆனால் = அந்த feature redundant (other features-ல் ஏற்கனவே அந்த information இருக்கிறது).

Notebook-ல்:

from statsmodels.stats.outliers_influence import variance_inflation_factor

# VIF > 10 ஆக இருக்கும் features-ஐ remove செய்கிறோம்

Real-life analogy: Resume-ல் "Age: 25" மற்றும் "Year of birth: 1999" இரண்டும் போடுவது — same information duplicate. ஒன்று போதும்.

5. RFE (Recursive Feature Elimination) — Features எப்படி select செய்வது?

20 features இருக்கிறது. எல்லாம் important-ஆ? இல்லை! சில features noise add செய்யும், model-ஐ confuse செய்யும்.

RFE எப்படி வேலை செய்கிறது:

Step 1: எல்லா 20 features-ஐ வைத்து model train செய்

Step 2: மிகக் குறைவான importance உள்ள feature-ஐ remove செய்

Step 3: 19 features-ல் மீண்டும் train செய்

Step 4: மீண்டும் least important-ஐ remove செய்

...

Step 15: 5 features மட்டும் மிச்சம் — இவைதான் best features!

உதாரணம்: House price prediction-ல் 10 features:

Start: area, rooms, floor, age, garden, parking, wall_color, gate_type, owner_age, pet_friendly

Round 1: Remove wall_color (least important)

Round 2: Remove gate_type

Round 3: Remove owner_age

...

Final: area, rooms, floor, parking, garden ← best 5!

Notebook-ல்:

from sklearn.feature_selection import RFE

from sklearn.linear_model import LinearRegression

rfe = RFE(estimator=LinearRegression(), n_features_to_select=5)

rfe.fit(X_train, y_train)

print(rfe.support_) # [True, True, False, True, False, True, True, ...]

print(rfe.ranking_) # [1, 1, 3, 1, 5, 1, 1, ...] — 1 = selected

ஏன் important?

Fewer features = simpler model = less overfitting
Faster training
Easier to interpret ("price depends mainly on area, rooms, location")

6. SGDRegressor — Gradient Descent Code-ல் எப்படி வேலை செய்கிறது?

Regular LinearRegression() — Normal Equation use செய்கிறது (math formula-ல் directly best answer calculate). Small datasets-க்கு perfect.

SGDRegressor — Stochastic Gradient Descent use செய்கிறது. Step by step learn செய்கிறது.

Difference:

LinearRegression = Calculator-ல் answer கண்டுபிடிப்பது (instant, exact)

SGDRegressor = Trial and error-ல் slowly correct answer-க்கு நெருங்குவது

SGD எப்படி வேலை செய்கிறது (step by step):

Data: experience → salary

[1, 2, 3, 4, 5] → [3, 5, 7, 9, 11]

Start: m=0, b=0 (random guess)

Prediction for x=1: y = 0×1 + 0 = 0 (actual: 3, error = 3!)

Step 1: Update m and b slightly (learning_rate = 0.01)

m = 0 + 0.01 × gradient = 0.06

b = 0 + 0.01 × gradient = 0.06

Step 2: Pick RANDOM sample (say x=3)

Prediction: y = 0.06×3 + 0.06 = 0.24 (actual: 7, still far)

Update m, b again...

... 1000 steps later...

Final: m ≈ 2.0, b ≈ 1.0

Prediction for x=3: y = 2×3 + 1 = 7 ✓

"Stochastic" என்றால்? ஒவ்வொரு step-லும் ஒரு random sample எடுத்து update செய்வது (full data-வை use செய்யாமல்). இதனால் faster, ஆனால் zig-zag path-ல் converge ஆகும்.

Notebook-ல்:

from sklearn.linear_model import SGDRegressor

sgd = SGDRegressor(learning_rate='constant', eta0=0.01, max_iter=1000, random_state=42)

sgd.fit(X_train_scaled, y_train) # MUST use scaled data!

ஏன் SGD important?

Data மிகப்பெரியதாக இருக்கும்போது (1 million+ rows) Normal Equation slow
Neural Networks எல்லாம் SGD variants-ல்தான் train ஆகிறது
Online learning — new data வரும்போது continuously update செய்யலாம்

7. GridSearchCV — Hyperparameter Tuning Automated-ஆக

Hyperparameter என்ன?

Model-ன் settings — நீங்கள் set செய்ய வேண்டும், model தானாக கற்றுக்கொள்ளாது.

Ridge regression-ல்: alpha = ? (regularization strength)

alpha = 0.01 → almost no regularization

alpha = 1.0 → moderate regularization

alpha = 100 → very strong regularization

எது best? Try பண்ணி பார்க்க வேண்டும்!

GridSearchCV என்ன செய்கிறது:

எல்லா combinations-ஐயும் try செய்து, best-ஐ கண்டுபிடிக்கிறது:

"நான் try செய்ய வேண்டிய values:"

alpha: [0.01, 0.1, 1, 10, 100]

GridSearchCV process:

alpha=0.01 → Cross-validation score: 0.78

alpha=0.1 → Cross-validation score: 0.82

alpha=1 → Cross-validation score: 0.85 ← BEST!

alpha=10 → Cross-validation score: 0.83

alpha=100 → Cross-validation score: 0.70

Result: Best alpha = 1

"CV" part: ஒவ்வொரு alpha-வுக்கும் K-Fold Cross Validation செய்கிறது — ஒரே ஒரு train/test split-ல் நம்பாமல், 5 splits-ல் average score பார்க்கிறது.

Notebook-ல்:

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Ridge

param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}

grid_search = GridSearchCV(Ridge(), param_grid, cv=5, scoring='r2')

grid_search.fit(X_train, y_train)

print(grid_search.best_params_) # {'alpha': 1}

print(grid_search.best_score_) # 0.85

Real-life analogy: Biriyani recipe perfect செய்வது. Salt: [1tsp, 1.5tsp, 2tsp], Chili: [2, 3, 4], Time: [30min, 45min, 60min]. எல்லா combinations-ஐயும் try → family-க்கு taste test → best combination select. GridSearchCV = automated taste testing!

8. PolynomialFeatures — Curve Fit செய்வது Code-ல் எப்படி?

Problem:

Experience: 1 2 3 4 5 6 7 8 9 10

Salary: 20 25 32 42 55 62 66 68 69 70

Plot செய்தால் straight line fit ஆகாது — salary ஆரம்பத்தில் fast-ஆக உயரும், பிறகு slow ஆகும் (diminishing returns). Curve தேவை!

PolynomialFeatures என்ன செய்கிறது:

Original feature x-லிருந்து புதிய features உருவாக்குகிறது:

Degree 2: x → [x, x²]

Degree 3: x → [x, x², x³]

உதாரணம் (experience = 3):

Original: [3]

Degree 2: [3, 9] ← x=3, x²=9

Degree 3: [3, 9, 27] ← x=3, x²=9, x³=27

இப்போது Linear Regression-ஐ இந்த new features-ல் apply செய்தால்:

y = m1×x + m2×x² + b

இது technically "linear" regression — ஆனால் x-ன் polynomial powers use செய்வதால் curve fit ஆகிறது!

Notebook-ல்:

from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import LinearRegression

# Step 1: Create polynomial features (degree 2)

poly = PolynomialFeatures(degree=2)

X_poly = poly.fit_transform(X)

# Original X = [[1], [2], [3]]

# X_poly = [[1, 1, 1], [1, 2, 4], [1, 3, 9]]

# (bias, x, x²)

# Step 2: Fit linear regression on polynomial features

model = LinearRegression()

model.fit(X_poly_train, y_train)

Degree selection — overfitting risk:

Degree 1: Straight line → underfitting (too simple)

Degree 2: Gentle curve → usually good

Degree 3: More flexible → might be good

Degree 10: Crazy wiggles → OVERFITTING! (passes through every training point)

Real-life analogy: Drawing through dots.

Degree 1 = ruler வைத்து straight line
Degree 2 = ஒரு smooth curve
Degree 10 = ஒவ்வொரு dot-ஐயும் exactly touch செய்ய நெளிந்து போகிறது — ஆனால் புதிய dots-க்கு completely wrong ஆக இருக்கும்

எப்படி right degree select செய்வது? GridSearchCV!

param_grid = {'polynomialfeatures__degree': [1, 2, 3, 4, 5]}

# Cross-validation best degree கண்டுபிடிக்கும்

Quick Reference Summary

Concept ஒரு வரியில் எப்போது use?
EDA	Data-ஐ model-க்கு முன் புரிந்துகொள்	எப்போதும் முதலில்
StandardScaler	எல்லா features-ஐ same scale-க்கு	Gradient descent, Ridge, Lasso
Residual Analysis	Model assumptions சரியா check	Model build-க்கு பிறகு
VIF	Duplicate features கண்டுபிடி	Feature selection-ல்
RFE	Best features auto-select	Features அதிகமாக இருக்கும்போது
SGDRegressor	Step-by-step learning	Big data, online learning
GridSearchCV	Best settings auto-find	Hyperparameter tune செய்ய
PolynomialFeatures	Curve fit செய்ய	Straight line போதாதபோது