Decision Tree Models | Part 2

Wichada Chaiprasertsud

4 min readFeb 5, 2025

Basic of tree, Random Forest, Gradient Boosting Tree & XGBoost, etc.

Previous:

Decision Tree Models | Part 1

Basic of tree, Random Forest, Gradient Boosting Tree & XGBoost, etc.

wasrek.medium.com

ในตอนที่แล้วเราได้พูดถึง Decision Tree ปกติไป คราวนี้เราจะมาพูดถึง Decision Tree ที่ถูกพัฒนาเพื่อแก้ไขปัญหาบางอย่าง

Instability Problem

https://slideplayer.com/slide/7029806/ Zhangxi Lin ISQS Texas Tech University Note

Decision Tree มีความ unstable สูงมาก ๆ การเปลี่ยนแปลงแค่นิดเดียว หรือเพียงแค่ข้อมูลเดียวเหมือนดังรูป อาจทำให้เราได้ Tree ที่มี structure เปลี่ยนไปโดยสิ้นเชิง

เราจึงจะแก้ไขปัญหานี้โดยการใช้ technique — ensemble model หรือ model ที่เป็น combination ของ model หลาย ๆ อัน ซึ่งก็มีหลายวิธี เช่น

Bagging ซึ่งเป็นการ train model หลาย ๆ ตัวแยกกันแบบ parallel แล้วเอาผลมา combine
Boosting ซึ่งเป็นการ train model หลาย ๆ ตัวต่อกัน แบบ sequentially แล้วให้ model ตัวหลังมาแก้ error ของ model ตัวก่อนหน้า

กลายเป็น model ที่ถูกพัฒนาต่อมา ไม่ว่าจะเป็น

Bagging (Bootstrap Aggregation)

ปัจจุบันไม่ค่อยนิยมแล้ว เพราะคนไปใช้ Random Forest ซึ่งเป็นเวอร์ชันที่ถูกพัฒนาจาก Bagging

https://www.analyticsvidhya.com/blog/2023/01/ensemble-learning-methods-bagging-boosting-and-stacking/

Take Majority Vote(Average) of Prediction

Random Forest Algorithm (RF)

enhanced Bagging, Bagging + random feature selection for each split. Improve robustness and reduce overfitting

machine learning — What is the difference between bagging and random forest if only one explanatory variable is used? — Cross Validated

Many trees → Forest!

Forest takes bootstrap samples of the rows in training data (sampling with replacement; after selected we put it back to the pool, so it can be selected again) then at each step, a set of variables (columns) is sampled.

โดยค่าของ max features หรือ max samples รวมถึง จำนวนต้นไม้ใน RF เราสามารถกำหนดได้

Pros: คำนวณง่าย คำนวณได้แบบ parallel เป็น strong baseline

Cons: ด้วยความที่ model complex ขึ้น เราอาจจะตีความผลจากต้นไม้ได้ยาก เพราะผลมาจากต้นไม้หลายต้นเฉลี่ยกัน

Boosting (AdaBoost — Adaptive Boosting)

ปัจจุบันไม่ค่อยนิยมแล้ว เพราะคนไปใช้ Gradient Boosting ซึ่งเป็นเวอร์ชันที่ถูกพัฒนาจาก Boosting

https://www.almabetter.com/bytes/tutorials/data-science/adaboost-algorithm

https://medium.com/towards-data-science/the-ultimate-guide-to-adaboost-random-forests-and-xgboost-7f9327061c4f

Focuses on re-weighting misclassified samples.
Each weak learner is trained sequentially, with more weight given to incorrectly classified points.
The final model is a weighted sum of weak learners. Models that perform better get higher weight.

สังเกตได้ว่ากระบวนการแบบนี้จะทำให้ Outlier มีผลกับ model มาก ๆ เพราะจะได้ weight ที่เพิ่มขึ้น

Gradient Boosting Tree

ใช้หลักการ Boosting เหมือนกัน แต่เป้าหมายของต้นไม้ตัวถัดไปคือ การทำนาย Error ไม่ใช่ Target!

prediction result = summation of all trees (not average as RF any more)

Initial Prediction
ค่า Initial Prediction มักจะเป็นค่าจากโมเดลง่าย ๆ เช่น:

ค่าเฉลี่ย (Mean) ของ yy สำหรับปัญหาการพยากรณ์แบบ Regression
ค่า Log-Odds ของ Class Prior สำหรับ Classification
ตัวอย่างเช่น ในปัญหาการทำนายราคา ถ้าค่าเฉลี่ยของราคาบ้านคือ 3 ล้าน บาท ค่าเริ่มต้นของโมเดลก็จะเป็น 3 ล้านบาททุกตัวอย่าง

Prediction

Residuals/Error เป็นค่าที่โมเดลถัดไปจะเรียนรู้
เช่นค่า +0.95 หรือ -0.44 ในรูป
เมื่อโมเดลแรกทำนายออกมา ค่า Residual (ri = yi — f0(xi)) จะกลายเป็น target ใหม่ ที่โมเดลถัดไปต้องเรียนรู้
ตัวอย่าง:
ถ้าเราทำนายราคาบ้านได้ 2.5 ล้าน แต่ความจริงคือ 3 ล้าน, Residual คือ +0.5 ล้าน
โมเดลต้นไม้ต้นต่อไปจะพยายามพยากรณ์ +0.5 ล้าน (ค่าความผิดพลาดที่เหลืออยู่)

การใช้ Gradient Descent Learning Rate (η) เพื่อป้องกัน Overfitting

การเพิ่มผลรวมของต้นไม้แต่ละต้นไม่ได้เกิดขึ้นแบบ 100% แต่ถูกปรับด้วยค่าที่เรียกว่า Learning Rate (η) ของ GD
ค่า η ควบคุมว่าโมเดลต้นไม้ใหม่จะมีอิทธิพลต่อผลลัพธ์สุดท้ายแค่ไหน
ค่า η ต่ำ (เช่น 0.1) → โมเดลเรียนช้าลง แต่ลด Overfitting
ค่า η สูง (เช่น 1.0) → โมเดลอัปเดตเร็วขึ้น แต่เสี่ยง Overfitting

Model สุดท้ายเป็นการรวมค่าทั้งหมด (Ensemble Model)

ผลลัพธ์สุดท้ายไม่ได้มาจากต้นไม้ต้นเดียว แต่เป็น การรวมกันของต้นไม้หลายต้น
สมการของโมเดลสุดท้ายคือ

ซึ่งหมายความว่าแต่ละต้นไม้ช่วยกันลดข้อผิดพลาดทีละน้อย ทำให้ผลลัพธ์สุดท้ายแม่นยำขึ้น

เมื่อเราได้ Error แล้ว สังเกตได้ว่า Error แต่ละค่าจะมีค่า weight ที่คูณอยู่ ซึ่งนั่นคือการที่เราจะหา weight ที่ดีที่สุดโดยการใช้ GD Gradient Descent

“Optimize weight of leaf nodes by gradient descent Original Tree”

ทำให้บน model สามารถปรับ learning rate สำหรับ GD ได้ด้วยนั่นเอง

eXtreme Gradient Boosting Tree (XGBoost)

Model ที่ให้ผลใกล้เคียงกับ XGBoost แต่เร็วกว่ามาก

เป็นการปรับปรุง GBM โดยเพิ่ม

Regularization (มี L1 (Lasso) และ L2 (Ridge) ทำให้ลด Overfitting ได้ดีขึ้น)
Parallelization มีการคำนวณแบบ Parallel และ Optimized Memory Usage
Pruning ใช้วิธี “Loss-based Pruning”
จัดการ Missing Values ได้อัตโนมัติ และอื่น ๆ

รวมถึงหนึ่งใน key สำคัญคือการใช้ Approximate Tree Learning

Approximate Tree Learning

ปัญหาของ Gradient Boosting ปกติคือ การแบ่งโหนด (Splitting) ของต้นไม้ต้อง Sort ค่า Feature ทุกครั้ง → ทำให้ใช้เวลานานมาก ส่งผลให้

การหาจุดแบ่ง (Best Split) เป็นกระบวนการที่มีค่าใช้จ่ายสูง (Computational Cost สูง)

XGBoost แก้ปัญหานี้ด้วย Approximate Splitting

แทนที่จะ Sort ทุกค่า ในแต่ละรอบ XGBoost ใช้วิธี Approximate Algorithm ที่เป็น Greedy function
โดยการ แบ่ง Feature เป็น Quantile Sketch (Bucket-based) → ลดการคำนวณ
ทำให้การตัดสินใจเลือก Split Point เร็วขึ้น อาจมีการเลือก Split Point ที่ไม่แม่นยำที่สุดในบางกรณี แต่โดยรวมแล้วยังได้ผลลัพธ์ที่ดีและเร็วขึ้นมาก

*XGBoost package ไม่อยู่ใน sklearn