Python - Machine Learning Random Forest Untuk Prediksi Masa Panen
Ringkasan Langkah-langkah lengkap :
- Load data hasil panen dari CSV.
- Pisahkan fitur dan target.
- Latih model Random Forest Regressor.
- Evaluasi performa model (R², RMSE).
- Visualisasi:
- Feature importance
- Prediksi vs aktual
- Distribusi error
- Learning curve
- Simpan model & hasil prediksi.
Langkah 1
Import Library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import mean_squared_error, r2_score
import joblib
pandas,numpy→ untuk mengolah data.matplotlib.pyplot,seaborn→ untuk visualisasi grafik.RandomForestRegressor→ algoritma machine learning (model prediksi).train_test_split→ memisahkan data untuk training & testing.mean_squared_error,r2_score→ metrik evaluasi performa model.joblib→ menyimpan dan memuat model (serialization).
Langkah 2
Load dan siapkan data
data = pd.read_csv('data_panen_500.csv')
X = data.drop('jumlah_hari_panen', axis=1)
y = data['jumlah_hari_panen']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)pd.read_csv() membaca file CSV berisi data hasil panen.
X = fitur input (misalnya: luas lahan, curah hujan, pupuk, dll).
y = target/output yang ingin diprediksi → jumlah_hari_panen.
Data dibagi jadi:
- 80% untuk training
- 20% untuk testing
Langkah 3
Melatih Model (Training)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
joblib.dump(model, 'model_panen_rf_analisis.pkl')RandomForestRegressor → model berbasis ensemble decision tree.
n_estimators=100 → jumlah pohon dalam forest (semakin banyak, semakin stabil hasilnya).
model.fit() → melatih model dengan data training.
joblib.dump() → menyimpan model agar bisa digunakan lagi nanti tanpa training ulang.
Langkah 4
Evaluasi Dasar
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")
y_pred → hasil prediksi model.
MSE (Mean Squared Error) → rata-rata kuadrat error.
RMSE → akar dari MSE → ukuran seberapa jauh prediksi dari nilai sebenarnya.
R² Score → menunjukkan seberapa baik model menjelaskan variasi data (semakin dekat ke 1, semakin bagus).
Langkah 5 : Visualisasi Feature Importance
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
features = X.columns
plt.figure(figsize=(10,6))
sns.barplot(x=importances[indices], y=features[indices], palette='viridis')
plt.title('Feature Importance Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
Menampilkan fitur mana yang paling berpengaruh dalam prediksi:
model.feature_importances_→ nilai kepentingan tiap fitur.- Urutan fitur disortir dari yang paling penting.
- Visualisasi menggunakan bar chart (
seaborn.barplot).
Langkah 6 : Plot Prediksi vs Aktual
plt.figure(figsize=(6,6))
sns.scatterplot(x=y_test, y=y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("Nilai Aktual")
plt.ylabel("Prediksi Model")
plt.title("Prediksi vs Aktual (Random Forest)")
plt.tight_layout()
plt.show()Grafik ini membandingkan hasil prediksi dengan nilai aktual.
Garis merah (r--) menunjukkan kondisi ideal: prediksi = aktual.
Jika titik-titik mendekati garis merah → model akurat.
Langkah 7 : Distribusi Error (Residual)
residuals = y_test - y_pred
plt.figure(figsize=(8,5))
sns.histplot(residuals, bins=20, kde=True, color='skyblue')
plt.title("Distribusi Error (Residuals)")
plt.xlabel("Error (y_test - y_pred)")
plt.tight_layout()
plt.show()Residuals = selisih antara nilai aktual dan prediksi.
Plot histogram residual membantu mendeteksi:
- Apakah error tersebar merata (good model).
- Adakah bias (error condong ke arah tertentu).
Langkah 8 : Learning Curve
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5, scoring='r2', n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10)
)
train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
plt.figure(figsize=(8,5))
plt.plot(train_sizes, train_mean, 'o-', label="Training score")
plt.plot(train_sizes, test_mean, 'o-', label="Validation score")
plt.xlabel("Training Set Size")
plt.ylabel("R² Score")
plt.title("Learning Curve - Random Forest Regressor")
plt.legend()
plt.tight_layout()
plt.show()Learning curve digunakan untuk melihat:
- Apakah model underfitting / overfitting.
- Saat data training bertambah, seharusnya training dan validation score mendekat.
Jika kedua garis saling mendekat di nilai tinggi → model sudah cukup baik.
Langkah 9 : Tabel Hasil Evaluasi
eval_df = pd.DataFrame({
'Metric': ['R²', 'RMSE'],
'Value': [r2, rmse]
})
print("\n=== Ringkasan Evaluasi Model ===")
print(eval_df)Membuat tabel sederhana berisi hasil evaluasi agar mudah dibaca.
Langkah 10 : Simpan Hasil Prediksi
hasil = pd.DataFrame({'Aktual': y_test, 'Prediksi': y_pred, 'Error': residuals})
hasil.to_csv('hasil_prediksi_random_forest.csv', index=False)
print("\nFile hasil_prediksi_random_forest.csv berhasil disimpan.")Menggabungkan nilai aktual, prediksi, dan error dalam satu tabel.
Menyimpannya ke file CSV agar bisa dianalisis lebih lanjut (misalnya di Excel).
KODE LENGKAP
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import mean_squared_error, r2_score
import joblib
# ------------------------------------------------------
# 1. Load data
# ------------------------------------------------------
data = pd.read_csv('data_panen_500.csv')
# Pisahkan fitur dan label
X = data.drop('jumlah_hari_panen', axis=1)
y = data['jumlah_hari_panen']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# ------------------------------------------------------
# 2. Training Model
# ------------------------------------------------------
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Simpan model
joblib.dump(model, 'model_panen_rf_analisis.pkl')
# ------------------------------------------------------
# 3. Evaluasi Dasar
# ------------------------------------------------------
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")
# ------------------------------------------------------
# 4. Visualisasi Feature Importance
# ------------------------------------------------------
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
features = X.columns
plt.figure(figsize=(10,6))
sns.barplot(x=importances[indices], y=features[indices], palette='viridis')
plt.title('Feature Importance Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
# ------------------------------------------------------
# 5. Plot Prediksi vs Aktual
# ------------------------------------------------------
plt.figure(figsize=(6,6))
sns.scatterplot(x=y_test, y=y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("Nilai Aktual")
plt.ylabel("Prediksi Model")
plt.title("Prediksi vs Aktual (Random Forest)")
plt.tight_layout()
plt.show()
# ------------------------------------------------------
# 6. Distribusi Error (Residual)
# ------------------------------------------------------
residuals = y_test - y_pred
plt.figure(figsize=(8,5))
sns.histplot(residuals, bins=20, kde=True, color='skyblue')
plt.title("Distribusi Error (Residuals)")
plt.xlabel("Error (y_test - y_pred)")
plt.tight_layout()
plt.show()
# ------------------------------------------------------
# 7. Learning Curve
# ------------------------------------------------------
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5, scoring='r2', n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10)
)
train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
plt.figure(figsize=(8,5))
plt.plot(train_sizes, train_mean, 'o-', label="Training score")
plt.plot(train_sizes, test_mean, 'o-', label="Validation score")
plt.xlabel("Training Set Size")
plt.ylabel("R² Score")
plt.title("Learning Curve - Random Forest Regressor")
plt.legend()
plt.tight_layout()
plt.show()
# ------------------------------------------------------
# 8. Tabel Hasil Evaluasi
# ------------------------------------------------------
eval_df = pd.DataFrame({
'Metric': ['R²', 'RMSE'],
'Value': [r2, rmse]
})
print("\n=== Ringkasan Evaluasi Model ===")
print(eval_df)
# ------------------------------------------------------
# (Opsional) Simpan laporan hasil prediksi
# ------------------------------------------------------
hasil = pd.DataFrame({'Aktual': y_test, 'Prediksi': y_pred, 'Error': residuals})
hasil.to_csv('hasil_prediksi_random_forest.csv', index=False)
print("\nFile hasil_prediksi_random_forest.csv berhasil disimpan.")