After months of optimizing ML models to fit in 50MB on budget phones, I wanted a weekend project with zero constraints. Dota 2 match prediction: given the 10 hero picks in a match, predict which team wins. Simple problem, fun dataset, and a good excuse to play more Dota.
The Dataset and Features
I pulled 50,000 ranked matches from the OpenDota API. Each match has 5 heroes per team (out of 113 total heroes) and a binary outcome. The simplest feature representation: a 113-dimensional binary vector for Radiant picks and another for Dire picks.
import numpy as np
import requests
def fetch_matches(n=1000):
matches = []
for _ in range(n // 100):
resp = requests.get('https://api.opendota.com/api/publicMatches')
matches.extend(resp.json())
return matches
def match_to_features(match, num_heroes=114):
features = np.zeros(num_heroes * 2)
for player in match['players']:
hero_id = player['hero_id']
if player['player_slot'] < 128: # Radiant
features[hero_id] = 1
else: # Dire
features[num_heroes + hero_id] = 1
return features, 1 if match['radiant_win'] else 0
Logistic Regression Gets You to 65%
No neural networks needed. Logistic regression on hero pick features alone hit 58% accuracy. Adding hero pair interaction features (did hero A and hero B appear on the same team) pushed it to 65%.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X = np.array([match_to_features(m)[0] for m in matches])
y = np.array([match_to_features(m)[1] for m in matches])
# Basic hero picks
model = LogisticRegression(C=0.1, max_iter=1000)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Hero picks only: {scores.mean():.3f}") # ~0.58
# Add pairwise interactions for same-team heroes
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_interact = poly.fit_transform(X)
scores2 = cross_val_score(model, X_interact, y, cv=5, scoring='accuracy')
print(f"With interactions: {scores2.mean():.3f}") # ~0.65
65% might sound low, but Dota has enormous variance. Pro analysts estimate that draft advantage accounts for about 10-15% of win probability. The rest is player skill, execution, and itemization. A model that only sees hero picks is fundamentally capped.
Feature Engineering > Model Complexity
I tried a random forest, gradient boosting, and a small neural network. None beat logistic regression by more than 1-2%. The gains came from better features, not better models.
Adding average hero win rate as a feature helped. Adding a “team synergy” score based on historical win rates of hero pairs helped more. The model didn’t need to learn that Anti-Mage is good from raw match data if I told it directly.
def add_winrate_features(X, hero_winrates):
radiant_avg_wr = np.array([
np.mean([hero_winrates[h] for h in range(114) if row[h] == 1])
for row in X
])
dire_avg_wr = np.array([
np.mean([hero_winrates[h] for h in range(114) if row[114 + h] == 1])
for row in X
])
return np.column_stack([X, radiant_avg_wr, dire_avg_wr])
What I Learned
This was a good reminder that feature engineering matters more than model complexity. At work we spent weeks squeezing 1% accuracy out of model architectures. Here, 30 minutes of feature engineering gave me 7% accuracy improvement over the baseline.
The other lesson: know your ceiling. No model will predict Dota matches at 90% accuracy from draft alone. Understanding the theoretical limit saves you from chasing diminishing returns. Same principle applies to production ML. If your training data has 5% label noise, don’t expect 99% accuracy.
Fun project. Took a Saturday afternoon. Would recommend as a first ML project for anyone who plays Dota.