By
Guilherme Klink (@guiklink)
Sherif Mostafa (@sherifm)
Note: This webpage provides a quick overview and the usage of this project. For a detailed description of its development, kindly download the full project report PDF.
Project Overview
This project uses a machine learning approach to forecast Tennis match results (men and women) based on previous match statistics. Given two players and one of the 4 major Tennis tournaments (Australian Open, French Open - Roland Garros, Wimbledon and the US Open), the provided application (user manual attached in the appendix) will return the player that is more likely to win the match. In contrast to other sports - i.e. Baseball - publicly available statistical data for tennis is scarce, indicating that not much data analytics has been applied in the world of Tennis. The majority of Tennis forecasts, such as betting odds are obtained purely from player ranking, and number of wins/losses. The novelty of the approach taken in this project, is that it does not depend on the official ATP (Association of Tennis Professionals) or WTP (Women’s Tennis Association) player rankings or how many matches a player won or lost, but rather how they played. Instead, we look at more objective performance data (i.e. net points, unforced errors, serving percentages, serving direction, points won on the second serve...) .
A fundamental component of this project is to determine which features should be passed-on to the learning algorithm that was used to train the forecaster. The initial data scraped from various online sources provided up to 1000 features. Using correlation techniques, the most relevant features are extracted. By virtue of its interpretability, a decision tree learning algorithm is the preferred choice for this project. The features used contained both nominal and categorical types.
The majority of the data was crowd-charted, meaning that it contained plenty of inconsistencies and required intense preprocessing. Once the formatting was consistent, data from multiple sources was cross-checked to reduce the number of errors. Subsequently, the correlation index of each attribute to the classifier is determined. The attributes that exceed a certain correlation index threshold were selected and reexamined using the author's expertise in Tennis to avoid post hoc fallacies (also know as post hoc ergo propter hoc). The learned hypothesis was tested on the major ongoing tournament during the development of this project (2015 French Open), upon which the forecaster’s success is measured.
When it was tested on the 2015 Roland Garros tournament, the forecaster demonstrated a prediction accuracy of 89%.
Once again if you are more intersted in the machine learning approach, kindly download the full project report PDF.
How to Run the App
Runnin this app will require you to have a running installation of python as well as sci-kit-learn. If you still don't have sqlite (usually comes with the default version of python) make sure to install it too.
Running the prediction app from source is very simple. First, clone our repository if you have not done that yet:
$ cd the_directory_to_clone
$ git clone https://github.com/guiklink/Tennis_Predictor_App.git
$ git checkout master
Alternativally, if you don't have git you can dowload it by clicking on Download ZIP on the bottom left of this page. This might take a few minutes, since the app will need our database and trees in order forecast.
Now you have to go inside the folder Python and laungh the GUI file tennis_predict_GUI.py
$ cd Python
$ python tennis_predict_GUI.py
The App Interface
- Enter the name of player 1*.
- Enter the name of player 2*.
- Predict Button: Click to predict the winner of the match.
- Select the tournament for the prediction.
- Prediction status label: I.e. this field will let you know if there's a problem with the prediction.
- Returns the prediction.
- Use this box to write the path in which your output will be saved**.
- Export Player Data button: Export the data that was used for each player to create the test instance as a CSV.
- Export Tree to PDF button: Export a graph representation of the learned tree that was used for prediction in .pdf .
- Status of Exporting: This field notifies you about the export progress.
** The last part of the path will be your file name. Do not append it with the format (e.g. .pdf or .csv), the app will handle this internally. A tree example can be downloaded here. An example of a player's data can be viewed here.
Tree Gaphical Representation
One of the apps functionalities is to export a graphical representation of the tree. This can be done by writing a path in the path text box and pressing the button Export Tree To PDF (see APP interface section).
Tree sample.
Understanding a Leaf Node
- Gini represents the impurity or “entropy” of the node
- How many Samples of the training reached this leaf
- What percentage of the samples were uniquely classified
Understanding a Parent Node
- X[18] is the attribute which this node split and the <=19.0000 is the value. A list with all attributes index is provided here.
- Gini stands for impurity and “entropy” for the information gain
- How many samples of the training data made to this leaf
Raw Data
The repository for this website contains only the compiled data and final versions of the Tennis Predictor App, the whole development sequence as well as all the raw data can be found in this repository.