[CDAF] Exercise 1¶

Name¶

Name: XXXXXXXXXXXXXXXX

References¶

Introduction¶

In this activity, we will review the concepts learned in the classroom about randomness and forecasting, working on the Soccer Prediction Challenge[1] dataset[2].

Question 1¶

Load the 'TrainingSet_2023_02_08' dataset
Create a histogram for the number of goals scored per game by the home team, the away team, total goals, and the goal difference per match.
If there are instances with clearly incorrect values, highlight them and remove them before generating the histograms.
Calculate the minimum, maximum, and average of each of the 4 histograms requested above.

In [ ]:

Question 2¶

Choose a season that has already ended, from any of the leagues present in the dataset.
Perform the same histograms as in question 1, but now for the chosen season.
What are the differences between the histograms of question 1 and question 2? What might this indicate about the offensive quality of the chosen league vs. the whole?

In [ ]:

Question 3¶

From the data of the selected championship, create a dataframe that corresponds to the league table at the end of the season containing the names of the teams, number of points, games, wins, draws, losses, goals scored, goals against, and goal difference. Order the league table by points, wins, goal difference, and goals scored.
Do the same only for the first half of the games.

In [ ]:

Question 4¶

Using the games from the chosen league, use Poisson regression to create a forecast model for the results, as seen in the classroom slides and in Soccermatics[3].
Print the summary of the adjustment
Simulate the match between the 1st and 4th placed team, where the 1st plays at home. First, present the expected number of goals for each team. Then, present a histogram with the probability of different scores between the teams.

In [ ]:

Question 5¶

Use the trained model to simulate the expected scores of all the season's games.
Construct a league table based on the expected results. Consider that games with an expected goal difference < 0.5 is a draw.
Compare the real table with the simulated one. Where are the main differences between them? And similarities? What might this indicate in terms of what the model underestimates and overestimates about the quality of the teams?

In [ ]: