Data Analysis on FIFA international world cup data

Mohamed Fawas
5 min readMay 17, 2022

I am a big fan of Real Madrid. I have seen a lot of football matches of them and other teams. I started watching football during 2010 world cup. After watching 2010 quarter final between Germany and Argentina I became a huge fan of German international football team.

When I was on my 4th semester of my masters I was looking to upskill myself by doing some data analysis projects. For this I was exploring different datasets and different project ideas. While I was just scrolling through different datasets in Kaggle , I found this dataset which consists of the data of the matches played in FIFA international world cup from 1930 to 2014. Then I thought it will be really interesting to do a data analysis project on this data and while doing this I will be able to increase my knowledge about data analysis and football.

So, I downloaded the dataset from the Kaggle and then uploaded to a new repository in my Github. Then I started to do the coding tasks of my project in google colab.

I followed the below plan of action to complete this data analysis project.

  1. Import the libraries
  2. Load the dataset
  3. Data Cleaning
  4. Exploratory Data Analysis

Anyone can reproduce this work with the help of this python notebook.

Import the libraries

The first thing we have to do is to import the python libraries that we need for this data analysis project. You can do that with the help of the below code.

import numpy as np import pandas as pd import matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlineimport plotly as pyimport cufflinks as cf

Load the dataset

Then I have loaded the dataset from my own Github repository. i did this step with the help of pandas library and I used `read_csv` function of pandas.

Then display all the first 5 rows of different datasets to get a small idea about the data.

matches.head()

Data Cleaning

Earlier we saw that at the end of the dataset of world cup matches there are a lot of empty rows. We need to remove those rows from our dataset.

matches.dropna(subset=['Year'], inplace=True)
# Display the data after removing empty rows
matches.tail()

In the dataset we can see some countries established after 1930 are labeled with a tag rn”> . We need to remove that tag from the names of the countries.

names = matches[matches['Home Team Name'].str.contains('rn">')]['Home Team Name'].value_counts()names

create a list of those countries which are labeled incorrectly

wrong = list(names.index)

Correct those incorrectly labeled country names.

correct = [name.split('>')[1] for name in wrong]

In the stadium name data we can see that names of some stadiums are changed later , So we need to make both the old and stadium have name same. So, we will update the old stadium name with the present name. This will help us for an easier data analysis.

old_name = ['Germany FR', 'Maracan� - Est�dio Jornalista M�rio Filho', 'Estadio do Maracana']new_name = ['Germany', 'Maracan Stadium', 'Maracan Stadium']

Now we can make the final step

wrong = wrong + old_namecorrect = correct + new_name

Display the names before and after data cleaning

print("Before data cleaning")print(wrong,"\n")print("After data cleaning")print(correct)

Now we need to make changes for the above mentioned labels in the whole dataset. This can be done with the help of the following code

Exploratory data analysis

Here I will post the results I got after plotting different data from the given FIFA world cup dataset.

  1. Which team had won more titles in FIFA international world cup?
Brazil had won the most number of world cups

2. Which team had finished as runner up for more number of times?

Germany had finished as runner up for most number of times.

3. Which team had finished on the third spot more number of times?

Germany had finished at 3rd position most number of times in FIFA WC

4. Create a plot based on which country finished on top 3 positions in all the world cup happened till 2014?

5. Make a plot based on the number of goals scored by countries in the top 20 goal scoring countries?

6. Make a plot based on number of teams qualified in each year?

7. Create a plot based on the total number of goals scored by all teams in each world cup

8. Create a plot based on the number of matches played in each year?

9. Create a plot based on the top 10 matches with more attendance?

10 . Create a plot based on top 10 stadiums with average attendance?

--

--