GDP Analysis using Countries of the World Dataset with Python

5 min readOct 28, 2021

Data Science is rapidly becoming one of the most widely used technologies, and it is assisting governments in maintaining transparency, becoming far more efficient, and boosting the economy and productivity. Using statistics, data preparation, predictive modeling, and machine learning, it attempts to resolve various problems within different areas and the economy on the whole.

Gross Domestic Product(GDP) is a measure to gauge the health of an economy and represents the total value of goods and services produced in an economy. GDP growth rate is the major indicator of a country’s economic performance. Per capita GDP is a global indicator of a country’s economy that economists use in combination with GDP to assess a country’s wealth based on its economic growth. The formula of GDP per capita is:

GDP per capita = Gross Domestic Product (GDP) / Population

About the Dataset

The primary goal of this project is to investigate the dataset “Countries of the World” and to focus on the elements that are influencing a Country’s GDP per capita. The dataset provides information about countries on population, region, area size, infant mortality, GDP, and more.

Let’s start by importing the required Python libraries.

Load and observe the data

Data Preparation

Fill the missing data using the median of the region that a country belongs to, as countries that are close geologically are often similar in many aspects. For example, let's check the region median of ‘GDP ($ per capita)’, ‘Literacy (%)’ and ‘Agriculture’.

Data Exploration

Top Countries with highest GDP per capita

Observe the bar graph of the top 20 countries with the highest GDP per capita. Luxembourg is quite ahead, the next 19 countries are close. Germany, the 20th has about 2.5 times GDP per capita of the world average.

Correlation between Variables

The heatmap depicts the correlation between all numerical columns.

Top Factors affecting GDP per capita

We selected the six columns that mostly correlated to GDP per capita and made scatter plots. We can notice there are many countries with low average GDP and few with high average GDP.

Modeling

The first label encodes the categorical features ‘Region’ and ‘Climate’, and I have just used all features given in the data set.

Training and Testing the data

First, let’s try the linear regression model. As for metrics, I have checked both root mean squared error and mean squared log error.

Also, as we know the target is not linear with many features, it is worth trying some nonlinear models. Like the random forest model.

Data Visualization

To visualize the data, we made a scatter plot of prediction against actual GDP. The model gives a reasonable prediction, as the data points are gathered around the line y=x.

Total GDP

Top 10 countries with highest total GDPs, their GDP make up to about 2/3 of the global GDP.

Compared the above ten countries rank by total GDP and GDP per capita. We can see the countries with high total GDPs are quite different from those with high average GDPs. China and India jump above a lot when it comes to the total GDP. The only country that is in the top 10 (in fact top 2) for both total and average GDPs is the United States.

Factors affecting Total GDP

We also checked the correlation between total GDP and the other columns. The top two factors are population and area, following many factors that have also been found mostly correlated to GDP per capita.