Big Picture
Research Question
I am hoping to understand how health factors and government spending affect life expectancy around the world. Specifically, I hope to answer whether expenditure on health and levels of alcohol consumption predict age of life expectancy in countries around the world.
Description of data
The data set includes a variety of immunization, mortality, economic and social factors from 2000 - 2015 in 193 countries. The dataset was downloaded from Kaggle here and is a combined from several datasets from the World Health Organization and the United Nations website.
Load data
library(tidyverse)
library(janitor)
library(skimr)
data_raw <- read_csv("https://raw.githubusercontent.com/simplyjin/ModernDive/master/data/life_data.csv")
#Clean variable names
data_raw <- data_raw %>%
clean_names()
Explore Data
glimpse(data_raw)
skim(data_raw)
There are 10 NA’s for life expectancy, 194 for alcohol, and 0 for expenditure. The NA’s will require further investigation when I make a final data frame for analysis.
Variables
The identification variable is defined by the combination of the country
and year
variables.
The outcome variable is called life_expectancy
per row.
The numerical variable is called percentage_expenditure
which is the expenditure on health as a percentage of GDP per capita per country per year.
I feel that this data set does not provide an appropriate categorical variable for this analysis so I will be creating a variable alcohol_group
based on the numerical variable alcohol
. alcohol
measures the per capita consumption in liters of pure alcohol. I will divide alcohol
into three equally sized levels of: Low
, Medium
, and High
.
Observational units
Each row in this data set will represent a country from each year between 2000 and 2015. The data set has 2938 rows representing 193 unique countries.
Preview Data
I will create a new data frame including only the variables that will be required in the analysis as well as filtering out rows based on the investigation into NA values. Next,I will create the alcohol group levels. Finally I will sample the resulting data frame.
data_raw %>%
select(country, year, life_expectancy, alcohol, percentage_expenditure) %>%
filter(is.na(life_expectancy))
data_raw %>%
select(country, year, life_expectancy, alcohol, percentage_expenditure) %>%
filter(is.na(alcohol))
The data set has missing values for life_expectancy
in 2013 for 10 countries. They seem to be small island nations. Thus I believe that these can be filtered out in my analysis.
The data set also has missing values for alcohol
in 2015 for it appears every country. Furthermore, for 2015 many countries have a percentage_expenditure
of 0. Thus the final analysis I will only use data from the years 2000 - 2014. Montenegro and South Sudan also have missing data so these two will also be removed.
#countries to remove based on NA life_expectancy
data_countries <- data_raw %>%
select(country, year, life_expectancy, alcohol, percentage_expenditure) %>%
filter(is.na(life_expectancy))
# double check that 10 countries should be removed
# data_raw %>%
# filter(country %in% data_countries$country)
data_analysis <- data_raw %>%
select(country, year, life_expectancy, alcohol, percentage_expenditure) %>%
filter(!country %in% data_countries$country,
!country %in% c("Montenegro", "South Sudan"),
year != 2015) %>%
mutate(
alcohol_size = cut_number(alcohol, n = 3),
size = recode_factor(alcohol_size, "[0.01,1.67]" = "Low", "(1.67,6.4]" = "Medium", "(6.4,17.9]" = "High")
)
Lets sample 5 rows on the resulting data frame that will be used in the final analysis
set.seed(274) #for reproducibility
data_analysis %>%
sample_n(5)
## # A tibble: 5 x 7
## country year life_expectancy alcohol percentage_expen~ alcohol_size size
## <chr> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Botswana 2008 57.5 6.56 477. (6.4,17.9] High
## 2 Timor-Leste 2009 66.6 0.09 36.2 [0.01,1.67] Low
## 3 Saint Vinc~ 2005 71.4 6.04 0 (1.67,6.4] Medi~
## 4 Turkey 2004 72 1.35 1.13 [0.01,1.67] Low
## 5 Saint Lucia 2006 73.5 13.4 0 (6.4,17.9] High