Big Picture

Research Question

I am hoping to understand how health factors and government spending affect life expectancy around the world. Specifically, I hope to answer whether expenditure on health and levels of alcohol consumption predict age of life expectancy in countries around the world.

Description of data

The data set includes a variety of immunization, mortality, economic and social factors from 2000 - 2015 in 193 countries. The dataset was downloaded from Kaggle here and is a combined from several datasets from the World Health Organization and the United Nations website.

Load data

library(tidyverse)
library(janitor)
library(skimr)

data_raw <- read_csv("https://raw.githubusercontent.com/simplyjin/ModernDive/master/data/life_data.csv")

#Clean variable names
data_raw <- data_raw %>% 
  clean_names()

Explore Data

glimpse(data_raw)
skim(data_raw)

There are 10 NA’s for life expectancy, 194 for alcohol, and 0 for expenditure. The NA’s will require further investigation when I make a final data frame for analysis.

Variables

The identification variable is defined by the combination of the country and year variables.

The outcome variable is called life_expectancy per row.

The numerical variable is called percentage_expenditure which is the expenditure on health as a percentage of GDP per capita per country per year.

I feel that this data set does not provide an appropriate categorical variable for this analysis so I will be creating a variable alcohol_group based on the numerical variable alcohol. alcohol measures the per capita consumption in liters of pure alcohol. I will divide alcohol into three equally sized levels of: Low, Medium, and High.

Observational units

Each row in this data set will represent a country from each year between 2000 and 2015. The data set has 2938 rows representing 193 unique countries.

Preview Data

I will create a new data frame including only the variables that will be required in the analysis as well as filtering out rows based on the investigation into NA values. Next,I will create the alcohol group levels. Finally I will sample the resulting data frame.

data_raw %>% 
  select(country, year, life_expectancy, alcohol, percentage_expenditure) %>% 
  filter(is.na(life_expectancy))

data_raw %>% 
  select(country, year, life_expectancy, alcohol, percentage_expenditure) %>% 
  filter(is.na(alcohol))

The data set has missing values for life_expectancy in 2013 for 10 countries. They seem to be small island nations. Thus I believe that these can be filtered out in my analysis.

The data set also has missing values for alcohol in 2015 for it appears every country. Furthermore, for 2015 many countries have a percentage_expenditure of 0. Thus the final analysis I will only use data from the years 2000 - 2014. Montenegro and South Sudan also have missing data so these two will also be removed.

#countries to remove based on NA life_expectancy
data_countries <- data_raw %>% 
  select(country, year, life_expectancy, alcohol, percentage_expenditure) %>% 
  filter(is.na(life_expectancy))

# double check that 10 countries should be removed
# data_raw %>% 
#   filter(country %in% data_countries$country)

data_analysis <- data_raw %>% 
  select(country, year, life_expectancy, alcohol, percentage_expenditure) %>% 
  filter(!country %in% data_countries$country,
         !country %in% c("Montenegro", "South Sudan"),
         year != 2015) %>% 
  mutate(
    alcohol_size = cut_number(alcohol, n = 3),
    size = recode_factor(alcohol_size, "[0.01,1.67]" = "Low", "(1.67,6.4]" = "Medium", "(6.4,17.9]" = "High")
        )

Lets sample 5 rows on the resulting data frame that will be used in the final analysis

set.seed(274) #for reproducibility
data_analysis %>% 
  sample_n(5)
## # A tibble: 5 x 7
##   country      year life_expectancy alcohol percentage_expen~ alcohol_size size 
##   <chr>       <dbl>           <dbl>   <dbl>             <dbl> <fct>        <fct>
## 1 Botswana     2008            57.5    6.56            477.   (6.4,17.9]   High 
## 2 Timor-Leste  2009            66.6    0.09             36.2  [0.01,1.67]  Low  
## 3 Saint Vinc~  2005            71.4    6.04              0    (1.67,6.4]   Medi~
## 4 Turkey       2004            72      1.35              1.13 [0.01,1.67]  Low  
## 5 Saint Lucia  2006            73.5   13.4               0    (6.4,17.9]   High