Stat 301-2 Final Report

Introduction
Data Citations
Constructing the Dataset
Exploratory Data Analysis
Model Fitting
- Model Pre-processing
- Model Selection
Model Performance Evaluation
Debrief

Introduction

As last month’s coup d’etat in Myanmar reminds us, national coups are relatively frequent, if jarring, political events. In fact, there have been nearly 1000 documented coups across the world in the last 80 or so years. In the outcome of a coup rests the stability, safety, and political rights of entire nations of people, or sometimes even of the entire world. From this fact, my research question arises: Can we predict a coup attempt’s outcome based on attributes surrounding the coup and the nation in which it is occurring?

To answer this question, I look primarily to the Coup D’etat dataset compiled by the Cline Center at University of Illinois (see citation below). This dataset consists of approximately 1000 worldwide coup attempts from 1945-2019, including 426 successful coups, 336 attempted coups, and 181 coup conspiracies in 136 unique countries. The outcome of a coup, the event_type variable in the CDP dataset, is defined essentially as how far along it gets before being foiled, if it is foiled at all: in the planning stages (labelled as a conspiracy), during the attempt (an attempt), or not at all (a realized coup).

In addition to this outcome variable, the CDP dataset includes many variables to be used for prediction: country, date and 17 dummy variables of attributes concerning the coup group and the treatment/involvement of the head of state during the event. These dummy variables include whether there was foreign involvement, whether the coup was popular, whether the head of state fled, etc.

To supplement this data, I have added joined two other datasets. The first is the National Material Capabilities Dataset, which contains various measures of a country’s powers and capabilities at a particular year. These measures, chosen by political scientists to factor into a Composite Index of National Capability, are national iron and steel production, military expenditure, military personnel, energy consumption, total population, and urban population.

Likewise, I have also joined on data from Gapminder. This data, joined on year and country, includes life expectancy, GDP per Capita and a convenient classification of countries into continents. Together with the NMC dataset, I believe this provides a fairly comprehensive view of the situation within a country in a particular year that a coup event occurs.

Data Citations

Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, and Jonathan Bonaguro. 2020. Cline Center Coup D’état Project Dataset. Cline Center for Advanced Social Research. V.2.0.0. November 16. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V2

Enterline, Andrew, Michael Greig, Resat Bayer, Diane Dutka, et. al. 2017. National Material Capabilities Dataset. The Correlates of War Project. V.5.0.0. February 1. University of North Texas.

Jennifer Bryan (NA). gapminder: Data from Gapminder. https://github.com/jennybc/gapminder, http://www.gapminder.org/data/, https://doi.org/10.5281/zenodo.594018.

Constructing the Dataset

As discussed above, the final predictive dataset is compiled and wrangled from multiple sources.

Loading data:

# loading packages
library(tidyverse)
library(tidymodels)
library(janitor)
library(skimr)
library(vip)
library(patchwork)
# source for supplementary data
library(gapminder)

# loading primary data
coup <- read_csv('data/unprocessed/Coup_Data_v2.0.0.csv') %>%
  clean_names()

# loading supplementary data and joining
mat_cap <- read_csv('data/unprocessed/NMC_5_0.csv', na = c("-9")) %>%
  clean_names() %>%
  # getting rid of not-relevant columns
  select(-version, -stateabb, -cinc)

Compiling data from NMC:

# joining in NMC data ----

# Data is only available through 2012, so if a coup occured after 2012, this code imputes 2012 country data
coup_not_na <- coup %>%
  left_join(mat_cap, by = c("cowcode" = "ccode", "year" = "year")) %>%
  drop_na(irst)

coup_to_impute <- coup %>%
  left_join(mat_cap, by = c("cowcode" = "ccode", "year" = "year")) %>%
  filter(is.na(irst))

coup_imputed <- coup_to_impute %>%
  select(-irst, -milex, -milper, -pec, -tpop, -upop) %>%
  left_join((mat_cap %>% filter(year == 2012)), by = c("cowcode" = "ccode")) %>%
  select(-year.y) %>%
  rename(year = year.x)

# confirming that no rows were lost in join
#dim(coup)
#dim(coup_not_na)
#dim(coup_imputed)

coup <- bind_rows(coup_not_na, coup_imputed)

Compiling data from gapminder:

# joining in gapminder data ---- 

# gapminder data is only available every 5 years. The code below matches gapminder data with any coup in its 5-year period
gapminder <- gapminder %>%
  mutate(year_1 = year + 1,
         year_2 = year + 2,
         year_3 = year + 3,
         year_4 = year + 4)

coup_2_7 <- coup %>%
  left_join(gapminder, by = c("country" = "country", "year" = "year")) %>%
  drop_na(lifeExp) %>%
  select(-year_1, -year_2, -year_3, -year_4)

coup_3_8 <- coup %>%
  left_join(gapminder, by = c("country" = "country", "year" = "year_1")) %>%
  drop_na(lifeExp) %>%
  select(-year.y, -year_2, -year_3, -year_4)

coup_4_9 <- coup %>%
  left_join(gapminder, by = c("country" = "country", "year" = "year_2")) %>%
  drop_na(lifeExp) %>%
  select(-year_1, -year.y, -year_3, -year_4)

coup_5_0 <- coup %>%
  left_join(gapminder, by = c("country" = "country", "year" = "year_3")) %>%
  drop_na(lifeExp) %>%
  select(-year_1, -year_2, -year.y, -year_4)

coup_6_1 <- coup %>%
  left_join(gapminder, by = c("country" = "country", "year" = "year_4")) %>%
  drop_na(lifeExp) %>%
  select(-year_1, -year_2, -year_3, -year.y)

coup_sup <- bind_rows(coup_2_7, coup_3_8, coup_4_9, coup_5_0, coup_6_1) %>%
  select(coup_id, continent, lifeExp, gdpPercap)

coup <- coup %>%
  left_join(coup_sup, by = c('coup_id' = 'coup_id'))

# ensuring no rows were lost
#dim(coup)

Recoding and wrangling data:

# converting some column data types
coup <- coup %>%
  mutate(month = as.numeric(month)) %>%
  mutate(day = as.numeric(day)) %>%
  mutate(event_type = as.factor(event_type)) %>%
  mutate(country = as.factor(country))

# These variables should be normalized by a countries size
coup <- coup %>%
  mutate(
    irst = irst/tpop,
    milex = milex/tpop,
    milper = milper/tpop,
    pec = pec/tpop,
    upop = upop/tpop
         )

# The injured dummy only has a handful of non-zero values. I think it is more useful to lump it with killed.
coup <- coup %>%
  mutate(inj_or_killed = injured + killed) %>%
  select(-injured, -killed)

# some countries were not coded to continents by Gapminder. Doing this coding manually below.
coup <- coup %>%
  mutate(continent=replace(continent, (country %in% c('Fiji')), 'Oceania')) %>%
  mutate(continent=replace(continent, (country %in% c('Laos', 'China', 'Kyrgyz Republic',
                                                      'Lebanon', 'Maldives', 'Myanmar',
                                                      'Pakistan', 'Papua New Guinea', 'Philippines',
                                                      'Qatar', 'Republic of Korea', 'Republic of Vietnam',
                                                      'Syria', 'Tajikistan', 'Thailand',
                                                      'Turkmenistan', 'United Arab Emirates', 'Yemen Arab Republic',
                                                      'Yemen PDR', 'East Timor', 'Yemen',
                                                      'Kyrgyzstan', 'Bangladesh')), 'Asia')) %>%
  mutate(continent=replace(continent, (country %in% c('Georgia',
                                                      'Albania', 'Azerbaijan', 'Cyprus',
                                                      'Czechoslovakia', 'German Democratic Republic', 'Greece',
                                                      'Hungary', 'Italy', 'Luxembourg',
                                                      'Portugal', 'Rumania', 'USSR',
                                                      'Ukraine', 'Armenia', 'Montenegro',
                                                      'Turkey', 'Russia')), 'Europe')) %>%
  mutate(continent=replace(continent, (country %in% c('Argentina', 'Bolivia', 'Brazil', 'Chile', 'Costa Rica', 'Cuba',
                                                      'Dominica', 'Dominican Republic', 'Ecuador', 'Ecudaor',
                                                      'El Salvador', 'Grenada', 'Guatemala', 'Guyana',
                                                      'Haiti', 'Nicaragua', 'Panama',
                                                      'Paraguay', 'Peru', 'Surinam',
                                                      'United States of America', 'Venezuela')), 'Americas')) %>%
  mutate(continent=replace(continent, (country %in% c('Congo', 'Democratic Republic of the Congo', 'Egypt', 'Seychelles',
                                                      'Burkina Faso', 'Burundi', 'Guinea-Bissau',
                                                      'Lesotho', 'Mali', 'Sudan',
                                                      'Algeria', 'Gabon', 'Gambia',
                                                      'Nigeria', 'South Sudan', 'Tunisia',
                                                      'Zimbabwe', 'Libya', 'Ivory Coast',
                                                      'Eritrea', 'Chad', 'Niger')), 'Africa'))

Due to joining together multiple datasets, the dataset lacks a consistent naming convention. Below, column names from the joined-in data are standardized.

coup <- coup %>%
  rename(military_exp = milex,
         military_personnel = milper,
         iron_steel_prod = irst,
         energy_consum = pec,
         pop = tpop,
         urban_pop = upop,
         life_exp = lifeExp,
         gdp_per_cap = gdpPercap)

Exploratory Data Analysis

Now that the dataset is compiled and standardized, we will overview it before fitting models.

coup %>% skim ()

Data summary
Name	Piped data
Number of rows	943
Number of columns	37
_______________________
Column type frequency:
character	1
factor	3
numeric	33
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
coup_id	0	1	9	12	0	943	0

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
country	1	FALSE	136	Bol: 37, Sud: 28, Arg: 24, Hai: 24
event_type	1	FALSE	3	cou: 426, att: 336, con: 181
continent	1	FALSE	5	Afr: 383, Ame: 267, Asi: 214, Eur: 74

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
cowcode	0	1.00	436.99	245.55	2.0	155.00	461.00	630.00	950.00	▇▂▇▆▃
year	0	1.00	1977.37	17.77	1945.0	1963.00	1976.00	1990.00	2019.00	▅▇▇▆▂
month	0	1.00	6.59	3.51	1.0	4.00	7.00	10.00	12.00	▇▅▅▅▇
day	0	1.00	15.95	8.79	0.0	8.00	16.00	23.00	31.00	▆▇▇▇▇
unrealized	0	1.00	0.55	0.50	0.0	0.00	1.00	1.00	1.00	▆▁▁▁▇
realized	0	1.00	0.45	0.50	0.0	0.00	0.00	1.00	1.00	▇▁▁▁▆
conspiracy	0	1.00	0.19	0.39	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▂
attempt	0	1.00	0.36	0.48	0.0	0.00	0.00	1.00	1.00	▇▁▁▁▅
military	0	1.00	0.59	0.49	0.0	0.00	1.00	1.00	1.00	▆▁▁▁▇
dissident	0	1.00	0.29	0.45	0.0	0.00	0.00	1.00	1.00	▇▁▁▁▃
rebel	0	1.00	0.06	0.24	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
palace	0	1.00	0.13	0.33	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
foreign	0	1.00	0.07	0.26	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
auto	0	1.00	0.03	0.18	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
resign	0	1.00	0.09	0.28	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
popular	0	1.00	0.10	0.30	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
counter	0	1.00	0.02	0.14	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
other	0	1.00	0.00	0.06	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
noharm	0	1.00	0.94	0.23	0.0	1.00	1.00	1.00	1.00	▁▁▁▁▇
harrest	0	1.00	0.04	0.19	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
jailed	0	1.00	0.05	0.22	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
tried	0	1.00	0.01	0.10	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
fled	0	1.00	0.08	0.27	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
exile	0	1.00	0.05	0.22	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁
military_exp	66	0.93	38.67	150.00	0.0	3.81	9.62	23.85	2377.78	▇▁▁▁▁
military_personnel	23	0.98	0.01	0.01	0.0	0.00	0.00	0.01	0.05	▇▁▁▁▁
iron_steel_prod	5	0.99	0.03	0.18	0.0	0.00	0.00	0.00	4.53	▇▁▁▁▁
energy_consum	12	0.99	1.11	3.31	0.0	0.09	0.35	0.69	45.38	▇▁▁▁▁
pop	5	0.99	19187.49	55603.40	62.0	2871.75	6633.00	16630.00	998877.00	▇▁▁▁▁
urban_pop	5	0.99	0.17	0.13	0.0	0.07	0.14	0.24	0.78	▇▅▂▁▁
life_exp	235	0.75	52.20	9.72	23.6	44.87	50.44	58.48	77.44	▁▆▇▅▂
gdp_per_cap	235	0.75	3120.00	3458.07	347.0	952.39	1718.97	3788.01	29796.05	▇▁▁▁▁
inj_or_killed	0	1.00	0.05	0.22	0.0	0.00	0.00	0.00	1.00	▇▁▁▁▁

There are some missing values in the joined-in data. We will account for these within the recipe. We will also take care of the dummy variables with means below a certain threshold.

ggplot(coup, aes(factor(event_type, levels = c('conspiracy', 'attempted', 'coup')), fill = event_type)) +
  geom_bar() +
  labs(x = '',
       y = 'Count',
       title = 'Event Type') +
  scale_fill_discrete(name = 'Event Type')

From the above chart, we see that their is not a terrible imbalance in the outcome variable. While there are more realized coups and less conspiracies, there are plenty of data points for both. However, it will be important to stratify the train/test split on event type.

ggplot(coup, aes(factor(event_type, levels = c('conspiracy', 'attempted', 'coup')), fill = event_type)) +
  geom_bar() +
  facet_wrap(~continent) +
  labs(x = 'Event Type',
       y = 'Count',
       title = 'Coup Events by Continent') +
  scale_fill_discrete(name = 'Event Type')

In addition to differences in frequency, it is clear that the distribution of coup outcomes varies by continent. In the Americas, few coups fail in the planning stages, but there are nearly as many failed attempts as realized coups. In Europe, there are more coups that fail in the conspiracy stage than during the attempt, which is different than the pattern seen on other continents. It may be useful to include continent as a predictor.

p1 <- ggplot(coup, aes(factor(military, labels = c('False', 'True')), fill = event_type)) +
  geom_bar(position = 'fill') +
  labs(x = 'Military', y = 'Proportion') +
  theme(legend.position = 'none')

p2 <- ggplot(coup, aes(factor(dissident, labels = c('False', 'True')), fill = event_type)) +
  geom_bar(position = 'fill') +
  labs(x = 'Dissident') +
  theme(legend.position = 'none', axis.text.y=element_blank(), axis.title.y=element_blank())

p3 <- ggplot(coup, aes(factor(rebel, labels = c('False', 'True')), fill = event_type)) +
  geom_bar(position = 'fill') +
  labs(x = 'Rebel') +
  theme(legend.position = 'none', axis.text.y=element_blank(), axis.title.y=element_blank())

p4 <- ggplot(coup, aes(factor(palace, labels = c('False', 'True')), fill = event_type)) +
  geom_bar(position = 'fill') +
  labs(x = 'Palace') +
  theme(legend.position = 'none', axis.text.y=element_blank(), axis.title.y=element_blank())

p5 <- ggplot(coup, aes(factor(foreign, labels = c('False', 'True')), fill = event_type)) +
  geom_bar(position = 'fill') +
  labs(x = 'Foreign', y = 'Proportion') +
  theme(legend.position = 'none')

p6 <- ggplot(coup, aes(factor(resign, labels = c('False', 'True')), fill = event_type)) +
  geom_bar(position = 'fill') +
  labs(x = 'Resign') +
  theme(legend.position = 'none', axis.text.y=element_blank(), axis.title.y=element_blank())

p7 <- ggplot(coup, aes(factor(popular, labels = c('False', 'True')), fill = event_type)) +
  geom_bar(position = 'fill') +
  labs(x = 'Popular') +
  theme(legend.position = 'none', axis.text.y=element_blank(), axis.title.y=element_blank())

p8 <- ggplot(coup, aes(factor(counter, labels = c('False', 'True')), fill = event_type)) +
  geom_bar(position = 'fill') +
  labs(x = 'Counter') +
  scale_fill_discrete(name = "Event Type") +
  theme(axis.text.y=element_blank(), axis.title.y=element_blank())

p1 + p2 + p3 + p4 + p5 + p6 + p7 + p8 + 
  plot_layout(ncol = 4)

As can be seen in the above chart, some of the dummy columns concerning the attributes of a coup seem to have relationships with event type observed. I predict that the predictors Dissident, Resign, and Popular may be of particular value. Other predictors, like Foreign, may not have a meaningful relationship with Event Type, at least without interactions.

p9 <- ggplot(coup, aes(life_exp)) +
  geom_density()

p10 <- ggplot(coup, aes(gdp_per_cap)) +
  geom_density()

p11 <- ggplot(coup, aes(urban_pop)) +
  geom_density()

p12 <- ggplot(coup, aes(pop)) +
  geom_density()

p13 <- ggplot(coup, aes(energy_consum)) +
  geom_density()

p14 <- ggplot(coup, aes(iron_steel_prod)) +
  geom_density()

p9 + p10 + p11 + p12 + p13 + p14 + 
  plot_layout(ncol = 2)

While life_exp is fairly symmetrical, all other continuous numerical columns have significant to severe right skews. This will be dealt with in the recipe.

ggplot(coup, aes(year, fill = factor(event_type, levels = c('conspiracy', 'attempted', 'coup')))) +
  geom_histogram(binwidth = 1) +
  labs(x = 'Year') +
  scale_fill_discrete(name = "Event Type") +
  facet_wrap(~factor(event_type, levels = c('conspiracy', 'attempted', 'coup')), ncol = 1)

The span from 1960 to about 1980 seems to have had the highest frequency of coups, followed by a period in the 1990s. From the chart above, it looks like there may be a relationship between year and coup outcome. For instance, I can discern a period in the early 80s with more conspiracies than either attempts or realized coups.

ggplot(coup, aes(factor(month), fill = factor(event_type, levels = c('conspiracy', 'attempted', 'coup')))) +
  geom_bar() +
  labs(x = 'Month') +
  scale_fill_discrete(name = "Event Type") +
  facet_wrap(~factor(event_type, levels = c('conspiracy', 'attempted', 'coup')), ncol = 1)

From the chart above, we see that there may also be a relationship between month and coup outcome. November stands out as a month with a higher frequency of successful coups.

ggplot(coup, aes(factor(day), fill = factor(event_type, levels = c('conspiracy', 'attempted', 'coup')))) +
  geom_bar() +
  labs(x = 'Day') +
  scale_fill_discrete(name = "Event Type") +
  facet_wrap(~factor(event_type, levels = c('conspiracy', 'attempted', 'coup')), ncol = 1)

Coups seem to be randomly distributed by day, so I do not think this variable should be included. Additionally, the day a coup conspiracy is discovered is does not seem like it would be a meaningful measure.

Model Fitting

Model Pre-processing

The first steps in this modeling process are setting the random seed, splitting the data, verifying the split, and cross-validation.

# setting seed
set.seed(3012)

# splitting data
coup_split <- coup %>%
  initial_split(prop = 0.8, strata = event_type)

coup_train <- training(coup_split)
coup_testing <- testing(coup_split)

# verifying split
#dim(coup)
#dim(coup_train)
#dim(coup_testing)

# cross-validation
coup_fold <- vfold_cv(coup_train, v = 10, repeats = 5, strata = event_type)

The next step is to write the recipe.

The first step in this recipe is to remove some columns. I will remove the following: - coup_id: this is a non-predictive column that assigns a unique id to each event - cowcode: this is a non-predictive column that contains the COW unique id for each country, essentially a duplicate of country name - country: as a factor, country has far too many levels to be useful. I believe that continent should act as a simplified version of this variable - day: does not seem to be meaningful, and could be misleading, as discussed above - unrealized, realized, conspiracy, attempt: merely a dummified version of the outcome variable, event_type

Next, I employ step_nvz() to remove any columns that have near-zero variance. These columns are highly sparse and imbalanced.

Next, I impute all missing values with the median using step_medianimpute(). I chose median rather than mean or linear because some data is skewed.

After this, step_dummy() one hot encodes the continent factor variable.

Then, I use step_BoxCox() to transform the skewed variables noted in the EDA above, according to their BoxCox transformation.

Lastly, I use step_normalize() to scale and center all predictors.

# making recipe
coup_recipe <- recipe(event_type ~ ., data = coup_train) %>%
  step_rm(coup_id, cowcode, country, day, unrealized, realized, conspiracy, attempt) %>%
  step_nzv(all_predictors()) %>%
  step_medianimpute(all_numeric()) %>%
  step_dummy(continent, one_hot = TRUE) %>%
  step_BoxCox(military_exp, military_personnel, iron_steel_prod, energy_consum, pop, urban_pop) %>%
  step_normalize(all_predictors())  

# Viewing recipe output
coup_recipe %>%
  prep(coup_train) %>%
  bake(new_data = NULL)

## # A tibble: 755 x 28
##     year  month military dissident  rebel palace foreign resign popular noharm
##    <dbl>  <dbl>    <dbl>     <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
##  1  1.63 -0.728   -1.22      1.56  -0.249 -0.384  -0.280 -0.302  -0.342  0.249
##  2  1.92 -1.30     0.820    -0.641 -0.249 -0.384  -0.280  3.31    2.92   0.249
##  3  1.92  0.125    0.820     1.56  -0.249 -0.384  -0.280 -0.302  -0.342  0.249
##  4  1.75  1.26     0.820    -0.641 -0.249 -0.384  -0.280 -0.302  -0.342  0.249
##  5  1.86 -0.728    0.820    -0.641 -0.249 -0.384  -0.280 -0.302  -0.342  0.249
##  6  1.98 -0.728    0.820    -0.641 -0.249 -0.384  -0.280 -0.302  -0.342  0.249
##  7  1.80 -0.159    0.820    -0.641 -0.249  2.60   -0.280 -0.302  -0.342  0.249
##  8  1.80 -1.01     0.820    -0.641 -0.249 -0.384  -0.280  3.31    2.92   0.249
##  9  1.86  1.26     0.820    -0.641 -0.249 -0.384  -0.280 -0.302  -0.342  0.249
## 10  1.98 -1.30    -1.22     -0.641 -0.249 -0.384  -0.280  3.31    2.92   0.249
## # … with 745 more rows, and 18 more variables: jailed <dbl>, fled <dbl>,
## #   exile <dbl>, military_exp <dbl>, military_personnel <dbl>,
## #   iron_steel_prod <dbl>, energy_consum <dbl>, pop <dbl>, urban_pop <dbl>,
## #   life_exp <dbl>, gdp_per_cap <dbl>, inj_or_killed <dbl>, event_type <fct>,
## #   continent_Africa <dbl>, continent_Americas <dbl>, continent_Asia <dbl>,
## #   continent_Europe <dbl>, continent_Oceania <dbl>

These models have run in their respective R scripts. They have been loaded below, but the code present in each script is included below as well.

The models I have chosen to run are:

Random Forest , tuning mtry and min_n

# load required objects ----
load('model_fitting_rdas/coup_setup.rda')

# define model
rf_model <- rand_forest(
    mode = "classification",
    mtry = tune(),
    min_n = tune()
  ) %>%
  set_engine("ranger", importance = "impurity")

# check tuning params
# parameters(rf_model)

# set-up tuning grid ----
rf_params <- parameters(rf_model) %>%
  update(mtry = mtry(range(c(2, 6))))

# define grid
rf_grid <- grid_regular(rf_params, levels = 5)

# Random forest workflow ----
rf_workflow <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(coup_recipe)

# Tuning/fitting ----
rf_tune <- rf_workflow %>%
  tune_grid(resamples = coup_fold, grid = rf_grid)

# Write out results & workflow
save(rf_tune, rf_workflow, file = "model_fitting_rdas/rf_tune.rda")

Boosted tree , tuning mtry, min_n, and learn_rate

# load required objects ----
load('model_fitting_rdas/coup_setup.rda')

# define model
bt_model <- boost_tree(
    mode = "classification",
    mtry = tune(),
    min_n = tune(),
    learn_rate = tune()
  ) %>%
  set_engine("xgboost", importance = "impurity")

# check tuning params
# parameters(bt_model)

# set-up tuning grid ----
bt_params <- parameters(bt_model) %>%
  update(mtry = mtry(range(c(2, 6))))

# define grid
bt_grid <- grid_regular(bt_params, levels = 5)

# Random forest workflow ----
bt_workflow <- workflow() %>%
  add_model(bt_model) %>%
  add_recipe(coup_recipe)

# Tuning/fitting ----
bt_tune <- bt_workflow %>%
  tune_grid(resamples = coup_fold, grid = bt_grid)

# Write out results & workflow
save(bt_tune, bt_workflow, file = "model_fitting_rdas/bt_tune.rda")

K Nearest Neighbors, tuning neighbors

# load required objects ----
load('model_fitting_rdas/coup_setup.rda')

# define model
knn_model <- nearest_neighbor(
    mode = "classification",
    neighbors = tune()
  ) %>%
  set_engine("kknn")

# check tuning params
# parameters(knn_model)

# set-up tuning grid ----
knn_params <- parameters(knn_model)

# define grid
knn_grid <- grid_regular(knn_params, levels = 5)

# Random forest workflow ----
knn_workflow <- workflow() %>%
  add_model(knn_model) %>%
  add_recipe(coup_recipe)

# Tuning/fitting ----
knn_tune <- knn_workflow %>%
  tune_grid(resamples = coup_fold, grid = knn_grid)

# Write out results & workflow
save(knn_tune, knn_workflow, file = "model_fitting_rdas/knn_tune.rda")

Elastic Net, tuning penalty and mixture

# load required objects ----
load('model_fitting_rdas/coup_setup.rda')

# define model
en_model <- multinom_reg(
  mode = "classification",
  penalty = tune(),
  mixture = tune()
) %>%
  set_engine("glmnet")

# check tuning params
# parameters(en_model)

# set-up tuning grid ----
en_params <- parameters(en_model)

# define grid
en_grid <- grid_regular(en_params, levels = 5)

# Random forest workflow ----
en_workflow <- workflow() %>%
  add_model(en_model) %>%
  add_recipe(coup_recipe)

# Tuning/fitting ----
en_tune <- en_workflow %>%
  tune_grid(resamples = coup_fold, grid = en_grid)

# Write out results & workflow
save(en_tune, en_workflow, file = "model_fitting_rdas/en_tune.rda")

I chose these four model types because they are recommended choices for multi-class classification problems.

# loading models
load("model_fitting_rdas/rf_tune.rda")
load("model_fitting_rdas/bt_tune.rda")
load("model_fitting_rdas/knn_tune.rda")
load("model_fitting_rdas/en_tune.rda")

Model Selection

Since this is a multi-class classification problem, it is appropriate to use ROC AUC as a performance measure.

In the charts below, we see how ROC AUC varies by tuning parameters for each model.

Random Forest

autoplot(rf_tune, metric = "roc_auc")

Boosted Tree

autoplot(bt_tune, metric = "roc_auc")

KNN

autoplot(knn_tune, metric = "roc_auc")

Elastic Net

autoplot(en_tune, metric = "roc_auc")

While the best performing parameters for each model performed similarly, with ROC AUCs in the range of about 0.76-0.82, we see that the random forest performed the best by a small amount. Thus, we select the best random forest model as the model to use on the test set.

# finding best model
select_best(rf_tune, metric = "roc_auc")

## # A tibble: 1 x 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1     6    11 Preprocessor1_Model10

This model is a random forest with mtry = 6 and min_n = 11.

Model Performance Evaluation

Now that we have selected a model, we finalize our workflow with this model. We fit it to the training data.

rf_workflow_tuned <- rf_workflow %>% 
  finalize_workflow(select_best(rf_tune, metric = "roc_auc"))

rf_results <- fit(rf_workflow_tuned, coup_train)

Next, we evaluate the model on the testing data, continuing with ROC AUC as our metric.

predict(rf_results, new_data = coup_testing, type = 'prob') %>% 
  bind_cols(coup_testing %>% select(event_type))%>% 
  roc_auc(.pred_attempted, .pred_conspiracy, .pred_coup, truth = event_type)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till      0.807

On the test set, the model receives a ROC AUC score of about 0.81.

With this metric, an area under the ROC curve of 1 represents a perfect model. Likewise, a score of 0.5 is considered worthless. A score between 0.80-0.90 is generally considered ‘good’. Since this model’s AUC score is about 0.81, I consider it’s performance to be satisfactory.

Visualizing the ROC curve, from which ROC AUC score is derived from, gives us further insight into the models abilities.

predict(rf_results, new_data = coup_testing, type = 'prob') %>% 
  bind_cols(coup_testing %>% select(event_type))%>% 
  roc_curve(.pred_attempted, .pred_conspiracy, .pred_coup, truth = event_type) %>%
  autoplot()

We see that the model performs best with the ‘coup’ class, and worst with the ‘attempted’ class. However, the model is still much better than random guessing for each class. The confusion matrix below gives more insight into when the model errs.

predict(rf_results, new_data = coup_testing) %>% 
  bind_cols(coup_testing %>% select(event_type)) %>%
  conf_mat(truth = event_type, estimate = .pred_class)

##             Truth
## Prediction   attempted conspiracy coup
##   attempted         32         14   16
##   conspiracy        15         17    1
##   coup              20          5   68

The matrix confirms that the model performs worst with conspiracies. The model often predicted ‘attempted’ or ‘coup’ for true conspiracies. Likewise, the model often predicted ‘conspiracy’ for true attempts. The model exhibited the least confusion with coups.

The following plot shows the importance of the most important 20 features in the model.

rf_results %>%
  pull_workflow_fit() %>%
  vip(num_features = 20)

It is interesting that the characteristic dissident, which signals a coup initiated by a small group of discontents which may include ex-military leaders, religious leaders, ex-government officials or others is such an important predictor. Besides this, a diverse set of predictors, from both the original dataset and the two supplementary datasets, turn out to be of major importance to the model.

Debrief

From this model, it appears that both the characteristics of a coup event and the state of the country at the time that the event occurs can be used to predict whether a coup will be foiled during planning (conspiracy), during attempt (attempt), or in fact be realized (coup), with decent accuracy. It is relieving to know that there is in fact a relationship between all these variables.

I believe that with additional data resources, this model could be improved further. Firstly, it is likely that the model could be improved with a larger training set. While I do not wish for there to be more coups around the world, I do hope that the Cline Center continues to update their dataset and make it more robust. Additionally, I think more supplemental data on the political state of each data point at the time of the coup would be valuable. For instance, there are political science organizations that score a nation’s democracy or human rights on a standardized scale. I failed to find this information in a free, compiled format, but I would like to include it if it becomes available in the future. Lastly, I think the model might be improved by a more granular geographic predictor. continent did not turn out to be particularly important, and I suspect this is because it too broad. I would like to test this model with a more regional geographic classification, for example with regions like South East Asia or Caribbean.

I think that the random forest model turned out to be the best because of a few advantages. First, random forests perform well with high dimensionality, since it works with subsets of the data. On the other hand, other models like KNN do not handle high dimensionality as well. Since there ended up being more than two dozen predictors in this model, ability to handle high dimensionality may have contributed to the random forest’s success. Additionally, random forests handle outliers, non-linear and unbalanced data well. This may have had an impact, as some data points contained outliers and some dummy columns were somewhat imbalanced.

After this model, I have a few new research questions. For one, I would be interested in whether the occurrence of coup events could be predicted. This is certainly a more difficult research question than the one I have already answered, but one that is interesting and may be possible. Also, I would be interested in seeing if coup events could be modeled as a poisson process, or even as a Hawkes self-exciting process. I think it would be interesting to investigate this, particularly within regions.