Work Hours
Everyday: 北京时间8:00 - 23:59
Final Project
PLS 202: Introduction to Data Analytics
Welcome to the final project! Think of this as any other assignment. Complete it the same way, and know
that I have the same expectations as I have had on the other assignments. Files must be turned in as a
knitted PDF showing both your code and the output, just as before. While this is the class’ “exam,” feel
free to use slides, readings, or the internet to help with the assignment. You can also ask me questions, but
I will likely be much more cryptic in my answers than I have been throughout the semester.
Download and load the file final_data.RData from D2L. Open it in RStudio (don’t forget to copy and paste
the load function on the top of your script. It is a dataset called data containing the following variables
from The World Bank:
• country: the name of the country
• year: the year
• highschool: percent of population with at least a highschool education
• gdp: gdp per capita (in thousands USD)
• gini: Gini coefficient, a measure of income inequality
• literacy: adult literacy rate
• migration: number of immigrants entering the country
• population: total population
#load the data or packages here
Question 1
- There is a variable called gini which is Gini coefficient, a measure of income or wealth inequality within
a population. It ranges from 0 to 100, where 0 indiciates perfect equality and 100 perfect inequality. I
would like to compare the inequality levels in the 20th and the 21st centuries. To do that, first, make
a new variable called cent in the data and give 0 when year is less than 2000, and 1 otherwise. - What was the mean of gini in 20th century and 21st century each?
- I would like to test whether the difference between the averages of migration in 20th and 21st centuries
is statistically significant. Run a t-test to compare the two means and interpret the result. [Hint: You
need to mention either p-value or confidence interval or both from the t-test results to argue about
statistical significance]
Answer:
Question 2 - Now, I would like to look at the yearly trend of the inequality in line with economic development. So
f
irst of all, calculate the correlation coefficient between year and gini from the data. Interpret the
result with the mention of statistical significance. [Hint: Again, you need to mention either p-value or
confidence interval or both from the correlation test results to argue about statistical significance.]
1
Answer: - The next question is going to involve ggplot as I wish to visualize my analysis. So let’s get some data
ready. Create a new object called ggdata that calculates yearly means for the gini and gdp variables.
Then print the first 10 rows of the ggdata. - Let’s get into some sweet graphics. Use ggplot (don’t forget to load the ggplot2 package) to plot gini
coefficient over time via a lineplot. As always:
• Give the axes appropriate titles, and give the overall graph a title
• Use a different theme (e.g., theme_bw())
• Remove the panel border
• Remove the axis ticks
[Hint: you might see some cuts and interruptions in the plot due to missing data, which is totally fine.] - Now, plot how the yearly averages of GDP per capita and gini coefficient change over time on the same
plot. In plotting, remember that :
• Plot GDP and gini coefficient on different facets. When using facet_wrap you might need to let the
y-axes vary between the two plots.
• Facets should be organized to go down by a single column (not side by side).
• Use different colors for GDP per capita and gini coefficient.
• Plot both dots and lines for both variables.
• Additionally, make sure everything is appropriately labeled including the titles on the facets.
[Hint: you need to use pivot_longer in tidyr package to plot them using facet_wrap.]
Question 3 - Okay, in Question 2 we analyzed the yearly trend between the two variables. Now, I would like to see
the direct relationship between economic development and inequality, using a linear model. First, fit
a linear model called gdp_gini using the data and lm(), with the gini coefficient as the dependent
variable and gdp per capita as the independent variable. - Show the summary of the gdp_gini model. Interpret the results with the exact value of coefficient
estimate of gdp. [Hint: when gdp per capita increases by 1000 USD, how does the gini coefficient
change? (the unit of gdp per capita in the data is 1000 USD)]
Answer: - Not let’s add the predicted gini values from the model to the original data. Create a new column called
pred in data, and store the predicted values in the column. - Finally, let’s plot the real observations and the model predictions together.
• xaxis should be gdp per capita, and y the gini coefficient
• Plot the real observations with points, and the model prediction as line
• Change the color of the line other than black
• Give appropriate titles and axis labels
• Use theme_minimal()
2 - Compare the plot you created above with the plot from Q2.4. What differences do you notice in the
patterns between the two variables (GDP per capita and Gini coefficient)? Do the results in Q3.4
make more or less sense to you compared to Q2.4? Why or why not? [Hint: Consider how the data is
structured in each plot and what each plot is actually showing.]
Answer:
Question 4
In this question, we aim to understand the types of protests and riots that developed in the United States
before and after George Floyd’s death. We will use the provided ACLED dataset, which measures different
types of protests and riots. Our goal is to analyze the spatial distribution and types of protests and riots in
the months surrounding this event. The data is located in a file called acled.rda you will have to call the
load function again to load this dataset in. The ACLED dataset contains the following variables:
• year: The year of the event
• month: The month of the event
• event_type: Categories “Protests” or “Riots”
• sub_event_type: Specific sub-categories such as “Peaceful protest”, “Excessive force against
protesters”
• geo_precision: An indicator of the precision of the latitude and longitude values (1 = precise)
• longitude: Longitude of the event location
• latitude: Latitude of the event location - Before we can start plotting, we need to clean the data. Perform the following steps using dplyr:
• Filter the dataset to include only observations from 2020.
• Filter the dataset to include only observations from the months of April to September. George Floyd’s
murder occurred towards the end of May, but we include April for context. [Hint: you need to use
correct names of months, not numbers.]
• Filter the dataset to include only cases involving “Peaceful\nprotest”, “Protest w/\nintervention”,
“Violent\ndemonstration”, “Excessive force\nagainst protesters” as sub_event_type. [Do
not change any of those variables here, even though they look a bit awkward! “\n” is included to
specify new lines in writing long texts.]
• Filter the dataset to include only cases with geo_precision = 1, meaning that we have precise
longitude and latitude values.
• Make sure that the filtered data should keep the same dataset name as acled to run the provided code
in Q3.2. - Once the data is cleaned, we will proceed to create the desired plot. First of all, we will filter the
protests and riots only for the United States. To do this, you can just activate (remove #) and run
the code in the following chunk.
#acled <- acled %>% filter(latitude > 22, latitude < 50, longitude >-150, longitude <-66)
Then, let’s bring the map of the United States. Those who can use the ne_states() function from the
rnaturalearth package, follow the instructions in 2-1. Those who have an issue with the installation of the
necessary packages, go to 2-2.
2-1. Use the ne_states() function from the rnaturalearth package to get a map of the United States.
Make sure to exclude Alaska and Hawaii.
2-2. Load the us_states.RData file using the code below. Be sure that you downloaded the file from the
D2L. When you load the data, the object called us_states will be then loaded in, which contains a map file
of the United States. You need to exclude Alaska and Hawaii from the data.
3 - Your task is to generate a plot similar to the one shown and described below:
Points showcasing the locations of riots and protests in the United States in the month before and several
months after George Floyd’s murder.
• Points are sized by the num_events variable, which counts the number of events that occurred in a
month and location.
• The color of the points indicates the sub_event_type.
• The plot includes different facets for each month.
• Theplot has several thematic changes, such as positioning the legend and removing axis ticks and text.
Reproduce the plot using the cleaned dataset.
[Hint: When you knit your assignment to PDF, it is okay if the plot appears at a different zoom level
compared to the example in the question PDF. Due to the plot size, some details (e.g., legends or labels)
might get cut off or not fully fit on the page, which is also acceptable.]
Bonus question
The ACLED data has a lot of interesting information and a number of important events have happened
between 2020 and 2022. In the question above, we focused on understanding the spatial distribution of
protests and riots following George Floyd’s murder. Pick some related question and see what you can
learn about its implications from the ACLED dataset. This is completely open-ended so you can examine
particular states or regions instead of the US as a whole as well.
4