Work Hours
Everyday: 北京时间8:00 - 23:59
Assignment1
Introduction
This is a random sample of DATA2X02 students. Because the data of 211 students was chosen out of the class of 700+ students.
The potential bias is Non-response bias (Delighted team, 2019). Variables such as favorite social media platform, non-spam emails number, entry salary of data scientist are likely to be subjected to this bias. Because they require subjectively entered answers, students are more likely to skip these questions comparing to multiple choice questions.
Some questions needed improvement to generate useful data:
- How tall are you? The unit need to be the same. For instance, centimeter.
- Gender. The spelling needs to be consistent. For example, Female, and Male.
- Entry salary of a data scientist. Also, the unit need to be consistent like x AUD per year.
Results
The number of COVID tests a student has taken in the past two months does not follow a poisson distribution.
A dispersion test, which relies on the fact that the Poisson distribution’s mean is equal to its variance (Allan, 2020), was applied to the data. The p value <0.05 allows me to reject the null hypothesis that COVID test number are Poisson distributed.Code
## Dispersion test of count data:
## 211 data points.
## Mean: 1.014218
## Variance: 3.661702
## Probability of being drawn from Poisson distribution: 0
Q1. Is the students’ average R coding ability level lower than 6?
Motivation: I feel a bit difficult to study DATA2002. So I would like to know whether the class’s average R coding ability is less than 6 or not.Code

Null Hypothesis: The average R coding ability level is 6.
Alternative Hypothesis: The average R coding ability level is less than 6.Code
##
## One Sample t-test
##
## data: x
## t = -8.8144, df = 209, p-value = 2.363e-16
## alternative hypothesis: true mean is less than 6
## 95 percent confidence interval:
## -Inf 5.141008
## sample estimates:
## mean of x
## 4.942857
Using directional one sample T-testing, from the result, a small p-value (<0.05) indicates strong evidence against the null hypothesis. So it is rejected and the alternative hypothesis is accepted, which means the class’s average R coding ability is less than 6 (on a scale from 0 to 10).
Q2. Does stressed level has correlation with loneliness level?
Motivation: I think people who feel lonely are easier to be stressed. So I would like to find some evidence to prove this opinion.
The following figure is a smooth scatter plot of two variables: loneliness level, and stressed level. In the original dataset, there are some NA values in these variables. So I filled the NA values with column’s median value to make a more accurate analysis. From the plot, We can find that there is a pattern that higher loneliness level results in higher stressed level.Code

Null hypothesis: Stressed level has no correlation with loneliness level.
Alternative hypothesis: Stressed level has positive or negative correlation with loneliness level.Code
##
## Pearson's product-moment correlation
##
## data: loneliness and stressed
## t = 4.5979, df = 209, p-value = 7.386e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1751852 0.4209196
## sample estimates:
## cor
## 0.3030822
Using test for association/correlation between paired samples, since p value < 0.05, we should take the alternative hypothesis. Also, we can find the correlation coefficient is 0.3. Therefore, stressed level has a positive relationship with loneliness level.
Q3. Do Youtube guys and Tiktok guys have the same level of loneliness?
Motivation: In my opinion, people who love Tiktok are more lonely than people who love Youtube. So I would like to find evidence from the survey data.
Before doing hypothesis testing, I did a boxplot to see loneliness level by social media platforms. Because of the mess of social media platform names, I have to do data processing at first. For example, ‘ig’, ‘insta’, ‘instgram’, and ‘instagram’ should be combined into ‘instagram’. Similarly, ‘Wechat’, ‘Weixin’, ‘Wetchat’, ‘WeChat’, etc. should be combined into ‘wechat’.
From the visualization, it seems that Youtube’s loneliness level is equal or less than Tiktok’s loneliness. Next, I’ll do subsetting samples in T-test.Code

Null hypothesis: People whose favorite social media platform is Yoube has the same level of loneliness with people whose favorite is Tiktok.
Alternative hypothesis: People who love Youtube has lower level of loneliness than Tiktok guys.Code
##
## Welch Two Sample t-test
##
## data: loneliness by social_media
## t = 0.17694, df = 18.243, p-value = 0.5692
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 2.02953
## sample estimates:
## mean in group tiktok mean in group youtube
## 5.111111 4.923077
From the result, p value > 0.05 indicates weak evidence against the null hypothesis, so we fail to reject it. Thus, people whose favorite social media platform is Youtube statistically have the same level of loneliness with people who love Tiktok.
Conclusion
In this project, we analysed survey data from DATA2X02 students, and got the following conclusions:
- The number of COVID tests a student has taken in the past two months does not follow a Poisson distribution.
- The students’ average R coding ability is less than 6 (on a scale from 0 to 10).
- The stressed level has a positive correlation with loneliness level, the Correlation coefficient is 0.3.
- People whose favorite social media platform is Youtube has the same level of loneliness with people who love Tiktok.
References
Yan, H. (2018). Boxplot with individual data points. https://www.r-graph-gallery.com/89-box-and-scatter-plot-with-ggplot2.html
DataFlair (2021). Introduction to Hypothesis Testing in R – Learn every concept from Scratch! https://data-flair.training/blogs/hypothesis-testing-in-r/
Delighted team (2019). The 7 types of sampling and response bias to avoid in customer surveys. https://delighted.com/blog/avoid-7-types-sampling-response-survey-bias
Kelly B. (2015). Intermediate Plotting. https://www.cyclismo.org/tutorial/R/intermediatePlotting.html
Allan C. (2020). How do I know if my data fit a Poisson Distribution using R? https://stackoverflow.com/questions/59809960/how-do-i-know-if-my-data-fit-a-poisson-distribution-using-r