More Subjects
STATISTICS ASSIGNMENT
Name of Student
Name of Institution
Contents
TOC \o "1-3" \h \z \u Introduction PAGEREF _Toc8792932 \h 2
Array Table 2007 PAGEREF _Toc8792933 \h 2
Arrayed table 2017 PAGEREF _Toc8792934 \h 3
Descriptive statistics with grouped data PAGEREF _Toc8792935 \h 4
Mean 2007 PAGEREF _Toc8792936 \h 5
Median 2007 PAGEREF _Toc8792937 \h 5
Mode 2007 PAGEREF _Toc8792938 \h 6
Standard deviation 2007 PAGEREF _Toc8792939 \h 6
Mean 2017 PAGEREF _Toc8792940 \h 7
Median 2017 PAGEREF _Toc8792941 \h 7
Mode 2017 PAGEREF _Toc8792942 \h 8
Standard Deviation 2017 PAGEREF _Toc8792943 \h 8
Scatter Diagram PAGEREF _Toc8792944 \h 9
Correlation PAGEREF _Toc8792945 \h 9
Regression PAGEREF _Toc8792946 \h 10
Coefficient of Determination PAGEREF _Toc8792947 \h 11
Calculation of estimated values PAGEREF _Toc8792948 \h 12
95% confidence interval for population mean PAGEREF _Toc8792949 \h 13
Conclusion PAGEREF _Toc8792950 \h 14
References PAGEREF _Toc8792951 \h 16
Introduction
Statistics is seen by the people in many different ways. Generally, it is considered to be a study that deals with some numerical characteristics of the data. In the view of other people, it is more concerned with the collecting, interpreting and presenting large amounts of numerical data. In the first place the word statistics means numerical facts systematically arranged. In this sense the word statistics is always used as a plural. The major uses of the statistical information is to inform the general public about some happenings, to show what has already happened, to justify some claim that has already been made and to develop some relationships between some factors.
The present analysis makes use of almost all the above uses of Statistics. The data has been provided in the raw form. Some analysis has been made from the raw data to observe the characteristics. The data is converted to the grouped data and various measures of central tendency will be calculated. One of the major measures of dispersion namely the standard deviation is also calculated. The relationship between the data is checked by using the correlation and regression analysis.
Array Table 2007
4.95-8.35
5
8.35-11.75
18
11.75-15.15
18
15.15-18.55
6
18.55-21.95
3
21.95-25.35
1
Above is the histogram for the number of suicides in 2007, this shows that the data may not be normal but it will be positively skewed. This aspect is also shown by the characteristics of mean, median and mode. In a positively skewed distribution, mean is greater than median and median is greater than the mode. The right most values in the diagram shows the outliers.
Arrayed table 2017
Classes
F
5.95-10.55
7
10.55-15.15
16
15.15-19.75
16
19.75-24.35
9
24.35-28.95
3
The above graph shows the number of suicides in 51 states of the USA for the year 2017. The distribution is positively skewed which is shown by the longer tail on the right side. This aspect is also depicted by the fact that mean is greater than median and median is greater than the mode.
Descriptive statistics with grouped data
A measure of central tendency is a single value that is assumed to lie in the center of the data. The measure helps us to know how much the values tend to move towards the center or mean. This measure tries to describe some certain characteristics of the data with the help of these values. The mean, median and mode are all valid measures of central tendency. These are appropriate to be used with different data characteristics. In the following lines we calculate the mean, median and mode for the two sets of data provided.
For the data pertaining to 2007, we will use the following table to calculate all the descriptive statistics.
Class boundaries
Frequency
Cumulative frequency
X
fx
Fx2
4.95-8.35
5
5
6.65
33.25
221.1125
8.35-11.75
18
23
10.05
180.9
2020.05
11.75-15.15
18
41
13.45
242.1
3075.3425
15.15-18.55
6
47
16.85
101.1
1703.53
18.55- 21.95
3
50
20.25
60.75
820.125
21.95-25.35
1
51
23.65
23.65
559.3225
Mean 2007
Mean = ∑fx/∑f
= 644/51
= 12.62
This is the simple average of the data. This shows the number of suicides in a state if all states had the same number of suicides. This shows the average rate of the suicides per state of the country. There are certain advantages and disadvantages associated with the use of this figure as an average. This is very simple to calculate but it also has some disadvantages. First of all, it is affected by the extreme values. It is also affected by the change of origin and scale which means that if some number is added to the data, the same number is added to the mean and if some number is multiplied by the data, the same number is multiplied by the mean. A small or large mean is not representative of the data as a whole due to these drawbacks. This is the best measure when the data is symmetrical or continuous. This value does not necessarily come from the data itself. This method of average produces the lowest quantity of error when compared to the actual values.
Median 2007
Median = L + h/f (n/2-c)
51/2= 25.5
=11.75+3.4/18(25.5-23)
= 11.75+3.4/18(2.5)
= 11.75+0.189(2.5)
=11.75+0.4725
= 12.225
In simple terms, median is the middle value of the data. This value lies in the center of the data having 50% of the values to its each side. The basic assumption of calculating the median is that the values are evenly distributed in the group. The process of calculating median starts with dividing the total number of values by 2. A value higher than the resulting value is looked in the cumulative frequency column. This decides the median class. The lower-class boundary and frequency are taken from this class. This measure is preferred over other measures of central tendency when the distribution is skewed.
Mode 2007
638175154940Mode = L+ fm-f1X h
(Fm-f1)+ (fm-f2)
304800163829008.35+18-5 X 3.4
(18-5) +(0)
=8.35+3.4
= 11.75
The mode is defined as the most repeated value from the data. The fm in the formula is the highest frequency of the distribution which is 18 in this case. F1 is the frequency that is above the highest frequency and f2 is the frequency below the highest frequency. The value of lower-class boundary is obtained from the class with the highest frequency. Mode is not a representative of the data from which it is calculated.
Standard deviation 2007
Standard Deviation = √∑fx2/∑f – (∑fx/∑f) 2
=8399.4825/51-(12.62) ^2
=164.69-159.2644
=5.425^0.5
=2.33
This is a measure dispersion and does not have a simple interpretation as the arithmetic mean. This is a very important concept that serves as a basic measure of variability in the data. A smaller value of standard deviation as calculated above shows that most of the values are very close to the mean of the data. A larger value shows that the observations are scattered and are not very closely gathered around the mean.
For the year 2017, we will be using the following table to calculate the descriptive statistics.
Class boundaries
Frequency
Cumulative frequency
X
fx
Fx2
5.95-10.55
7
7
8.25
57.75
476.4375
10.55-15.15
16
23
12.85
205.6
2641.96
15.15-19.75
16
39
17.45
279.2
4872.04
19.75-24.35
9
48
22.05
198.45
4375.8225
24.35-28.95
3
51
26.65
79.95
2130.6675
Mean 2017
Mean = ∑fx/∑f
= 841/51
= 16.49
Mean is higher as compared to the suicides data of 2007. This means that the average number of suicides have increased over the 10-year period. If we see the data more closely, the difference between the values of certain states is more as compared to others. One particular example of this aspect can be Montana state. The values in 2017 have been higher than 2007 for almost all the states.
Median 2017
Median = L + h/f (n/2-c)
51/2= 25.5
= 15.15+4.6/16(25.5-23)
= 15.86
The median is the central value of the data and the above value shows that 50% of the data lies on both sides of this value.
Mode 2017
638175154940Mode = L+ fm-f1X h
(Fm-f1)+ (fm-f2)
504825163830=10.55+16-7 X 4.6
(16-7)+0
52387517335400=10.55 + 9X 4.6
9
=10.55+4.6
= 15.15
The mode is the most repeated value of the data. This value is suitable when we need to find the occurrences of certain values within the given data.
Standard Deviation 2017
Standard Deviation = √∑fx2/∑f – (∑fx/∑f) 2
= 14496.9275/51- (16.49) ^2
=284.25- 271.92
= 12.33^0.5
= 3.51
This is one of the most important measures that have to be calculated from the given data. This shows how much variability is present in the data. This also shows how much the data is dispersed away from the average value. The value of standard deviation in 2017 is higher than 2007 which means that the variability of the data has increased over a period of 10 years.
Scatter Diagram
This is the first step of determining whether any relationship exists between the independent and dependent variables. This has been done by plotting each pair of the independent, dependent variable on a graph paper. The above diagram shows the scatter diagram to show some relationship between the suicides in 2007 and in 2017 in 51 states of the USA. This is the first step of finding if there is any relationship between any two given variables. This just draws the pairs of values on a single graph. If the paired values are clustered closely together, there exists a strong relationship between the variables. The more scattered the diagram, the weaker will be the relationship.
Correlation
13335018288000r = n∑XY-∑X∑Y
√ (n∑X2- (∑X) ^2) (n∑Y2-(∑Y)^2
∑X= 644.8, ∑Y= 841.5, ∑XY= 11431.18, ∑X2 = 8774.4, ∑Y2 = 15095.45, n = 51
Putting the values in the formula
17145017398900r = 51*11431.18-(644.8)(841.5)
√(51*8774.4- (644.8)2)(51*15095.45-(841.5)2
9525016446400= 582990.18- 542599.2
√(447494.4-415767.04)(769867.95-708122.25)
466724184150= 40390.98
√31727.36*61745.7
295274212725= 40390.98
44260.90
=0.9125
A correlation coefficient between variable is a measure of strength or weakness of relationship between them. The correlation coefficient may take values between -1 and +1. A value closer to -1 will show a strong negative relationship between the variables while a value closer to +1 will show a strong positive relationship between the variables. This figure is important as it will help us to predict the change in one variable given the change in the other variable. In the above calculation we see that the correlation coefficient is +0.92. This shows a strong positive correlation between the numbers of people committing suicide in 2007 and 2017. The problem with this measure is that it only allows us to see the direction of the change in one variable as a result of a change in the other variable. The exact magnitude of the change in one variable as a result of the change in the other variable is not shown by this measure.
Regression
This form of data analysis is applied when a relationship between a dependent and an independent variable has to be developed. Unlike the correlation that shows the relationship between two variables, the regression analysis allows the researchers to predict the value of one variable based on the change in the other variable. In our analysis, we have taken the suicides in 2017 as dependent variable while suicides in 2007 are taken as the independent variable.
Suicides in 2017 = a + b (suicides in 2007)
a = Mean of 2017 values – b (mean of 2007 values)
190500189230b = n∑XY - ∑X∑Y
n∑X2- (∑X) 2
∑ XY = 11431.18, ∑X = 644.8, ∑ Y = 841.5, ∑ X2 = 8774.4, n = 51
Putting values
-13335017335551*11431.18 – (644.8)(841.5)
51*8774.4- (644.8)2
114300221615= 582990.18 – 542599.2
447494.4 – 415767.04
114300164465= 40390.98
31727.36
b = 1.273
a = Mean of 2017 values – b (mean of 2007 values)
= 841.5/51-(1.273) (644.8/51)
= 16.5-(1.273)12.64)
= 16.5-16.09
= 0.405
Suicides in 2017 = 0.405 + 1.273 (suicides in 2007)
The above equation shows the relationship between the suicides committed in 2017 and in 2007. The suicides committed in 2007 is taken as the independent variable while suicides in 2017 has been considered as the dependent variable. The value of b is 1.273 which shows that one unit change in the independent variable will bring 1.273 units change in the dependent variable. The value of intercept is 0.405 which shows the value of the dependent variable if the value of independent variable is zero.
Coefficient of Determination
The coefficient of determination is represented by R2. This is also called the explained variation of the model. This depicts the extent to which the variation in the dependent variable is explained by the independent variables included in the model. This is calculated by squaring the value of correlation coefficient r. The value of coefficient of determination lies between 0 and 1. The value of 0 means that the dependent variable cannot be predicted by the independent variable. The value of 1 for the coefficient shows that the independent variable will predict the dependent variable exactly without any error. In the above data, we have calculated a value of 0.8464 for coefficient of determination which shows that the independent variable is responsible for 84.64% change in the dependent variable.
Calculation of estimated values
After the above equation has been developed, the next step is to estimate the values of dependent variable against some specific values of independent variable. We take 5 values from the data given in 2007 column and put them in the equation formed.
Value of 2007
Equation
Result
6
Suicides in 2017 = 0.405 + 1.273 (suicides in 2007)
8.043
7.7
Suicides in 2017 = 0.405 + 1.273 (suicides in 2007)
10.2071
9
Suicides in 2017 = 0.405 + 1.273 (suicides in 2007)
11.862
10.8
Suicides in 2017 = 0.405 + 1.273 (suicides in 2007)
14.1534
12.4
Suicides in 2017 = 0.405 + 1.273 (suicides in 2007)
16.1902
The above table shows the 5 selected values of the suicides in 2007 and their corresponding values in 2017 calculated through the regression equation. The results show considerable differences between the actual values and the estimated values. The difference between the actual values and the estimated values show the error associated with the model. The error generally occurs because in any given model, it is not possible for the researcher to consider all the variables affecting the dependent variable so the variables that are ignored are shown as the error term.
95% confidence interval for population mean
When we compute a confidence interval for the population mean, we want to know the percentage chance that the sample mean will be different from the population mean by a certain value. There are 3 values in the formula that have to be calculated so that the confidence interval can be found. The formula used to calculate the value of confidence interval is as follows:
CI = X + z α/2. σ mean, X - z α/2. σ mean
To apply the above formula, we will use the data from the year 2007 and 2017 one by one.
For 2007, the mean will be 12.64, standard deviation of this sample will be 3.52. Since we are calculating a 95% confidence interval so the value of z will be 1.96. When we put these values in the formula, we get
CI = 12.64+ (1.96) (3.52), 12.64-(1.96) (3.52)
= 12.64+6.8992, 12.64-6.8992
= 19.54, 5.74
This shows that any value that is in the data has a 95% chance of falling between 5.74 and 19.54. In other words, we are 95% confident that any given value will lie in this given interval.
For the year 2017, the mean will be 16.5, the value of z will be the same as used in the previous working as 1.96. The standard deviation will be 4.92. Putting these values in the formula, we get
= 16.5+ (1.96)(4.92), 16.5- (1.96)(4.92)
= 16.5+9.64, 16.5-9.64
= 26.14, 6.86
This shows that any value that is in the data has a 95% chance of falling between 6.86 and 26.14. In other words, we are 95% confident that any given value will lie in this given interval. We have used the z distribution in the above formulas because the sample sizes are greater than 30 in both cases. The above formula also show that we are 95% confidence that the sample mean will be within 1.96 standard deviations of the mean.
Conclusion
The current analysis has been done to see the characteristics of data for suicides in 2007 and 2017 for 51 states of the USA. Firstly, the raw data has been observed for any irregularities or outliers. The normality of both sets of data has been seen by making the histogram for each. The descriptive statistics have been calculated for both the data sets. The major findings are that the mean of 2017 values is much higher than 2007 values. Same applies for the values of median and mode. One of the most important measurements for the analysis is the standard deviation which is higher in case of 2017 as compared to 2007. This shows that the variability of the data in 2017 is higher than in 2007. This also means that the difference of values from their respective means has increased over a period of 10 years under consideration.
The second part of the analysis tries to find if there is any relationship between the figures of suicides in 2007 and 2017. The starting point for this part is the scatter diagram that shows whether the data is clustered closely or scattered here and there. The scatter diagram for the data shows that the data is clustered together closely. The diagram uses the pairs of data to plot them on the graph. Once the scatter diagram has been made, the next step is the calculation of correlation. This shows the direction of the relationship between the two variables. The answer to this calculation is 0.9125 which shows a strong positive correlation between the two variables considered. The next step is to develop a model that will help us to predict one variable on the basis of the other. For this analysis, the number of suicides in 2017 has been predicted on the basis of number of suicides in 2007. There are 2 important numbers in this analysis namely intercept that is depicted by a and the slope that is depicted by b. The regression analysis has shown that one-unit change in the independent variable brings about a change of 1.273 units in the dependent variable. If the independent variable is kept at 0, the value of dependent variable will be 0.405. The coefficient of determination shows the power of the model. The value of the coefficient of determination comes out to be 0.8464. This shows that the independent variable accounts for 84.64% change in the dependent variable.
The 95% confidence interval values show the limits within which there is a 95% chance for the population mean to lie based on the calculations of the sample data. This aspect is differently stated in terms of level of significance This shows that the level of significance for this study will be 5%. This is calculated as 100-95.
References
BIBLIOGRAPHY Sagepub. (2010, March). https://www.sagepub.com/sites/default/files/upm-binaries/35399_Module5.pdf. Retrieved from https://www.sagepub.com/sites/default/files/upm-binaries/35399_Module5.pdf: https://www.sagepub.com/sites/default/files/upm-binaries/35399_Module5.pdf
More Subjects
Join our mailing list
© All Rights Reserved 2024