QUESTION
Write 2000 Words Statistical Analysis
Topic:: Smoking VS Death Due to Cancel
ANSWER
Contents
INTRODUCTION
HYPOTHESIS
DATA COLLECTION
Number of Smokers Data
NUMBER OF DEATHS DUE TO CANCER
DATA ANALYSIS
MEAN
MEDIAN
STANDARD DEVIATION
RANGE
BOX PLOT
HISTOGRAM
CORRELATION
SPEARMAN’s RANK
REGRESSION
CONCLUSION
RECOMMENDATIONS
INTRODUCTION
The following report tries to establish a correlation between the number of people who smoke and the number of people who die from cancer.
Smoking as we all are aware becomes a habit once a person starts getting used to it. It becomes an addiction because of the nicotine present in the tobacco. It has a lot of harmful effect and cancer is one of them. Cancer is a disease which happens because of abnormal growth in some body cells. It can happen in any part of the body. Smoking can primarily lead to lung cancer. Although with the advent of technology we have been able to cure this disease, but still large number of people die because of it.
Even today with large number of people dying because of cancer and smoking being one of its principal cause, and widely becoming a habit among the younger generation, makes it an interesting research topic.
HYPOTHESIS
Within the reach of my research I would like to prove that a relationship exists between the number of people who smoke regularly and those who die from cancer. For proving my hypothesis, I have gone for a secondary research methodology.
DATA COLLECTION
For proving my hypothesis, I have applied a secondary research methodology. This secondary data has been collected from the following website: www.OCED.org.uk. Two separate sets of data Number of Smokers in the country (per 100000 person) and Number of deaths due to cancer (per 100000 person) were collected. For getting a proper and logical result, data of 41 countries were taken over the period of 20062016. The average of the 10year data of each country was then taken for further analysis. The excel sheet attached would contain all the data which has been used in this research.
The following is the snapshot and average of the data which we have gathered and used for our research.
Number of Smokers Data
The data (average of 20062016 data) which has been shown below shows the number of people above 15 years of age who are addicted to smoking in these 41 countries.
Number of Smokers 15+ (20062016) per 100000 people 

LOCATION 
VALUE 
AUS 
14420 
AUT 
23800 
BEL 
19783 
BRA 
12860 
CAN 
15463 
CHE 
20450 
CHL 
27183 
COL 
16833 
CRI 
14422 
CZE 
22133 
DEU 
21500 
DNK 
21556 
ESP 
24950 
EST 
25994 
FIN 
18236 
FRA 
24560 
GBR 
19812 
GRC 
34892 
HUN 
26383 
IND 
12478 
IRL 
20444 
ISL 
14379 
ISR 
18389 
ITA 
21924 
JPN 
21436 
KOR 
22020 
LTU 
21167 
LUX 
17697 
LVA 
26917 
MEX 
10206 
NLD 
20855 
NOR 
17545 
NZL 
16529 
POL 
23550 
PRT 
18000 
RUS 
36317 
SVK 
21367 
SVN 
18917 
SWE 
12894 
TUR 
27444 
USA 
14439 
ZAF 
19667 
Graphically it could be presented as:
NUMBER OF DEATHS DUE TO CANCER
The data (average of 20062016 data) which has been shown below shows the number of people above 15 years of age who are dying because of cancer in these 41 countries.
LOCATION 
VALUE 
AUS 
203.04 
AUT 
208.84 
BEL 
219.30 
BRA 
164.25 
CAN 
217.80 
CHE 
188.50 
CHL 
206.45 
COL 
166.91 
CRI 
169.72 
CZE 
251.66 
DEU 
211.92 
DNK 
253.77 
ESP 
203.04 
EST 
258.75 
FIN 
185.12 
FRA 
215.29 
GBR 
231.88 
GRC 
205.95 
HUN 
302.66 
IRL 
240.37 
ISL 
209.84 
ISR 
189.73 
ITA 
213.95 
JPN 
192.31 
KOR 
202.20 
LTU 
251.32 
LUX 
211.37 
LVA 
266.20 
MEX 
121.35 
NLD 
241.73 
NOR 
212.74 
NZL 
222.25 
POL 
258.98 
PRT 
205.41 
RUS 
227.60 
SVK 
267.11 
SVN 
264.65 
SWE 
194.27 
TUR 
163.30 
USA 
201.03 
ZAF 
198.03 
Graphically it could be presented as:
DATA ANALYSIS
The following statistical analysis was even carried on this data which have been explained below
MEAN
Mean tells you the average number of people in these 41 countries per 100000 person who are addicted to smoking. All calculations have been shown in the above excel sheet. For the above data the mean comes out to be 20472 people which approximately means 20.47% of the population on average are smokers.
Similarly, from the second set of data we see that the mean comes out to be 215.14 which approximately means 0.215% of the population on average die from cancer.
MEDIAN
Median gives us a true picture about the data than the mean as it is the middle value of the data. In the above data the median comes out to be 20447 people which approximately means 20.44% of the population on average are smokers.
Similarly, from the second set of data we observe that the median comes out to be 211.37 which approximately means 0.211% of the population on average die from cancer.
STANDARD DEVIATION
Standard Deviation is one of the commonly used statistical tool and provides a great understanding of the data. It helps in understanding the dispersion in the data. Standard deviation of this data is 5512, which implies that the values in the data are very far from each other. This implies that in some countries the number of smokers is more as compared to other countries.
From the data on deaths due to cancer we see that the standard deviation comes out to be 34.72 which implies that the data value of deaths due to cancer in most countries is quite close to each other.
RANGE
The range gives us the difference between the highest and lower values in the data. The larger the range value tells us that the data is spread through a wide region. From the first set of data the range value comes out to be 23839, which is a huge value.
From the second set of data we get the range to be 145, which is fairly a low value. Thus, the data is very congested together in this case.
BOX PLOT
In order to understand our data for a proper analysis we took the help of box plot analysis. Box plot is a way to better visualise your data. Box plot basically tells us about the following five specific numbers minimum, quartile 1, median, quartile 3 and maximum of a data series. It also helps us to identify any outliers present in our data. Below is the box plot of the number of smokers’ data showing the above statistical numbers and the outliers.
Similarly, the box plot was also drawn for the death due to cancer data which has been shown below.
HISTOGRAM
This is another statistical tool which helps us to analyse our data efficiently. It tells us how many countries have been clubbed together in a data range. Following are the histograms for the two data series in our analysis.
CORRELATION
In statistics, correlation is a technique which helps us to understand the relationship or dependence of one variable with another variable. In this number of smokers is one variable and deaths due to cancer is the other variable. A correlation test led to the following result:

Smokers 
Deaths 
Smokers 
1 
0.413215 
Deaths 
0.413215 
1 
This test helps us in understanding that there seems to be a linear relationship between both the variables: number of smokers and deaths due to cancer. A correlation of .413 implies that as the number of smokers increase, the death due to cancer will also increase and vice versa.
SPEARMAN’s RANK
This analysis is generally done when the data are not close to each other because of which the association between the two variables might get distorted. It is the correlation of the ranks of the above data.
For carrying out the analysis the number of smokers’ data and the deaths due to cancer data are given a rank in ascending order using the ‘RANK. AVG function in the excel. After which the Spearman’s Rho is being calculated using the function ‘CORREL’ in Excel, which comes out to be
Spearman’s Rho 
0.416202 
This results even highlights the same result which Pearson’s correlation was highlighting. This result helps us to verify that the data doesn’t have much outliers and if they have its not having much effect on our result. If we want to check this graphically then below is their scatter plot diagram. Here we can see the presence of outliers in the data but its not causing much deviations in our results, which has also been proved in the excel. We see that the data of the following countries Russia, Greece, Mexico are outliers and hence have been neglected. When the Pearson’s Correlation is again calculated we see that the value comes to 0.432 which is close to our previous result. Hence the outliers are not creating much of a difference.
REGRESSION
In statistics regression analysis is the most commonly used technique in understanding the association between the data variables. For regression analysis, there is one data which is dependent and the other is said to be independent. So here we have assumed the number of smoker data to be independent and the deaths due to cancer to be a dependant variable. The following results got displayed with the above assumptions.
SUMMARY OUTPUT 

Regression Statistics 

Multiple R 
0.413215147 

R Square 
0.170746757 

Adjusted R Square 
0.149483854 

Standard Error 
32.0180677 

Observations 
41 

ANOVA  

df 
SS 
MS 
F 
Significance F 

Regression 
1 
8232.279948 
8232.279948 
8.030265299 
0.007248193 

Residual 
39 
39981.10971 
1025.156659 

Total 
40 
48213.38966 


Coefficients 
Standard Error 
t Stat 
Pvalue 
Lower 95% 
Upper 95% 
Lower 95.0% 
Upper 95.0% 
Intercept 
160.6487338 
19.86762485 
8.085955669 
7.27106E10 
120.4626694 
200.8347982 
120.4626694 
200.8347982 
X Variable 1 
0.002636511 
0.000930389 
2.833772274 
0.007248193 
0.000754621 
0.0045184 
0.000754621 
0.0045184 
The following line plot was also observed:
From the above result we observe that R square value comes out to be 0.17 which is a very small number which tells us that the actual data is very far from the regression line. This small value indicated a weak relationship between the two variables. Although this may not be true because regression has its own set of limitations.
CONCLUSION
From the results of the different statistical tools it has been very hard to establish that a strong correlation exists between the above two data. Since data has been collected from a secondary source there can some data issues as well. So, data collection has also been one of the limitations of this research. Even the statistical methods which have been employed all suffer from their own respective limitations.
The data that has been taken has a lot of variation because of which it has been very hard to prove our above hypothesis. Although we observed that there exists a positive correlation between the two data sets but still the R square value was very less which forces us to negate our hypothesis. So, we can conclude that although smoking is a bad habit but there is only a small percentage of people who are dying because of it.
RECOMMENDATIONS
The above analysis has been done with the data which was available on the website. The data which was available had certain drawbacks and was not directly applicable to our research. Since our research was irrespective of the age group, but our data was limited to a particular age group. Even the number of death due to cancer data as well as smoking data for different countries were not of the same years (as can be observed in the excel sheet), because of which we had to work around with the mean data for a particular country during a particular time frame. These all issues can be attributed for our poor outcome of our regression result. These all issues could have been eliminated had we focussed on primary research rather than secondary research. Our research would have yielded significant value had the quality of data been good (even the secondary data values).
Even use of higher statistical tools like logarithmic regression, logistic regression and others would have yielded a better result had we had more independent variables in our data set like number of male and female smokers and other variables which are related to our core issue.
Looking for best Statistics Assignment Help. Whatsapp us at +16469488918 or chat with our chat representative showing on lower right corner or order from here. You can also take help from our Live Assignment helper for any exam or live assignment related assistance