Regression Analysis The inference of the regression analysis mainly involves finding the relation ship between the dependent variable Y and the independent variable X (or several variables). In the case of regression analysis of total drug sales across the United States, the independent variables (thereafter referred to as X1, X2, ... Xn) are the five categories of data measured in all fifty states. These categories are income per capita, percentage of uninsured population, percentage of people over 65 years old, prescriptions filled per capita, and unemployment rate. The goal of the analysis is to reveal the connection between the five predictors (explanatory variables) and the dependent variable, which is the total sales of drugs. In other words, the task consists to determine (and comment on) the degree to which each predictor influences total sales, and whether this influence is direct or inverse. The most important value in deciding whether the predictors actually are reliable or not is the residual standard deviation S (square root of variance S2). The greater this value, the more error we expect to make when predicting values of dependent variables. Therefore, the ideal model, which occurs very rarely in real life, is to have an analysis with residuals equal to zero. However, in order to test separately which of the predictors contributes significantly to the value of dependent value, we have to turn to hypothesis testing for ? parameters. The model for our regression analysis is Y = β0 + β1X1 + β2X2 + ... + X5 + e. In order to examine each of the independent variables for reliability, we have to make sure that coefficient ? is not equal to zero. Calculating this coefficient and testing the hypothesis that it is equal to zero implies turning to summary of ANOVA table. Computers with statistical data processors can easily compute the values included in the ANOVA table. The main values involved are SST, SSE, SSR, MSR, and MSE. These five stand for total sum of squares (since this value is strictly dependent on Y, it is equal to SSy), sum of squares for error, sum of squares for regression, mean square for regression, and mean square for error respectively. The so-called F-value for the entire analysis is required to determine the reliability of each of the predictors. This value can be calculated by the ratio of MSR over MSE, which in our particular case equals 3.69. And since our problem is to determine the level of contribution of each of the independent variables, we will apply hypothesis testing with statistic F = MSR/MSE. The task of each of the five test will consist in comparing the separate value of F* to the F-value of the entire model. The steps are to formulate the null and alternative hypothesis, test the statistic, turn to appropriate F-distribution table and compare the result to the F-value, and finally reject or fail to reject the null hypothesis. In case the computed value of F* for a particular predictor is greater than 3.69, we reject the null hypothesis and infer that this independent variable does contribute to the value of Y, which means it does have significant predictive ability. The other route to determine the predictive ability is to apply the t* statistic, as well as t-distribution table. By dividing the coefficient of a separate independent variable over standard deviation, we can test whether its contribution of this variable is significant of not. Assuming that the significance level of the all five hypothesis tests is 0.05, we can determine the value of t to which the computed values t* will be compared. As the calculated values of t are determined separately, the picture becomes more apparent. The value of t to which the other five values would be compared is derived from the t-distribution table, and is equal to 1.684. The closest number of degrees of freedom is 40, and the actual number of degrees, 44, of freedom is n – k – 1, where n is the ample size and k is the number of independent variables. The predictor that contributes the least to the final value of Y is the percentage of population, which is older than 65 years old. Since the value of t* of this predictor is much less than t (1.684), and is equal to 0.77, this particular independent variable does not influence total sales of drugs in a state. Another independent variable, which does not qualify as a predictor is retail prescriptions filled per capita. Since the t* value for this predictor is less than t, we failed to reject the null hypothesis. Both the percentage of the elderly people from the entire population in a state and prescription filled per capita in the same state can be excluded from the list of predictors. If we try to predict total sales in a given state based on the five predictors we have, we would want to omit these two variables, for they will only distort the values of the dependent variables. As for the other three predictors, namely unemployment rate, percentage of uninsured population, and income per capita, the contributions of them proved to be significant. According to statistical data (and logic), the most predictive independent variable is income per capita. Both percentage of uninsured population and unemployment rate also contribute significantly to the values of total sales of drugs. Therefore, the statistical inference for this particular regression model is that the only three out of the five independent variables proved to be significant predictors. Number of prescribed medicines and percent of elder people can be easily removed from the analysis, for they skew the predicted. On the other hand, the most reliable predictor turns out to be income per capita.