predict()
Description
Calculates linear predictions, various residuals, leverage statistics, distance statistics, and others. Included in classes which support this method, it’s also available as separate method using -predict()-.
Classes where predict included
ols
anova
Parameters
Input
predict(mdl_data = {}, estimate = None)
mdl_data : Class object from ols or anova
estimate : Desired estimate. Available options are:
“y” or “xb” : Linear prediction
“residuals”, “res”, or “r” : Residuals
“standardized_residuals”, “standardized_r”, or “rstand” : Standardized residuals
“studentized_residuals”, “student_r”, or “rstud” : Studentized (jackknifed) residuals
“leverage”, “lev” : Leverage of the observation (diagonal of the H matrix)
Returns
Numpy array with n x 1 dimensions; where n is the number of observations
Formulas
Note
Y is the dependent variable array/vector
X is the independent variable design matrix
H is the hat matrix
\(^T\) indicates a transpose, i.e. \(X^T\) is the transpose of X
Linear prediction [1]
The linear prediction is calculated as:
Residuals [1]
Residuals are calculated as:
Standardized residuals [1]
Standardized residuals are calculated using the following formula:
Studentized (jackknifed) residuals [1]
Studentized (jackknifed) residuals are calculated using the following formula:
where:
n = number of observations
k = number of predictors
r = standardized residual
Leverage [1]
Calculate the leverage of the observation; the leverage is the diagonal of the H (hat) matrix. The H matrix is calculated using:
Examples
First to load required libraries for this example. Below, an example data set will be loaded in using statsmodels.datasets; the data loaded in is a data set available through Stata called ‘systolic’.
import researchpy as rp
import pandas as pd
# Used to load example data #
import statsmodels.datasets
systolic = statsmodels.datasets.webuse('systolic')
Now let’s get some quick information regarding the data set.
systolic.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 58 entries, 0 to 57
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drug 58 non-null int16
1 disease 58 non-null int16
2 systolic 58 non-null int16
Now to take a look at the descriptive statistics of the univariate data. The output indicates that there are no missing observations and that each variable is stored as an integer.
rp.summarize(systolic["systolic"])
Name | N | Mean | Median | Variance | SD | SE | 95% Conf. Interval | |
---|---|---|---|---|---|---|---|---|
0 | systolic | 58 | 18.8793 | 21 | 163.862 | 12.8009 | 1.6808 | [15.5135, 22.2451] |
rp.crosstab(systolic["disease"], systolic["drug"])
Variable | Outcome | Count | Percent | |
---|---|---|---|---|
0 | drug | 4 | 16 | 27.59 |
1 | 2 | 15 | 25.86 | |
2 | 1 | 15 | 25.86 | |
3 | 3 | 12 | 20.69 | |
4 | disease | 3 | 20 | 34.48 |
5 | 2 | 19 | 32.76 | |
6 | 1 | 19 | 32.76 |
m = anova("systolic ~ C(drug) + C(disease) + C(drug):C(disease)", data = systolic, sum_of_squares = 3)
desc, table = m.results()
print(desc, table, sep = "\n"*2)
Note: Effect size values for factors are partial.
Number of obs = | 58.0000 |
---|---|
Root MSE = | 10.5096 |
R-squared = | 0.4560 |
Adj R-squared = | 0.3259 |
Source | Sum of Squares | Degrees of Freedom | Mean Squares | F value | p-value | Eta squared | Omega squared |
---|---|---|---|---|---|---|---|
Model | 4,259.3385 | 11 | 387.2126 | 3.5057 | 0.0013 | 0.4560 | 0.3221 |
drug | 2,997.4719 | 3.0000 | 999.1573 | 9.0460 | 0.0001 | 0.3711 | 0.2939 |
disease | 415.8730 | 2.0000 | 207.9365 | 1.8826 | 0.1637 | 0.0757 | 0.0295 |
drug:disease | 707.2663 | 6.0000 | 117.8777 | 1.0672 | 0.3958 | 0.1222 | 0.0069 |
Residual | 5,080.8167 | 46 | 110.4525 | ||||
Total | 9,340.1552 | 57 | 163.8624 |
m.predict(estimate="r")[:10]
array([[ 12.6667],
[ 14.6667],
[ 6.6667],
[-16.3333],
[-10.3333],
[ -7.3333],
[ 4.75 ],
[ -2.25 ],
[ 4.75 ],
[ -7.25 ]])