Title: | Nonparametric Statistical Methods |
---|---|
Description: | Accompanies the book "Nonparametric Statistical Methods Using R, 2nd Edition" by Kloke and McKean (2024, ISBN:9780367651350). Includes methods, datasets, and random number generation useful for the study of robust and/or nonparametric statistics. Emphasizes classical nonparametric methods for a variety of designs --- especially one-sample and two-sample problems. Includes methods for general scores, including estimation and testing for the two-sample location problem as well as Hogg's adaptive method. |
Authors: | John Kloke [aut, cre], Joseph McKean [aut] |
Maintainer: | John Kloke <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.0.0 |
Built: | 2025-03-10 03:18:13 UTC |
Source: | https://github.com/kloke/npsm |
This a simulated data set which is used as an example of analysis of covariance. The data frame acov231 contains the data. The responses are in column 1, column 2 contains the levels of factor A, column 3 contains the levels of factor B, and the 4th column contains the covariate. All true parameters (effects) are 0 in this generated data set.
data(acov231)
data(acov231)
A data frame with 33 observations and 4 variables.
response
numeric. the response.
fA
numeric. factor A with 2 levels.
fB
numeric. factor B with 3 levels.
covariate
numeric. a covariate.
Kloke, J. and McKean J.W. (2014), Nonparametric Statistical Methods using R, Boca Raton, FL: Chapman-Hall.
levs = c(2,3) data = acov231[,1:3] xcov = matrix(acov231[,4],ncol=1) temp = kancova(levs,data,xcov)
levs = c(2,3) data = acov231[,1:3] xcov = matrix(acov231[,4],ncol=1) temp = kancova(levs,data,xcov)
Aligned rank test for a group/treatment effect after adjusting for covariates.
aligned.test(x, y, g, scores = Rfit::wscores,...)
aligned.test(x, y, g, scores = Rfit::wscores,...)
x |
n by p design matrix |
y |
n by 1 response vector |
g |
n by 1 vector denoting group/treatment membership. |
scores |
Which scores should be used for the fit and the test. An object of class scores. |
... |
optional arguments. passed to rfit. |
Data are aligned based on the design matrix x using a rank-based fit via rfit.
statistic |
The value of the test statistic. |
p.value |
The p-value based on a chisq(k-1) distribution where k is the number of groups/treatments. |
John Kloke
Hettmansperger, T.P. and McKean J.W. (2011), Robust Nonparametric Statistical Methods, 2nd ed., New York: Chapman-Hall.
y<-rt(30,2) x<-runif(30) g<-rep(1:3,each=10) aligned.test(x,y,g)
y<-rt(30,2) x<-runif(30) g<-rep(1:3,each=10) aligned.test(x,y,g)
Demographics and position information on 1000 randomly selected baseball players who debuted after 1945.
data("baseball_players1000")
data("baseball_players1000")
A data frame with 1000 observations on the following 28 variables.
playerID
a character vector
birthYear
a numeric vector
birthMonth
a numeric vector
birthDay
a numeric vector
birthCountry
a character vector
birthState
a character vector
nameFirst
a character vector
nameLast
a character vector
weight
a numeric vector
height
a numeric vector
bats
a character vector
throws
a character vector
debutYear
a numeric vector
G_all
a numeric vector
G_p
a numeric vector
G_c
a numeric vector
G_1b
a numeric vector
G_2b
a numeric vector
G_3b
a numeric vector
G_ss
a numeric vector
G_lf
a numeric vector
G_cf
a numeric vector
G_rf
a numeric vector
G_of
a numeric vector
G_dh
a numeric vector
G_ph
a numeric vector
G_pr
a numeric vector
pitcher
a logical vector
A random subset of baseball players who debuted after 1945 and played in at least 160 games. Includes information on birth (date and location); height (inches) and weight (pounds); whether they bat left (L), right (R), or switch (B); and games played at each postion. The variable pitcher is a derived variable based on if the majority of games were played as a pitcher (i.e.; G_pr/G_all > 0.5).
https://github.com/chadwickbureau/baseballdatabank
https://github.com/chadwickbureau/baseballdatabank/blob/master/readme2014.txt
data(baseball_players1000) hist(baseball_players1000$weight,xlab="Weight (lbs)", probability=TRUE, ylim=c(0,0.02), main="Histogram of Weight for 1000 Baseball Players") lines(density(baseball_players1000$weight,na.rm=TRUE))
data(baseball_players1000) hist(baseball_players1000$weight,xlab="Weight (lbs)", probability=TRUE, ylim=c(0,0.02), main="Histogram of Weight for 1000 Baseball Players") lines(density(baseball_players1000$weight,na.rm=TRUE))
Batting (average, home runs, RBIs) statistics for 2010 full time players. By full time we mean that the batter had at least 450 official at bats during the season.
data(bb2010)
data(bb2010)
A data frame with 122 observations on the following 3 variables.
ave
batting average
hr
home runs
rbi
runs batted in
baseballguru.com
plot(hr~ave,data=bb2010)
plot(hr~ave,data=bb2010)
Data table from Table 9.11 of Hollander and Wolfe (1999). The data consists of triglyceride levels on 13 patients. Two factors, each at two levels, were recorded: Sex and Obesity. The concomitant variables are chylomicrons, age, and three lipid variables (very low-density lipoproteins (VLDL), low-density lipoproteins (LDL), and high-density lipoproteins (HDL)).
data(blood.plasma)
data(blood.plasma)
A data frame with 13 observations on 8 variables.
Total
Triglyceride level, response
Sex
Sex, 2 levels
Obese
Obesity, 2 levels
Chylo
Chylomicrons, covariate
VLDL
Very low density, lipids, covariate
LDL
Low density, lipids, covariate
HDL
High density, lipids, covariate
Age
Age
Hollander, M. and Wolfe, D.A. (1999), Nonparametric Statistical Methods, New York: Wiley.
Hollander, M. and Wolfe, D.A. (1999), Nonparametric Statistical Methods, New York: Wiley.
data(blood.plasma) plot(Total~Age,data=blood.plasma) boxplot(Total~Obese,data=blood.plasma)
data(blood.plasma) plot(Total~Age,data=blood.plasma) boxplot(Total~Obese,data=blood.plasma)
Basic Summaries of Boxscores for the Major League Baseball team Milwaukee (WI) Brewers 1982 Season. The Brewers won the American League championship that year. Brewer, Robin Yount won the Most Valueable Player (MVP) award. #Robin Yount. MVP.
data("brewers1982")
data("brewers1982")
A data frame with 163 observations on the following 8 variables.
Date
a character vector
Opp
a character vector
R
a numeric vector
RA
a numeric vector
Time
a character vector
Attendance
a numeric vector
home
a logical vector
win
a logical vector
data(brewers1982) # proportion of wins for a given number of runs scored pwin <- with(brewers1982,tapply(win,R,mean)) pwin # graphical display of the above plot(names(pwin),pwin,xlab='Runs', ylab='Proportion of Wins',main='Brewers 1982')
data(brewers1982) # proportion of wins for a given number of runs scored pwin <- with(brewers1982,tapply(win,R,mean)) pwin # graphical display of the above plot(names(pwin),pwin,xlab='Runs', ylab='Proportion of Wins',main='Brewers 1982')
Survival times (in days) for undergoing standard treatment (S) and a new treatment (N).
data("cancertrt")
data("cancertrt")
A data frame with 17 observations on the following 3 variables.
time
Survival time in days
event
Indicator for event
trt
a factor with levels N
S
Higgins (2004), Introduction to Modern Nonparametric Statistics, Pacific Grove, CA:Brooks/Cole–Thomson Learning
data(cancertrt) with(cancertrt,gehan.test(time,event,trt))
data(cancertrt) with(cancertrt,gehan.test(time,event,trt))
Centers a matrix.
centerx(x)
centerx(x)
x |
a matrix |
Returns a centered matrix, i.e., each column of the matrix is replaced by deviations from its column mean.
The centered matrix.
John Kloke, Joseph McKean
scale
x <- cbind(seq(1,5,length=5),seq(10,20,length=5)) xc <- centerx(x) apply(xc,1,mean)
x <- cbind(seq(1,5,length=5),seq(10,20,length=5)) xc <- centerx(x) apply(xc,1,mean)
A regression example with response cloud point of a liquid and predictor the percent of Iodine 8 added to the liquid; see Chapter 3 of Hettmansperger and McKean (2011) or Exercise 4.9.10 of Kloke and McKean (2014)/Exercise 4.7.7 of Kloke and McKean (2024).
data(cloud)
data(cloud)
Nineteen observations on two variables.
cloud.point
Cloud point of the liquid
I8
Percent Iodine 8 added
Draper, N.R. and Smith, H. (1966), Applied Regression Analysis, New York: John Wiley and Sons.
Hettmansperger, T.P. and McKean J.W. (2011), Robust Nonparametric Statistical Methods, 2nd ed., New York: Chapman-Hall.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods Using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistical methods Using R, Second Edition, Boca Raton, FL: Chapman-Hall.
rfit(cloud.point ~ I8,data=cloud)
rfit(cloud.point ~ I8,data=cloud)
Returns a bootstrap confidence interval for any of the correlations available in the base R
cor
function.
cor.boot.ci(x, y, method = "spearman", conf = 0.95, nbs = 3000)
cor.boot.ci(x, y, method = "spearman", conf = 0.95, nbs = 3000)
x |
n by 1 vector |
y |
n by 1 vector |
method |
Which correlation to use. Argument passed to |
conf |
Confidence level. |
nbs |
number of bootstrap samples to base CI on. |
Obtains a percentile bootstrap confidence interval.
The bootstrap samples are obtained via the function boot
.
A confidence interval.
John Kloke, Joseph McKean
See Also as cor
library(boot) with(bb2010,cor.boot.ci(ave,hr))
library(boot) with(bb2010,cor.boot.ci(ave,hr))
A regression example with response energy output in watts and the predictor temperature difference in degrees Kevin; see Devore (2012) and Exercise 4.9.11 of Kloke and McKean (2014)/Exercise 4.7.8 of Kloke and McKean (2024).
data(energy)
data(energy)
Twenty-four observations on two variables.
output
Energy output in watts
temp.diff
Temperature difference in K
Devore, J. (2012), Probaility and statistics for engineering and the sciences, 8th ed., Boston: Brooks/Cole.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistical methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
rfit(output ~ temp.diff,data=energy)
rfit(output ~ temp.diff,data=energy)
The amount of time it took 22 baseball players to round first base for each of three methods of rounding.
data(firstbase)
data(firstbase)
A data frame with 22 observations on the following 3 variables.
round.out
Time when using round out method.
narrow.angle
Time when using narrow angle method.
wide.angle
Time when using wide angle method.
Rounding methods are illustrated in Figure 7.1 of Hollander and Wolfe (1999).
Hollander, M. and Wolfe, D.A. (1999), Nonparametric Statistical Methods, New York: Wiley.
Hollander, M. and Wolfe, D.A. (1999), Nonparametric Statistical Methods, New York: Wiley.
Returns the Fligner-Kileen test for homogeneous scales for two-samples. Also estimates of ratio of scales based on the logs of folded median-aligned samples and a corresponding confidence interval is computed. fk.test computes the value of the statistic based on squared-normal scores following the optimal (for normal errors) such test described in Section 2.10 of Hettmansperger and McKean (2011). Hence, it will differ from the core R routine fligner.test; see the discussion in Section 3.3 of Kloke and McKean (2014)/Section 3.5 of Kloke and McKean (2024).
fk.test(x,y,alternative = c("two.sided", "less", "greater"),conf.level = 0.95)
fk.test(x,y,alternative = c("two.sided", "less", "greater"),conf.level = 0.95)
x |
vector of first sample responses |
y |
vector of second sample responses |
alternative |
alternative indicator for hypotheses |
conf.level |
confidence coefficient for the returned confidence intervals |
Returns the Fligner-Kileen test for the two-sample scale problem.
statistic |
chi-squared test statistic |
p.value |
p-value of the test |
estimate |
vector of estimates of ratio of scales |
conf.int |
table of confidence intervals |
John Kloke, Joseph McKean
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall.
Kloke, J. and McKean, J.W. (2024), Nonparametric statistcal methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
Hettmansperger, T.P. and McKean J.W. (2011), Robust Nonparametric Statistical Methods, 2nd ed., New York: Chapman-Hall.
fkk.test
x<-rnorm(18) y<-rnorm(22)*3 fk.test(x,y)
x<-rnorm(18) y<-rnorm(22)*3 fk.test(x,y)
Returns the Fligner-Kileen test for homogeneous scales for k-samples. Also estimates of ratio of scales based on the logs of folded median-aligned samples and a corresponding confidence interval is computed. The first level (sample) is referenced. See the discussion in Section 5.7 of Kloke and McKean (2014)/Section 5.8 of Kloke and McKean (2024).
fkk.test(y,ind,conf.level = 0.95)
fkk.test(y,ind,conf.level = 0.95)
y |
vector of responses |
ind |
vector of corresponding levels |
conf.level |
confidence coefficient for the returned confidence intervals |
Returns the Fligner-Kileen test for the k-sample scale problem.
statistic |
chi-squared test statistic |
p.value |
p-value of the test |
estimate |
vector of estimates of ratio of scales |
conf.int |
table of confidence intervals |
cwts |
vector of weights based on the estimates difference in scales |
John Kloke, Joseph McKean
Hettmansperger, T.P. and McKean J.W. (2011), Robust Nonparametric Statistical Methods, 2nd ed., New York: Chapman-Hall.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistcal methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
fk.test
y1 <- rnorm(10) y2 <- rnorm(12)*3 y3 <- rnorm(15)*5 y<-c(y1,y2,y3) ind<-rep(1:3,times=c(10,12,15)) fkk.test(y,ind)
y1 <- rnorm(10) y2 <- rnorm(12)*3 y3 <- rnorm(15)*5 y<-c(y1,y2,y3) ind<-rep(1:3,times=c(10,12,15)) fkk.test(y,ind)
Returns the test based on placements for the Behrens-Fisher problem. This test was developed by Fligner and Policello (1981); see, also, Section 2.11 of Hettmansperger and McKean (2011) and Section 4.4 of Hollander and Wolfe (1999). The version computed by fp.test is discussed in Section 3.4 of Kloke and McKean (2014)/Section 3.6 of Kloke and McKean (2024).
fp.test(x,y,delta0=0,alternative = "two.sided")
fp.test(x,y,delta0=0,alternative = "two.sided")
x |
vector of first sample responses |
y |
vector of second sample responses |
delta0 |
null value tested |
alternative |
alternative indicator for hypotheses |
Returns the Placement Test for the Behrens-Fisher problem.
statistic |
chi-squared test statistic |
p.value |
p-value of the test |
numerator |
numerator of test statistic |
denominator |
denominator of test statistic |
John Kloke, Joseph McKean
Fligner, M.~A. and Policello, G.~E. (1981), Robust rank procedures for the Behrens-Fisher problem, Journal of the American Statistical Association, 76, 162–168.
Hettmansperger, T.P. and McKean J.W. (2011), Robust Nonparametric Statistical Methods, 2nd ed., New York: Chapman-Hall.
Hollander, M. and Wolfe, D.~A. (1999), Nonparametric statistical methods, 2nd Edition, New York: John Wiley and Sons.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistcal methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
Generalization of the Wilcoxon rank sum which allows for censored data.
gehan.test(time, event, trt)
gehan.test(time, event, trt)
time |
Time of event or of censoring |
event |
Indicator variable representing a event occur or not (time is censored) |
trt |
Variable indicating treatment group. |
statistic |
Value of the test statistic |
p.value |
p-value |
John Kloke
Higgins (2004), Introduction to Modern Nonparametric Statistics, Pacific Grove, CA:Brooks/Cole–Thomson Learning
n<-76 y<-rexp(n) event<-rbinom(n,1,0.7) # about 30% censored trt<-sample(c(0,1),n,replace=TRUE) gehan.test(y,event,trt)
n<-76 y<-rexp(n) event<-rbinom(n,1,0.7) # about 30% censored trt<-sample(c(0,1),n,replace=TRUE) gehan.test(y,event,trt)
Returns the hetrogeneous slopes design matrix used in ANCOVA. It refereences the first level.
getxact(amat,bmat)
getxact(amat,bmat)
amat |
cell mean design matrix of factor. |
bmat |
matrix of covariates. |
Returns the heterogeneous slopes analysis of covariance matrix.
cmat |
heterogeneous slopes analysis of covariance matrix |
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall.
Returns the hetrogeneous slopes design matrix used in ANCOVA. It refereences the first level. Also, column names are supplied.
getxact2(amat,bmat)
getxact2(amat,bmat)
amat |
cell mean design matrix of factor. |
bmat |
matrix of covariates. |
Returns the heterogeneous slopes analysis of covariance matrix.
cmat |
heterogeneous slopes analysis of covariance matrix eith columns named |
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall.
Hemorrhage data from Dupont.
data(hemorrhage)
data(hemorrhage)
A data frame with 71 observations on the following 3 variables.
genotype
a numeric vector
time
a numeric vector
recur
a numeric vector
Dupont
data(hemorrhage) ## maybe str(hemorrhage) ; plot(hemorrhage) ...
data(hemorrhage) ## maybe str(hemorrhage) ; plot(hemorrhage) ...
Hodges-Lehmann type estimation and confidence intervals.
hodges_lehmann.ci(x, y, var.equal = FALSE, conf.level = 0.95, ...)
hodges_lehmann.ci(x, y, var.equal = FALSE, conf.level = 0.95, ...)
x |
numeric vector. |
y |
numeric vector. |
var.equal |
logical. Assume scales are equal (TRUE) of not (FALSE). |
conf.level |
confidence level to be used for the confidence interval. |
... |
optional arguments. currently unused. |
Currently implements 2-sample estimation and confidence intervals based on methods purposed by Hodges and Lehnmann.
estimate |
parameter point estimate |
stderr |
estimated standard error of point estimate |
conf.int |
estimated confidence interval |
John Kloke, Joseph McKean
Hollander, M. and Wolfe, D.A. (1999), Nonparametric Statistical Methods, New York: Wiley.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods using R, Boca Raton, FL: Chapman-Hall.
zoo<-c(390,258,298,255,324,240,416,319,225,284) rh <- c(187,186,179,269,382,264,353 ,38,350,267,229,383,254,302,195, 43,337,390) hodges_lehmann.ci(zoo,rh)
zoo<-c(390,258,298,255,324,240,416,319,225,284) rh <- c(187,186,179,269,382,264,353 ,38,350,267,229,383,254,302,195, 43,337,390) hodges_lehmann.ci(zoo,rh)
These data are described in Example~11.7 of Hollander and Wolfe (1999). Results from a clinical trial in early Hodgkin's disease. Subjects received one of two treatments: radiation of affected node (AN) or total nodal radiation (TN).
data("hodgkins")
data("hodgkins")
A data frame with 49 observations on the following 3 variables.
time
Survival time
relapse
Indicator variable for relapse
trt
treatment: a factor with levels AN
TN
Hollander, M. and Wolfe, D.A. (1999), Nonparametric Statistical Methods, New York: Wiley.
Based on selector statistics (Q1 & Q2) one of four score functions is choosen. A rank test and p-value is then calculated based on it.
hogg.test(x, y, ...)
hogg.test(x, y, ...)
x |
n by 1 vector |
y |
m by 1 vector |
... |
additional arguments. currently not used |
statistic |
Value of the test statistic. |
p.value |
p-value based on a normal approximation. |
scores |
Which of the score functions was choosen. |
John Kloke, Patrick Kimes
Hogg, R. McKean, J, Craig, A (2013) Introduction to Mathematical Statistics, 7th Ed. Boston: Pearson.
hogg.test(rt(20,1),rt(22,1)+0.2)
hogg.test(rt(20,1),rt(22,1)+0.2)
Q1 is a measure of skewness and Q2 is a measure of tail heaviness.
Q1(z)
Q1(z)
z |
n by 1 vector |
Used as selector statistics in adaptive schemes. Both Q1 and Q2 are ratios. For Q1, the numerator is upper 5% mean minus the middle 50% mean, while the denominator is difference between the middle 5% mean and the lower 5% mean. For Q2, the numerator is upper 5% mean minus the lower 5% mean, while the denominator is difference between the upper 50% mean and the lower 50% mean. These statistics are not robust.
Returns the calculated ratio as a numeric scalar.
John Kloke
Hogg, R. McKean, J, Craig, A (2013) Introduction to Mathematical Statistics, 7th Ed. Boston: Pearson.
A data set presented on Page 496 of huitema (2011). The design is a 2 by 2 with one covariate.
data(huitema496)
data(huitema496)
A 16 by 4 array with the following 4 columns:
y
number of novel responses.
i
type of reinforcement (2 levels).
j
type of program (2 levels).
x
covariate, a measure of verbal fluency.
Discussion can be found in both references listed below.
Huitema, B.E. (2011), The analysis of covariance and alternatives, 2nd ed., New York: Wiley.
Huitema, B.E. (2011), The analysis of covariance and alternatives, 2nd ed., New York: Wiley.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall.
huitema496 <- data.frame(huitema496) fit <- rfit(y~factor(i)+factor(j)+x,data=huitema496) summary(fit)
huitema496 <- data.frame(huitema496) fit <- rfit(y~factor(i)+factor(j)+x,data=huitema496) summary(fit)
Study the breakdown time of an electrical insulating fluid subject to seven different levels of voltage stress.
data("insulation")
data("insulation")
A data frame with 76 observations on the following 2 variables.
log.stress
log of voltage stress
log.time
log of failure time
Nelson, W. (1982), Applied lifetime data analysis, New York: John Wiley and Sons.
Lawless, J.F. (1982), Statistical models and methods for lifetime data, New York: John Wiley and Sons.
Hettmansperger, T.P. and McKean J.W. (2011), Robust Nonparametric Statistical Methods, 2nd ed., New York: Chapman-Hall.
myscores <- logGFscores myscores@param <- c(1,5) fit <- rfit(log.time ~ log.stress,scores=myscores,data=insulation) summary(fit) fit$tauhat
myscores <- logGFscores myscores@param <- c(1,5) fit <- rfit(log.time ~ log.stress,scores=myscores,data=insulation) summary(fit) fit$tauhat
Internal functions not intended for general use. Used in calculation of Hogg's Qs.
lmean(z, p)
lmean(z, p)
z |
n by 1 vector |
p |
scalar |
Returns the calculated value as a numeric scalar.
John Kloke, Joseph McKean
Computes Jonckheere's Test for Ordered Alternatives; see Section 5.6 of Kloke and McKean (2014)/Section 5.7 of Kloke and McKean (2024).
jonckheere(y, groups)
jonckheere(y, groups)
y |
vector of responses |
groups |
vector of associated groups (levels) |
Computes Jonckheere's Test for Ordered Alternatives. The main source was downloaded from the site:
smtp.biostat.wustl.edu/sympa/biostat/arc/s-news/2000-10/msg00126.html
Jonckheere |
test statistic |
ExpJ |
null expectation |
VarJ |
null variance |
p |
p-value |
John Kloke, Joseph McKean
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistcal methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
smtp.biostat.wustl.edu/sympa/biostat/arc/s-news/2000-10/msg00126.html
r<-rnorm(30) gp<-c(rep(1,10),rep(2,10),rep(3,10)) jonckheere(r,gp)
r<-rnorm(30) gp<-c(rep(1,10),rep(2,10),rep(3,10)) jonckheere(r,gp)
Returns a robust rank-based analysis of covariance for a k-way layout assuming heterogenous slopes; see Section 5.4 of Kloke and McKean (2014)/Sections 5.6 and 7.3 of Kloke and McKean (2024). Currently only wilcoxon scores are used.
kancova(levs,data,xcov,print.table=TRUE)
kancova(levs,data,xcov,print.table=TRUE)
levs |
vector of levels corresponding to the factors A, B, C, etc. |
data |
matrix with response in column 1 and level in column 2 |
xcov |
matrix of covariates |
print.table |
logical indicating a table should be printed |
Returns the analysis of covariance table assuming heterogenous slopes for a k-way layout.
tab2 |
analysis of covariance |
fint |
rank-based ful model (heterogenous slopes |
fithomog |
rank-based ful model (homogeneous slopes |
John Kloke, Joseph McKean
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistcal methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
levels <- c(2,2) y.group <- huitema496[,c('y','i','j')] xcov <- huitema496[,'x'] kancova(levels,y.group,xcov)
levels <- c(2,2) y.group <- huitema496[,c('y','i','j')] xcov <- huitema496[,'x'] kancova(levels,y.group,xcov)
routine used in making the display of the ANCOVA table obtained by kancova.
kancovarown(vec)
kancovarown(vec)
vec |
vector to be labeled. |
Returns the labels.
nm |
vector of labels |
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall.
Train a k nearest neighbors (knn) classifer via cross validation (cv). The number of folds and the set of the number of neihbors to consider may be specified.
knn_cv(xy, k.cv = 5, kvec = seq(1, 47, by = 2))
knn_cv(xy, k.cv = 5, kvec = seq(1, 47, by = 2))
xy |
Data frame with the data matrix x as the first set of columns and the vector y as the last column. |
k.cv |
scalar. number of folds to use. default is 5. |
kvec |
vector. set of neighbors to consider. default is odd integers between 1 and 47 (inclusive). |
kvec |
set of neighbors considered |
error |
vector of misclassification error rates corresponding to kvec |
k.best |
number of neighbors with lowest error rate |
k.cv |
number of folds to used |
John Kloke
Hastie, T., Tibshiani, R., and Friedman, J. (2017), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, New York: Springer.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning with Applications in R, New York: Springer.
Venables, W. N. and Ripley, B. D. (2002) _Modern Applied Statistics with S._ Fourth edition. Springer.
train_set <- sim_class2[sim_class2$train==1,-1] set.seed(19180511) fit_cv <- knn_cv(train_set,k.cv=10) fit_cv
train_set <- sim_class2[sim_class2$train==1,-1] set.seed(19180511) fit_cv <- knn_cv(train_set,k.cv=10) fit_cv
The response variable is the quality of a vintage based on a scale of 1 to 5 over the years 1961 to 2004. The predictor is end of harvest, days between August 31st and the end of harvest for that year, and the factor of interest is whether or not it rained at harvest time.
data(latour)
data(latour)
A data frame with 44 rows and 4 columns.
year
Year of harvest
quality
Rating on a scale of 1-5
end.of.harvest
Days August 31 and the end of harvest
rain
indicator variable for rain
Sheather, SJ (2009), A Modern Approach to Regression with R, New York: Springer.
data(latour) plot(quality~end.of.harvest,pch='',data=latour) points(quality~end.of.harvest,data=latour[latour$rain==0,],pch=3) points(quality~end.of.harvest,data=latour[latour$rain==1,],pch=4)
data(latour) plot(quality~end.of.harvest,pch='',data=latour) points(quality~end.of.harvest,data=latour[latour$rain==0,],pch=3) points(quality~end.of.harvest,data=latour[latour$rain==1,],pch=4)
Mood's classical nonparametric method for calculating a difference in population medians.
mood.ci(x, y, var.equal = FALSE, conf.level = 0.95, ...)
mood.ci(x, y, var.equal = FALSE, conf.level = 0.95, ...)
x |
n x 1 vector |
y |
m x 1 vector |
var.equal |
Logical. Assume scale of the two populations are equal. |
conf.level |
numeric value. confidence level for the confidence interval. |
... |
not currently implmented |
A vector of length 2 containing the lower and upper endpoints of the confidence interval.
John Kloke, Joseph McKean
Hollander, M. and Wolfe, D.A. (1999), Nonparametric Statistical Methods, New York: Wiley.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods using R, Boca Raton, FL: Chapman-Hall.
x <- rt(101,9) y <- rt(108,9)+0.3 mood.ci(x,y)
x <- rt(101,9) y <- rt(108,9)+0.3 mood.ci(x,y)
Returns tests for homogeneous slopes and also assuming homogeneous slopes a test for differences in level. Currently only wilcoxon scores are used.
onecova(levs,data,xcov,print.table=TRUE)
onecova(levs,data,xcov,print.table=TRUE)
levs |
Number of levels of the one-way design |
data |
matrix with response in column 1 and level in column 2 |
xcov |
matrix of covariates |
print.table |
logical indicating a table should be printed |
Returns the analysis of covariance table.
tab |
analysis of covariance |
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall.
data=latour[,c('quality','rain')] xcov<-cbind(latour['end.of.harvest']) onecova(2,data,xcov,print.table=TRUE)
data=latour[,c('quality','rain')] xcov<-cbind(latour['end.of.harvest']) onecova(2,data,xcov,print.table=TRUE)
Returns a robust rank-based analysis of covariance for a one-way layout assuming heterogenous slopes; see Section 5.4 of Kloke and McKean (2014)/Sections 5.6 and 7.3 of Kloke and McKean (2024). Currently only wilcoxon scores are used.
onecovaheter(levs,data,xcov,print.table=TRUE)
onecovaheter(levs,data,xcov,print.table=TRUE)
levs |
Number of levels of the one-way design |
data |
matrix with response in column 1 and level in column 2 |
xcov |
matrix of covariates |
print.table |
logical indicating a table should be printed |
Returns the analysis of covariance table assuming heterogenous slopes.
tab |
analysis of covariance |
fit |
rank-based ful model (heterogenous slopes |
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistcal methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
data=latour[,c('quality','rain')] xcov<-cbind(latour['end.of.harvest']) onecovaheter(2,data,xcov,print.table=TRUE)
data=latour[,c('quality','rain')] xcov<-cbind(latour['end.of.harvest']) onecovaheter(2,data,xcov,print.table=TRUE)
Returns a robust rank-based analysis of covariance for a one-way layout assuming homogeneous slopes; see Section 5.4 of Kloke and McKean (2014)/Sections 5.6 and 7.3 of Kloke and McKean (2024). Currently only wilcoxon scores are used.
onecovahomog(levs,data,xcov,print.table=TRUE)
onecovahomog(levs,data,xcov,print.table=TRUE)
levs |
Number of levels of the one-way design |
data |
matrix with response in column 1 and level in column 2 |
xcov |
matrix of covariates |
print.table |
logical indicating a table should be printed |
Returns the analysis of covariance table assuming homogeneous slopes.
tab |
analysis of covariance |
fit |
rank-based ful model (homogeneous slopes |
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistcal methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
data=latour[,c('quality','rain')] xcov<-cbind(latour['end.of.harvest']) onecovahomog(2,data,xcov,print.table=TRUE)
data=latour[,c('quality','rain')] xcov<-cbind(latour['end.of.harvest']) onecovahomog(2,data,xcov,print.table=TRUE)
Returns the placements of the first vector in terms of the second vector used the R function fp.test; see Section 2.11 of Hettmansperger and McKean (2011) and Section 4.4 of Hollander and Wolfe (1999). The version computed by fp.test is discussed in Section 3.4 of Kloke and McKean (2014)/Section 3.6 of Kloke and McKean (2024).
place(x,y)
place(x,y)
x |
first vector |
y |
second vector of second sample responses |
Returns the Placements for the routine fp.test.
ic |
vector of placements. |
John Kloke, Joseph McKean
Hettmansperger, T.P. and McKean J.W. (2011), Robust Nonparametric Statistical Methods, 2nd ed., New York: Chapman-Hall.
Hollander, M. and Wolfe, D.~A. (1999), Nonparametric statistical methods, 2nd Edition, New York: John Wiley and Sons.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistcal methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
Abebe et al. (2001) discuss a dataset resulting from a three-way layout for a neurological experiment in which the time required for a mouse to exit a narrow elevated wooden plank is measured. The response is the log of time (in seconds) to exit. Interest lies in assessing the effects of three factors: the Mouse Strain (Tg+, Tg-), the mouse's Gender (female, male), and the mouse's Age (Aged, Middle, Young). The design is a 2 by 2 by 3 factorial design.
data(plank)
data(plank)
A data frame with 64 observations on the following 4 variables.
response
a numeric vector
strain
a factor with levels 1
2
gender
a factor with levels 1
2
age
a factor with levels 1
2
3
Abebe, A., Crimin, K., McKean, J. W., Vidmar, T. J., and Haas, J. V. (2001) “Rank-Based Procedures for Linear Models: Applications to Pharmaceutical Science Data" Drug Information Journal,
data(plank) boxplot(response~strain,data=plank) raov(response~strain:gender:age,data=plank)
data(plank) boxplot(response~strain,data=plank) raov(response~strain:gender:age,data=plank)
plots the misclassification error rate versus number of neighbors based on call to knn_cv
## S3 method for class 'knn_cv' plot(x, ...)
## S3 method for class 'knn_cv' plot(x, ...)
x |
object of class knn_cv. |
... |
additional arguments. currently not used. |
The list x is assumed to have attributes kvec and error representing the number of neighbors and the corresponding misclassification rate, respectively.
No return value, called for side effects of creating plot.
John Kloke
Hastie, T., Tibshiani, R., and Friedman, J. (2017), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, New York: Springer.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning with Applications in R, New York: Springer.
Venables, W. N. and Ripley, B. D. (2002) _Modern Applied Statistics with S._ Fourth edition. Springer.
A simulated polynomial (3rd degree) model discussed in Section 4.7.1 of Kloke and McKean (2014)/4.6.1 of Kloke and McKean (2024).
data(poly)
data(poly)
One-hundred observations on two variables.
y
response variable
x
predictor
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistical methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
plot(y ~ x,data=poly)
plot(y ~ x,data=poly)
Tests for the degree of a polnomial. This test was suggested by Graybill (1976) and is discussed from a robust point-of-view in Section 4.7.1. of Kloke and McKean (2014)/4.6.1 of Kloke and McKean (2024).
polydeg(y, x, P, alpha = 0.05)
polydeg(y, x, P, alpha = 0.05)
y |
vector of responses |
x |
Predictor |
P |
Super degree of polynomial which provides a satisfactory fit |
alpha |
Level of the testing |
Returns the degree of the polynomial based on the algorithm.
deg |
The determined degree |
coll |
Matrix of step information |
fitf |
Fit of the polynomial based on the determoned degreer |
Graybill, F.A. (1976), Theory and application of the linear model, North Scituate, Ma: Duxbury Press.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistcal methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
x <- 1:20 xc <- x - mean(x) y<- .2*xc + xc^3 +rt(20,3)*90 plot(y~x) polydeg(y,xc,6)
x <- 1:20 xc <- x - mean(x) y<- .2*xc + xc^3 +rt(20,3)*90 plot(y~x) polydeg(y,xc,6)
Internal print functions
## S3 method for class 'hogg.test' print(x, digits = max(5, .Options$digits - 2), ...) ## S3 method for class 'rank.test' print(x,...) ## S3 method for class 'fkk.test' print(x,...) ## S3 method for class 'knn_cv' print(x,...) ## S3 method for class 'npsm.ci' print(x, estimate=FALSE,stderr=FALSE,digits = max(5, .Options$digits - 2),...)
## S3 method for class 'hogg.test' print(x, digits = max(5, .Options$digits - 2), ...) ## S3 method for class 'rank.test' print(x,...) ## S3 method for class 'fkk.test' print(x,...) ## S3 method for class 'knn_cv' print(x,...) ## S3 method for class 'npsm.ci' print(x, estimate=FALSE,stderr=FALSE,digits = max(5, .Options$digits - 2),...)
x |
Object to be printed. |
digits |
Number of digits to present. Passed to print function. |
... |
Additional arguments. |
estimate |
not currently implemented. |
stderr |
not currently implemented. |
No return value, called for side effects
John Kloke, Joseph McKean
Under investigation in this clinical trial was the pharmaceutical agent diethylstilbestrol DES; subjects were assigned treatment to 1.0 mg DES (treatment = 2) or to placebo (treatment = 1).
data(prostate)
data(prostate)
A data frame with 38 observations on the following 8 variables.
patient
a numeric vector
treatment
a numeric vector
time
a numeric vector
status
a numeric vector
age
a numeric vector
shb
a numeric vector
size
a numeric vector
index
a numeric vector
http://www.crcpress.com/product/isbn/9781584883258
Collett, D. (2003) Modeling survival data in medical research CRC press.
data(prostate) boxplot(size~treatment,data=prostate)
data(prostate) boxplot(size~treatment,data=prostate)
A regression example with response yearly upkeep of a home and the predictor value of home; see Bowerman et al. (2005) and Exercise 4.9.8 of Kloke and McKean (2014)/Exercise 7.6.2 of Kloke and McKean (2024).
data(qhic)
data(qhic)
Forty observations on two variables.
upkeep
annual upkeep expenditure of home (y)
value
value of the home (x)
Bowerman, B.L., O'Connell, R.T., and Koehler, A.B. (2005), Forecasting, time series, and regression: An applied approach, Australia: Thomson.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistical methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
plot(upkeep~value,data=qhic,xlab='Value (in $1000s)',ylab='Annual upkeep (in $10s)')
plot(upkeep~value,data=qhic,xlab='Value (in $1000s)',ylab='Annual upkeep (in $10s)')
Two sample quail data.
data(quail2)
data(quail2)
A data frame with 30 observations on the following 2 variables.
treat
indicator variable for treatment
ldl
ldl measurement
Hettmansperger, T.P. and McKean J.W. (2011), Robust Nonparametric Statistical Methods, 2nd ed., New York: Chapman-Hall.
McKean J.W., Vidmar, T.J., and Sievers, G.L. (1989), A robust two stage multiple comparison procedure with application to a random drug screen, Biometrics, 45, 1281–1297.
data(quail2) boxplot(ldl~treat,data=quail2)
data(quail2) boxplot(ldl~treat,data=quail2)
A generalization of the Wilcoxon rank-sum test where a score function is applied to the ranks. Any scores from Rfit can be used as well as user defined. Default is to perform a Wilcoxon analysis.
rank.test(x, y, alternative = "two.sided", scores = Rfit::wscores, conf.int = FALSE, conf.level = 0.95)
rank.test(x, y, alternative = "two.sided", scores = Rfit::wscores, conf.int = FALSE, conf.level = 0.95)
x |
m x 1 vector |
y |
n x 1 vector |
alternative |
one of 'two.sided', 'less', or 'greater' |
scores |
an object of class scores |
conf.int |
logical indicating if a confidence interval should be estimated |
conf.level |
desired level of confidence for interval |
Test is based on T = sum_i a(R(y_i)) where R is the rank based on the combined sample and a(t) = varphi(t/(N+1)). Confidence interval, if requested, is based on call to Rfit.
statistic |
Standardized value of test statistics |
Sphi |
Test statistic |
p.value |
p-value |
conf.int |
confidence interval for shift in location |
estimate |
point estimate for shift in location |
John Kloke, Joseph McKean
Hettmansperger, T.P. and McKean J.W. (2011), Robust Nonparametric Statistical Methods, 2nd ed., New York: Chapman-Hall.
rank.test(rt(20,1),rt(22,1)+0.2)
rank.test(rt(20,1),rt(22,1)+0.2)
Generate a random sample from a contaminated normal distribution.
rcn(n, eps, sigmac) rcn_5_5(n)
rcn(n, eps, sigmac) rcn_5_5(n)
n |
sample size |
eps |
proportion of proportion of contamination |
sigmac |
standard devation of contaiminated component |
With probability (1-eps) a deviates are drawn from a standard normal distribution. With probability eps deviates are drawn from a normal distribution with mean 0 and standard devation sigmac rcn_5_5 is a special case where eps=0.05 and sigma=5.
n x 1 numeric vector containing the random deviates.
John Kloke, Joseph McKean
Hogg, R. McKean, J, Craig, A (2013) Introduction to Mathematical Statistics, 7th Ed. Boston: Pearson.
qqnorm(rcn(100,.25,10)) set.seed(101); rcn(10,0.05,5) set.seed(101); rcn_5_5(10)
qqnorm(rcn(100,.25,10)) set.seed(101); rcn(10,0.05,5) set.seed(101); rcn_5_5(10)
Generate random data from a contaminated normal distribution where the contaimation is a multiplicative factor. As, for example, in cases of data recorded in incorrect units or incorrect decimal point.
rcnx100(n,eps=0.001,x=100,mu=0,sigma=1,...) rcnx(...) rcnx_01_100(n)
rcnx100(n,eps=0.001,x=100,mu=0,sigma=1,...) rcnx(...) rcnx_01_100(n)
n |
sample size to be drawn. |
eps |
amount (probability) of contaminated observations |
x |
multiplier for the contaminated observations |
mu |
mean of uncontaminated samples |
sigma |
standard deviation of uncontaminated samples |
... |
optional arguments. |
Samples are drawn from a random normal distribution with mean mu and standard deviations. A fraction of the observations (eps) are multiplied by the factor x. rcnx is an alias for rcnx100. rcnx_01_100 is a special case where the observations are drawn from a standard normal distribution (i.e., mu=0 and sigma=1 — the defaults in rcnx100) and eps and x are specified as 0.01 and 100, respectively.
Numeric vector of length n is returned.
John Kloke
https://en.wikipedia.org/wiki/Fat-finger_error
set.seed(101); x1 <- rcnx100(10) set.seed(101); x2 <- rcnx(10) set.seed(101); x3 <- rcnx_01_100(10) qqnorm(rcnx(10000,eps=0.005,x=10)) qqnorm(rcnx(1000,eps=0.05,x=1/100))
set.seed(101); x1 <- rcnx100(10) set.seed(101); x2 <- rcnx(10) set.seed(101); x3 <- rcnx_01_100(10) qqnorm(rcnx(10000,eps=0.005,x=10)) qqnorm(rcnx(1000,eps=0.05,x=1/100))
Random generation for the Laplace (double exponential) data with location 0 and scale 1.
rlaplace(n)
rlaplace(n)
n |
scalar. number of random draws. |
A Laplace or double expoential distribution has heavier tails than a normal distribution and so a sample will tend to have additional outliers.
A vector of length n is returned containing the random data.
John Kloke, Joseph McKean
Hogg, Robert V.; McKean, Joseph; and Craig, Allen T., "Introduction to Mathematical Statistics (6th Edition)" (2005).
x <- rlaplace(100) qqnorm(x)
x <- rlaplace(100) qqnorm(x)
A simulated regression model with one response and one predictor. It is discussed in Exercise 6.5.6 of Kloke and McKean (2014)/Exercise 8.11.23 of Kloke and McKean (2024).
data(rs)
data(rs)
Fifty observations on two variables.
y
simulated response
x
simulated predictor
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistical methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
rfit(y ~ x,data=rs)
rfit(y ~ x,data=rs)
A data set discussed in Hollander and Wolfe (1999) and Exercise 5.8.9 of Kloke and McKean (2014)/Exercise 5.9.15 of Kloke and McKean (2024). It contains part of a study on the effects of cloud seeding of cyclones.
data(SCUD)
data(SCUD)
Twenty-one observations on three variables.
trt
treatment indicator (1) is Seeded and (2) is control
M
predictor M, the geostrophic meridional circulation index
RI
measure of precipitation
Hollander, M. and Wolfe, D.A. (1999), Nonparametric Statistical Methods, New York: Wiley.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistical methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
plot(RI ~ M,data=SCUD)
plot(RI ~ M,data=SCUD)
Counts of viewers for 9 seasons of Seinfeld
data("seinfeld")
data("seinfeld")
A data frame with 180 observations on the following 4 variables.
episodeNumberOverall
a numeric vector
season
a numeric vector
episodeNumberSeason
a numeric vector
viewers
a numeric vector
Wikipedia https://en.wikipedia.org/wiki/List_of_Seinfeld_episodes (date unknown).
data(seinfeld) #Comparison boxplots of views versus season boxplot(viewers~season,data=seinfeld,ylab='Number of Viewers (in millions)',xlab='Season') # Normal q-q plots for selected seasons. oldpar_mfrow <- par()$mfrow par(mfrow=c(2,2)) seasons2display <- c(4,5,6,9) for( s in seasons2display) { v <- seinfeld[seinfeld$season==s,'viewers'] qqnorm(v,main=paste("Season",s)) abline(a=median(v),b=mad(v)) } par(mfrow=oldpar_mfrow) # Normal q-q plots for selected seasons # using centered and scaled residuals. oldpar_mfrow <- par()$mfrow par(mfrow=c(2,2)) seasons2display <- c(4,5,6,9) for( s in seasons2display) { v0 <- seinfeld[seinfeld$season==s,'viewers'] v1 <- (v0 - median(v0))/mad(v0) qqnorm(v1,main=paste("Season",s)) abline(a=0,b=1) } par(mfrow=oldpar_mfrow)
data(seinfeld) #Comparison boxplots of views versus season boxplot(viewers~season,data=seinfeld,ylab='Number of Viewers (in millions)',xlab='Season') # Normal q-q plots for selected seasons. oldpar_mfrow <- par()$mfrow par(mfrow=c(2,2)) seasons2display <- c(4,5,6,9) for( s in seasons2display) { v <- seinfeld[seinfeld$season==s,'viewers'] qqnorm(v,main=paste("Season",s)) abline(a=median(v),b=mad(v)) } par(mfrow=oldpar_mfrow) # Normal q-q plots for selected seasons # using centered and scaled residuals. oldpar_mfrow <- par()$mfrow par(mfrow=c(2,2)) seasons2display <- c(4,5,6,9) for( s in seasons2display) { v0 <- seinfeld[seinfeld$season==s,'viewers'] v1 <- (v0 - median(v0))/mad(v0) qqnorm(v1,main=paste("Season",s)) abline(a=0,b=1) } par(mfrow=oldpar_mfrow)
Doksum and Sievers (1976) describe an experiment involving the effect of ozone on weight gain of rats. The experimental group consisted of 22 rats which were placed in an ozone environment for seven days, while the control group contained 21 rats which were placed in an ozone-free environment for the same amount of time. The response was the weight gain in a rat over the time period.
data(sievers)
data(sievers)
A data frame with 45 observations on the following 2 variables.
group
indicator variable for treatment
weight.gain
response variable of weight gain
Hettmansperger, T.P. and McKean J.W. (2011), Robust Nonparametric Statistical Methods, 2nd ed., New York: Chapman-Hall.
Doksum, K. A. and Sievers, G. L. (1976), Plotting with confidence: Graphical comparisons of two populations, Biometrika, 63, 421-434.
data(sievers) boxplot(weight.gain~group,data=sievers)
data(sievers) boxplot(weight.gain~group,data=sievers)
p-value for a one sample sign test based on the binomial distribution.
signtest_pvalue(x, alternative = "two.sided", theta0 = 0, ...)
signtest_pvalue(x, alternative = "two.sided", theta0 = 0, ...)
x |
number vector. |
alternative |
type of alternative hypothesis |
theta0 |
null value of the parameter |
... |
optional arguments. currently ignored. |
Returns p-value using the binomial distribution.
a numeric scalar — the p-value — is returned
John Kloke, Joseph McKean
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods using R, Boca Raton, FL: Chapman-Hall.
x <- round(rt(19,9) + 2,1) signtest_pvalue(x,alternative='greater') S <- sum(x > 0) M <- sum(x != 0) 1-pbinom(S-1,M,0.5) x <- round(rt(19,9) + 0,1) signtest_pvalue(x) S <- sum(x > 0) M <- sum(x != 0) 2*min(pbinom(S,M,0.5), 1-pbinom(S-1,M,0.5))
x <- round(rt(19,9) + 2,1) signtest_pvalue(x,alternative='greater') S <- sum(x > 0) M <- sum(x != 0) 1-pbinom(S-1,M,0.5) x <- round(rt(19,9) + 0,1) signtest_pvalue(x) S <- sum(x > 0) M <- sum(x != 0) 2*min(pbinom(S,M,0.5), 1-pbinom(S-1,M,0.5))
A simulated classification example with two variables and two classes (labels).
data("sim_class2")
data("sim_class2")
A data frame with 1000 observations on the following 4 variables.
train
an indicator for training and test sets
x1
an explantory variable
x2
an explantory variable
y
response variable - a factor with levels 0
1
Random points in the x1,x2 plane were generated. Class labels based on location relative to two circles in the x1,x2 plane with some random variation in the labels simulated.
data(sim_class2) dim(sim_class2) train_set <- sim_class2[sim_class2$train==1,] dim(train_set) with(train_set,plot(x1,x2,main='Training Set',cex=0.625)) with(train_set,points(x1,x2,main='Training Set',pch=20,col=y,cex=0.625))
data(sim_class2) dim(sim_class2) train_set <- sim_class2[sim_class2$train==1,] dim(train_set) with(train_set,plot(x1,x2,main='Training Set',cex=0.625)) with(train_set,points(x1,x2,main='Training Set',pch=20,col=y,cex=0.625))
An experiment in which the members of two groups of students each played the game Simon twice.
data("simon")
data("simon")
A data frame with 31 observations on the following 3 variables.
game1
score on first trial
game2
score on second trial
class
group variable
Demonstrates the concept of regression toward the mean. Simulated data to represent a realistic realization of the experiment. See Problem 4.9.20 of Kloke and McKean (2014)/Problem 4.7.17 of Kloke and McKean (2024).
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistcal methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
data(simon) plot(game2~game1,data=simon) rfit(game2~game1,data=simon)
data(simon) plot(game2~game1,data=simon) rfit(game2~game1,data=simon)
Simulated dataset
data("sincos")
data("sincos")
A data frame with 197 observations on the following 2 variables.
x
independent variable
y
dependent variable
The data were generated using
x <- seq(1,50,by=.25) ; y <- 5*sin(3*x) + 6*cos(x/4)+rnorm(length(x),0,10)
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods using R, Boca Raton, FL: Chapman-Hall.
data(sincos) plot(y~x,sincos) ### code to create Figure 4.9 of Kloke & McKean 2014 ### my.sincos<-sincos my.sincos$y3<-my.sincos$y my.sincos$y3[137] <- 800 plot(y3~x,ylim=c(-50,50),data=my.sincos) fit4 <- loess(y3 ~ x,data=my.sincos) # lines(fit4$x,fit4$fitted,lty=2) with(fit4,lines(x,fitted,lty=2)) fit5 <- loess(y3 ~ x,family="symmetric",data=my.sincos) with(fit5,lines(x,fitted,lty=1)) legend('bottomleft',legend=c('Local Robust Fit','Local LS Fit'),lty=1:2) title("loess Fits of Sine-Cosine Data")
data(sincos) plot(y~x,sincos) ### code to create Figure 4.9 of Kloke & McKean 2014 ### my.sincos<-sincos my.sincos$y3<-my.sincos$y my.sincos$y3[137] <- 800 plot(y3~x,ylim=c(-50,50),data=my.sincos) fit4 <- loess(y3 ~ x,data=my.sincos) # lines(fit4$x,fit4$fitted,lty=2) with(fit4,lines(x,fitted,lty=2)) fit5 <- loess(y3 ~ x,family="symmetric",data=my.sincos) with(fit5,lines(x,fitted,lty=1)) legend('bottomleft',legend=c('Local Robust Fit','Local LS Fit'),lty=1:2) title("loess Fits of Sine-Cosine Data")
A sample of 82 cars with variables speed and miles per gallon collected.
data("speed")
data("speed")
A data frame with 82 observations on the following 2 variables.
mpg
Miles per gallon
sp
a numeric vector
Higgins (2003) Introduction to modern nonparmetric statistics.
Kloke, J. and McKean, J.W. (2014), Nonparametric statistcal methods using R, Boca Raton, FL: Chapman-Hall.
data(speed) plot(sp~mpg,data=speed) rfit(sp~mpg+I(mpg^2),data=speed)
data(speed) plot(sp~mpg,data=speed) rfit(sp~mpg+I(mpg^2),data=speed)
A data frame containg measurements of 48 turtles. The first three columns are the Length, Width, and Height measurements of the carapace of the turtle. The fourth column is a categorical variable sex with values of female and male. Data are drawn from Johnson and Wichern (2007).
data(turtle)
data(turtle)
48 observations on four variables.
numeric vector.
numeric vector.
numeric vector.
character vector.
Johnson, R.A. and Wichern, D.W. (2007), Applied Multivariate Statistical Analysis, 6th ed., Upper Saddle River, NJ: Pearson.
with(turtle,boxplot(Length~sex)) with(turtle,boxplot(Length~sex,ylab='Length (units)'))
with(turtle,boxplot(Length~sex)) with(turtle,boxplot(Length~sex,ylab='Length (units)'))
Performs the vanElteren extension of the Wilcoxon rank sum test for stratified experiments.
vanElteren.test(g, y, b)
vanElteren.test(g, y, b)
g |
n x 1 vector: treatment/group indicator |
y |
n x 1 vector: responses |
b |
n x 1 vector: denotes strata |
statistic |
Value of the test statistic. |
p.value |
p-value based on a normal approximation. |
January weather data for Kalamazoo, MI for the years 1900 to 1995. It is discussed in Example 4.7.4, page 105-106, of Kloke and McKean (2014)/Example 4.6.4, p.177-178, of Kloke and McKean (2024).
data(weather)
data(weather)
Ninety-six observations (1900-1995) for twelve weather variables.
avemax
avemax
avemin
avemin
coldestmax
coldestmax
hihest
hihest
lowest
lowest
maxdayprec
maxdayprec
maxdaysnowfall
maxdaysnowfall
meantmp
meantmp
totalprec
totalprec
totalsnow
totalsnow
warmest
warmest
year
year
http://weather-warehouse.com/WeatherHistory/
Kloke, J. and McKean, J.W. (2014), Nonparametric statistical methods using R, Boca Raton, FL: Chapman-Hall. Kloke, J. and McKean, J.W. (2024), Nonparametric statistical methods using R, Second Edition, Boca Raton, FL: Chapman-Hall.
plot(avemax ~ year,data=weather)
plot(avemax ~ year,data=weather)
Wilson (score) confidence interval for a population proportion.
wilson.ci(x, n, conf.level = 0.95)
wilson.ci(x, n, conf.level = 0.95)
x |
number of events |
n |
number of samples |
conf.level |
confidence level |
Uses defintion in Agresti.
conf.int |
estimated confidence interval |
John Kloke, Joseph McKean
Agresti (2002), Categorical data analysis, New York: John Wiley & Sons, Inc.
n <- 100 x <- rbinom(1,n,0.33) wilson.ci(n,x)
n <- 100 x <- rbinom(1,n,0.33) wilson.ci(n,x)