An introduction to data analysis using R
George F. Hart, Ph. D.,
Professor emeritus, LSU.
What does this eBook cover
This eBook concentrates on understanding the basic statistical procedures used to describe data and to analyze simple problems using the NORMAL distribution. It is for those who have little or no knowledge of statistics and is intended to provide a working knowledge in a short time. The eBook avoids mathematics proofs and uses R to perform the necessary procedures used in data-analysis, concentrating on understanding what the statistics are doing, why they are doing it, and what the results indicate about the data.
Table of Contents
This part is a basic course in statistics and graphics avoiding mathematical proofs and formula. It uses R code and R procedures to outline the fundamental knowledge needed to use statistics.
What is not covered
Getting R to work on your computer
Libraries
Data-frames for this course
Importing a data-frame into R
Data within R
Importing from a vector or list
Importing from a matrix
Importing data from a data-base
Exporting a data-frame out of R
As text output
As a spreadsheet
As a table
As a graphic
Saving objects in R
Saving a function in R
Saving your commands
Basic descriptive statistics
Median
Mode
Mean
Variance
Standard deviation
Coefficient of variation
Maximum – Minimum - Range
The z-score
Mid-range
Percentile
Quartile
Skewness
Kurtosis
Summary numbers used in descriptive statistics
Summary information using Contingency Tables
Error
Type 1 eerror
Type 2 error
The nature of data
Percentages, counts and other data measures
Scales of measurement
Nominal data
Dispersion
Association
Ordinal data
Distribution Overview
Central Tendency
Dispersion
Position
Association
Interval data
Distribution Overview
Central Tendency
Dispersion
Position
Ratio data
Distribution Overview
Central tendency
Dispersion
Position
Attribute types
Populations
Spatial variability and sampling a population
Random Sampling
Stratified Sampling
Cluster Sampling
Systematic Sampling
The single sample problem in geological / palaeo-biological studies
The single line problem in geological / palaeo-biological studies
The dual line problem in geological / palaeo-biological studies
The multiple line problem in geological / palaeo-biological studies
Coloring Graphics
Histogram plots
Box plot
Bag plot
Scatter plot
Index plot
Q-Q plot
Residuals plot
Frequency polygon
Pie chart
Pairs plot
Coplot
3-D plot
Transformations
The Box-Cox Normality Plot
Using the regression equation for prediction
Fitting a confidence interval
The variety of distributions
The box-plot eye ball method
Chebyshev's Theorem
Shapiro-Wilk test for normality
The D'agnostino test
The Kolmogorov-Smirnov Goodness-of-fit test
Comparing the sample data-frame with a random distribution
Fitting other distributions
Outliers
Assumptions
Degrees of freedom [df]
Equal population variances
Z-test [ used for interval and ratio data analysis]
T-test [interval, ratio data-analysis]
Independent [unpaired] t-test
Dependent [paired] t-test
Calculating a single probability value
Confidence Interval [CI]
Setting a confidence interval around a t distributed variable
Setting a confidence interval around a normally distributed variable
Calculating the power of the test, assuming standard deviation is known
Correlation coefficients
Equitability
Comparative diversity indices
Similarity
What this part covers. This section is designed for those who analyze samples taken from the natural environment [random samples] as opposed to sampling from designed experiments i.e. it emphasizes sampling designed analysis, not experimentally designed problems. Sampling designed problems are those associated with samples taken from the natural environment.
ANOVA and regression are complimentary methods for data analysis, in that regression creates a model of reality and ANOVA evaluates the model. Both examine a dependent [response] variable to determine the variability of the dependent variable as a response to factors [predictor or independent variables].
Linear Regression
Simple regression
Multiple regression
The Box-Cox procedure
This part is designed for those who analyze data from designed experiments in which a controlled set of samples are used, and for those who apply experimental design in the collection of sample data-frames.
This part is designed for those who analyze data that need similarity or difference measurement, including clustering techniques. The commonly used statistical procedures for classification are of two kinds. When the groups [classes] are known and the problem is to classify unknowns into one or another of the groups the procedure of choice is discriminant function analysis. Characterization of the groups uses the MANOVA procedure. When the groups are unknown but a cloud of data points exist which need to be separated into groups the procedure is factor analysis followed by discriminant function analysis.
This part is designed for those who analyze spatial data that need to be mapped in a geographical coordinate system
This part is designed for those who already understand how to produce basic graphics in R and want to understand how to use ggplot2 to produce specialized and individualized graphs.