# Works in Preparation

by George F. Hart

An introduction to data analysis using R

George F. Hart, Ph. D.,

Professor emeritus, LSU.

What does this eBook cover

This eBook concentrates on understanding the basic statistical procedures used to describe data and to analyze simple problems using the NORMAL distribution. It is for those who have little or no knowledge of statistics and is intended to provide a working knowledge in a short time. The eBook avoids mathematics proofs and uses R to perform the necessary procedures used in data-analysis, concentrating on understanding what the statistics are doing, why they are doing it, and what the results indicate about the data.

# For my son Vaughan Ian Hart – the one who is really good at numbers!

## Part One: Fundamental statistics using R

This part is a basic course in statistics and graphics avoiding mathematical proofs and formula. It uses R code and R procedures to outline the fundamental knowledge needed to use statistics.

• Session 1: What is R
• What is not covered

Getting R to work on your computer

Libraries

Data-frames for this course

Importing a data-frame into R

Data within R

Importing from a vector or list

Importing from a matrix

Importing data from a data-base

Exporting a data-frame out of R

As text output

As a table

As a graphic

Saving objects in R

Saving a function in R

Basic descriptive statistics

Median

Mode

Mean

Variance

Standard deviation

Coefficient of variation

Maximum – Minimum - Range

The z-score

Mid-range

Percentile

Quartile

Skewness

Kurtosis

Summary numbers used in descriptive statistics

Summary information using Contingency Tables

• Session 2: Basic concepts for the data analyst
• Error

Type 1 eerror

Type 2 error

The nature of data

Percentages, counts and other data measures

Scales of measurement

Nominal data

Dispersion

Association

Ordinal data

Distribution Overview

Central Tendency

Dispersion

Position

Association

Interval data

Distribution Overview

Central Tendency

Dispersion

Position

Ratio data

Distribution Overview

Central tendency

Dispersion

Position

• Session 3: Populations and Samples
• Attribute types

Populations

Spatial variability and sampling a population

Random Sampling

Stratified Sampling

Cluster Sampling

Systematic Sampling

The single sample problem in geological / palaeo-biological studies

The single line problem in geological / palaeo-biological studies

The dual line problem in geological / palaeo-biological studies

The multiple line problem in geological / palaeo-biological studies

• Session 4: Summarizing information using graphics
• Coloring Graphics

Histogram plots

Box plot

Bag plot

Scatter plot

Index plot

Q-Q plot

Residuals plot

Frequency polygon

Pie chart

Pairs plot

Coplot

3-D plot

• Session 5: Transformations and distributions
• Transformations

The Box-Cox Normality Plot

Using the regression equation for prediction

Fitting a confidence interval

The variety of distributions

The box-plot eye ball method

Chebyshev's Theorem

Shapiro-Wilk test for normality

The D'agnostino test

The Kolmogorov-Smirnov Goodness-of-fit test

Comparing the sample data-frame with a random distribution

Fitting other distributions

Outliers

• Session 6: Tests
• Assumptions

Degrees of freedom [df]

Equal population variances

Z-test [ used for interval and ratio data analysis]

T-test [interval, ratio data-analysis]

Independent [unpaired] t-test

Dependent [paired] t-test

Calculating a single probability value

Confidence Interval [CI]

Setting a confidence interval around a t distributed variable

Setting a confidence interval around a normally distributed variable

Calculating the power of the test, assuming standard deviation is known

• Session 7: Measures of association
• Correlation coefficients

Equitability

Comparative diversity indices

Similarity

## Part two: Regression analysis using R

What this part covers. This section is designed for those who analyze samples taken from the natural environment [random samples] as opposed to sampling from designed experiments i.e. it emphasizes sampling designed analysis, not experimentally designed problems. Sampling designed problems are those associated with samples taken from the natural environment.

ANOVA and regression are complimentary methods for data analysis, in that regression creates a model of reality and ANOVA evaluates the model. Both examine a dependent [response] variable to determine the variability of the dependent variable as a response to factors [predictor or independent variables].

• Session 1: Sub-setting a data-frame
• Session 1: Sub-setting a data-frame
• Session 2: Regression Analysis
• Linear Regression

Simple regression

Multiple regression

• Session 3: Transformation of predictor variables
• The Box-Cox procedure

• Session 4: Selection of variables for multivariate analysis
• Session 5: Local regression analysis [loess]
• Session 6: Prediction
• ## Part three: Analysis of variance using R

This part is designed for those who analyze data from designed experiments in which a controlled set of samples are used, and for those who apply experimental design in the collection of sample data-frames.

• Session 1: One way ANOVA
• Session 2: One way ANCOVA
• Session 3: Two way ANOVA
• Session 4: Interaction effects in two way ANOVA
• Session 5: Three way ANOVA
• Session 6: MANOVA
• ## Part four: Classification procedures using R

This part is designed for those who analyze data that need similarity or difference measurement, including clustering techniques.
The commonly used statistical procedures for classification are of two kinds. When the groups [classes] are known and the problem is to classify unknowns into one or another of the groups the procedure of choice is discriminant function analysis. Characterization of the groups uses the MANOVA procedure. When the groups are unknown but a cloud of data points exist which need to be separated into groups the procedure is factor analysis followed by discriminant function analysis.

• Session 1: Introduction to classification procedures
• Session 2: The Similarity Indices procedures
• Session 3: Clustering procedures
• Session 4: Discriminant function analysis
• Session 5: Session 5: Factor Analysis and Principal Components procedures
• Session 6: Tree models
• ## Part five: Spatial analysis using R

This part is designed for those who analyze spatial data that need to be mapped in a geographical coordinate system

• Session 1: Spatial variables in 2D and 3D
• Session 2: Time versus depth axes within an x, y coordinate system
• Session 3: Two dimensional spatial analysis
• Session 4: Three dimensional spatial analysis
• Session 5: Introduction to computer mapping
• Session 6: GRASS
• Session 7: Using R within GRASS
• Session 8: A worked problem
• ## Part six: Advanced Graphics using R

This part is designed for those who already understand how to produce basic graphics in R and want to understand how to use ggplot2 to produce specialized and individualized graphs.

• Session 1: Graphic libraries, ggplot2, par() and the main variety of possible graphics.
• Session 2: The Bar plots.
• Session 3: The Histogram plots.
• Session 4: Bivariate plots.
• Session 5: Trivariate plots.
• Session 6: Multivariate plots.
• Session 7: Cluster plots.
• Session 8: Color in graphics
• Session 9: How to set up a presentation