Works in Preparation

by George F. Hart

An introduction to data analysis using R

George F. Hart, Ph. D.,

Professor emeritus, LSU.

What does this eBook cover

This eBook concentrates on understanding the basic statistical procedures used to describe data and to analyze simple problems using the NORMAL distribution. It is for those who have little or no knowledge of statistics and is intended to provide a working knowledge in a short time. The eBook avoids mathematics proofs and uses R to perform the necessary procedures used in data-analysis, concentrating on understanding what the statistics are doing, why they are doing it, and what the results indicate about the data.

For my son Vaughan Ian Hart – the one who is really good at numbers!

Table of Contents

Part One: Fundamental statistics using R

This part is a basic course in statistics and graphics avoiding mathematical proofs and formula. It uses R code and R procedures to outline the fundamental knowledge needed to use statistics.

Session 1: What is R

What is not covered

Getting R to work on your computer

Libraries

Data-frames for this course

Importing a data-frame into R

Data within R

Importing from a vector or list

Importing from a matrix

Importing data from a data-base

Exporting a data-frame out of R

As text output

As a spreadsheet

As a table

As a graphic

Saving objects in R

Saving a function in R

Saving your commands

Basic descriptive statistics

Median

Mode

Mean

Variance

Standard deviation

Coefficient of variation

Maximum – Minimum - Range

The z-score

Mid-range

Percentile

Quartile

Skewness

Kurtosis

Summary numbers used in descriptive statistics

Summary information using Contingency Tables

Session 2: Basic concepts for the data analyst

Error

Type 1 eerror

Type 2 error

The nature of data

Percentages, counts and other data measures

Scales of measurement

Nominal data

Dispersion

Association

Ordinal data

Distribution Overview

Central Tendency

Dispersion

Position

Association

Interval data

Distribution Overview

Central Tendency

Dispersion

Position

Ratio data

Distribution Overview

Central tendency

Dispersion

Position

Session 3: Populations and Samples

Attribute types

Populations

Spatial variability and sampling a population

Random Sampling

Stratified Sampling

Cluster Sampling

Systematic Sampling

The single sample problem in geological / palaeo-biological studies

The single line problem in geological / palaeo-biological studies

The dual line problem in geological / palaeo-biological studies

The multiple line problem in geological / palaeo-biological studies

Session 4: Summarizing information using graphics

Coloring Graphics

Histogram plots

Box plot

Bag plot

Scatter plot

Index plot

Q-Q plot

Residuals plot

Frequency polygon

Pie chart

Pairs plot

Coplot

3-D plot

Session 5: Transformations and distributions

Transformations

The Box-Cox Normality Plot

Using the regression equation for prediction

Fitting a confidence interval

The variety of distributions

The box-plot eye ball method

Chebyshev's Theorem

Shapiro-Wilk test for normality

The D'agnostino test

The Kolmogorov-Smirnov Goodness-of-fit test

Comparing the sample data-frame with a random distribution

Fitting other distributions

Outliers

Session 6: Tests

Assumptions

Degrees of freedom [df]

Equal population variances

Z-test [ used for interval and ratio data analysis]

T-test [interval, ratio data-analysis]

Independent [unpaired] t-test

Dependent [paired] t-test

Calculating a single probability value

Confidence Interval [CI]

Setting a confidence interval around a t distributed variable

Setting a confidence interval around a normally distributed variable

Calculating the power of the test, assuming standard deviation is known

Session 7: Measures of association

Correlation coefficients

Equitability

Comparative diversity indices

Similarity

Part two: Regression analysis using R

What this part covers. This section is designed for those who analyze samples taken from the natural environment [random samples] as opposed to sampling from designed experiments i.e. it emphasizes sampling designed analysis, not experimentally designed problems. Sampling designed problems are those associated with samples taken from the natural environment.

ANOVA and regression are complimentary methods for data analysis, in that regression creates a model of reality and ANOVA evaluates the model. Both examine a dependent [response] variable to determine the variability of the dependent variable as a response to factors [predictor or independent variables].

Session 1: Sub-setting a data-frame

Session 2: Regression Analysis

Linear Regression

Simple regression

Multiple regression

Session 3: Transformation of predictor variables

The Box-Cox procedure

Session 4: Selection of variables for multivariate analysis

Session 5: Local regression analysis [loess]

Session 6: Prediction

Part three: Analysis of variance using R

This part is designed for those who analyze data from designed experiments in which a controlled set of samples are used, and for those who apply experimental design in the collection of sample data-frames.

Session 1: One way ANOVA

Session 2: One way ANCOVA

Session 3: Two way ANOVA

Session 4: Interaction effects in two way ANOVA

Session 5: Three way ANOVA

Session 6: MANOVA

Part four: Classification procedures using R

This part is designed for those who analyze data that need similarity or difference measurement, including clustering techniques.
The commonly used statistical procedures for classification are of two kinds. When the groups [classes] are known and the problem is to classify unknowns into one or another of the groups the procedure of choice is discriminant function analysis. Characterization of the groups uses the MANOVA procedure. When the groups are unknown but a cloud of data points exist which need to be separated into groups the procedure is factor analysis followed by discriminant function analysis.

Session 1: Introduction to classification procedures

Session 2: The Similarity Indices procedures

Session 3: Clustering procedures

Session 4: Discriminant function analysis

Session 5: Session 5: Factor Analysis and Principal Components procedures

Session 6: Tree models

Part five: Spatial analysis using R

This part is designed for those who analyze spatial data that need to be mapped in a geographical coordinate system

Session 1: Spatial variables in 2D and 3D

Session 2: Time versus depth axes within an x, y coordinate system

Session 3: Two dimensional spatial analysis

Session 4: Three dimensional spatial analysis

Session 5: Introduction to computer mapping

Session 6: GRASS

Session 7: Using R within GRASS

Session 8: A worked problem

Part six: Advanced Graphics using R

This part is designed for those who already understand how to produce basic graphics in R and want to understand how to use ggplot2 to produce specialized and individualized graphs.

Session 1: Graphic libraries, ggplot2, par() and the main variety of possible graphics.

Session 2: The Bar plots.

Session 3: The Histogram plots.

Session 4: Bivariate plots.

Session 5: Trivariate plots.

Session 6: Multivariate plots.

Session 7: Cluster plots.

Session 8: Color in graphics

Session 9: How to set up a presentation