Works in Preparation

by George F. Hart

An introduction to data analysis using R

George F. Hart, Ph. D.,

Professor emeritus, LSU.

What does this eBook cover


This eBook concentrates on understanding the basic statistical procedures used to describe data and to analyze simple problems using the NORMAL distribution. It is for those who have little or no knowledge of statistics and is intended to provide a working knowledge in a short time. The eBook avoids mathematics proofs and uses R to perform the necessary procedures used in data-analysis, concentrating on understanding what the statistics are doing, why they are doing it, and what the results indicate about the data.

For my son Vaughan Ian Hart – the one who is really good at numbers!

Table of Contents

Part One: Fundamental statistics using R

This part is a basic course in statistics and graphics avoiding mathematical proofs and formula. It uses R code and R procedures to outline the fundamental knowledge needed to use statistics.


  • Session 1: What is R
  • What is not covered

    Getting R to work on your computer

    Libraries

    Data-frames for this course

    Importing a data-frame into R

    Data within R

    Importing from a vector or list

    Importing from a matrix

    Importing data from a data-base

    Exporting a data-frame out of R

    As text output

    As a spreadsheet

    As a table

    As a graphic

    Saving objects in R

    Saving a function in R

    Saving your commands

    Basic descriptive statistics

    Median

    Mode

    Mean

    Variance

    Standard deviation

    Coefficient of variation

    Maximum – Minimum - Range

    The z-score

    Mid-range

    Percentile

    Quartile

    Skewness

    Kurtosis

    Summary numbers used in descriptive statistics

    Summary information using Contingency Tables

  • Session 2: Basic concepts for the data analyst
  • Error

    Type 1 eerror

    Type 2 error

    The nature of data

    Percentages, counts and other data measures

    Scales of measurement

    Nominal data

    Dispersion

    Association

    Ordinal data

    Distribution Overview

    Central Tendency

    Dispersion

    Position

    Association

    Interval data

    Distribution Overview

    Central Tendency

    Dispersion

    Position

    Ratio data

    Distribution Overview

    Central tendency

    Dispersion

    Position

  • Session 3: Populations and Samples
  • Attribute types

    Populations

    Spatial variability and sampling a population

    Random Sampling

    Stratified Sampling

    Cluster Sampling

    Systematic Sampling

    The single sample problem in geological / palaeo-biological studies

    The single line problem in geological / palaeo-biological studies

    The dual line problem in geological / palaeo-biological studies

    The multiple line problem in geological / palaeo-biological studies

  • Session 4: Summarizing information using graphics
  • Coloring Graphics

    Histogram plots

    Box plot

    Bag plot

    Scatter plot

    Index plot

    Q-Q plot

    Residuals plot

    Frequency polygon

    Pie chart

    Pairs plot

    Coplot

    3-D plot

  • Session 5: Transformations and distributions
  • Transformations

    The Box-Cox Normality Plot

    Using the regression equation for prediction

    Fitting a confidence interval

    The variety of distributions

    The box-plot eye ball method

    Chebyshev's Theorem

    Shapiro-Wilk test for normality

    The D'agnostino test

    The Kolmogorov-Smirnov Goodness-of-fit test

    Comparing the sample data-frame with a random distribution

    Fitting other distributions

    Outliers

  • Session 6: Tests
  • Assumptions

    Degrees of freedom [df]

    Equal population variances

    Z-test [ used for interval and ratio data analysis]

    T-test [interval, ratio data-analysis]

    Independent [unpaired] t-test

    Dependent [paired] t-test

    Calculating a single probability value

    Confidence Interval [CI]

    Setting a confidence interval around a t distributed variable

    Setting a confidence interval around a normally distributed variable

    Calculating the power of the test, assuming standard deviation is known

  • Session 7: Measures of association
  • Correlation coefficients

    Equitability

    Comparative diversity indices

    Similarity

    Part two: Regression analysis using R

    What this part covers. This section is designed for those who analyze samples taken from the natural environment [random samples] as opposed to sampling from designed experiments i.e. it emphasizes sampling designed analysis, not experimentally designed problems. Sampling designed problems are those associated with samples taken from the natural environment.

    ANOVA and regression are complimentary methods for data analysis, in that regression creates a model of reality and ANOVA evaluates the model. Both examine a dependent [response] variable to determine the variability of the dependent variable as a response to factors [predictor or independent variables].


  • Session 1: Sub-setting a data-frame
  • Session 1: Sub-setting a data-frame
  • Session 2: Regression Analysis
  • Linear Regression

    Simple regression

    Multiple regression

  • Session 3: Transformation of predictor variables
  • The Box-Cox procedure

  • Session 4: Selection of variables for multivariate analysis
  • Session 5: Local regression analysis [loess]
  • Session 6: Prediction
  • Part three: Analysis of variance using R

    This part is designed for those who analyze data from designed experiments in which a controlled set of samples are used, and for those who apply experimental design in the collection of sample data-frames.

  • Session 1: One way ANOVA
  • Session 2: One way ANCOVA
  • Session 3: Two way ANOVA
  • Session 4: Interaction effects in two way ANOVA
  • Session 5: Three way ANOVA
  • Session 6: MANOVA
  • Part four: Classification procedures using R

    This part is designed for those who analyze data that need similarity or difference measurement, including clustering techniques.
    The commonly used statistical procedures for classification are of two kinds. When the groups [classes] are known and the problem is to classify unknowns into one or another of the groups the procedure of choice is discriminant function analysis. Characterization of the groups uses the MANOVA procedure. When the groups are unknown but a cloud of data points exist which need to be separated into groups the procedure is factor analysis followed by discriminant function analysis.

  • Session 1: Introduction to classification procedures
  • Session 2: The Similarity Indices procedures
  • Session 3: Clustering procedures
  • Session 4: Discriminant function analysis
  • Session 5: Session 5: Factor Analysis and Principal Components procedures
  • Session 6: Tree models
  • Part five: Spatial analysis using R

    This part is designed for those who analyze spatial data that need to be mapped in a geographical coordinate system

  • Session 1: Spatial variables in 2D and 3D
  • Session 2: Time versus depth axes within an x, y coordinate system
  • Session 3: Two dimensional spatial analysis
  • Session 4: Three dimensional spatial analysis
  • Session 5: Introduction to computer mapping
  • Session 6: GRASS
  • Session 7: Using R within GRASS
  • Session 8: A worked problem
  • Part six: Advanced Graphics using R

    This part is designed for those who already understand how to produce basic graphics in R and want to understand how to use ggplot2 to produce specialized and individualized graphs.

  • Session 1: Graphic libraries, ggplot2, par() and the main variety of possible graphics.
  • Session 2: The Bar plots.
  • Session 3: The Histogram plots.
  • Session 4: Bivariate plots.
  • Session 5: Trivariate plots.
  • Session 6: Multivariate plots.
  • Session 7: Cluster plots.
  • Session 8: Color in graphics
  • Session 9: How to set up a presentation