Project 2

Due: Wednesday, November 23, 12:00pm (noon) on Canvas

Submission: Submit both a pdf containing your report, and a file containing the R code you used in your analysis.

Data

The dataset for this project comes from a long-term research program run by Wake Forest University in the Serengeti Park in Tanzania. Dr. T. Michael Anderson in the Department of Biology runs the project, with assistance from Dr. Staci Hepler and Dr. Rob Erhardt in our Department of Statistical Sciences (along with folks at other universities).

The data are obtained from cameras installed at 19 different sites in the Serengeti. The cameras are triggered by motion and heat, so when an animal passes by they take a photgraph which is then uploaded to snapshotserengeti.org. Volunteers classify the images after some training. In this subset of the data you will analyze three species (Topi, zebra, Thomson’s gazelle) over 8-day periods. The variables count count the number of each animal during the 8 days at the particular site. The variables present simply indicate whether or not the animal was ever present. Also measured are a number of environmental variables.

Variables

The dataset provided for this project contains the following columns:

  • siteID is simply an identifier for each location
  • site.date refers to the site and date of the 8 day period corresponding to the row
  • date refers to the date only of the 8 day period corresponding to the row
  • ndvi stands for , which is a measure of how “green” the location is. Locations have high values of ndvi if they have a lot of vegetation, if the vegetation is of high quality and very green, or some combination of the two.
  • gazelleThomsons.count refers to the count of Thomson’s gazelles that were spotted over the 8-day period.
  • gazelleThomsons.present is a simple indicator function showing if any Thomsons gazelle were present during the 8-day period.
  • topi.count refers to the count of Topi that were spotted over the 8-day period.
  • topi.present is a simple indicator function showing if any Topi were present during the 8-day period.
  • zebra.count refers to the count of Zebra that were spotted over the 8-day period.
  • zebra.present is a simple indicator function showing if any Zebra were present during the 8-day period.
  • fire refers to the presence of a detected wildfire in the 30 days preceeding the time period.
  • amRivDist is the distance from the location to the nearest river. I honestly don’t know the units, it’s just some sort of length.
  • TM100 counts the number of termite mounds within 100 meters of the location. Termite mounds are an indicator of soil quality.
  • LriskDry is a measure of how “risky” the location is to lion predation. It is computed based on the number of previous recorded lion attacks at the location. It has no units.
  • T50 is a count of the number of trees within 50 meters of the location.

Downloading the data

To load the data, use the code below.

serengeti <- read.csv("https://sta712-f22.github.io/projects/serengeti.csv")

Research questions

You are asked by the research team to build models for the count of each species observed at the different sites. In particular, they are interested in the following two questions:

  1. Is there a relationship between the environmental variables and the counts for each species?
  2. Are the same environmental variables important for modeling the counts for each species?

To address these research questions, you will perform exploratory data analysis, fit several models, and analyze the results.

What will you be turning in?

You will submit two documents on Canvas for this project.

  • Your written report (.pdf file):
    • This is the knitted write up that will explain your work, and the answers to the researchers’ questions
    • This report will contain very specific sections, which you must label and include in your final report. More details on the sections needed for this report are included below.
    • In your formal report, there should be no code showing. This includes warnings and other stray code output – hide it all.
    • You will be graded on writing as well as your statistical analysis.
  • A code file (.R or .Rmd), with all the code you used for your analysis
    • The goal is that a person who reads your report, and wants to replicate your results, could look at your code file and completely reproduce the results and figures in the knitted report.

Content

Your report should contain the following labeled sections.

Abstract

This is the first section in your report, but it is actually the last thing you will write. Called an abstract in academia, and an executive summary in industry, this one paragraph summary of your entire paper is the first thing people will read. For example:

Estimating the number of pandas inhabiting a national panda preserve is critical to understanding the health of the panda species as a whole. In this report, we discuss the process of estimating the number of pandas in the Wolong National Nature Reserve. Data were collected from volunteers walking trails in the park, and we applied capture-recapture estimation process to then estimate the total number of pandas in the park. We detail the steps of the process, and compare our results to two other possible methods of estimating pandas. Based on our estimation, there are between 140 and 155 pandas in the surveyed area. We discuss our findings, and limitations of the study, in the following paper.

Section 1: Introduction

Write 2-3 paragraphs in which you:

  • Motivate the research questions
  • Describe where the data came from
  • Provide background on the dataset
  • Finish by summarizing what you will do in the report

Section 2: Data

In this section, you will summarize and explore your data. First, write 1-2 paragraphs which:

  • Describe details of the data. What does a row in the data represent? How many rows and columns? What information do the variables record? Are there any missing data?
  • Describe any data manipulation. Did you remove any missing values? Did you focus only on a subset of the data? Did you create any new variables?

Next, perform exploratory data analysis. Your goal is to create any visualizations necessary to explore the data before fitting a model (e.g. tables, histograms, empirical log mean plots to choose transformations, etc.). Make sure that:

  • You create a plot showing the distribution of your response variable(s)
  • Your figures have axis labels, captions, and are numbered Figure 2.1, Figure 2.2, etc., where the 2. represents the section number, and the number after the decimal represents how many plots so far in this section
  • Any plot that is shown should be discussed in writing in your report, and referenced by number (e.g., “As shown in Figure 2.1…”). What information does this graph give us, and why is that important for us to know when before we start building our model in the next section?

Section 3: Modeling

In this section, you will build one or more models to answer the two research questions. For each model, make sure to include the population form of the model you are using and explain why this model is appropriate for these data.

You will then fit a model and write down the equation of the fitted model using appropriate notation.

Section 3.1: Importance of environmental variables

For the first research question, make sure to include environmental variables in your models. Interpret the relevant coefficients in context of the research question, and perform hypothesis tests that allows you to address the question. When testing hypotheses, make sure to:

  • State the hypotheses in terms of one or more model parameters
  • Calculate a test statistic and p-value
  • Use the p-value to make a conclusion about the research question

Section 3.2: Comparing species

For the second research question, we are interested in comparing models for the different species counts. Choose models for each species count based on your exploratory data analysis above, and through model selection using AIC or BIC. Comment on any relationships you see in your models. Are the same variables selected to model each species?

Section 3.3: Model diagnostics

Use model diagnostics to check the assumptions for your models in sections 3.1 and 3.2. You should use quantile residual plots to check the shape assumption, Cook’s distance to identify potential influential points, and VIFs to check for multicollinearity.

While this section comes after 3.1 and 3.2 in the report, you should address any assumption violations before creating your final models for 3.1 and 3.2.

Section 4: Discussion

In this section, you are providing a conclusion. Write 1–2 paragraphs which summarize what you learned from your analysis, and how it addresses the original research questions. Also discuss any limitations to your analysis, and what you other relationships you might explore in future.

Appearance and style

Writing

The report should be written like an article or research paper: in full sentences and paragraphs, with headings for each section. You should not write your report with question numbers or as a list of bullet points. Scientific articles are generally written in third person, though “we” can also be acceptable (“we can see from Figure 2.1 …”) in some disciplines.

Code

In full reports, the only output that should be visible from code chunks are figures and tables. If a code chunk does not produce a figure or table, you can hide it from the knitted document with include=F:

```{r, include=F}

```

If a code chunk produces a figure or table, only the figure or table should be visible in the knitted document. You can hide the chunk but display the output with echo=F, message=F, warning=F:

```{r, echo=F, message=F, warning=F}

```

Figures

Figures should have labeled axes, and should be clear and easy to read. Figures should also be captioned and numbered; to caption a figure, use fig.cap = "..." in the chunk options. For example (scroll to the right on the code to see it all),

```{r, echo=F, message=F, warning=F, fig.cap="Figure 2.1: Bill depth vs. bill length for penguins near Palmer Station, Antarctica."}
penguins %>%
  ggplot(aes(x = bill_length_mm, 
             y = bill_depth_mm)) +
  geom_point() +
  labs(x = "Bill length (mm)",
       y = "Bill depth (mm)") +
  theme_bw()
```

is displayed as

Figure 1: Bill depth vs. bill length for penguins near Palmer Station, Antarctica.

Figure 1: Bill depth vs. bill length for penguins near Palmer Station, Antarctica.

Captions should provide enough information to understand what is being plotted, but interpretation can be left to the main text. Refer to figures by their number in the text. Make sure that any figures you include are discussed in the text.

Tables

Tables should be nicely formatted, and have a number and caption. This can be done with the kable function in the knitr package.

For example,

```{r, echo=F, message=F}`r''`
table(penguins$island, penguins$species) %>%
  knitr::kable(caption = "Table 3.2: Penguins by island and species")
```

is displayed as

Table 3.2: Penguins by island and species
Adelie Chinstrap Gentoo
Biscoe 44 0 124
Dream 56 68 0
Torgersen 52 0 0

Writing math in R Markdown

If you want to write mathematical notation, we need to tell Markdown, “Hey, we’re going to make a math symbol!” To do that, you use dollar signs. For instance, to make \(\widehat{\beta}_1\), you simply put $\widehathat{\beta}_1$ into the white space (not a chunk) in your Markdown.

Here are some examples of writing math, which you can adapt:

Math Code
\(Y_i \sim Bernoulli(p_i)\) $Y_i \sim Bernoulli(p_i)$
\(\log \left( \dfrac{p_i}{1 - p_i} \right)\) $\log \left( \dfrac{p_i}{1 - p_i} \right)$
\(\widehat{p}_i\) $\widehat{p}_i$

Collaboration

You are welcome to work with other students on this project, but everyone needs to submit their own report.