---
title: "Introduction to testdat"
# author: "Danny Smith"
# date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to testdat}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE, warning = FALSE, message = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", error = TRUE)
options(tibble.print_min = 4L, tibble.print_max = 4L)
library(testdat)
library(dplyr)
```

testdat is a package designed to ease data validation, particularly for complex
data processing, inspired by software unit testing. testdat extends the strong
and flexible unit testing framework already provided by testthat with a family
of functions and reporting tools focused on checking data frames.

Features include:

* A fully fledged test framework so you can spend more time specifying tests and
  less time running them

* A set of common methods for simply specifying data validation rules

* Repeatability of data tests (avoid unintentionally breaking your data set!)

* Data-focused reporting of test results

# Getting started

As an extension of testthat, testdat uses the same basic testing framework.
Before using testdat (and reading this documentation), make sure you're familiar
with the introduction to testthat in
[R packages](https://r-pkgs.org/testing-basics.html).

The main addition provided by testdat is a set of expectations designed for
testing data frames and an accompanying mechanism to globally specify a data set
for testing.

# Data expectations

## Overview

In general, our approach to data testing is variable-centric - the majority of
expectations perform a check on one or more variables in a given data frame.

The standard form of a data expectation is:
```r
expect_*(var(s), ..., flt = TRUE, data = get_testdata())
```

testdat uses dplyr at its core, and thus supports tidy evaluation.

### Variables

Most operations act on one or more variables. There are two variants of the
variable argument:

* `vars` requires a set of columns specified as [tidy selections](https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html).

```{r}
test_that("multi-variable identifier is unique", {
  expect_unique(c(name, year, month, day, hour), data = storms)
})
```

* `var` requires an unquoted variable name. This only applies to a small number
of expectations.

```{r}
test_that("hour values are valid", {
  expect_base(ts_diameter, year >= 2004)
})
```

### Filter

The `flt` argument takes a logical predicate defined in terms of the variables
in `data` using [data
masking](https://dplyr.tidyverse.org/reference/dplyr_data_masking.html). Only
rows where the condition evaluates to `TRUE` are included in the test.

```{r}
test_that("iris range checks", {
  expect_range(Petal.Width, 0, 1, data = iris)
})

test_that("iris range checks filtered", {
  # Test passes for setosa rows
  expect_range(Petal.Width, 0, 1, flt = Species == "setosa", data = iris)
  # Failures will provide the filter
  expect_range(Petal.Width, 0, 0.5, flt = Species == "setosa", data = iris)
})
```

### Data

The `data` argument takes a data frame to test. To avoid redundant code the data
argument defaults to a global test data set retrieved using `get_testdata()`.
This can be used in two ways:

* Setting the global test data using `set_testdata()`.

```{r}
set_testdata(iris)
identical(get_testdata(), iris)

test_that("Versicolor has sepal length greater than 5 - will fail", {
  expect_cond(Species %in% "versicolor", Sepal.Length >= 5)
})
```

* Using the `with_testdata()` wrapper to temporarily set the global test data
for a block of code.

```{r}
set_testdata(mtcars)
identical(get_testdata(), mtcars)

with_testdata(iris, {
  test_that("Versicolor has sepal length greater than 5 - will fail", {
    expect_cond(Species %in% "versicolor", Sepal.Length >= 5)
  })
})

identical(get_testdata(), mtcars)
```

Both approaches are equivalent to:

```{r}
test_that("Versicolor has sepal length greater than 5 - will fail", {
  expect_cond(Species %in% "versicolor", Sepal.Length >= 5, data = iris)
})
```

By default, `set_testdata()` stores a quosure with a reference to the provided
data frame rather than the data itself, so changes made to the data frame will
be reflected in the test results.

```{r}
tmp_data <- tibble(x = c(1, 0), y = c(1, NA))

set_testdata(tmp_data)
print(get_testdata())
expect_base(y, x == 1)

tmp_data$y <- 1
print(get_testdata())
expect_base(y, x == 1)
```

### `...`

Additional arguments are specific to the expectation. See the help page for the
expectation function for details.

## Categories

Data expectations fall into a few main classes.

Value

:   `` ?`date-expectations` ``
:   `` ?`label-expectations` ``
:   `` ?`pattern-expectations` ``
:   `` ?`proportion-expectations` ``
:   `` ?`text-expectations` ``
:   `` ?`value-expectations` ``
    
    Value expectations test variable values for valid data. Tests include
    explicit value checks, pattern checks and others.

Relationships

:   `` ?`exclusivity-expectations` ``
:   `` ?`uniqueness-expectations` ``

    Relationship expectations test for relationships among variables. Tests
    include uniqueness and exclusivity checks.

Conditional

:   `` ?`conditional-expectations` ``
    
    Conditional expectations check for the co-existence of multiple conditions.

Data frame comparison

:   `` ?`datacomp-expectations` ``
    
    Data frame comparison expectations test for consistency between data frames,
    for example ensuring similar frequencies between similar variables in
    different data frames.

Generic

:   `` ?`generic-expectations` ``
    
    Generic expectations allow for testing of a data frame using an arbitrary
    function. The function provided should take a single vector as its first
    argument and return a logical vector showing whether each element has passed
    or failed. Additional arguments to the checking function can be passed as a
    list using the `args` argument.
    
    testdat includes a set of useful checking functions. See `` ?`chk-generic`
    `` for details. Several of the checking functions have a corresponding
    expectation. These are listed in `` ?`chk-expect` ``.

# Using tests

## Testing inside a script

The easiest way to use data testing is directly inside an R script. Expecations
and test blocks throw an error if they fail, so it is very clear to the user
that something needs to be checked, and the script will fail when sourced.

The example below shows how to adapt a script from exploratory "print and check"
to a testing approach, using a simple data processing exercise.

### Print and check

```{r}
library(dplyr)

x <- tribble(
  ~id, ~pcode, ~state, ~nsw_only,
  1,   2000,   "NSW",  1,
  2,   3123,   "VIC",  NA,
  3,   2123,   "NSW",  3,
  4,   12345,  "VIC",  3
)

# check id is unique
x %>% filter(duplicated(id))

# check values
x %>% filter(!pcode %in% 2000:3999)
x %>% count(state)
x %>% count(nsw_only)

# check base for nsw_only variable
x %>% filter(state != "NSW") %>% count(nsw_only)

x <- x %>% mutate(market = case_when(pcode %in% 2000:2999 ~ 1,
                                     pcode %in% 3000:3999 ~ 2))

x %>% count(market)
```

### testdat

```{r}
library(testdat)
library(dplyr)

x <- tribble(
  ~id, ~pcode, ~state, ~nsw_only,
  1,   2000,   "NSW",  1,
  2,   3123,   "VIC",  NA,
  3,   2123,   "NSW",  3,
  4,   12345,  "VIC",  3
)

with_testdata(x, {
  test_that("id is unique", {
    expect_unique(id)
  })
  
  test_that("variable values are correct", {
    expect_values(pcode, 2000:2999, 3000:3999)
    expect_values(state, c("NSW", "VIC"))
    expect_values(nsw_only, 1:3) # by default expect_values allows NAs
  })
  
  test_that("filters applied correctly", {
    expect_base(nsw_only, state == "NSW")
  })
})

x <- x %>% mutate(market = case_when(pcode %in% 2000:2999 ~ 1,
                                     pcode %in% 3000:3999 ~ 2))

with_testdata(x, {
  test_that("market derived correctly", {
    expect_values(market, 1:2, miss = NULL) # miss = NULL excludes NAs from valid values
  })
})
```

Note that:

* The global test data can be set with `set_testdata()` or `with_testdata()`.

* The `test_that()` wrapper is not strictly necessary, but it provides more
informative error messaging and logically groups a set of expectations.

* Each `with_testdata()` block will fail on the first `test_that()` block that
fails. The "filters applied correctly" test block is not run in the example.

## Using a test suite

Setting up a proper project test suite can be useful for automatically
validating final processed data sets.

Setting up proper file based testing infrastructure is out of the scope of this
vignette. See [R packages](https://r-pkgs.org/testing-basics.html) for a brief
introduction to testing infrastructure.

In testthat, related tests are grouped into files. When using testdat, each file
should have a test data set specified with a call to `set_testdata()`.

```{r}
set_testdata(iris)

test_that("Variable format checks", {
  expect_regex(Species, "^[a-z]+$")
})

test_that("Versicolor has sepal length greater than 5 - will fail", {
  expect_cond(Species %in% "versicolor", Sepal.Length >= 5)
})
```