--- title: "Introducing conjecture" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{conjecture} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE ) ``` `conjecture()` is the black swan of the `sift` family. If you encounter the type of eccentric datasets conjecture is designed to tackle, you'll be glad you read this vignette. At its heart, conjecture is a reshaping operation similar to `tidyr::pivot_wider()`. However, the intended application for conjecture is more idiosyncratic than that of pivot_wider. This vignette illustrates the basic aspects of such an application. ## Example 1: Radio Transmissions The `comms` dataset contains a time-series of radio transmissions. ```{r} library(sift) library(dplyr) library(tidyr) comms ``` A few notes: * We are **not interested** in the interaction between the stations. Each station (A, B, C, D) can be regarded as 4 independent time-series. * The subject of each transmission is denoted by `msg_code`. * Each `msg_code` can be repeated multiple times (see below). * There is **no guarantee** that a *sent* transmission will be met with a *response* sharing the same `msg_code`. ```{r} comms %>% filter(station == "C", msg_code == 3060) ``` Suppose we wish to restructure `comms` so that the "natural" pairing of `send` + `receive` transmissions is more apparent. Since there is no explicit information linking these rows together, we "conjecture" that, for a given `send` transmission (anterior), the corresponding `receive` transmission (posterior) is the closest observation measured by `timestamp`. `conjecture()` always takes **4** arguments. 1. Dataset to reshape (`comms`). 2. Column, as a symbol, used to measure distance between observations (`timestamp`). 3. Column, as a symbol, that demarks observations as anterior or posterior (`type`). 4. Scalar quantity signifying anterior observation (`"send"`). ```{r} comms_conjecture <- conjecture(comms, # dataset to reshape. timestamp, # friendly. must be coercible to numeric. type, # any type of atomic vector is fine. "send") # we could flip our logic and supply "receive" instead. comms_conjecture ``` We can partially achieve the same result with pivot_wider. ```{r} comms_pivot <- comms %>% pivot_wider(names_from = type, values_from = timestamp, values_fn = first) %>% filter(receive > send) comms_pivot ``` Notice that pivot_wider produces `r nrow(comms_pivot)` rows compared to `r nrow(comms_conjecture)` in `comms_conjecture`. What pairs are found in `comms_conjecture` that aren't captured in `comms_pivot`? **First**, there a quite a few transmissions that do not elicit a response. conjecture doesn't sweep these under the rug. ```{r} comms_pivot %>% filter(is.na(receive)) comms_conjecture %>% filter(is.na(receive)) ``` **Second**, our call to pivot_wider only returned the "first viable pairs" within each combination of `station` + `msg_code`. On the other hand, `comms_conjecture` contains 3 (4 including missing value) viable pairs for the below combination. ```{r} comms_pivot %>% filter(station == "A", msg_code == 221) comms_conjecture %>% filter(station == "A", msg_code == 221) ``` The inclusion of multiple pairs for a given `station` + `msg_code` combination is the touchstone of conjecture. ### Underlying Logic We'll use a small fragment from `comms` to illustrate how conjecture works. ```{r} comms_small <- comms %>% filter(station == "A", msg_code == 221) comms_small ``` We can readily identify the send/receive pairs from the above observations. But how does conjecture accomplish this programmatically? 1. timestamps (specified by `sort_by = timestamps`) are separated into two vectors (specified by `names_from = type`). ```{r} send <- comms_small %>% filter(type == "send") %>% pull(timestamp) %>% sort() send receive <- comms_small %>% filter(type == "receive") %>% pull(timestamp) %>% sort() receive ``` 2. Iterate through each element in `send`, with a nested loop for each element in `receive`. We can invert this hierarchy by setting `names_first = "receive"` instead. ```{r} output <- integer(length = length(send)) for (i in seq_along(send)) { output[i] <- NA_integer_ for (j in seq_along(receive)) { if (is.na(receive[j])) { next } else if (receive[j] > send[i]) { output[i] <- j break } else { next } } } tibble(send, receive = receive[output]) ``` Conceptually, the above process flow is an accurate depiction of conjecture - though the underlying structure is more robust: * conjecture reconciles the presence of additional columns, similar to conventional reshaping operations. * conjecture relies on C++ to execute the looping structure. ### Duplicate Posterior Values There is an important consequence associated with the above logic. We'll demonstrate by removing all but **one** of the `receive` elements from `comms_small`. ```{r} # from comms small receive <- receive[3] # rerun the algorithm for (i in seq_along(send)) { output[i] <- NA_integer_ for (j in seq_along(receive)) { if (is.na(receive[j])) { next } else if (receive[j] > send[i]) { output[i] <- j break } else { next } } } tibble(send, receive = receive[output]) ``` Why does `1999-02-21 12:29:59` appear 3 times? Recall: "for a given `send` transmission (anterior), the corresponding `receive` transmission (posterior) is the closest observation measured by `timestamp`." The above result is in accordance with this statement. However, at some point in the future, I may add the ability to drop repeat occurrences of posterior timestamps, which would produce the following result instead. ```{r echo = FALSE} tibble(send, receive = receive[c(1, NA, NA, NA)]) ``` ## Example 2: Toll Lane Records The `express` dataset contains toll records for **northbound** and **southbound** vehicles over the course of one business day. ```{r} library(readr) library(mopac) mopac::express ``` Suppose we are interested in vehicles using the express lane both `North` and `South` (i.e. commuting to work). It's up to us to designate an anterior `direction`. If we are only interested in vehicles commuting downtown, we set `names_first = "South"`. ```{r} conjecture(express, time, direction, "South") %>% drop_na() # We can't assume incomplete pairs are commuting to downtown ``` ```{r fig.keep='none'} library(ggplot2) conjecture(express, time, direction, "South") %>% drop_na() %>% mutate(trip_length = difftime(North, South, units = "hours")) %>% ggplot(aes(trip_length)) + geom_histogram() ``` ```{r, echo = FALSE, fig.width=4} library(ggplot2) conjecture(express, time, direction, "South") %>% drop_na() %>% mutate(trip_length = difftime(North, South, units = "hours")) %>% ggplot(aes(trip_length)) + geom_histogram() + scale_y_continuous(expand = c(0, 0)) + theme_minimal() + theme(panel.grid.minor = element_blank(), plot.title.position = "plot") + labs(title = "Trip length distribution", subtitle = "Vehicles commuting downtown", x = "Round trip length [hours]", y = NULL) ```