conjecture()
is the
black swan of the sift
family. If you encounter the type of
eccentric datasets conjecture is designed to tackle, you’ll be glad you
read this vignette.
At its heart, conjecture is a reshaping operation similar to
tidyr::pivot_wider()
. However, the intended application for
conjecture is more idiosyncratic than that of pivot_wider. This vignette
illustrates the basic aspects of such an application.
The comms
dataset contains a time-series of radio
transmissions.
library(sift)
library(dplyr)
library(tidyr)
comms
#> # A tibble: 50,000 × 4
#> station timestamp msg_code type
#> <chr> <dttm> <int> <chr>
#> 1 D 1999-01-21 09:37:57 2537 send
#> 2 C 1999-01-24 17:52:07 720 send
#> 3 D 1999-01-25 09:12:31 1332 receive
#> 4 D 1999-01-25 17:46:18 2959 receive
#> 5 B 1999-01-25 19:43:17 512 receive
#> 6 A 1999-01-26 01:08:25 2197 receive
#> 7 B 1999-01-26 01:26:43 986 receive
#> 8 A 1999-01-26 05:13:40 2851 receive
#> 9 B 1999-01-26 10:04:34 2108 receive
#> 10 D 1999-01-26 17:50:58 2531 send
#> # ℹ 49,990 more rows
A few notes:
msg_code
.msg_code
can be repeated multiple times (see
below).msg_code
.comms %>%
filter(station == "C",
msg_code == 3060)
#> # A tibble: 14 × 4
#> station timestamp msg_code type
#> <chr> <dttm> <int> <chr>
#> 1 C 1999-02-10 07:33:11 3060 send
#> 2 C 1999-02-12 03:43:31 3060 receive
#> 3 C 1999-02-14 23:07:35 3060 send
#> 4 C 1999-02-17 14:31:48 3060 receive
#> 5 C 1999-02-18 17:33:57 3060 receive
#> 6 C 1999-02-19 06:43:25 3060 receive
#> 7 C 1999-02-21 05:24:25 3060 send
#> 8 C 1999-02-21 09:00:18 3060 send
#> 9 C 1999-02-22 01:13:55 3060 send
#> 10 C 1999-02-22 15:38:07 3060 receive
#> 11 C 1999-02-22 20:39:10 3060 send
#> 12 C 1999-02-26 15:41:56 3060 receive
#> 13 C 1999-03-01 05:35:59 3060 receive
#> 14 C 1999-03-01 11:54:49 3060 receive
Suppose we wish to restructure comms
so that the
“natural” pairing of send
+ receive
transmissions is more apparent. Since there is no explicit information
linking these rows together, we “conjecture” that, for a given
send
transmission (anterior), the corresponding
receive
transmission (posterior) is the closest observation
measured by timestamp
.
conjecture()
always takes 4
arguments.
comms
).timestamp
).type
)."send"
).comms_conjecture <- conjecture(comms, # dataset to reshape.
timestamp, # <dttm> friendly. must be coercible to numeric.
type, # any type of atomic vector is fine.
"send") # we could flip our logic and supply "receive" instead.
comms_conjecture
#> # A tibble: 24,958 × 4
#> station msg_code send receive
#> <chr> <int> <dttm> <dttm>
#> 1 D 2537 1999-01-21 09:37:57 1999-02-16 03:56:29
#> 2 C 720 1999-01-24 17:52:07 1999-02-22 18:24:57
#> 3 D 2531 1999-01-26 17:50:58 1999-02-09 13:14:33
#> 4 D 2992 1999-01-27 08:48:56 1999-02-22 18:05:55
#> 5 A 2262 1999-01-27 15:19:56 1999-02-12 01:43:42
#> 6 B 1785 1999-01-27 18:11:04 1999-02-07 09:07:50
#> 7 C 1624 1999-01-27 21:33:09 1999-02-20 11:07:54
#> 8 C 2280 1999-01-28 02:06:18 1999-02-18 17:25:13
#> 9 B 1170 1999-01-28 06:55:33 NA
#> 10 B 2137 1999-01-28 08:30:30 1999-02-26 23:13:21
#> # ℹ 24,948 more rows
We can partially achieve the same result with pivot_wider.
comms_pivot <- comms %>%
pivot_wider(names_from = type,
values_from = timestamp,
values_fn = first) %>%
filter(receive > send)
comms_pivot
#> # A tibble: 4,734 × 4
#> station msg_code send receive
#> <chr> <int> <dttm> <dttm>
#> 1 D 2537 1999-01-21 09:37:57 1999-02-16 03:56:29
#> 2 C 720 1999-01-24 17:52:07 1999-02-22 18:24:57
#> 3 D 2531 1999-01-26 17:50:58 1999-02-09 13:14:33
#> 4 D 2992 1999-01-27 08:48:56 1999-02-22 18:05:55
#> 5 A 2262 1999-01-27 15:19:56 1999-02-12 01:43:42
#> 6 B 1785 1999-01-27 18:11:04 1999-02-07 09:07:50
#> 7 C 1624 1999-01-27 21:33:09 1999-02-20 11:07:54
#> 8 C 2280 1999-01-28 02:06:18 1999-02-18 17:25:13
#> 9 B 2137 1999-01-28 08:30:30 1999-02-26 23:13:21
#> 10 A 924 1999-01-29 00:41:03 1999-02-17 10:50:19
#> # ℹ 4,724 more rows
Notice that pivot_wider produces 4734 rows compared to 24958 in
comms_conjecture
. What pairs are found in
comms_conjecture
that aren’t captured in
comms_pivot
?
First, there a quite a few transmissions that do not elicit a response. conjecture doesn’t sweep these under the rug.
comms_pivot %>%
filter(is.na(receive))
#> # A tibble: 0 × 4
#> # ℹ 4 variables: station <chr>, msg_code <int>, send <dttm>, receive <dttm>
comms_conjecture %>%
filter(is.na(receive))
#> # A tibble: 10,857 × 4
#> station msg_code send receive
#> <chr> <int> <dttm> <dttm>
#> 1 B 1170 1999-01-28 06:55:33 NA
#> 2 D 2258 1999-01-29 01:32:47 NA
#> 3 D 1519 1999-01-29 08:38:16 NA
#> 4 C 1799 1999-01-29 19:02:48 NA
#> 5 A 132 1999-01-30 00:21:29 NA
#> 6 C 1542 1999-01-30 10:57:17 NA
#> 7 B 791 1999-01-30 11:08:52 NA
#> 8 A 1548 1999-01-30 17:09:38 NA
#> 9 A 100 1999-01-30 21:55:40 NA
#> 10 C 2900 1999-01-31 02:18:50 NA
#> # ℹ 10,847 more rows
Second, our call to pivot_wider only returned the
“first viable pairs” within each combination of station
+
msg_code
. On the other hand, comms_conjecture
contains 3 (4 including missing value) viable pairs for the below
combination.
comms_pivot %>%
filter(station == "A",
msg_code == 221)
#> # A tibble: 1 × 4
#> station msg_code send receive
#> <chr> <int> <dttm> <dttm>
#> 1 A 221 1999-02-05 22:52:22 1999-02-11 03:37:52
comms_conjecture %>%
filter(station == "A",
msg_code == 221)
#> # A tibble: 4 × 4
#> station msg_code send receive
#> <chr> <int> <dttm> <dttm>
#> 1 A 221 1999-02-05 22:52:22 1999-02-11 03:37:52
#> 2 A 221 1999-02-11 21:38:03 1999-02-18 16:37:46
#> 3 A 221 1999-02-19 07:43:27 1999-02-21 18:29:59
#> 4 A 221 1999-03-01 13:26:50 NA
The inclusion of multiple pairs for a given station
+
msg_code
combination is the touchstone of conjecture.
We’ll use a small fragment from comms
to illustrate how
conjecture works.
comms_small <- comms %>%
filter(station == "A",
msg_code == 221)
comms_small
#> # A tibble: 7 × 4
#> station timestamp msg_code type
#> <chr> <dttm> <int> <chr>
#> 1 A 1999-02-05 22:52:22 221 send
#> 2 A 1999-02-11 03:37:52 221 receive
#> 3 A 1999-02-11 21:38:03 221 send
#> 4 A 1999-02-18 16:37:46 221 receive
#> 5 A 1999-02-19 07:43:27 221 send
#> 6 A 1999-02-21 18:29:59 221 receive
#> 7 A 1999-03-01 13:26:50 221 send
We can readily identify the send/receive pairs from the above observations. But how does conjecture accomplish this programmatically?
sort_by = timestamps
) are
separated into two vectors (specified by
names_from = type
).send <- comms_small %>% filter(type == "send") %>% pull(timestamp) %>% sort()
send
#> [1] "1999-02-05 22:52:22" "1999-02-11 21:38:03" "1999-02-19 07:43:27"
#> [4] "1999-03-01 13:26:50"
receive <- comms_small %>% filter(type == "receive") %>% pull(timestamp) %>% sort()
receive
#> [1] "1999-02-11 03:37:52" "1999-02-18 16:37:46" "1999-02-21 18:29:59"
send
, with a nested
loop for each element in receive
. We can invert this
hierarchy by setting names_first = "receive"
instead.output <- integer(length = length(send))
for (i in seq_along(send)) {
output[i] <- NA_integer_
for (j in seq_along(receive)) {
if (is.na(receive[j])) {
next
} else if (receive[j] > send[i]) {
output[i] <- j
break
} else {
next
}
}
}
tibble(send, receive = receive[output])
#> # A tibble: 4 × 2
#> send receive
#> <dttm> <dttm>
#> 1 1999-02-05 22:52:22 1999-02-11 03:37:52
#> 2 1999-02-11 21:38:03 1999-02-18 16:37:46
#> 3 1999-02-19 07:43:27 1999-02-21 18:29:59
#> 4 1999-03-01 13:26:50 NA
Conceptually, the above process flow is an accurate depiction of conjecture - though the underlying structure is more robust:
There is an important consequence associated with the above logic.
We’ll demonstrate by removing all but one of the
receive
elements from comms_small
.
# from comms small
receive <- receive[3]
# rerun the algorithm
for (i in seq_along(send)) {
output[i] <- NA_integer_
for (j in seq_along(receive)) {
if (is.na(receive[j])) {
next
} else if (receive[j] > send[i]) {
output[i] <- j
break
} else {
next
}
}
}
tibble(send, receive = receive[output])
#> # A tibble: 4 × 2
#> send receive
#> <dttm> <dttm>
#> 1 1999-02-05 22:52:22 1999-02-21 18:29:59
#> 2 1999-02-11 21:38:03 1999-02-21 18:29:59
#> 3 1999-02-19 07:43:27 1999-02-21 18:29:59
#> 4 1999-03-01 13:26:50 NA
Why does 1999-02-21 12:29:59
appear 3 times? Recall:
“for a given send
transmission (anterior), the
corresponding receive
transmission (posterior) is the
closest observation measured by timestamp
.”
The above result is in accordance with this statement. However, at some point in the future, I may add the ability to drop repeat occurrences of posterior timestamps, which would produce the following result instead.
#> # A tibble: 4 × 2
#> send receive
#> <dttm> <dttm>
#> 1 1999-02-05 22:52:22 1999-02-21 18:29:59
#> 2 1999-02-11 21:38:03 NA
#> 3 1999-02-19 07:43:27 NA
#> 4 1999-03-01 13:26:50 NA
The express
dataset contains toll records for
northbound and southbound vehicles
over the course of one business day.
library(readr)
library(mopac)
mopac::express
#> # A tibble: 13,032 × 6
#> direction time plate make model color
#> <chr> <dttm> <chr> <chr> <chr> <chr>
#> 1 North 2020-05-20 10:00:33 DZR-4059 Mercedes S-Series Black
#> 2 North 2020-05-20 10:01:13 GRG-4300 Nissan Altima Grey
#> 3 North 2020-05-20 10:03:47 QZS-2886 Mazda 6 White
#> 4 North 2020-05-20 10:04:54 OHK-3972 BMW i1 White
#> 5 North 2020-05-20 10:10:41 EAS-1671 Ford F-250 Silver
#> 6 North 2020-05-20 10:13:53 OKP-7589 Dodge Journey White
#> 7 North 2020-05-20 10:13:55 HNN-1298 Volkswagen Passat Grey
#> 8 North 2020-05-20 10:15:59 EWL-6179 Toyota Venza Grey
#> 9 North 2020-05-20 10:16:01 YVH-4374 GMC Safari White
#> 10 North 2020-05-20 10:18:16 DLU-6055 Nissan Titan Blue
#> # ℹ 13,022 more rows
Suppose we are interested in vehicles using the express lane both
North
and South
(i.e. commuting to work). It’s
up to us to designate an anterior direction
. If we are only
interested in vehicles commuting downtown, we set
names_first = "South"
.
conjecture(express, time, direction, "South") %>%
drop_na() # We can't assume incomplete pairs are commuting to downtown
#> # A tibble: 1,069 × 6
#> plate make model color South North
#> <chr> <chr> <chr> <chr> <dttm> <dttm>
#> 1 QQA-8430 Nissan 370Z Black 2020-05-20 10:00:16 2020-05-20 19:29:06
#> 2 NTL-6850 Ford Crown Vict… Beige 2020-05-20 10:04:39 2020-05-20 23:09:47
#> 3 RBF-4890 Infiniti QX60 Grey 2020-05-20 10:04:55 2020-05-20 21:32:07
#> 4 FDX-4994 Ford Fusion Grey 2020-05-20 10:12:38 2020-05-20 20:40:54
#> 5 DFF-5919 Honda Accord White 2020-05-20 10:14:32 2020-05-20 17:41:54
#> 6 CWW-1823 Porsche Panamera Black 2020-05-20 10:19:57 2020-05-20 21:15:45
#> 7 GRZ-3678 Volkswagen Eos Red 2020-05-20 10:21:23 2020-05-20 22:53:49
#> 8 VMV-7990 Mercedes GE-Class Black 2020-05-20 10:23:14 2020-05-20 20:42:00
#> 9 YDR-9931 Mazda 3 White 2020-05-20 10:25:51 2020-05-20 19:46:11
#> 10 BWC-4843 Alfa Romeo Giulia Red 2020-05-20 10:30:50 2020-05-20 21:34:49
#> # ℹ 1,059 more rows
library(ggplot2)
conjecture(express, time, direction, "South") %>%
drop_na() %>%
mutate(trip_length = difftime(North, South, units = "hours")) %>%
ggplot(aes(trip_length)) +
geom_histogram()