Title: | Intelligently Peruse Data |
---|---|
Description: | Facilitate extraction of key information from common datasets. |
Authors: | Scott McKenzie [aut, cre], RStudio [cph] (internal functions from dplyr.R) |
Maintainer: | Scott McKenzie <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-02-02 04:11:11 UTC |
Source: | https://github.com/sccmckenzie/sift |
User-friendly interface that synthesizes power of dplyr::left_join
and findInterval
.
break_join(x, y, brk = character(), by = NULL, ...)
break_join(x, y, brk = character(), by = NULL, ...)
x |
A data frame. |
y |
Data frame containing desired reference information. |
brk |
Name of column in |
by |
Joining variables, if needed. See mutate-joins. |
... |
additional arguments automatically directed to |
An object of the same type as x
.
All x
rows will be returned.
All columns between x
and y
are returned.
Rows in y
are matched with x
based on overlapping values of brk
(e.g. findInterval(x$brk, y$brk, ...)
).
# joining USA + UK leaders with population time-series break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start")) # simple dataset set.seed(1) a <- data.frame(p = c(rep("A", 10), rep("B", 10)), q = runif(20, 0, 10)) b <- data.frame(p = c("A", "A", "B", "B"), q = c(3, 5, 6, 9), r = c("a1", "a2", "b1", "b2")) break_join(a, b, brk = "q") # p identified as common variable automatically break_join(a, b, brk = "q", by = "p") # same result break_join(a, b, brk = "q", all.inside = TRUE) # note missing values have been filled # joining toll prices with vehicle time-series library(mopac) library(dplyr, warn.conflicts = FALSE) library(hms) express %>% mutate(time_hms = as_hms(time)) %>% break_join(rates, brk = c("time_hms" = "time"))
# joining USA + UK leaders with population time-series break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start")) # simple dataset set.seed(1) a <- data.frame(p = c(rep("A", 10), rep("B", 10)), q = runif(20, 0, 10)) b <- data.frame(p = c("A", "A", "B", "B"), q = c(3, 5, 6, 9), r = c("a1", "a2", "b1", "b2")) break_join(a, b, brk = "q") # p identified as common variable automatically break_join(a, b, brk = "q", by = "p") # same result break_join(a, b, brk = "q", all.inside = TRUE) # note missing values have been filled # joining toll prices with vehicle time-series library(mopac) library(dplyr, warn.conflicts = FALSE) library(hms) express %>% mutate(time_hms = as_hms(time)) %>% break_join(rates, brk = c("time_hms" = "time"))
Dataset intended to demonstrate usage of sift::conjecture
.
comms
comms
An object of class tbl_df
(inherits from tbl
, data.frame
) with 50000 rows and 4 columns.
On the surface, conjecture()
appears similar to tidyr::pivot_wider()
, but uses different logic tailored to a specific type of dataset:
column corresponding to names_from
contains only 2 levels
there is no determinate combination of elements to fill 2 columns per row.
See vignette("conjecture") for more details.
conjecture(data, sort_by, names_from, names_first)
conjecture(data, sort_by, names_from, names_first)
data |
A data frame to reshape. |
sort_by |
Column name, as symbol. Plays a similar role as |
names_from |
Column name, as symbol. Used to differentiate anterior/posterior observations. Column must only contain 2 levels (missing values not allowed). |
names_first |
level in variable specified by |
conjecture()
uses the following routine to match elements:
Values in sort_by
are separated into two vectors: anterior and posterior.
Each anterior element is matched with the closest posterior element measured by sort_by
.
An object of the same type as data
.
# See vignette("conjecture") for more examples conjecture(comms, timestamp, type, "send")
# See vignette("conjecture") for more examples conjecture(comms, timestamp, type, "send")
Automatically cluster 1-dimensional continuous data.
kluster(x, bw = "SJ", fixed = FALSE)
kluster(x, bw = "SJ", fixed = FALSE)
x |
Vector to be clustered. Must contain at least 1 non-missing value. |
bw |
kernel bandwidth. Default "SJ" should suffice more application, however you can supply a custom numeric value. See ?stats::density for more information. |
fixed |
logical; if |
An integer vector identifying the cluster corresponding to each element in x
.
# Below vector clearly has 2 groups. # kluster will identify these groups using kernel density estimation. kluster(c(0.1, 0.2, 1)) # kluster shines in cases where manually assigning groups via "eyeballing" is impractical. # Suppose we obtained vector 'x' without knowing how it was generated. set.seed(1) nodes <- runif(10, min = 0, max = 100) x <- lapply(nodes, function(x) rnorm(10, mean = x, sd = 0.1)) x <- unlist(x) kluster(x) # kluster reveals the natural grouping kluster(x, bw = 10) # adjust bandwidth depending on application # Example with faithful dataset faithful$k <- kluster(faithful$eruptions) library(ggplot2) ggplot(faithful, aes(eruptions)) + geom_density() + geom_rug(aes(color = factor(k))) + theme_minimal() + scale_color_discrete(name = "k")
# Below vector clearly has 2 groups. # kluster will identify these groups using kernel density estimation. kluster(c(0.1, 0.2, 1)) # kluster shines in cases where manually assigning groups via "eyeballing" is impractical. # Suppose we obtained vector 'x' without knowing how it was generated. set.seed(1) nodes <- runif(10, min = 0, max = 100) x <- lapply(nodes, function(x) rnorm(10, mean = x, sd = 0.1)) x <- unlist(x) kluster(x) # kluster reveals the natural grouping kluster(x, bw = 10) # adjust bandwidth depending on application # Example with faithful dataset faithful$k <- kluster(faithful$eruptions) library(ggplot2) ggplot(faithful, aes(eruptions)) + geom_density() + geom_rug(aes(color = factor(k))) + theme_minimal() + scale_color_discrete(name = "k")
Includes selected headlines and additional metadata for NYT articles throughout 2020. This dataset is not a comprehensive account of all major events from 2020.
nyt2020
nyt2020
A data frame with 1,830 rows and 6 variables:
Article Headline
Brief summary of article
Contributing Writers
Date of Publication
NYT section in which article was published
Article URL
...
Obtained using NYT Developer Portal (Archive API)
Imagine dplyr::filter
that includes neighboring observations.
Choose how many observations to include by adjusting inputs sift.col
and scope
.
sift(.data, sift.col, scope, ...)
sift(.data, sift.col, scope, ...)
.data |
A data frame. |
sift.col |
Column name, as symbol, to serve as "sifting/augmenting" dimension. Must be non-missing and coercible to numeric. |
scope |
Specifies augmentation bandwidth relative to "key" observations. Parameter should share the same scale as If length 1, bandwidth used is +/- If length 2, bandwidth used is (- |
... |
Expressions passed to |
sift()
can be understood as a 2-step process:
.data
is passed to dplyr::filter
, using subsetting expression(s) provided in ...
. We'll refer to these intermediate results as "key" observations.
For each key observation, sift
expands the row selection bidirectionally along dimension specified by sift.col
. Any row from the original dataset within scope
units of a key observation is captured in the final result.
Essentially, this allows us to "peek" at neighboring rows surrounding the key observations.
A sifted data frame, with 2 additional columns:
.cluster <int>
: Identifies resulting group formed by each key observation and its neighboring rows. When the key observations are close enough together, the clusters will overlap.
.key <lgl>
: TRUE
indicates key observation.
# See current events from same timeframe as 2020 Utah Monolith discovery. sift(nyt2020, pub_date, scope = 2, grepl("Monolith", headline)) # or Biden's presidential victory. sift(nyt2020, pub_date, scope = 2, grepl("Biden is elected", headline)) # We can specify lower & upper scope to see what happened AFTER Trump tested positive. sift(nyt2020, pub_date, scope = c(0, 2), grepl("Trump Tests Positive", headline)) # sift recognizes dplyr group specification. library(dplyr) library(mopac) express %>% group_by(direction) %>% sift(time, 30, plate == "EAS-1671") # row augmentation performed within groups.
# See current events from same timeframe as 2020 Utah Monolith discovery. sift(nyt2020, pub_date, scope = 2, grepl("Monolith", headline)) # or Biden's presidential victory. sift(nyt2020, pub_date, scope = 2, grepl("Biden is elected", headline)) # We can specify lower & upper scope to see what happened AFTER Trump tested positive. sift(nyt2020, pub_date, scope = c(0, 2), grepl("Trump Tests Positive", headline)) # sift recognizes dplyr group specification. library(dplyr) library(mopac) express %>% group_by(direction) %>% sift(time, 30, plate == "EAS-1671") # row augmentation performed within groups.
These datasets are intended to demonstrate usage of sift::break_join
.
us_uk_pop us_uk_leaders
us_uk_pop us_uk_leaders
See tidyr::who
and ggplot2::presidential
.