Title: | Nearest Neighbors Matching of Case-Control Data |
---|---|
Description: | Provides nearest-neighbors matching and analysis of case-control data. Cui, Z., Marder, E. P., Click, E. S., Hoekstra, R. M., & Bruce, B. B. (2022) <doi:10.1097/EDE.0000000000001504>. |
Authors: | Beau Bruce [aut, cre], Zhaohui Cui [aut] |
Maintainer: | Beau Bruce <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.0.0 |
Built: | 2025-02-22 06:27:24 UTC |
Source: | https://github.com/cran/nncc |
A toy dataset containing 7-day exposure history of 250 cases and 250 controls
anifood
anifood
A data frame with 500 rows and 11 variables:
case status, 1 = case, 0 = control
whether exposed to exp01, 1 = yes, 0 = no
whether exposed to exp09, 1 = yes, 0 = no
whether exposed to exp20, 1 = yes, 0 = no
whether exposed to exp24, 1 = yes, 0 = no
whether exposed to exp27, 1 = yes, 0 = no
whether exposed to exp43, 1 = yes, 0 = no
whether exposed to exp45, 1 = yes, 0 = no
whether exposed to exp50, 1 = yes, 0 = no
whether exposed to exp52, 1 = yes, 0 = no
whether exposed to exp57, 1 = yes, 0 = no
Save results from code that takes a long time to execute to a .rds file if that file does not exist in the cache directory. If the file exists in the cache directory, that file will be loaded to memory without evaluating the code.
cacheit(name, code, dir, createdir = FALSE, clearcache = FALSE)
cacheit(name, code, dir, createdir = FALSE, clearcache = FALSE)
name |
Name of the file to create without extension |
code |
Expression of the code to execute and cache |
dir |
Name of cache directory which should be placed in the working directory |
createdir |
Logical about whether to create the directory if it does not exist |
clearcache |
Logical about whether to recalculate the cached .rds file for this object |
For more information, please refer to the vignette using
browseVignettes("nncc")
.
Output of code, either freshly executed if the file does not exist or or clearcache is TRUE otherwise returns result from the cache file
Each case and matched controls form a stratum in the data set. This function is to calculate the pooled OR for the data set.
calc_strata_or(dfs, filter = TRUE, filterdata = NULL)
calc_strata_or(dfs, filter = TRUE, filterdata = NULL)
dfs |
A named list of dataframes created by package functions |
filter |
Filter statement to apply |
filterdata |
Extra data to left join to the |
Uses the M-H method unless there is only one strata for which the
fisher.test is used. For more information, please refer to the vignette
using browseVignettes("nncc")
.
Distance density plots comparing closest to random choices
distance_density_plot(threshold_results)
distance_density_plot(threshold_results)
threshold_results |
See |
The ggplot showing the distances of cases matched to their nearest neighbor vs. a random control
A dataset lists variables that are excluded from matching for each exposure.
This dataset is supplied to the rmvars
argument of the function
make_knn_strata
. The two columns must be named with "exp_var" and "rm_vars".
excl_vars
excl_vars
A data frame with two variables:
exposures of interest
variables to be excluded from matching for a given exposure
Ensures that a control retained in a data frame is used once and remove strata without any case or any control. In this process, priority is first given to the smallest strata then smallest distance if a control is matched to multiple cases (i.e., that control exists in multiple strata).
finalize_data(dfs, filter = TRUE, filterdata = NULL)
finalize_data(dfs, filter = TRUE, filterdata = NULL)
dfs |
A list of data frames generated by
|
filter |
Filter statement to apply |
filterdata |
Extra data to left join to the |
For more information, please refer to the vignette using
browseVignettes("nncc")
.
A list of data frames
Fix the strata so they all have at least one case and control
fix_df(d)
fix_df(d)
d |
A stratified dataset |
Calculate population attributable fraction using odds ratio
get_paf(df_or, which_or, exp_var, exp_level, df_matched)
get_paf(df_or, which_or, exp_var, exp_level, df_matched)
df_or |
A data frame that stores odds ratios for all exposure of interest |
which_or |
An unquoted name of the name of the column that stores odds
ratio, or its lower or upper confidence limit in |
exp_var |
An unquoted name of the column that stores the name of
exposures in |
exp_level |
An unquoted name of the column that stores the level of the
exposure variable in |
df_matched |
The list of data frames used to calculate odds ratios |
Use odds ratio, its upper confidence limit, and its lower confidence limit to calculate population attributable fraction, its upper confidence limit, and its lower confidence limit, respectively.
For more information, please refer to the vignette using
browseVignettes("nncc")
.
A data frame.
To find a threshold for distance to define controls that are qualified to be matched with a case.
get_threshold(data, vars, case_var = "case", p_threshold = 0.5, seed = 1600)
get_threshold(data, vars, case_var = "case", p_threshold = 0.5, seed = 1600)
data |
The dataset |
vars |
The variables to use for calculating distance |
case_var |
The name of the case identifier variable |
p_threshold |
The probability that the closest matching approach
produces the closer matching relative to the random matching approach.
The greater |
seed |
A random seed. |
This function uses logistic regression to predict by the distance whether a control is the closest (unique) match for each case vs. a random selection and by default returns the 50
For more information, please refer to the vignette using
browseVignettes("nncc")
.
A list with items:
threshold |
The numeric threshold chosen |
modeldata |
The data used to fit the logistic regression model |
strata |
The strata made by make_knn_strata |
model |
The fit logisitic regression model |
Set a maximum number of controls that are allowed to be matched to a case; ensure that matched case-control pairs have a distance closer than the predefined threshold; merge strata sharing same controls.
make_analysis_set( var, stratified_data, data, maxdist = 0, maxcontrols = 20, silent = FALSE )
make_analysis_set( var, stratified_data, data, maxdist = 0, maxcontrols = 20, silent = FALSE )
var |
Character of current exposure variable in
|
stratified_data |
Stratified dataset, see |
data |
Original case control data |
maxdist |
Reject any controls more than maxdist from their case |
maxcontrols |
Maximum number of controls to keep per strata |
silent |
Suppress exposure info useful for *apply/loop implementations |
For more information, please refer to the vignette using
browseVignettes("nncc")
.
A list of data frames with the length
of number of exposures.
This helper function facilitates the implement the make_analysis_set() to each exposure.
make_analysis_sets(stratified_data, expvars, data, threshold)
make_analysis_sets(stratified_data, expvars, data, threshold)
stratified_data |
List of stratified data sets, see
|
expvars |
Character vector of exposure variable for each set in
|
data |
Original case control data |
threshold |
Maximum distance threshold for cases and controls created by
|
For more information, please refer to the vignette using
browseVignettes("nncc")
.
A list of data frames with the length
of number of exposures
Select a pre-defined number of controls for each case based on calculated distances between cases and controls.
make_knn_strata( expvar, matchvars, df, rmvars = data.frame(exp_var = character(), rm_vars = character(), stringsAsFactors = FALSE), casevar = "case", ncntls = 250, metric = "gower", silent = FALSE )
make_knn_strata( expvar, matchvars, df, rmvars = data.frame(exp_var = character(), rm_vars = character(), stringsAsFactors = FALSE), casevar = "case", ncntls = 250, metric = "gower", silent = FALSE )
expvar |
A character - the name of the exposure variable in |
matchvars |
Character vector - what are the variables to match on. Note that the function automatically excludes the the exposure variable. |
df |
A dataframe that contains the case-control data. |
rmvars |
A data frame that lists variables to be excluded from matching for each exposure. For details, please see the vignette of this package. |
casevar |
A character - what is the name of the variable indicating case status (1 = case, 0 = control) |
ncntls |
An integer to specify number of controls to find for each case (k in knn). |
metric |
A character to specify a metric for measuring distance between
a case and a control. See |
silent |
Suppress exposure info useful for *apply/loop implementations? |
For more information, please refer to the vignette using
browseVignettes("nncc")
.
A list of data frames with a length
of number of exposures of
interest.
The nncc
package implements an approach to match cases with their
nearest controls defined by Gower distance. This approach may achieve
better confounding control than conventional analytic approaches such as
(conditional) logistic regression when you have a relatively large number of
exposures of interest. To learn more
about nncc
, start with the vignettes: browseVignettes("nncc")
.
Maintainer: Beau B. Bruce [email protected]
Coauthor: Zhaohui Cui
Compare the original strata's distances to the knn version
original_compare_plot(data, casevar, stratavar, threshold_results)
original_compare_plot(data, casevar, stratavar, threshold_results)
data |
The original data |
casevar |
The variable that defines cases vs. controls |
stratavar |
The variable that defines the strata |
threshold_results |
See |
An list with items:
plot_density |
The ggplot displayed |
prop_distance_gt_threshold |
A table showing proportion of pairs exceeding numeric threshold chosen |
Plot the OR results
plot_results(csvfilename, filter = TRUE)
plot_results(csvfilename, filter = TRUE)
csvfilename |
CSV results file, see |
filter |
How to filter the results |
For more information, please refer to the vignette using
browseVignettes("nncc")
.
Returns csvfilename
to allow chaining
This data set deals with urinary tract infection in sexually active college women, along with covariate information on age and contraceptive use. The variables are all binary and coded in 1 (condition is present) and 0 (condition is absent).
sex2
sex2
sex2: a data.frame containing 239 observations
urinary tract infection, the study outcome variable
>= 24 years
use of diaphragm
use of oral contraceptive
use of condom
use of lubricated condom
use of spermicide
<https://www.cytel.com/>
Cytel Inc., (2010) LogXact 9 user manual, Cambridge, MA:Cytel Inc
This data set deals with urinary tract infection in sexually active college women, along with covariate information on age an contraceptive use. The variables are all binary and coded in 1 (condition is present) and 0 (condition is absent): case (urinary tract infection, the study outcome variable), age (>= 24 years), dia (use of diaphragm), oc (use of oral contraceptive), vic (use of condom), vicl (use of lubricated condom), and vis (use of spermicide).
sexagg
sexagg
sexagg: an aggregated data.frame containing 31 observations with case weights (COUNT).
urinary tract infection, the study outcome variable
>= 24 years
use of diaphragm
use of oral contraceptive
use of condom
use of lubricated condom
use of spermicide
<https://www.cytel.com/>
Cytel Inc., (2010) LogXact 9 user manual, Cambridge, MA:Cytel Inc
Calculate odds ratios using the M-H method when the matched dataset has more than 1 stratum, and using the Fisher's exact test when the matched dataset has only one stratum.
test_mh(case, exp, strata)
test_mh(case, exp, strata)
case |
The case statuses |
exp |
The exposure statuses |
strata |
The strata identifiers |
For more information, please refer to the vignette using
browseVignettes("nncc")
.
The list of statistical results
Show the prediction of the logistic regression model
threshold_model_plot(threshold_results, p_threshold = 0.5)
threshold_model_plot(threshold_results, p_threshold = 0.5)
threshold_results |
See |
p_threshold |
The probability that the closest matching approach
produces the closer matching relative to the random matching approach.
The greater |
The ggplot showing the threshold logistic regression model
Ensures controls are unique to avoid possible pseudoreplication issues
unique_controls(stratifieddata)
unique_controls(stratifieddata)
stratifieddata |
See |
A tibble after it has been examined and filtered for duplicate controls
Format strata output into CSV
write_strata_or_output(results, varnames, filename)
write_strata_or_output(results, varnames, filename)
results |
Output of |
varnames |
Vector of exposure variable names |
filename |
String of the filename to output to |
Returns the filename to allow chaining