This function standardizes and cleans a dataset of qualitative excerpts coded by multiple coders.
It standardizes column names, filters excerpts by a preferred coder hierarchy, converts code columns
to logical (TRUE/FALSE), assigns descriptive variable labels, and optionally exports the cleaned data
to Excel (.xlsx) or Stata (.dta) format. The function also returns a codebook containing
variable names, labels, and data types.
Usage
clean_data(
excerpts,
preferred_coders,
rename_vars = NULL,
relabel_vars = NULL,
output_path = NULL,
output_type = c("none", "xlsx", "dta")
)Arguments
- excerpts
A data frame containing excerpt-level data exported from Dedoose or a similar coding platform.
- preferred_coders
A character vector of coder names in order of preference. The function keeps the highest-preference coder for each unique
media_title.- rename_vars
An optional named list or
dplyr::rename()-style mapping of variables to rename. For example,list(new_name = "old_name").- relabel_vars
An optional named list of new variable labels. For example,
list(old_name = "New label for var1", var2 = "Updated label for var2").- output_path
Optional file path to save the cleaned dataset. If
NULL, the data will not be saved to disk.- output_type
A string specifying the export format. Must be one of:
"none"– no file is written (default)"xlsx"– save as Excel file viaopenxlsx::write.xlsx()"dta"– save as Stata file viahaven::write_dta()
Value
A list with two elements:
dataA cleaned data frame with standardized names, filtered coders, and labelled variables.
codebookA data frame with columns:
variable,label, andtype.
Details
The function performs the following steps:
Standardizes variable names (lowercase, underscores instead of spaces).
Renames
excerpt_copytoexcerptif present.Removes columns ending with
"range"or"weight".Detects code columns matching the pattern
"^code.*applied$"and converts them to logicals.Renames code columns with a
c_prefix and assigns human-readable variable labels.Filters to the preferred coder per
media_title.Applies default labels to key metadata variables (e.g.,
excerpt_creator,media_title).Optionally renames or relabels variables via user-supplied arguments.
Drops columns that are entirely
NA.Generates a codebook summarizing variables, labels, and types.
When exporting to .dta, logicals remain stored as TRUE/FALSE rather than being coerced to 0/1.
Variable labels are preserved in Stata format using the labelled and haven packages.
Examples
if (FALSE) { # \dontrun{
result <- clean_data(
excerpts = excerpts_raw,
preferred_coders = c("CoderA", "CoderB"),
rename_vars = list(new_name = "old_name"),
relabel_vars = list(old_name = "new variable label"),
output_path = "cleaned_excerpts.dta",
output_type = "dta"
)
# Access cleaned data and codebook
head(result$data)
head(result$codebook)
} # }
