
Batch redact transcripts in a folder
ghost_batch.RdScans a folder for transcript files and runs the appropriate redaction function per file type:
.docx->ghost_docx().txt->ghost_txt().vtt->ghost_vtt()(can output VTT/DOCX/TXT)
Usage
ghost_batch(
input_dir,
interviewers,
interviewees = character(),
redact_other = character(),
redact_interviewer = FALSE,
include_common_names = FALSE,
redacted_token = "[REDACTED]",
add_blank_line_between_turns = TRUE,
output_dir = NULL,
recursive = FALSE,
suffix = "_redacted",
out_format = NULL,
report_redacted = FALSE
)Arguments
- input_dir
Folder containing transcripts.
- interviewers
Character vector of interviewer names.
- interviewees
Character vector of interviewee/participant names.
- redact_other
Other words/phrases to redact.
- redact_interviewer
If
TRUE, also redact interviewer names.- include_common_names
If
TRUE, also redact a default list of common names (e.g., top US baby names, if available viaghosted::common_names_default).- redacted_token
Replacement token used for redactions (names and other phrases).
- add_blank_line_between_turns
For VTT -> DOCX/TXT outputs, insert a blank line between turns.
- output_dir
Folder to write outputs (default:
input_dir).- recursive
If
TRUE, include files in subdirectories.- suffix
Suffix to append to each output filename (default:
_redacted).- out_format
Output format for all inputs; one of
"vtt","docx", or"txt". IfNULL(default), each file keeps its original format. If you request"vtt"for.docx/.txtinputs, cues are written without timestamps.- report_redacted
If
TRUE, print phrases found/redacted per file.
Details
Outputs are written to output_dir (defaults to input_dir) using the input
base filename with suffix appended before the extension. Shows a console
progress bar while processing.
Examples
# Redact all transcripts in a folder, keeping the same formats:
# ghost_batch("inst/data/transcripts", interviewers = "Dr. Smith",
# interviewees = "Jane Doe", suffix = "_DEID")
# Redact VTTs to DOCX, others keep their formats:
# ghost_batch("inst/data/transcripts", interviewers = "Dr. Smith",
# interviewees = "Jane Doe", out_format = "docx")