De-identify transcripts with helpers:
-
ghost_vtt()— redact a.vtt(Zoom/WebVTT) file and write.vtt/.docx/.txt; supportsredact_other,redact_interviewer,include_common_names,redacted_token,add_blank_line_between_turns,output_path,suffix,out_format,report_redacted. -
ghost_docx()— redact a.docxand write.docx/.txt/.vtt; supportsredact_other,redact_interviewer,include_common_names,redacted_token,add_blank_line_between_turns,output_path,suffix,out_format,report_redacted. -
ghost_txt()— redact a.txtand write.txt/.docx/.vtt; supportsredact_other,redact_interviewer,include_common_names,redacted_token,add_blank_line_between_turns,output_path,suffix,out_format,report_redacted. -
ghost_batch()— run the same logic across a folder of.vtt/.docx/.txtfiles; supportsredact_other,redact_interviewer,include_common_names,redacted_token,add_blank_line_between_turns,output_dir,suffix,out_format,report_redacted.
Common arguments (ghost_vtt/ghost_docx/ghost_txt):
-
filepath: input file path (.vtt/.docx/.txtrespectively). -
interviewers: character vector of interviewer names (required). -
interviewees: character vector of participant names (optional). -
redact_other: additional words/phrases to redact. -
redact_interviewer: ifTRUE, also redact interviewer names in text. -
include_common_names: ifTRUE, includeghosted::common_names_default()if available. -
redacted_token: replacement token for redactions (default[REDACTED]). -
add_blank_line_between_turns: for DOCX/TXT outputs, insert a blank line between turns. -
output_path: explicit output file path; ifNULL, uses input dir withsuffix. -
suffix: appended to base name when auto-generating outputs (default"_redacted"). -
out_format: output type. Allowed values per function:-
ghost_vtt():"vtt"(default),"docx","txt" -
ghost_docx():"docx"(default),"txt","vtt" -
ghost_txt():"txt"(default),"docx","vtt"
-
-
report_redacted: ifTRUE, prints which phrases were found and redacted.
Installation
You can install the development version of ghosted like so:
# via remotes
remotes::install_github("abiraahmi/ghosted")
# or with pak
# pak::pak("abiraahmi/ghosted")Examples
library(ghosted)
# VTT → DOCX (or "vtt"/"txt")
ghost_vtt(
filepath = "inst/data/sample.vtt",
interviewers = "Sansa Stark",
interviewees = "Arya Stark",
out_format = "docx",
output_path = "output/sample_DEID.docx",
suffix = "_DEID"
)
# DOCX → DOCX
ghost_docx(
filepath = "inst/data/transcript.docx",
interviewers = "Sansa Stark",
interviewees = "Arya Stark",
output_path = "output/transcript_DEID.docx"
)
# TXT → TXT
ghost_txt(
filepath = "inst/data/notes.txt",
interviewers = "Sansa Stark",
interviewees = "Arya Stark",
output_path = "output/notes_DEID.txt"
)
# Batch a folder (keep file types)
res <- ghost_batch(
input_dir = "inst/data/transcripts",
interviewers = "Sansa Stark",
interviewees = "Arya Stark",
output_dir = "output",
suffix = "_DEID"
)
print(res)Notes - Leading speaker tokens at line start (e.g., Name: or <v Name>) are normalized to Interviewer/Participant while names elsewhere are redacted. - For DOCX/TXT → VTT conversions in ghost_batch(out_format = "vtt"), cues are written without timestamps. - Ensure officer is installed for .docx reads/writes.
