
De-identifying Transcripts with ghosted
getting-started.RmdThe ghosted package provides four high-level helpers to de‑identify transcripts without forcing you to work with intermediate data.frames:
-
ghost_vtt()— redact a.vtt(Zoom/WebVTT) file and write.vtt/.docx/.txt; supportsredact_other,redact_interviewer,include_common_names,redacted_token,add_blank_line_between_turns,output_path,suffix,out_format,report_redacted. -
ghost_docx()— redact a.docxand write.docx/.txt/.vtt; supportsredact_other,redact_interviewer,include_common_names,redacted_token,add_blank_line_between_turns,output_path,suffix,out_format,report_redacted. -
ghost_txt()— redact a.txtand write.txt/.docx/.vtt; supportsredact_other,redact_interviewer,include_common_names,redacted_token,add_blank_line_between_turns,output_path,suffix,out_format,report_redacted. -
ghost_batch()— run the same logic across a folder of.vtt/.docx/.txtfiles; supportsredact_other,redact_interviewer,include_common_names,redacted_token,add_blank_line_between_turns,output_dir,suffix,out_format,report_redacted.
Common arguments (ghost_vtt/ghost_docx/ghost_txt):
-
filepath: input file path (.vtt/.docx/.txtrespectively). -
interviewers: character vector of interviewer names (required). -
interviewees: character vector of participant names (optional). -
redact_other: additional words/phrases to redact. -
redact_interviewer: ifTRUE, also redact interviewer names in text. -
include_common_names: ifTRUE, includeghosted::common_names_default()if available. -
redacted_token: replacement token for redactions (default[REDACTED]). -
add_blank_line_between_turns: for DOCX/TXT outputs, insert a blank line between turns. -
output_path: explicit output file path; ifNULL, uses input dir withsuffix. -
suffix: appended to base name when auto-generating outputs (default"_redacted"). -
out_format: output type. Allowed values per function:-
ghost_vtt():"vtt"(default),"docx","txt" -
ghost_docx():"docx"(default),"txt","vtt" -
ghost_txt():"txt"(default),"docx","vtt"
-
-
report_redacted: ifTRUE, prints which phrases were found and redacted.
One‑file workflows
# VTT → DOCX
ghost_vtt(
filepath = "path/to/meeting.vtt",
interviewers = "Sansa Stark",
interviewees = "Arya Stark",
out_format = "docx", # or "vtt" / "txt"
output_path = "output/",
suffix = "_DEID"
)
# DOCX → DOCX
ghost_docx(
filepath = "path/to/transcript.docx",
interviewers = "Sansa Stark",
interviewees = "Arya Stark",
output_path = "output/transcript_DEID.docx"
)
# TXT → TXT
ghost_txt(
filepath = "path/to/notes.txt",
interviewers = "Sansa Stark",
interviewees = "Arya Stark",
output_path = "output/notes_DEID.txt"
)Speaker labels at the start of a line/paragraph are normalized to
Interviewer or Participant (e.g.,
Name: or <v Name> →
Interviewer:), while names elsewhere are replaced with
[REDACTED].
Batch a folder
res <- ghost_batch(
input_dir = "path/to/transcripts",
interviewers = "Sansa Stark",
interviewees = "Arya Stark",
output_dir = "output/",
suffix = "_DEID", # default
out_format = NULL # keep each file’s type; or "docx"/"txt"/"vtt"
)
print(res)Notes
-
ghost_batch(): Progress is shown with a console progress bar. - For DOCX/TXT → VTT conversions (when
out_format = "vtt"), cues are written without timestamps. - Leading speaker tokens at line start (e.g.,
Name:or<v Name>) are normalized toInterviewer/Participantwhile names elsewhere are redacted. - For DOCX/TXT → VTT conversions in
ghost_batch(out_format = "vtt"), cues are written without timestamps. - Ensureofficeris installed for.docxreads/writes. - Install
officerfor DOCX reads/writes.