This repository includes scripts and support files for 2019 IRIS-NCSES record linkage.
This script requires the additional installation of at least one package (unidecode). Excecute this to ensure that all requirements to run the main script run are present:
python -m pip install --user -r requirements.txt
python NCSES_clean_names.py
This code cleans and normalizes name fields, month, and year of birth. Key steps:
- Create nickname lookup from nickname csv (
NICKNAME_FILENAME
) - Pull the source data input (
INPUT_FILENAME
) - Clean and normalize each field.
- Apply nickname lookup function to assign a first name group from first given name.
- Output to a ready-to-hash CSV (
OUTPUT_FILENAME
).
INPUT_FILENAME
and OUTPUT_FILENAME
should be customized as needed.
- They can be relative (
sourcenames.csv
,./input/rawdata.csv
) or absolute (C:/data/raw.csv
). - Use forward slashes
/
in filenames, not backslash\
. Windows natively handles either.
Other constants are fixed configurations that should not be changed independently.
The INPUT_FIELDS
variable specifies the following fields that must be in the source name CSV:
name_first_middle
- concatenation of all given names: first(s) and/or middle(s)
name_last
- last name as provided by source
mob
- month of birth
yob
- year of birth
All other fields in the source CSV (e.g. IDs) will be passed directly to the cleaned CSV.
The script uses, the OUTPUT_FIELDS
variable helps validate, these outgoing fields:
-
cleaned versions of each input field, with new names for each field
given
family
month
year
-
complete concatenated given + family
complete
-
name group assigned from the first word of first name
given_nickname
-
given name trio that breaks first/middle after the first word
given_first_word
given_middle_initial
given_all_but_first
-
given name trio that breaks first/middle before the last word
given_all_but_final
given_final_initial
given_final_word
Input:
name_first_middle Emilia Isobel Euphemia Rose Kit
name_last Clarke Harington
mob 10 ??
yob 1986 1986
Output:
given emiliaisobeleuphemiarose kit
family clarke harington
month 10
year 1986 1986
complete emiliaisobeleuphemiaroseclarke kitharington
given_nickname emilia christopher
given_first_word emilia kit
given_middle_initial i
given_all_but_first isobeleuphemiarose
given_all_but_final emiliaisobeleuphemia
given_final_initial r
given_final_word rose