{ ## 'Global Setup (For any notebook)'#### Configure relative path (Change!) path_global_setup_function ='../../R/setup/global_setup.R'#### Generic global setup code (Do not change!)source(path_global_setup_function) setup =global_setup(path_global_setup_function)invisible(list2env(setup$sourced_objects, envir = .GlobalEnv)) global_context = setup$global_context}{ ## Context## Local Objects path_templating_data_folder =generate_relative_path(global_context$notebook_relative_root,'datasets/_templating_data/') path_seeds_folder =generate_relative_path(global_context$notebook_relative_root,'clean/seeds') path_templating_data_seed =file.path(path_seeds_folder, 'templating_data.xlsx') path_templating_dictionary_seed =file.path(path_seeds_folder, 'templating_dictionary.json') path_templating_df_seed =file.path(path_seeds_folder, 'templating_data.parquet')## Local Context context =c(global_context)}
The SALURBAL system utilized composite keys and dataframe templating to handle the highly heterogeneous relationships between data and metadata. Our Dictionary table is similar in that it is grouped by the composite key combinations:
by_key: in these cases, the key is the only column in the table that is used for compling the associated value
by_key_iso2: in these cases, the key and iso2 columns are used for compiling the associated value
by_key_iso2_year: in these cases, the key, iso2, and year columns are used for compiling the associated value
by_key_iso2_year_dataset: in these cases, the key, iso2, year, and dataset columns are used for compiling the associated value
This notebook contains the logic for ingestion, processing, testing/validation of SALURBAL key-value dictionary tables. The source tables are ingested from the _src folder and are processed into a single intermediate data structure for validation. The validation process includes unit tests for structure and content issues. If the source tables pass the validation tests, they are exported to the clean/seeds folder as a single .xlsx file for use during the ETL process.
Currently we we have 26 data source and 1 utility templating data files. The first thing we do is to compile them all into a single intermediate source data structure for validation by testing for structure and content issues.
Templating engines require a key value pair during the rendering the process. The information we have now is the in the form of a dataframe that includes both the root keyword and also some positional information (iso2, year, dataset). To make this composite information available to the templating enginer we will operationalize a composite key string that captures this relational information so it can be used to access the correct value during the rendering process into dataframes. A few notes:
order of composite key must be standardized
Lets just standardize it to: sort(c('dataset_id', 'iso2', 'strata_id', 'geo', 'year')).
Now we have our upstream source tables ingested, lets do some testing to validate the structure as well as integrity of the source dictionary.
2.1 Unit Tests
Unit tests include:
valid_value_cols : checks that all tables have the three expect cols (public, source, ack .. _value)
valid_key: all keys must not have under dashes only number, character and dahses
valid_src_utf8: checks everything is valid utf-8 encoding
valid_no_seperator: makes sure seperators (;) as everything should be tidy
valid_tidy_iso2_year: checks src__by_iso2_year is tidy
valid_pk: checks that pk matches the table names
unique_composite_keys
df_invalid = df_templating_data %>%add_count(composite_key) %>%filter(n>1)if (nrow(df_invalid) >0 ) cli_abort("Some composite keys are not unique. Please check!")
valid_key
df_invalid = df_templating_data %>%filter(str_detect(key, "_"))if (nrow(df_invalid) >0 ) cli_abort('Some keys contain invalid chracter of underdashes. Please make sure all keys do not have underdashes.')
valid_value_cols
valid_value_cols = templating_data %>%imap_lgl(~{ table_tmp = .x index <- .y valid_tmp =all(required_cols%in%names(table_tmp))if (!valid_tmp) cli_alert_warning("{index} Missing expect value columns")return(valid_tmp) }) %>%all()if (!valid_value_cols) cli_abort('Some src sheets are missing expect columns.')
The table below shows the Templating keys avaialable and the associated relationships these keys are available at.
Below we provide a preview of each of tables within the SALURBAL Templating Data resource. You can use the ‘Search’ box in each of these tables for preliminary preview of the templating data available to other processes within the SALURBAL system.