Administrative Layer Schema (Seed file)

Author

Ran Li

Published

August 29, 2024

SALURBAL Notebook - Global Setup
## Manual configure relative path (Configure for each notebook)
path_global_setup_function = '../../../../R/setup/global_setup.R'

## Generic global setup code (Do not change across notebooks)
source(path_global_setup_function)
setup = global_setup(path_global_setup_function)
invisible(list2env(setup$sourced_objects, envir = .GlobalEnv))
global_context = setup$global_context

One of the key seeds for the project is a centralized codebook for the columns found int tables that fall within the administrative layer. This use to be in a .xslx file but we shifted to a Notion database for better documentation and visibility. We will update the codebase seed file with this database in batch upload flow so:

  1. Schema change is document in the Notion SALURBAL Administrative Layer Fields (Database)
  2. We export the whole SALURBAL Administrative Layer Fields (Database) as .csv and save it with a dated file at /cache
  3. We process the most recent cached file and write a cleaned seed file to project /clean folder

Import

Find most recent cache file

df_files = tibble(file = list.files("cache") )  %>% 
  mutate(date = str_extract(file, "\\d{1,2}-\\d{1,2}-\\d{2}")) %>%
  mutate(date = as.Date(date, format = "%m-%d-%y"))  %>% 
  select(date, file) %>% 
  arrange(desc(date))

These are the available Admin layer codebooks.

file_to_process = df_files %>% 
  filter(date == max(date)) %>%
  pull(file) %>% 
  paste0("cache/", .)
file_to_process
[1] "cache/9-10-24 SALURBAL Administrative Layer Fields (Codebook) 228da34c473f44ac881b1e2f0efbad2d_all.csv"

Lets use the most recent one.

dfa = read_csv(file_to_process)
glimpse(dfa)
Rows: 47
Columns: 7
$ Name        <chr> "variable_origin", "public", "longitudinal", "file_codeboo…
$ Desc        <chr> "A list of SALURBAL var_name used to operationalize this v…
$ df_var_name <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No"…
$ Type        <chr> "metadata", "metadata", "metadata", "metadata", "metadata"…
$ Tags        <chr> "Standard metadata field", "Standard metadata field", "Sta…
$ Note        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Status      <chr> "Done", "Done", "Done", "Done", "Done", "Done", "Done", "D…

Each row in this table is an accepted column in the administrative layer. We will process this table simply just name, desc and type

df_admin_layer_codebook = dfa %>% 
  select(Name, Desc, Type, Tags) %>% 
  janitor::clean_names()
df_admin_layer_codebook
name desc type tags
variable_origin A list of SALURBAL var_name used to operationalize this variable. metadata Standard metadata field
public Categorical indicator for whether accessibility status for this data point (e.g. public, private … various shades of grey). metadata Standard metadata field
longitudinal Is this variable qualified for longitudinal analysis or visualizations? metadata Standard metadata field
file_codebook File name of the codebook file used for this dataset metadata Standard metadata field
acknowledgements Any acklowdgements for this variable metadata Standard metadata field
limitations Place to describe any limitaitons for this variable. metadata Standard metadata field
dataset_notes Any additional information can be added here. For example some dataset or file specific notes could be added here. metadata Standard metadata field
source Data source metadata Standard metadata field
strata_description This should describe in detail what strata are available for this variable. Please include details about each strata if applicable. metadata Standard metadata field
coding This is an optional internal field that describes in details the measurement metadata Standard metadata field
units This is the short label to be appended to the data value. It will be used for annotating text or visualizations with a unit label (e.g. cases per 100k). metadata Standard metadata field
value_type What type of data is the value. metadata Standard metadata field
var_def Details definition of what the variable is about. If categorical include coding here. metadata Standard metadata field
var_label Short human readable variable label. metadata Standard metadata field
subdomain Lower level of variable categorization. A list of these can be found at https://data.lacurbanhealth.org/data/about. metadata Standard metadata field
domain Highest level of variable categorization. A list of these can be found at https://data.lacurbanhealth.org/data/about. metadata Standard metadata field
strata_2_value Recoding of the raw strata_1_raw into a human-readable value (e.g. “City specific”) strata.csv NA
strata_2_raw The raw value of the second population strata for the specific data point. (e.g. “City”  ) strata.csv NA
strata_2_name The name of the the second population strata (e.g. Standard population). strata.csv NA
strata_1_value Recoding of the raw strata_1_raw into a human-readable value (e.g., “Male” or “Female”) strata.csv NA
strata_1_raw The raw value of the first population strata for the specific data point. (e.g. “1” or “0”) strata.csv NA
strata_1_name The name of the the first population strata (e.g. Sex). strata.csv NA
strata_id Unique identifier for strata within a variable. composite key NA
var_name_raw Original variable name in the source or pre-renovated SALURBAL dataset. data NA
var_name SALURBAL system harmonized variable name. composite key NA
dataset_id dataset id  e.g. APSL1AD. This should just be the name of the folder on the UHC server (minus the date appendix) composite key NA
value Value for this particular data point. data Standard
source_terms_of_use_URL The terms of use of the origin data source. metadata NA
source_URL URL Associated with the data source metadata NA
version version of the data point composite key NA
value_lci This is the Lower Confidence Interval (LCI) for the associated variable. The LCI representing the lower limit of the confidence range for the primary ‘value’. This value indicates the lowest possible accurate value within the confidence range, considering statistical uncertainty. data NA
value_uci This is the Upper Confidence Interval (UCI) for the associated variable.The UCI indicates the upper limit of the confidence range for the primary ‘value’. This estimate provides an understanding of the statistical uncertainty surrounding the primary value, suggesting the highest possible accurate value. data NA
time_resolution_type Represents the scale of time for this data point. intermediate NA
estimate_type This is a internal utility variable that determines if a data point is an estimate, iteration, actual value … etc. Need better docemntaiton here with OBT EDA. intermediate NA
has_confidence_interval Utility indicator that is true for values with CI and false when only point data is available. intermediate NA
value_iteration This is an attribute associated with sampled data (think model predictions or simiulations that have many iterations). data NA
month Month as numeric string (1-12) composite key NA
day Day as string composite key NA
observation_id Unique SALURBAL Observation identifier composite key NA
observation_type Category of SALURBAL Observation Type composite key NA
dataset_instance concatenation of {dataset_id}_{version}; usefule intermediate variable that currently is a required composite key (this requirement may be relax). composite key NA
iso2 lower case ISO2 country code. composite key NA
year Year of the particular observation. This could be a single year or a range; we have an intermediate variable time_resolution_type (https://www.notion.so/time_resolution_type-5ed14361dc4144a1b974eb24ea964558?pvs=21) which is a categorization of the type of value year is. composite key NA
file_data File where data came from. Note that this is only found in the data and not codebooks, so lets keep it as a data attribute that every data point will have. data Standard
geo What geographic level the observation is at (L1, L2 … etc). Note this is a required variable for area level data but should be empty for all other observation types. composite key NA
tags An optional metadata field that can used as a flexible tag that exists outside the normal domain/subdomain structure. This came from tags raw ‘Domains’ in the health survey records dataset. It should just be a semicolan seperate string for now. metadata NA
data_point_uuid This is the primary key we have to track each data point. data Standard

Validation

Let do some data validations

df_admin_layer_codebook_validated = df_admin_layer_codebook %>% 
  verify(is_uniq("name")) 

Export

Looks good lets export to clean folder

df_admin_layer_codebook_validated %>% write_csv("../../seed-admin-layer-schema.csv")
df_admin_layer_codebook_validated %>% 
  jsonlite::write_json(
  "../../seed-admin-layer-schema.json", pretty = T)