Administrative Layer Schema (Seed file)

Author

Ran Li

Published

August 29, 2024

SALURBAL Notebook - Global Setup

## Manual configure relative path (Configure for each notebook)
path_global_setup_function = '../../../../R/setup/global_setup.R'

## Generic global setup code (Do not change across notebooks)
source(path_global_setup_function)
setup = global_setup(path_global_setup_function)
invisible(list2env(setup$sourced_objects, envir = .GlobalEnv))
global_context = setup$global_context

One of the key seeds for the project is a centralized codebook for the columns found int tables that fall within the administrative layer. This use to be in a .xslx file but we shifted to a Notion database for better documentation and visibility. We will update the codebase seed file with this database in batch upload flow so:

Schema change is document in the Notion SALURBAL Administrative Layer Fields (Database)
We export the whole SALURBAL Administrative Layer Fields (Database) as .csv and save it with a dated file at /cache
We process the most recent cached file and write a cleaned seed file to project /clean folder

Import

Find most recent cache file

df_files = tibble(file = list.files("cache") )  %>% 
  mutate(date = str_extract(file, "\\d{1,2}-\\d{1,2}-\\d{2}")) %>%
  mutate(date = as.Date(date, format = "%m-%d-%y"))  %>% 
  select(date, file) %>% 
  arrange(desc(date))

These are the available Admin layer codebooks.

file_to_process = df_files %>% 
  filter(date == max(date)) %>%
  pull(file) %>% 
  paste0("cache/", .)
file_to_process

[1] "cache/9-10-24 SALURBAL Administrative Layer Fields (Codebook) 228da34c473f44ac881b1e2f0efbad2d_all.csv"

Lets use the most recent one.

dfa = read_csv(file_to_process)
glimpse(dfa)

Rows: 47
Columns: 7
$ Name        <chr> "variable_origin", "public", "longitudinal", "file_codeboo…
$ Desc        <chr> "A list of SALURBAL var_name used to operationalize this v…
$ df_var_name <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No"…
$ Type        <chr> "metadata", "metadata", "metadata", "metadata", "metadata"…
$ Tags        <chr> "Standard metadata field", "Standard metadata field", "Sta…
$ Note        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Status      <chr> "Done", "Done", "Done", "Done", "Done", "Done", "Done", "D…

Each row in this table is an accepted column in the administrative layer. We will process this table simply just name, desc and type

df_admin_layer_codebook = dfa %>% 
  select(Name, Desc, Type, Tags) %>% 
  janitor::clean_names()
df_admin_layer_codebook

name	desc	type	tags
variable_origin	A list of SALURBAL var_name used to operationalize this variable.	metadata	Standard metadata field
public	Categorical indicator for whether accessibility status for this data point (e.g. public, private … various shades of grey).	metadata	Standard metadata field
longitudinal	Is this variable qualified for longitudinal analysis or visualizations?	metadata	Standard metadata field
file_codebook	File name of the codebook file used for this dataset	metadata	Standard metadata field
acknowledgements	Any acklowdgements for this variable	metadata	Standard metadata field
limitations	Place to describe any limitaitons for this variable.	metadata	Standard metadata field
dataset_notes	Any additional information can be added here. For example some dataset or file specific notes could be added here.	metadata	Standard metadata field
source	Data source	metadata	Standard metadata field
strata_description	This should describe in detail what strata are available for this variable. Please include details about each strata if applicable.	metadata	Standard metadata field
coding	This is an optional internal field that describes in details the measurement	metadata	Standard metadata field
units	This is the short label to be appended to the data value. It will be used for annotating text or visualizations with a unit label (e.g. cases per 100k).	metadata	Standard metadata field
value_type	What type of data is the value.	metadata	Standard metadata field
var_def	Details definition of what the variable is about. If categorical include coding here.	metadata	Standard metadata field
var_label	Short human readable variable label.	metadata	Standard metadata field
subdomain	Lower level of variable categorization. A list of these can be found at https://data.lacurbanhealth.org/data/about.	metadata	Standard metadata field
domain	Highest level of variable categorization. A list of these can be found at https://data.lacurbanhealth.org/data/about.	metadata	Standard metadata field
strata_2_value	Recoding of the raw strata_1_raw into a human-readable value (e.g. “City specific”)	strata.csv	NA
strata_2_raw	The raw value of the second population strata for the specific data point. (e.g. “City” )	strata.csv	NA
strata_2_name	The name of the the second population strata (e.g. Standard population).	strata.csv	NA
strata_1_value	Recoding of the raw strata_1_raw into a human-readable value (e.g., “Male” or “Female”)	strata.csv	NA
strata_1_raw	The raw value of the first population strata for the specific data point. (e.g. “1” or “0”)	strata.csv	NA
strata_1_name	The name of the the first population strata (e.g. Sex).	strata.csv	NA
strata_id	Unique identifier for strata within a variable.	composite key	NA
var_name_raw	Original variable name in the source or pre-renovated SALURBAL dataset.	data	NA
var_name	SALURBAL system harmonized variable name.	composite key	NA
dataset_id	dataset id e.g. APSL1AD. This should just be the name of the folder on the UHC server (minus the date appendix)	composite key	NA
value	Value for this particular data point.	data	Standard
source_terms_of_use_URL	The terms of use of the origin data source.	metadata	NA
source_URL	URL Associated with the data source	metadata	NA
version	version of the data point	composite key	NA
value_lci	This is the Lower Confidence Interval (LCI) for the associated variable. The LCI representing the lower limit of the confidence range for the primary ‘value’. This value indicates the lowest possible accurate value within the confidence range, considering statistical uncertainty.	data	NA
value_uci	This is the Upper Confidence Interval (UCI) for the associated variable.The UCI indicates the upper limit of the confidence range for the primary ‘value’. This estimate provides an understanding of the statistical uncertainty surrounding the primary value, suggesting the highest possible accurate value.	data	NA
time_resolution_type	Represents the scale of time for this data point.	intermediate	NA
estimate_type	This is a internal utility variable that determines if a data point is an estimate, iteration, actual value … etc. Need better docemntaiton here with OBT EDA.	intermediate	NA
has_confidence_interval	Utility indicator that is true for values with CI and false when only point data is available.	intermediate	NA
value_iteration	This is an attribute associated with sampled data (think model predictions or simiulations that have many iterations).	data	NA
month	Month as numeric string (1-12)	composite key	NA
day	Day as string	composite key	NA
observation_id	Unique SALURBAL Observation identifier	composite key	NA
observation_type	Category of SALURBAL Observation Type	composite key	NA
dataset_instance	concatenation of {dataset_id}_{version}; usefule intermediate variable that currently is a required composite key (this requirement may be relax).	composite key	NA
iso2	lower case ISO2 country code.	composite key	NA
year	Year of the particular observation. This could be a single year or a range; we have an intermediate variable time_resolution_type (https://www.notion.so/time_resolution_type-5ed14361dc4144a1b974eb24ea964558?pvs=21) which is a categorization of the type of value year is.	composite key	NA
file_data	File where data came from. Note that this is only found in the data and not codebooks, so lets keep it as a data attribute that every data point will have.	data	Standard
geo	What geographic level the observation is at (L1, L2 … etc). Note this is a required variable for area level data but should be empty for all other observation types.	composite key	NA
tags	An optional metadata field that can used as a flexible tag that exists outside the normal domain/subdomain structure. This came from tags raw ‘Domains’ in the health survey records dataset. It should just be a semicolan seperate string for now.	metadata	NA
data_point_uuid	This is the primary key we have to track each data point.	data	Standard

Validation

Let do some data validations

df_admin_layer_codebook_validated = df_admin_layer_codebook %>% 
  verify(is_uniq("name"))

Export

Looks good lets export to clean folder

df_admin_layer_codebook_validated %>% write_csv("../../seed-admin-layer-schema.csv")
df_admin_layer_codebook_validated %>% 
  jsonlite::write_json(
  "../../seed-admin-layer-schema.json", pretty = T)