Skip to content

Use R to generate a database containing the UK Biobank data schemas from http://biobank.ctsu.ox.ac.uk/crystal/schema.cgi

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

bjcairns/ukbschemas

Repository files navigation

ukbschemas

Lifecycle: experimental Build Status

This R package can be used to create and/or load a database containing the UK Biobank Data Showcase schemas, which are data dictionaries describing the structure of the UK Biobank main dataset.

Installation

You can install the current version of ukbschemas from GitHub with:

# install.packages("devtools")
devtools::install_github("bjcairns/ukbschemas")

library(ukbschemas)

Examples

The package supports two workflows.

Save-Load workflow (recommended)

The recommended approach is to use ukbschemas_db() to download the schema tables and save them to an SQLite database, then use load_db() to load the tables from the database and store them as tibbles in a named list:

db <- ukbschemas_db(path = tempdir())
sch <- load_db(db = db)

By default, the database is named ukb-schemas-YYYY-MM-DD.sqlite (where YYYY-MM-DD is the current date) and placed in the current working directory. (path = tempdir() in the above example puts it in the current temporary directory instead.) At the most recent compilation of the database (03 August 2019), the size of the .sqlite database file produced by ukbschemas_db() was approximately 10.1MB.

Note that without further arguments, ukbschemas_db() tidies up the database to give it a more consistent relational structure (the changes are summarised in the output of the first example, above). Alternatively the raw data can be loaded with the as_is argument:

db <- ukbschemas_db(path = tempdir(), overwrite = TRUE, as_is = TRUE)

The overwrite option allows the database file to be overwritten (if TRUE), or prevents this (FALSE), or if not specified and the session is interactive (interactive() == TRUE) then the user is prompted to decide.

Note: If you have created a schemas database with an earlier version of ukbschemas, it should be possible to load that database with the latest version of load_db(), which (currently) should load any SQLite database, regardless of contents.

Load-Save workflow

The second approach is to download the schemas and store them in memory in a list, and save them to a database only as requried.

This is not recommended, because it is better (for everyone) not to download the schema files every time they are needed, and because the database assumes a certain structure that should be guaranteed when the database is saved. If you still want to take this approach, use:

sch <- ukbschemas()
db <- save_db(sch, path = tempdir())

Why R?

This package was originally written in bash (a Unix shell scripting language). However, R is more accessible and all dependencies are loaded when you install the package; there is no need to install any secondary software (not even SQLite).

Notes

Design notes

  • All the encoding value tables (esimpint, esimpstring, esimpreal, esimpdate, ehierint, ehierstring) have been harmonised and combined into a single table encvalues. The value column in encvalues has type TEXT, but a type column has been added in case the value is not clear from context. The original type-specific tables have been deleted.
  • To avoid redunancy, category parent-child relationships have been moved to table categories, as column parent_id, from table catbrowse (which has been deleted).
  • Reference to the category to which a field belongs is in the main_category column in the fields schema, but has been renamed to category_id for consistency with the categories schema.
  • Details of several of the field properties (value_type, stability, item_type, strata and sexed) are available elsewhere on the Data Showcase. These have been added manually to tables valuetypes, stability, itemtypes, strata and sexed, and appropriate ID references have been renamed with the _id suffix in tables fields and encodings.
  • There are several columns in the tables which are not well-documented (e.g. base_type in fields, availability in encodings and categories, and others). Additional tables documenting these encoded values may be included in future versions (and suggestions are welcome).

Known code issues

  • The UK Biobank data schemas are regularly updated as new data are added to the system. ukbschemas does not currently include a facility for updating the database; it is necessary to create a new database.
  • Because readr::read_csv() reads whole numbers as type double, not integer (allowing 64-bit integers without loss of information), column types in schemas loaded in R will differ depending on whether the schemas are loaded directly to R or first saved to a database. This should make little or no difference for most applications.
  • Any other issues.