| Title: | An Abstracted System for Easily Working with Databases with Large Datasets |
|---|---|
| Description: | Provides object-oriented database management tools for working with large datasets across multiple database systems. Features include robust connection management for 'PostgreSQL' databases, advanced table operations with bulk data loading and upsert functionality, comprehensive data validation through customizable field type and content validators, efficient index management, and cross-database compatibility. Designed for high-performance data operations in surveillance systems and large-scale data processing workflows. |
| Authors: | Richard Aubrey White [aut, cre] (ORCID: <https://orcid.org/0000-0002-6747-1726>), August Sørli Mathisen [aut] |
| Maintainer: | Richard Aubrey White <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2026.5.13 |
| Built: | 2026-05-13 07:02:01 UTC |
| Source: | https://github.com/niphr/csdb |
Returns the currently registered authentication hook function.
csdb_get_auth_hook()csdb_get_auth_hook()
The current auth hook function, or NULL if none is set.
Register a function to be called when a database connection fails. This is useful for refreshing Kerberos tickets or other authentication credentials before retrying the connection.
csdb_set_auth_hook(hook)csdb_set_auth_hook(hook)
hook |
A function with no arguments that performs authentication, or NULL to clear the hook. |
Invisibly returns the previous hook (if any).
## Not run: # Set an auth hook to refresh Kerberos credentials csdb_set_auth_hook(function() { system2("/bin/authenticate.sh", stdout = NULL) }) # Clear the hook csdb_set_auth_hook(NULL) ## End(Not run)## Not run: # Set an auth hook to refresh Kerberos credentials csdb_set_auth_hook(function() { system2("/bin/authenticate.sh", stdout = NULL) }) # Clear the hook csdb_set_auth_hook(NULL) ## End(Not run)
A robust database connection manager that handles connections to various database systems including Microsoft SQL Server and PostgreSQL. This class provides connection management, authentication, and automatic reconnection capabilities.
The DBConnection_v9 class encapsulates database connection logic and provides a consistent interface for connecting to different database systems. It supports both trusted connections and user/password authentication, handles connection failures gracefully, and provides automatic reconnection functionality.
Key features:
Support for multiple database systems (SQL Server, PostgreSQL)
Automatic connection management with retry logic
Secure credential handling
Connection status monitoring
Graceful error handling and recovery
configConfiguration details of the database.
connectionDatabase connection.
autoconnectionDatabase connection that automatically connects if possible.
new()
Create a new DBConnection_v9 object.
DBConnection_v9$new( driver = NULL, server = NULL, port = NULL, db = NULL, schema = NULL, user = NULL, password = NULL, trusted_connection = NULL, sslmode = NULL, role_create_table = NULL )
driverDriver
serverServer
portPort
dbDB
schemaSchema (e.g. "dbo")
userUser
passwordPassword
trusted_connectionNULL or "yes"
sslmodeNULL or "require"
role_create_tableNULL or the role to take when creating tables.
A new 'DBConnection_v9' object.
is_connected()
Is the DB schema connected?
DBConnection_v9$is_connected()
TRUE/FALSE
print()
Class-specific print function.
DBConnection_v9$print(...)
...Not used.
connect()
Connect to the database
DBConnection_v9$connect(attempts = 2)
attemptsNumber of attempts to be made to try to connect
disconnect()
Disconnect from the database
DBConnection_v9$disconnect()
clone()
The objects of this class are cloneable with this method.
DBConnection_v9$clone(deep = FALSE)
deepWhether to make a deep clone.
## Not run: # Create a SQL Server connection db_config <- DBConnection_v9$new( driver = "ODBC Driver 17 for SQL Server", server = "localhost", port = 1433, db = "mydb", user = "myuser", password = "mypass" ) # Connect to the database db_config$connect() # Check connection status db_config$is_connected() # Use the connection tables <- DBI::dbListTables(db_config$connection) # Disconnect when done db_config$disconnect() # PostgreSQL example pg_config <- DBConnection_v9$new( driver = "PostgreSQL", server = "localhost", port = 5432, db = "mydb", user = "myuser", password = "mypass" ) pg_config$connect() # ... use connection ... pg_config$disconnect() ## End(Not run)## Not run: # Create a SQL Server connection db_config <- DBConnection_v9$new( driver = "ODBC Driver 17 for SQL Server", server = "localhost", port = 1433, db = "mydb", user = "myuser", password = "mypass" ) # Connect to the database db_config$connect() # Check connection status db_config$is_connected() # Use the connection tables <- DBI::dbListTables(db_config$connection) # Disconnect when done db_config$disconnect() # PostgreSQL example pg_config <- DBConnection_v9$new( driver = "PostgreSQL", server = "localhost", port = 5432, db = "mydb", user = "myuser", password = "mypass" ) pg_config$connect() # ... use connection ... pg_config$disconnect() ## End(Not run)
A comprehensive database table management class that provides high-level operations for data manipulation, schema validation, and table administration. This class combines database connectivity with data validation and efficient bulk operations.
The DBTable_v9 class is a sophisticated database table abstraction that provides:
Core functionality:
Table creation and schema management
Data insertion with bulk loading capabilities
Upsert operations (insert or update)
Index management (creation, deletion)
Data validation through customizable validators
Integration with dplyr for data queries
Advanced features:
Automatic table creation based on field specifications
Schema validation with custom validator functions
Efficient bulk data loading using database-specific methods
Index optimization for query performance
Cross-database compatibility (SQL Server, PostgreSQL)
Data validation: The class supports custom validation functions for both field types and data contents, ensuring data integrity and schema compliance.
dbconnectionDatabase connection.
dbconfigConfiguration details of the database.
table_nameName of the table in the database.
table_name_short_for_mssql_fully_specified_for_postgresFully specified name of the table in the database (e.g. \[db\].\[dbo\].\[table_name\]).
table_name_short_for_mssql_fully_specified_for_postgres_textFully specified name of the table in the database (e.g. \[db\].\[dbo\].\[table_name\]).
table_name_fully_specifiedFully specified name of the table in the database (e.g. \[db\].\[dbo\].\[table_name\]).
table_name_fully_specified_textFully specified name of the table in the database (e.g. \[db\].\[dbo\].\[table_name\]) as a text string.
field_typesThe types of each column in the database table (INTEGER, DOUBLE, TEXT, BOOLEAN, DATE, DATETIME).
field_types_with_lengthThe same as field_types but with (100) added to the end of all TEXT fields.
keysThe combination of variables that uniquely identify each row in the database.
keys_with_lengthThe same as keys but with (100) added to the end of all TEXT fields.
indexesA named list of vectors (generally "ind1", "ind2", etc.) that improves the speed of data retrieval operations on a database table.
validator_field_contentsA function that validates the data before it is inserted into the database.
load_folderA temporary folder that is used to write data to before inserting into the database.
censorsA named list of censors.
new()
Create a new DBTable_v9 object.
DBTable_v9$new( dbconfig, table_name, field_types, keys, indexes = NULL, validator_field_types = validator_field_types_blank, validator_field_contents = validator_field_contents_blank )
dbconfigConfiguration details of the database (driver, server, port, db, schema, user, password, trusted_connection, sslmode, role_create_table).
table_nameName of the table in the database.
field_typesThe types of each column in the database table (INTEGER, DOUBLE, TEXT, BOOLEAN, DATE, DATETIME).
keysThe combination of these variables uniquely identifies each row of data in the table.
indexesA named list of vectors (generally "ind1", "ind2", etc.) that improves the speed of data retrieval operations on a database table.
validator_field_typesA function that validates the field_types before the DB schema is created.
validator_field_contentsA function that validates the data before it is inserted into the database.
A new 'DBTable_v9' object.
print()
Class-specific print function.
DBTable_v9$print(...)
...Not in use.
connect()
Connect from the database
DBTable_v9$connect()
disconnect()
Disconnect from the database
DBTable_v9$disconnect()
table_exists()
Does the table exist
DBTable_v9$table_exists()
create_table()
Create the database table
DBTable_v9$create_table()
remove_table()
Drop the database table
DBTable_v9$remove_table()
insert_data()
Inserts data
DBTable_v9$insert_data( newdata, confirm_insert_via_nrow = FALSE, verbose = TRUE )
newdataThe data to insert.
confirm_insert_via_nrowChecks nrow() before insert and after insert. If nrow() has not increased sufficiently, then attempt an upsert.
verboseBoolean. Inserts data into the database table
upsert_data()
Upserts data into the database table
DBTable_v9$upsert_data( newdata, drop_indexes = names(self$indexes), verbose = TRUE )
newdataThe data to insert.
drop_indexesA vector containing the indexes to be dropped before upserting (can increase performance).
verboseBoolean.
drop_all_rows()
Drops all rows in the database table
DBTable_v9$drop_all_rows()
drop_rows_where()
Drops rows in the database table according to the SQL condition.
DBTable_v9$drop_rows_where(condition)
conditionSQL text condition.
keep_rows_where()
Keeps rows in the database table according to the SQL condition.
DBTable_v9$keep_rows_where(condition)
conditionSQL text condition.
drop_all_rows_and_then_upsert_data()
Drops all rows in the database table and then upserts data.
DBTable_v9$drop_all_rows_and_then_upsert_data( newdata, drop_indexes = names(self$indexes), verbose = TRUE )
newdataThe data to insert.
drop_indexesA vector containing the indexes to be dropped before upserting (can increase performance).
verboseBoolean.
drop_all_rows_and_then_insert_data()
Drops all rows in the database table and then inserts data.
DBTable_v9$drop_all_rows_and_then_insert_data( newdata, confirm_insert_via_nrow = FALSE, verbose = TRUE )
newdataThe data to insert.
confirm_insert_via_nrowChecks nrow() before insert and after insert. If nrow() has not increased sufficiently, then attempt an upsert.
verboseBoolean.
tbl()
Provides access to the database table via dplyr::tbl.
DBTable_v9$tbl()
print_dplyr_select()
Prints a template dplyr::select call that you can easily copy/paste for all your variables.
DBTable_v9$print_dplyr_select()
add_indexes()
Adds indexes to the database table from 'self$indexes'
DBTable_v9$add_indexes()
drop_indexes()
Drops all indees from the database table
DBTable_v9$drop_indexes()
confirm_indexes()
Confirms that the names and number of indexes in the database are the same as in the R code. Does not confirm the contents of the indexes!
DBTable_v9$confirm_indexes()
nrow()
Gets the number of rows in the database table
DBTable_v9$nrow(use_count = FALSE)
use_countIf true, then uses the count command, which is slow but accurate. If false, then uses summary statistics, which is fast but inaccurate.
info()
Gets the information about the database table
DBTable_v9$info()
clone()
The objects of this class are cloneable with this method.
DBTable_v9$clone(deep = FALSE)
deepWhether to make a deep clone.
## Not run: # Create database connection db_config <- list( driver = "ODBC Driver 17 for SQL Server", server = "localhost", db = "mydb", user = "myuser", password = "mypass" ) # Define table schema field_types <- c( "id" = "INTEGER", "name" = "TEXT", "value" = "DOUBLE", "date_created" = "DATE" ) # Create table object my_table <- DBTable_v9$new( dbconfig = db_config, table_name = "my_data_table", field_types = field_types, keys = c("id"), validator_field_types = validator_field_types_blank, validator_field_contents = validator_field_contents_blank ) # Create table in database my_table$create_table() # Insert data sample_data <- data.frame( id = 1:3, name = c("Alice", "Bob", "Charlie"), value = c(10.5, 20.3, 15.7), date_created = as.Date("2023-01-01") ) my_table$insert_data(sample_data) # Query data using dplyr result <- my_table$tbl() |> dplyr::filter(value > 15) |> dplyr::collect() # Add indexes for performance my_table$add_indexes(c("name", "date_created")) # Upsert (insert or update) data new_data <- data.frame( id = 2:4, name = c("Bob_Updated", "Charlie", "David"), value = c(25.0, 15.7, 30.2), date_created = as.Date("2023-01-02") ) my_table$upsert_data(new_data) ## End(Not run)## Not run: # Create database connection db_config <- list( driver = "ODBC Driver 17 for SQL Server", server = "localhost", db = "mydb", user = "myuser", password = "mypass" ) # Define table schema field_types <- c( "id" = "INTEGER", "name" = "TEXT", "value" = "DOUBLE", "date_created" = "DATE" ) # Create table object my_table <- DBTable_v9$new( dbconfig = db_config, table_name = "my_data_table", field_types = field_types, keys = c("id"), validator_field_types = validator_field_types_blank, validator_field_contents = validator_field_contents_blank ) # Create table in database my_table$create_table() # Insert data sample_data <- data.frame( id = 1:3, name = c("Alice", "Bob", "Charlie"), value = c(10.5, 20.3, 15.7), date_created = as.Date("2023-01-01") ) my_table$insert_data(sample_data) # Query data using dplyr result <- my_table$tbl() |> dplyr::filter(value > 15) |> dplyr::collect() # Add indexes for performance my_table$add_indexes(c("name", "date_created")) # Upsert (insert or update) data new_data <- data.frame( id = 2:4, name = c("Bob_Updated", "Charlie", "David"), value = c(25.0, 15.7, 30.2), date_created = as.Date("2023-01-02") ) my_table$upsert_data(new_data) ## End(Not run)
Retrieves comprehensive information about database tables including their names, row counts, and storage size metrics. This function provides database-specific implementations for different database systems.
get_table_names_and_info(connection)get_table_names_and_info(connection)
connection |
A database connection object (e.g., from |
A data.table containing table information with columns:
Character. Name of the table
Numeric. Number of rows in the table
Numeric. Total size of the table in gigabytes
Numeric. Size of data in gigabytes
Numeric. Size of indexes in gigabytes
## Not run: # Microsoft SQL Server example con <- DBI::dbConnect(odbc::odbc(), driver = "ODBC Driver 17 for SQL Server", server = "localhost", database = "mydb") table_info <- get_table_names_and_info(con) print(table_info) DBI::dbDisconnect(con) # PostgreSQL example con <- DBI::dbConnect(RPostgres::Postgres(), host = "localhost", dbname = "mydb", user = "user") table_info <- get_table_names_and_info(con) print(table_info) DBI::dbDisconnect(con) ## End(Not run)## Not run: # Microsoft SQL Server example con <- DBI::dbConnect(odbc::odbc(), driver = "ODBC Driver 17 for SQL Server", server = "localhost", database = "mydb") table_info <- get_table_names_and_info(con) print(table_info) DBI::dbDisconnect(con) # PostgreSQL example con <- DBI::dbConnect(RPostgres::Postgres(), host = "localhost", dbname = "mydb", user = "user") table_info <- get_table_names_and_info(con) print(table_info) DBI::dbDisconnect(con) ## End(Not run)
This data comes from the Norwegian Surveillance System for Communicable Diseases (MSIS). The date corresponds to when the PCR-test was taken.
nor_covid19_cases_by_time_locationnor_covid19_cases_by_time_location
A csfmt_rts_data_v1 with 11028 rows and 18 variables:
day/isoweek
nation, county
nor
norge, 11 counties
2020
total
Isoyear of event
Isoweek of event
Isoyearweek of event
Season of event
Seasonweek of event
Calyear of event
Calmonth of event
Calyearmonth of event
Date of event
Number of confirmed covid19 cases
Number of confirmed covid19 cases per 100.000 population
The raw number of cases and cases per 100.000 population are recorded.
This data was extracted on 2022-05-04.
A pass-through validator that accepts any data without validation. This is useful as a placeholder when no specific data content validation is needed.
validator_field_contents_blank(data)validator_field_contents_blank(data)
data |
A data.frame or data.table containing the data to validate |
Always returns TRUE
# This validator always returns TRUE regardless of input test_data <- data.frame(id = 1:3, name = c("A", "B", "C"), value = c(10, 20, 30)) validator_field_contents_blank(test_data) # Works with any data structure empty_data <- data.frame() validator_field_contents_blank(empty_data)# This validator always returns TRUE regardless of input test_data <- data.frame(id = 1:3, name = c("A", "B", "C"), value = c(10, 20, 30)) validator_field_contents_blank(test_data) # Works with any data structure empty_data <- data.frame() validator_field_contents_blank(empty_data)
Validates that data contents conform to the csfmt_rts_data_v1 schema specification. This validator checks that granularity_time and granularity_geo fields contain valid values according to the surveillance data format requirements.
validator_field_contents_csfmt_rts_data_v1(data)validator_field_contents_csfmt_rts_data_v1(data)
data |
A data.frame or data.table containing the data to validate |
TRUE if data is valid for csfmt_rts_data_v1, FALSE otherwise (with error attribute)
# Valid data for csfmt_rts_data_v1 valid_data <- data.frame( granularity_time = c("date", "isoyearweek", "total"), granularity_geo = c("nation", "county", "municip"), stringsAsFactors = FALSE ) validator_field_contents_csfmt_rts_data_v1(valid_data) # Invalid data (wrong granularity_geo value) invalid_data <- data.frame( granularity_time = "date", granularity_geo = "invalid_geo", stringsAsFactors = FALSE ) validator_field_contents_csfmt_rts_data_v1(invalid_data)# Valid data for csfmt_rts_data_v1 valid_data <- data.frame( granularity_time = c("date", "isoyearweek", "total"), granularity_geo = c("nation", "county", "municip"), stringsAsFactors = FALSE ) validator_field_contents_csfmt_rts_data_v1(valid_data) # Invalid data (wrong granularity_geo value) invalid_data <- data.frame( granularity_time = "date", granularity_geo = "invalid_geo", stringsAsFactors = FALSE ) validator_field_contents_csfmt_rts_data_v1(invalid_data)
Validates that data contents conform to the csfmt_rts_data_v2 schema specification. This validator checks that granularity_time and granularity_geo fields contain valid values according to the surveillance data format requirements for version 2.
validator_field_contents_csfmt_rts_data_v2(data)validator_field_contents_csfmt_rts_data_v2(data)
data |
A data.frame or data.table containing the data to validate |
TRUE if data is valid for csfmt_rts_data_v2, FALSE otherwise (with error attribute)
# Valid data for csfmt_rts_data_v2 valid_data_v2 <- data.frame( granularity_time = c("date", "isoyearweek", "total"), granularity_geo = c("nation", "county", "municip"), stringsAsFactors = FALSE ) validator_field_contents_csfmt_rts_data_v2(valid_data_v2)# Valid data for csfmt_rts_data_v2 valid_data_v2 <- data.frame( granularity_time = c("date", "isoyearweek", "total"), granularity_geo = c("nation", "county", "municip"), stringsAsFactors = FALSE ) validator_field_contents_csfmt_rts_data_v2(valid_data_v2)
A pass-through validator that accepts any field types without validation. This is useful as a placeholder when no specific field type validation is needed.
validator_field_types_blank(db_field_types)validator_field_types_blank(db_field_types)
db_field_types |
A named character vector of database field types |
Always returns TRUE
# This validator always returns TRUE regardless of input field_types <- c("id" = "INTEGER", "name" = "TEXT", "date" = "DATE") validator_field_types_blank(field_types) # Works with any field types other_types <- c("value" = "DOUBLE", "status" = "BOOLEAN") validator_field_types_blank(other_types)# This validator always returns TRUE regardless of input field_types <- c("id" = "INTEGER", "name" = "TEXT", "date" = "DATE") validator_field_types_blank(field_types) # Works with any field types other_types <- c("value" = "DOUBLE", "status" = "BOOLEAN") validator_field_types_blank(other_types)
Validates that field types conform to the csfmt_rts_data_v1 schema specification. This validator ensures that the first 16 fields match the expected structure for real-time surveillance data format version 1.
validator_field_types_csfmt_rts_data_v1(db_field_types)validator_field_types_csfmt_rts_data_v1(db_field_types)
db_field_types |
A named character vector of database field types |
TRUE if field types are valid for csfmt_rts_data_v1, FALSE otherwise
# Valid field types for csfmt_rts_data_v1 valid_fields <- c( "granularity_time" = "TEXT", "granularity_geo" = "TEXT", "country_iso3" = "TEXT", "location_code" = "TEXT", "border" = "INTEGER", "age" = "TEXT", "sex" = "TEXT", "isoyear" = "INTEGER", "isoweek" = "INTEGER", "isoyearweek" = "TEXT", "season" = "TEXT", "seasonweek" = "DOUBLE", "calyear" = "INTEGER", "calmonth" = "INTEGER", "calyearmonth" = "TEXT", "date" = "DATE", "cases_n" = "INTEGER" ) validator_field_types_csfmt_rts_data_v1(valid_fields) # Invalid field types (wrong structure) invalid_fields <- c("id" = "INTEGER", "name" = "TEXT") validator_field_types_csfmt_rts_data_v1(invalid_fields)# Valid field types for csfmt_rts_data_v1 valid_fields <- c( "granularity_time" = "TEXT", "granularity_geo" = "TEXT", "country_iso3" = "TEXT", "location_code" = "TEXT", "border" = "INTEGER", "age" = "TEXT", "sex" = "TEXT", "isoyear" = "INTEGER", "isoweek" = "INTEGER", "isoyearweek" = "TEXT", "season" = "TEXT", "seasonweek" = "DOUBLE", "calyear" = "INTEGER", "calmonth" = "INTEGER", "calyearmonth" = "TEXT", "date" = "DATE", "cases_n" = "INTEGER" ) validator_field_types_csfmt_rts_data_v1(valid_fields) # Invalid field types (wrong structure) invalid_fields <- c("id" = "INTEGER", "name" = "TEXT") validator_field_types_csfmt_rts_data_v1(invalid_fields)
Validates that field types conform to the csfmt_rts_data_v2 schema specification. This validator ensures that the first 18 fields match the expected structure for real-time surveillance data format version 2.
validator_field_types_csfmt_rts_data_v2(db_field_types)validator_field_types_csfmt_rts_data_v2(db_field_types)
db_field_types |
A named character vector of database field types |
TRUE if field types are valid for csfmt_rts_data_v2, FALSE otherwise
# Valid field types for csfmt_rts_data_v2 (includes additional fields) valid_fields_v2 <- c( "granularity_time" = "TEXT", "granularity_geo" = "TEXT", "country_iso3" = "TEXT", "location_code" = "TEXT", "border" = "INTEGER", "age" = "TEXT", "sex" = "TEXT", "isoyear" = "INTEGER", "isoweek" = "INTEGER", "isoyearweek" = "TEXT", "season" = "TEXT", "seasonweek" = "DOUBLE", "calyear" = "INTEGER", "calmonth" = "INTEGER", "calyearmonth" = "TEXT", "date" = "DATE", "tag_outcome" = "TEXT", "tag_type" = "TEXT", "cases_n" = "INTEGER" ) validator_field_types_csfmt_rts_data_v2(valid_fields_v2)# Valid field types for csfmt_rts_data_v2 (includes additional fields) valid_fields_v2 <- c( "granularity_time" = "TEXT", "granularity_geo" = "TEXT", "country_iso3" = "TEXT", "location_code" = "TEXT", "border" = "INTEGER", "age" = "TEXT", "sex" = "TEXT", "isoyear" = "INTEGER", "isoweek" = "INTEGER", "isoyearweek" = "TEXT", "season" = "TEXT", "seasonweek" = "DOUBLE", "calyear" = "INTEGER", "calmonth" = "INTEGER", "calyearmonth" = "TEXT", "date" = "DATE", "tag_outcome" = "TEXT", "tag_type" = "TEXT", "cases_n" = "INTEGER" ) validator_field_types_csfmt_rts_data_v2(valid_fields_v2)