Overview | What happens to data shared with NACHC?
NACHC is increasingly involved in projects involving data. The purpose of this page is to describe how data is processed and used once it is shared with NACHC.
NACHC began constructing an architecture to house data and provide analytic services in 2020. Cosmos is NACHC’s enterprise informatics system for the capture, management, and analysis of data. Cosmos is designed to accept data and transform it for use across projects and organizations. Cosmos can be used across NACHC teams and projects. Cosmos is based around several software tools: Confluence, Databricks, and Microsoft Power BI.
The diagram to the right displays components of the NACHC data and informatics architecture; raw data enters Cosmos at the bottom of diagram and ascends through the data lifecycle which culminates at reporting and access. This diagram reflects both the current state and aspected of future state functionality that align with the anticipated needs of data partners. Data privacy and security is a key component of NACHC informatics work and Cosmos environment.
Below, each of the five domains described with technical details included in the right column. Cosmos evolves frequently and this page is updated often to reflect the current state.
Description of Data Sources
NACHC receives many types of limited and de-identified datasets from external organizations; organizations sharing data with NACHC are referred to as data partners. Each project dataset has different the components and structure. Project datasets are prepared by data partners with assistance from NACHC project team members. Templates and data dictionaries to facilitate creation of a project-specific dataset are available /wiki/spaces/COS/pages/914263418 and prioritizes the use of standard terminology (e.g., ICD-10 for diagnostic codes and LOINC for lab tests). Data are transmitted to NACHC from data partners via project-specific Confluence web pages that are hosted by NACHC on Amazon Web Services. These pages are secured by Confluence authentication and access is limited to NACHC project team members. Detailed information on transmitting and receiving data through Confluence is available /wiki/spaces/COS/pages/914261191. NACHC is committed to accepting data from data partners in any format; custom tools can be created ingest alternative data formats.
Technical Details about Character Separated Files (CSV): Currently all project datasets are being received as CSV files, which includes excel files, comma separated values and pipe delimited files.
Technical Details about APIs (XML,JSON, or FHIR): The Office of the National Coordinator has recommended the adoption and use of APIs for efficient exchange of health information (45CFR Part 170). Application programming interface (API) functionality is a future state for data sharing at NACHC. API benefits include the ability to automate what would otherwise be a manual process so that two machines (one being the data provider and the other the recipient) can talk directly. Three API types listed here are the most commonly used in healthcare. More information on Fast Health Interoperability Resources or FHIR (created by the Health Level Seven International (HL7) health-care standards organization). More information on XML. More information on JSON.
Description of Data Staging
The purpose of staging is to prepare data for the data warehouse so that only clean, normalized data in the desired format is loaded into the data warehouse. NACHC's staging environment consists of server-based storage and a data lake. Staging begins with data extraction, which includes cataloguing the data received, extracting the appropriate data, discarding unnecessary or irrelevant information, and capturing metadata which is structured information that describes the dataset received. Metadata ensures that NACHC can find, use, and preserve data accurately so that NACHC can fulfill its data provenance responsibilities.
Technical Details about Server-based capture: Server-based capture refers to a NACHC server where data is initially extracted from Confluence and transformed into a desired format. For example, excel files would be converted to CSV files.
Technical Details about Data Lake: A data lake is a storage solution to hold data in its raw format. NACHC used Databricks to build a data lake (in AWS) to store data until it is loaded into the data warehouse. Compared to a data warehouse which stores data in a highly structured database, a data lake uses a flat architecture with less structure. From the data lake, a JAVA program pulls each dataset into the data warehouse. Each data file is tagged with metadata using a standard mySQL script and is associated with a project; then, each data attribute is assigned to a larger data type. The metadata is the map for how that data will be loaded in the data warehouse.
Description of Data Normalization
Data normalization is the organization of data to increase the cohesion of data, facilitate data cleaning, and result in higher quality data. Normalization includes discarding unstructured data and converting data into the desired format and structure. For example, converting a text date into a structured date. Normalization also includes renaming variables to be consistent with a common structure.
Normalization includes data cleaning which is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. NACHC cleans each dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. Normalization may include incorporating ancillary data such as information about a health center from the UDS dataset. Embedded in data normalization is NACHC's adoption of a common data model which is a standard and extensible design for how data will be stored to facilitate efficient use.
NACHC does not follow one prescribed set of data cleaning steps because the process varies from dataset to dataset. NACHC cleans data by creating a custom data cleaning program (an ETL script written in SQL) for each dataset. NACHC works with data partners and subject matter experts to identify data quality issues, confirm that data cleaning is being performed correctly, and document the cleansing issues that were identified and how they were resolved.
Technical Details about the Data Warehouse: The data warehouse is a relational database which means many related tables that are indexed and connected by a series of IDs. The data warehouse can be access through SQL, Python, R, or any other Open Database Connectivity (ODBC) capable software tools. NACHC's data warehouse uses a common data model. This model is based on existing healthcare standards-based data models (e.g., FHIR, OMOP). The data model includes attributes for string data (i.e. not coded). The current NACHC Data Model is based on current NACHC projects but will evolve as NACHC's informatics work grows.
Description of Data Management
The purpose of data management is to create curated data assets that are analytics-ready. Data management includes the creation of data views and data marts that are project specific. Each project's data flows from the data warehouse into either views or data marts; projects may have one or multiple, depending on project needs. Curation of data marts and views also facilitates the use of functions and manipulation of data in ways that meet the projects requirements. Unlike the data warehouse, data marts and views:
- enhance user query and response time due to reduction in volume of data
- provide easy access to frequently requested data
- are simpler to implement when compared to data warehouse and at a lower cost
- are agile and easier to efficiently modify or alter
- are partitioned and allow very granular access control privileges
- can be segmented and stored on different hardware/software platforms
Project-specific data assets also allow NACHC to ensure that the project team has access to only the appropriate and relevant project data as an alternative to providing access to the NACHC data warehouse where all project data resides.
Technical Details about Data Views and Data Marts: NACHC uses data views and data marts; both draw data from the data warehouse. The main purpose of a view in SQL is thus to combine data from multiple sources in a useful way without having to create yet another database table to store that data. The multiple sources can include tables and view from other database servers. Views are stored as permanent query objects in the database. Thus, a view is a virtual table where the results returned from a view look like that of a regular table, but this table only exists as the result of running the query that defines the view. A regular view does not store any data in the database.
A data mart is a repository of data that is designed to serve a particular project or community of knowledge. Data marts enable users to retrieve information for single project, topic, or data partner, improving the user response time. Compared to views, data marts are more permanent as they are an established database (one table or multiple tables). Data marts also have easier to control access controls to ensure individuals have only the access that is approved. Data mart usually draws data from only a few sources compared to a data warehouse.
Description of Analytics and Access
One of NACHC's informatics goals is to use analytics with data partners to improve patient care and community health. We reach this goal by engaging partners in consumer-driven analysis and implementation science. Analytics are used to understand the composition of the data and to synthesize findings from the data. Analytics primarily includes examining temporal trends, geographic trends, bivariate analysis, frequencies, and regression to assess associations with an outcome.
How do analytics happen? Analytic and visualization tools are connected to project specific data assets (views or data marts) to conduct analysis.Analytic environments serve a secondary function of providing access to project-specific datasets for the project team, one day including team members external to NACHC. The intent of making project data available back to data partner is to support those organizations in hypothesis generation, analysis, and publication. NACHC also supports partners in cultivating investigators and developing informatics skills and methodological expertise to complete a partner-led study. NACHC commits to meeting partners at their level of skill and expertise. Access is also a priority because it is a tool to help HIT and clinical staff communicate about their work.
Technical Details about Visualization, Analysis, and Sharing: Cosmos is built on an architecture that allows for analytic access to information via all common channels including ODBC/JDBC, Rest, Analytical and programming tools such as R, Python, Java, SQL, SAS, STATA, SPSS, Microsoft Power BI, etc. as well the export of specified data sets as either a series of files representing relational information or as a single file representing a flat data structure. The architecture also allows for role-based access to raw data. The current system has limited auditing of user access as user access is currently only allowed directly by NACHC-CAD clients. Auditing capabilities will be added as needed. NACHC primarily uses PowerBI to visualize data and R to conduct statistical analysis but can support complementary analytic tools should the need arise.