Throughout the training exercises on this site we will use a small sample data set. If you followed the instructions documented on the environment setup page to set up your environment, you will find the sample data in the
~/bigdata-bootcamp/data folder in the virtual environment.
There are two data files with names
control.csv respectively. For the purpose of these exercises we will define patients who developed heart failure (HF) at some time point as case patients, and those who didn't develop HF as control patients.
Each line of the sample data file consists of a tuple structured as
(patient-id, event-id, timestamp, value), below are a few lines as an example:
patient-idis just a patient identifier (id) in order to differentiate records from different patients. For example, the portion of data we show above is all about the same patient, who has an id of
event-idencodes all the clinical events that a patient has had. For example,
DRUG00440128228indicates that the patient was taking a drug identified by a National Drug Code of
00440128228. The numbers in
DIAG486are the first 3 digits of an ICD9 code, which in this case is the code for Pneumonia. For this data an event-id of
PAYMENTmeans that the patient made a payment with the corresponding dollar amount.
timestampindicates the date at which the event on that row happened. Here the timestamp is not formatted as a real date but rather as an offset from an unspecified start point. This is done both to improve the simplicity of processing and to protect the privacy of the patients' data.
valueis the associated value for an event. See the below table for a detailed description data in the value field.
|diagnostic code||DIAG486||Will always be
|drug consumption||DRUG00440128228||Dosage of the drug||30|
|payment||PAYMENT||Amount of payment made on
|heartfailure||heartfailure||Indicator of heart failure event||1.0|