Summary: | Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017. === This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. === Cataloged from student-submitted PDF version of thesis. === Includes bibliographical references (pages 85-86). === Comparing irregular and event-driven time series data points is beyond the capabilities of most statistical techniques. This limits the potential to run insightful retrospective studies on many cross-sectional time-series datasets. In order to unlock the value of these datasets, we need techniques to standardize observations with irregular events enough to compare them to each other, and ways to select and sample them so as to produce class balances for each strata at modeling time that lend themselves to statistically sound analysis. In this study, we have developed two selection techniques and three sampling techniques for a characteristic cross-sectional time-series dataset. We found that using a Fluid-Balance Similarity-Based Dynamic Time Warp selection procedure with nearest neighbor parameter k=1 and using a Gamma distribution for sampling days produced consistently better class balance than all other methods when bootstrapped over 100 independent runs. We have written, documented and published open source MATLAB code for each selection and sampling technique, along with our bootstrap test. To evaluate our results, we have developed the Class Imbalance Penalty, a new metric that gives the lowest scores to the selection and sampling runs that produce most comparable counts of treatment and non-treatment observations for all strata. We validated our methods in the context of a study of diuretics treatment effects in ICU patients with Sepsis, drawn from the MIMIC II database. Starting from a group of 3,503 unique ICU stays from 2,341 study patients, with a Diuretics-treatment cohort of 349 unique ICU stays from 332 patients, we tested each selection and sampling technique, observing the trends across our dierent methods. We observed that sampling day was the stronger predictor of good class balance compared with selection technique, that the strongest similarity level (k=1) with the shortest history we considered produced the best results, and using a Gamma distribution for timepoint sampling most closely matched the distribution of actual administration days. Ultimately, we found strong evidence that our study lacked an important co- variate, physician-id, to more fully account for seemingly unpredictable assignments to Diuretics-treatment in our dataset. === by Brian Bell Jr. === M. Eng.
|