This article describes the features and discusses the challenges of the efforts of the New Mexico Office of the Medical Investigator to make data available for research that were originally collected for a purpose other than research.
A common problem in research is extracting data in an accurate, standardized, time efficient way. This is especially true when useful data were originally collected for a purpose other than research. The New Mexico Office of the Medical Investigator (OMI) controls such a database, called VAST. VAST is associated with a dataset that the OMI is making available for research in the New Mexico Decedent Image Database (NMDID). NMDID provides free access to 15,249 whole body CT scans and 57 associated metadata variables from VAST. The OMI encountered several challenges associated with merging these data sources. For example, sex can be listed as "M/m/male/Male." As data complexity increases, data recovery becomes less accurate. To address this, the NMDID utilizes data standards derived from anthropology, medicine, and other fields. For instance, the field "substance usage" was standardized using the International Classification of Diseases. Where there were no appropriate standards, modifications or new standards were generated. VAST also includes many free-text fields, each encompassing substantial, varied information. OMI implemented Canary for natural language processing (NLP) to retrieve these data. This effort resulted in drug prescription information on decedents as well as environmental conditions of the cadaver. Manually reviewing the record to extract prescriptions and conditions would take approximately 5 minutes per case, totaling 1,250 hours for this dataset. NLP enabled completion of this work in 50 hours. The CTs and associated metadata in NMDID can be used to address research questions in fields such as anatomy, growth and development, pathology, public health, and forensics. (publisher abstract modified)