U.S. flag

An official website of the United States government, Department of Justice.

NCJRS Virtual Library

The Virtual Library houses over 235,000 criminal justice resources, including all known OJP works.
Click here to search the NCJRS Virtual Library

Robust Methods and Statistical Thinking for “Big Data” Science and Surveys

NCJ Number
311675
Journal
Journal of Statistical Theory and Practice Volume: 20 Dated: 2026
Author(s)
Karen Kafadar
Date Published
February 2026
Abstract

Data science and machine learning algorithms are sometimes viewed as the only tools that are needed to analyze large datasets. Yet concepts from classical statistics remain critical in such settings. Massive data are rarely independent, outlier-free, or homogeneous: clusters, subdomains of observations, multiplicity of tests, and hidden trends are common and require statistical thinking, robust methods, and insightful displays. Sampling methodology, along with survey design and analysis, are essential in our current statistical framework for ensuring valid inferences with quantifiable uncertainties. This paper discusses some datasets where statistical analysis uncovered subtle biases and discrepancies that would have been hidden in these seemingly trustworthy, data-rich sources. Until a new statistical framework is developed to generate valid inferences on non-randomized, highly dependent clustered data, these examples demonstrate that statistical thinking, statistical methods, and informative displays remain critical for ensuring valid analyses and communication of justified conclusions from “Big Data.”

(Publisher abstract provided.)