NCJRS Virtual Library

The Virtual Library houses over 235,000 criminal justice resources, including all known OJP works.

Click here to search the NCJRS Virtual Library

NCJ Number

311675

Journal

Journal of Statistical Theory and Practice Volume: 20 Dated: 2026

Author(s)

Karen Kafadar

Date Published

February 2026

Abstract

Data science and machine learning algorithms are sometimes viewed as the only tools that are needed to analyze large datasets. Yet concepts from classical statistics remain critical in such settings. Massive data are rarely independent, outlier-free, or homogeneous: clusters, subdomains of observations, multiplicity of tests, and hidden trends are common and require statistical thinking, robust methods, and insightful displays. Sampling methodology, along with survey design and analysis, are essential in our current statistical framework for ensuring valid inferences with quantifiable uncertainties. This paper discusses some datasets where statistical analysis uncovered subtle biases and discrepancies that would have been hidden in these seemingly trustworthy, data-rich sources. Until a new statistical framework is developed to generate valid inferences on non-randomized, highly dependent clustered data, these examples demonstrate that statistical thinking, statistical methods, and informative displays remain critical for ensuring valid analyses and communication of justified conclusions from “Big Data.”

(Publisher abstract provided.)

Downloads

HTML

NCJRS Virtual Library

Robust Methods and Statistical Thinking for “Big Data” Science and Surveys

Downloads

Related Topics

NCJRS Virtual Library

Robust Methods and Statistical Thinking for “Big Data” Science and Surveys

Additional Details

Downloads

Related Topics