ARTICLEsciencedetective.org14 min read

Unveiling Copy-Paste Errors in Scientific Datasets

By Markus Englund

Unveiling Copy-Paste Errors in Scientific Datasets

AI Summary

In a landmark paper on Parkinson's Disease, it was discovered that the disease might originate in the gut rather than the brain. This study, which has been cited over 3000 times, was found to contain duplicated data sequences in its dataset, which has been publicly available for over eight years. The errors were detected by software I developed, inspired by previous cases of data fabrication. We scanned 600 datasets and found 18 with significant issues, including three notable cases.

## Case 1: Parkinson's Disease Study

The study involved genetically predisposed mice whose Parkinson's symptoms disappeared when their microbiomes were cleared. However, data errors were found in the motor function measurements, with duplicated sequences affecting the study's conclusions. The errors could be accidental or deliberate, but they significantly impact the study's validity.

## Case 2: Ostrich-Snake Data Mixup

A study on toxin resistance in animals showed duplicated and near-duplicated data between ostrich and snake samples. The lead author suggested these were due to measurement variations, but the pattern of errors suggests otherwise. The possibility of accidental or deliberate data manipulation remains, and the authors plan to replicate the process to verify results.

## Case 3: Fish Size Scramble

In a study on fish behavior, data errors were found in the fish size measurements. The authors admitted to misaligning data files, leading to incorrect conclusions about the impact of body size on behavior. They corrected the dataset, and the revised analysis showed body size had a minor effect on behavior.

Overall, the software found errors in 3% of the datasets, but the true error rate is likely higher. The lack of dedicated error-checking in scientific publishing means such mistakes often go unnoticed. Dryad, a research data repository, has been supportive in pushing for corrections. With funding from Astral Codex Ten, I plan to continue scanning datasets to uncover more errors.

Key Concepts

Data Integrity

Data integrity refers to the accuracy and consistency of data over its lifecycle. It is crucial for ensuring reliable and valid results in scientific research.

Scientific Misconduct

Scientific misconduct involves the violation of ethical standards in research, including data fabrication, falsification, and plagiarism.

Category

Science
M

Summarized by Mente

Save any article, video, or tweet. AI summarizes it, finds connections, and creates your to-do list.

Start free, no credit card