Product

Verity Engine

Scientific Data Credibility Platform

A purpose-built pipeline for validating, scoring, and curating research data. Designed for the unique challenges of scientific datasets — where a single bad data point can invalidate an entire study.

Literature DataSimulation DataExperimental Data

Architecture

Four-Stage Curation Pipeline

Each stage is independently configurable. You can run the full pipeline end-to-end or invoke individual stages via our API.

STAGE 1

Data Ingestion

Multi-format scientific data parser

Accepts raw datasets from heterogeneous sources. Built-in parsers for common scientific formats with automatic schema detection and unit normalization.

CSV / JSON / ExcelHDF5 / NetCDFCIF / POSCAR / XYZVASP / Gaussian / LAMMPS logsCustom instrument outputs

Handles missing fields, inconsistent units (eV vs Ry, Å vs Bohr), and mixed-precision floats automatically. Outputs a unified internal representation.

STAGE 2

Credibility Scoring

Physics-aware multi-dimensional assessment

Each record is evaluated across multiple credibility dimensions using domain-specific rules and statistical models.

Physical Consistency

Checks against known physical laws — negative formation energies, unrealistic bond lengths, thermodynamic violations

Statistical Coherence

Detects records that are statistical outliers within their distribution class using robust estimators (MAD, IQR)

Cross-Source Agreement

Compares values against multiple independent sources — does this DFT bandgap agree with experimental measurements?

Provenance Quality

Evaluates source reliability — journal impact, computational convergence parameters, experimental conditions reported

Outputs a composite credibility score (0–1) per record with per-dimension breakdowns. Fully interpretable — no black-box decisions.

STAGE 3

Intelligent Filtering

Human-in-the-loop curation workflow

Flagged records are categorized by severity. Researchers set their own thresholds and review borderline cases with full context.

High ConfidenceScore > 0.85Auto-accepted

Review Needed0.50 – 0.85Flagged for expert review

Low ConfidenceScore < 0.50Quarantined with reasoning

No data is silently deleted. All filtering decisions include human-readable explanations referencing the specific validation rule that triggered the flag.

STAGE 4

Validated Output

ML-ready datasets with provenance

Clean, structured datasets with per-record credibility metadata. Includes full curation lineage for reproducibility and publication compliance.

Cleaned dataset in original or standardized format
Per-record credibility scores and dimension breakdowns
Curation report (PDF) with methodology summary
Audit log for reproducibility

Datasets curated through Verity have shown 2x+ accuracy improvement in downstream ML tasks across materials property prediction, reaction yield estimation, and structure-activity modeling.

Benchmarks

Measured Impact on Downstream ML Tasks

We evaluated model performance on five representative tasks before and after Verity curation, using identical model architectures and hyperparameters. Only the training data quality was changed.

Task

Domain

Metric

Before

After

Improvement

Formation Energy Prediction

Materials Science (DFT)

MAE (eV/atom)

0.142

0.061

2.3x

Band Gap Classification

Semiconductors (Literature)

F1 Score

0.71

0.89

1.25x

CVD Growth Rate Regression

Nanomaterials (Experimental)

R² Score

0.58

0.91

1.57x

Catalyst Activity Prediction

Chemistry (Mixed sources)

RMSE

1.84

0.73

2.5x

Toxicity Endpoint Modeling

Biomedical (Literature)

AUC-ROC

0.74

0.92

1.24x

All benchmarks conducted on held-out test sets. “Before” denotes raw uncurated data; “After” denotes Verity-curated data. Same model architecture and training procedure used in both cases.

Developer Experience

Simple API, Deep Configurability

Verity exposes a Python SDK for integration into existing research workflows. Three lines to curate, full control when you need it.

example_curate.py

from sciscale import VerityEngine

# Initialize with domain-specific config
engine = VerityEngine(
    domain="materials_science",
    scoring_mode="strict",
)

# Load your raw dataset
dataset = engine.load("raw_cvd_data.csv",
    format="auto",
    unit_system="SI",
)

# Run full curation pipeline
result = engine.curate(dataset)

# Inspect results
print(f"Total records: {result.total}")
print(f"High confidence: {result.accepted}")
print(f"Flagged for review: {result.flagged}")
print(f"Quarantined: {result.rejected}")

# Export curated data + report
result.export("curated_output/",
    formats=["csv", "json"],
    include_scores=True,
    generate_report=True,
)

Python SDK

pip install sciscale

REST API

Programmatic access

Web Dashboard

Visual curation review

Coverage

Domain-Specific Validation Rules

Verity ships with pre-configured validation rulesets for major scientific domains. Custom rulesets can be defined for niche applications.

Materials Science

Production

Crystal structures, formation energies, band gaps, elastic constants, CVD process parameters

Chemistry

Production

Reaction yields, activation energies, molecular descriptors, catalyst performance metrics

Computational Physics

Production

DFT outputs, MD trajectories, CFD simulations, Monte Carlo results

Biomedical

Beta

Drug-target interactions, toxicity endpoints, clinical measurement records

Ready to curate your research data?

Verity is currently available through direct engagement. Get in touch to discuss your data curation needs or request a pilot.

Request a Pilot Back to SciScale Home