Verity Engine
Scientific Data Credibility Platform
A purpose-built pipeline for validating, scoring, and curating research data. Designed for the unique challenges of scientific datasets — where a single bad data point can invalidate an entire study.
Four-Stage Curation Pipeline
Each stage is independently configurable. You can run the full pipeline end-to-end or invoke individual stages via our API.
Data Ingestion
Multi-format scientific data parser
Accepts raw datasets from heterogeneous sources. Built-in parsers for common scientific formats with automatic schema detection and unit normalization.
Handles missing fields, inconsistent units (eV vs Ry, Å vs Bohr), and mixed-precision floats automatically. Outputs a unified internal representation.
Credibility Scoring
Physics-aware multi-dimensional assessment
Each record is evaluated across multiple credibility dimensions using domain-specific rules and statistical models.
Physical Consistency
Checks against known physical laws — negative formation energies, unrealistic bond lengths, thermodynamic violations
Statistical Coherence
Detects records that are statistical outliers within their distribution class using robust estimators (MAD, IQR)
Cross-Source Agreement
Compares values against multiple independent sources — does this DFT bandgap agree with experimental measurements?
Provenance Quality
Evaluates source reliability — journal impact, computational convergence parameters, experimental conditions reported
Outputs a composite credibility score (0–1) per record with per-dimension breakdowns. Fully interpretable — no black-box decisions.
Intelligent Filtering
Human-in-the-loop curation workflow
Flagged records are categorized by severity. Researchers set their own thresholds and review borderline cases with full context.
No data is silently deleted. All filtering decisions include human-readable explanations referencing the specific validation rule that triggered the flag.
Validated Output
ML-ready datasets with provenance
Clean, structured datasets with per-record credibility metadata. Includes full curation lineage for reproducibility and publication compliance.
- Cleaned dataset in original or standardized format
- Per-record credibility scores and dimension breakdowns
- Curation report (PDF) with methodology summary
- Audit log for reproducibility
Datasets curated through Verity have shown 2x+ accuracy improvement in downstream ML tasks across materials property prediction, reaction yield estimation, and structure-activity modeling.
Measured Impact on Downstream ML Tasks
We evaluated model performance on five representative tasks before and after Verity curation, using identical model architectures and hyperparameters. Only the training data quality was changed.
All benchmarks conducted on held-out test sets. “Before” denotes raw uncurated data; “After” denotes Verity-curated data. Same model architecture and training procedure used in both cases.
Simple API, Deep Configurability
Verity exposes a Python SDK for integration into existing research workflows. Three lines to curate, full control when you need it.
from sciscale import VerityEngine
# Initialize with domain-specific config
engine = VerityEngine(
domain="materials_science",
scoring_mode="strict",
)
# Load your raw dataset
dataset = engine.load("raw_cvd_data.csv",
format="auto",
unit_system="SI",
)
# Run full curation pipeline
result = engine.curate(dataset)
# Inspect results
print(f"Total records: {result.total}")
print(f"High confidence: {result.accepted}")
print(f"Flagged for review: {result.flagged}")
print(f"Quarantined: {result.rejected}")
# Export curated data + report
result.export("curated_output/",
formats=["csv", "json"],
include_scores=True,
generate_report=True,
)Python SDK
pip install sciscale
REST API
Programmatic access
Web Dashboard
Visual curation review
Domain-Specific Validation Rules
Verity ships with pre-configured validation rulesets for major scientific domains. Custom rulesets can be defined for niche applications.
Materials Science
Crystal structures, formation energies, band gaps, elastic constants, CVD process parameters
Chemistry
Reaction yields, activation energies, molecular descriptors, catalyst performance metrics
Computational Physics
DFT outputs, MD trajectories, CFD simulations, Monte Carlo results
Biomedical
Drug-target interactions, toxicity endpoints, clinical measurement records
Ready to curate your research data?
Verity is currently available through direct engagement. Get in touch to discuss your data curation needs or request a pilot.