Reproducible prompt research starts with managed prompts.
You're running the same prompt across models, tweaking parameters, tracking which wording produces which behavior. Spreadsheets and chat windows don't scale. A version-controlled prompt library with multi-model testing does.
How Promptmark fits
Multi-model testing with 300+ models
Run identical prompts against models from OpenAI, Anthropic, Google, Meta, Mistral, and dozens more. Compare responses side-by-side with token counts, latency, and cost per request. BYOK means you control billing and access — use your institutional API keys.
Template variables for experimental parameters
Define typed variables for the dimensions you're testing: {{temperature:number}}, {{persona:select:formal,casual,technical}}, {{example_count:number}}. One prompt template, hundreds of parameter combinations. Schema validation ensures every run uses valid inputs.
Version control for prompt iterations
Every edit creates an automatic snapshot. Diff any two versions to see exactly what changed between experimental conditions. Restore previous versions to re-run baselines. Your iteration history is permanent and auditable.
Collections and sharing for reproducibility
Group prompts by experiment, paper, or research question. Tag by technique, model family, or evaluation metric. Publish collections to your public profile so collaborators and reviewers can inspect, remix, and reproduce your exact prompt conditions.
Conversations as experiment logs
Run multi-turn interactions as part of your experimental protocol. Each conversation is saved with the model used, token counts, and the exact prompt version.
Playbooks for automated test batteries
Chain a sequence of prompts into a playbook that runs your full test battery. Run the entire sequence against a new model with one trigger URL call.
Per-user database isolation
Your experiment data lives in its own SQLite database, physically separated from every other user. No shared tables, no cross-contamination.
Example workflow
Design the experiment
Create a prompt template with variables for every dimension you want to test. Define the schema: variable types, allowed values, and defaults. This template is your experimental protocol.
Run across models
Execute the prompt with controlled inputs against your target models. Compare responses, token usage, and latency. Save results alongside the prompt version that produced them.
Iterate and version
Modify the prompt based on results. Version control captures the exact diff between iterations. Run the new version against the same models with the same inputs to isolate the effect of your changes.
Organize by experiment
Group prompt versions, templates, and results into collections. Tag by research question, model family, or evaluation criteria. Build a structured library as your research progresses.
Share for reproducibility
Publish your experiment collections to your profile. Collaborators see the exact prompt versions, variable schemas, and model targets. They can remix and re-run with their own API keys — full reproducibility without sharing credentials.
Your experiments deserve version control
Template variables for parameters, version history for iterations, and 300+ models for comparison. Build a research workbench, not another spreadsheet.
Start your first experiment — free