Use case: LLM Research

Reproducible prompt research starts with managed prompts.

You're running the same prompt across models, tweaking parameters, tracking which wording produces which behavior. Spreadsheets and chat windows don't scale. A version-controlled prompt library with multi-model testing does.

How Promptmark fits

Multi-model testing with 300+ models

Run identical prompts against models from OpenAI, Anthropic, Google, Meta, Mistral, and dozens more. Compare responses side-by-side with token counts, latency, and cost per request. BYOK means you control billing and access — use your institutional API keys.

Template variables for experimental parameters

Define typed variables for the dimensions you're testing: {{temperature:number}}, {{persona:select:formal,casual,technical}}, {{example_count:number}}. One prompt template, hundreds of parameter combinations. Schema validation ensures every run uses valid inputs.

Version control for prompt iterations

Every edit creates an automatic snapshot. Diff any two versions to see exactly what changed between experimental conditions. Restore previous versions to re-run baselines. Your iteration history is permanent and auditable.

Collections and sharing for reproducibility

Group prompts by experiment, paper, or research question. Tag by technique, model family, or evaluation metric. Publish collections to your public profile so collaborators and reviewers can inspect, remix, and reproduce your exact prompt conditions.

Conversations as experiment logs

Run multi-turn interactions as part of your experimental protocol. Each conversation is saved with the model used, token counts, and the exact prompt version.

Playbooks for automated test batteries

Chain a sequence of prompts into a playbook that runs your full test battery. Run the entire sequence against a new model with one trigger URL call.

Per-user database isolation

Your experiment data lives in its own SQLite database, physically separated from every other user. No shared tables, no cross-contamination.

Example workflow

Design the experiment

Create a prompt template with variables for every dimension you want to test. Define the schema: variable types, allowed values, and defaults. This template is your experimental protocol.

Run across models

Execute the prompt with controlled inputs against your target models. Compare responses, token usage, and latency. Save results alongside the prompt version that produced them.

Iterate and version

Modify the prompt based on results. Version control captures the exact diff between iterations. Run the new version against the same models with the same inputs to isolate the effect of your changes.

Organize by experiment

Group prompt versions, templates, and results into collections. Tag by research question, model family, or evaluation criteria. Build a structured library as your research progresses.

Share for reproducibility

Publish your experiment collections to your profile. Collaborators see the exact prompt versions, variable schemas, and model targets. They can remix and re-run with their own API keys — full reproducibility without sharing credentials.

Your experiments deserve version control

Template variables for parameters, version history for iterations, and 300+ models for comparison. Build a research workbench, not another spreadsheet.

Start your first experiment — free