Docs · 4-LLM Sandbox
Same prompt, four models
Stop picking your provider on vibes. The Sandbox runs your prompt across the four leading frontier models, scores each output, and shows you exactly where they agree.
Providers
Who's in the box
Anthropic
Claude Sonnet 4.6
OpenAI
GPT-5.5
Google
Gemini 2 Pro
Meta
Llama 4 (via Replicate)
Flow
Five steps from prompt to harness
- 01Write a prompt. Add 3-8 representative test inputs.
- 02Hit Run. RUQA sends the same payload to all four providers in parallel.
- 03Each response is scored 0-100 on the 5-criteria rubric. Latency, tokens, cost recorded.
- 04Side-by-side diff highlights agreement vs divergence between models.
- 05Promote the winner — or the consensus — into a harness.
Eval rubric
Five criteria, transparently weighted
The rubric is published, versioned, and editable per workspace. Defaults work for 80% of teams; customize per category if needed.
Format
Does the output match the requested structure (JSON, markdown, list)?
Coherence
Does the reasoning hold together? No contradictions, no leaps.
Accuracy
Are factual claims supported by the input or by verifiable knowledge?
Brevity
Is the output as short as it can be without losing what matters?
Actionability
Could a reasonable teammate take next steps from this output alone?
Try the sandbox in the demo
Pre-loaded with a real test case. Run it, see the diff.