Docs · 4-LLM Sandbox

Same prompt, four models

Stop picking your provider on vibes. The Sandbox runs your prompt across the four leading frontier models, scores each output, and shows you exactly where they agree.

Providers

Who's in the box

Anthropic

Claude Sonnet 4.6

OpenAI

GPT-5.5

Google

Gemini 2 Pro

Five steps from prompt to harness

01Write a prompt. Add 3-8 representative test inputs.

02Hit Run. RUQA sends the same payload to all four providers in parallel.

03Each response is scored 0-100 on the 5-criteria rubric. Latency, tokens, cost recorded.

04Side-by-side diff highlights agreement vs divergence between models.

05Promote the winner — or the consensus — into a harness.

Eval rubric

Five criteria, transparently weighted

The rubric is published, versioned, and editable per workspace. Defaults work for 80% of teams; customize per category if needed.

Format

Does the output match the requested structure (JSON, markdown, list)?

Coherence

Does the reasoning hold together? No contradictions, no leaps.

Accuracy

Are factual claims supported by the input or by verifiable knowledge?

Brevity

Is the output as short as it can be without losing what matters?

Actionability

Could a reasonable teammate take next steps from this output alone?

Try the sandbox in the demo

Pre-loaded with a real test case. Run it, see the diff.

Open sandbox