PAE Scoring
Post AI Evaluation — how Synthezer measures the quality of every AI output.
What is PAE?
PAE stands for Post AI Evaluation. It is the self-assessment step that runs at the end of every Synthezer pipeline, after the AI has produced its final output in Stage 5.
Once the AI finishes generating its deliverable, a separate evaluation pass is triggered. A fresh AI call—acting as the Evaluation Agent—reviews the output against the original Master Prompt, all selected chips, and every constraint that was set in Stage 1. The evaluator then scores the output across three dimensions and provides written commentary.
This matters because it creates a built-in accountability loop. Instead of blindly trusting the output, you get a structured, honest assessment from the AI itself. The evaluator is explicitly instructed to be "brutally honest"—if something is incomplete or wrong, it will say so.
Why PAE Exists
- Accountability: Every output gets a quality audit before you accept it.
- Quality signal: Scores tell you at a glance whether the output needs revision.
- Iteration guidance: The evaluator's comments point to exactly what can be improved.
- Historical tracking: Over time, PAE scores across your pipelines reveal patterns in prompt quality and chip effectiveness.
The Three Dimensions
PAE evaluates every output on three distinct metrics. Each is scored independently on a 0–100 scale.
Accuracy
Does the output correctly implement what was requested?
Accuracy measures whether the AI's output is correct. It checks factual correctness, whether code runs as intended, whether instructions were followed precisely, and whether the output matches the specifications laid out in the Master Prompt. A high accuracy score means the output does what it was supposed to do. A low score means there are errors, misinterpretations, or deviations from the original request.
Completeness
Are all requirements addressed? Is anything missing?
Completeness measures whether the AI covered everything that was asked for. It checks the Master Prompt requirements, the chip constraints (AI Mind behaviors, prerequisites, no-go rules), and the research findings from Stage 3. Even if what was delivered is perfectly accurate, a low completeness score means parts of the request were skipped, glossed over, or left unfinished.
Confidence
How confident is the AI in the quality of this output?
Confidence is the AI's own assessment of how reliable its output is. A high confidence score means the AI is certain its work is solid—the problem was well-defined, the research was sufficient, and the implementation is sound. A lower confidence score is a flag that the AI encountered ambiguity, worked outside its depth, or had to make assumptions that may not hold. Think of confidence as an honesty signal: it tells you how much the AI trusts its own work.
Score Range
All three dimensions use a 0–100 integer scale. Here is a general guide to what different ranges indicate:
| Range | Rating | What it Means |
|---|---|---|
| 90 – 100 | Excellent | Output is production-ready. Requirements fully met, no meaningful gaps. Safe to use as-is. |
| 75 – 89 | Good | Solid output with minor issues. May need light editing or a small fix. Generally reliable. |
| 60 – 74 | Adequate | Usable but has gaps. Review the evaluator's comments carefully—some requirements may have been partially addressed or assumptions were made. |
| 40 – 59 | Weak | Significant issues. The output likely needs substantial rework. Consider rerunning with better chips, a clearer prompt, or more research. |
| 0 – 39 | Poor | Major failure. The output does not meet the request. Start a new pipeline with revised inputs. |
These ranges are guidelines, not hard rules. Always read the evaluator's comments alongside the scores—a score of 78 with a comment like "all core features implemented, only edge case handling missing" is very different from a 78 with "several requirements misunderstood."
How PAE Works
PAE runs as a distinct AI call after your output is generated. Here is the step-by-step flow:
Output is Generated
Stage 4 (Gate 04) deploys the implementation. The AI produces a detailed output and a short summary, both saved to Stage 5.
You Trigger PAE
In Stage 5, click the PAE button. This initiates the evaluation.
Evaluation Agent Receives Context
A fresh AI call is made with full context: the original Master Prompt (from Stage 2), all chip constraints from Stage 1 (AI Mind, prerequisites, implementation area, tools, and no-go rules), plus the complete output from Stage 5.
Critical Self-Evaluation
The Evaluation Agent compares the output against every requirement, constraint, and behavioral guideline. It scores accuracy, completeness, and confidence independently.
Scores and Comments Returned
The AI returns a JSON response with three integer scores (0–100) and a detailed comments field. These are saved to the pipeline and displayed in the Stage 5 view.
PAE Response Format
{
"accuracy": 85,
"completeness": 92,
"confidence": 78,
"comments": "All core requirements implemented correctly. The sorting algorithm handles edge cases well. Missing: pagination was mentioned in the prompt but not implemented. Confidence is moderate because the error handling relies on assumptions about the input format that were not confirmed during Stage 2."
}
Reading Your Scores
The three scores together tell a story. Here is how to read them as a set:
High Accuracy, Low Completeness
What was delivered is correct, but parts of the request were skipped. Check the evaluator's comments to see which requirements were missed. This often happens when the prompt has many sub-tasks and the AI prioritized the main ones.
High Completeness, Low Accuracy
The AI addressed everything but got some things wrong. Look for factual errors, incorrect logic, or misinterpreted requirements. This pattern often appears with ambiguous prompts where the AI guessed at intent.
Low Confidence with High Other Scores
The output looks good on paper, but the AI is not sure it is right. This is a warning flag. Read the comments—the AI may have made educated guesses, worked with incomplete research, or encountered areas outside its expertise. Verify the output carefully before using it.
All Three Scores Aligned
When all three scores are in the same range (all high or all low), the evaluation is straightforward. High across the board means confident, complete, correct work. Low across the board means the pipeline needs revisiting from the prompt stage.
Always Read the Comments
Scores are a summary. The comments field is where the real insight lives. The evaluator explains what was done well, what could be improved, and any concerns or caveats. Two pipelines with the same score of 82 can have very different comment narratives—one might be a polished result with a minor omission, while the other has structural issues that happen to balance out.
Improving Scores
PAE scores are a direct reflection of how well the pipeline was set up. Here are the most effective ways to improve them:
Write a Clear, Specific Prompt
The single biggest factor in PAE scores is the Stage 1 prompt. Vague prompts lead to vague outputs and low scores. Be specific about what you want, how you want it structured, and what success looks like. The AI evaluates against the Master Prompt—if the Master Prompt is ambiguous, the evaluator cannot give high marks.
Choose the Right Chips
Chips guide the AI's behavior through every stage. The evaluator checks whether chip constraints were followed. Selecting relevant AI Mind chips (behavioral approach), appropriate implementation chips, and clear no-go rules gives the AI a tighter framework to work within—and a clearer standard to evaluate against.
Answer Stage 2 Questions Thoroughly
When the AI asks clarifying questions in Stage 2, answer them in detail. One-word answers or "just do your best" responses create ambiguity that carries through the entire pipeline and drags down all three scores.
Invest in Stage 3 Research
Providing strong research context—links, documentation, code samples, domain knowledge—directly improves both accuracy and confidence. The AI performs better when it has real reference material instead of relying on general knowledge. Low confidence scores often trace back to thin Stage 3 input.
Review Stage 4 Before Deploying
Stage 4 shows you the implementation plan and verification checklist before the AI executes. If the plan looks wrong or incomplete, submit feedback to revise it. Catching issues at Stage 4 prevents low accuracy and completeness scores at Stage 5.
Use No-Go Chips
No-go rules are checked during evaluation. If there are things you specifically do not want in the output (certain frameworks, approaches, patterns), set them as no-go chips. This gives the evaluator clear negative criteria to check against, which often improves the accuracy score.
PAE in Practice
Here are example scenarios showing how PAE scoring works in real usage:
Scenario: REST API Design
Well-defined prompt, strong chip selection, detailed research provided
Comments: "All 12 endpoints implemented correctly with proper HTTP methods, status codes, and error handling. Validation middleware matches the schema specification. Authentication flow follows the JWT pattern specified in the research guide. Minor: rate limiting configuration uses defaults rather than the custom values mentioned in the prompt. Confidence is high given the detailed research input and clear requirements."
Why it scored well: The user wrote a specific prompt listing all endpoints, provided API documentation as research, selected relevant implementation chips, and answered Stage 2 questions with concrete details.
Scenario: Marketing Copy for Product Launch
Clear prompt but minimal research, some ambiguity in tone
Comments: "All requested deliverables produced: tagline, three email variants, and landing page copy. However, the tone oscillates between casual and corporate—the AI Mind chip 'Professional Tone' conflicts with the prompt's request for 'fun, approachable language.' Product features are accurately described but some technical claims could not be verified against provided materials. Confidence is moderate due to limited brand guidelines in the research phase."
What to fix: The conflicting tone direction (chip vs. prompt) caused accuracy to drop. The low confidence stems from missing brand research. Resolving the tone contradiction and providing brand voice examples would push all three scores above 85.
Scenario: Database Migration Script
Vague prompt, no research provided, Stage 2 questions dismissed
Comments: "The migration script targets PostgreSQL but the prompt did not specify the database engine—this was assumed. Schema transformations cover only 3 of the approximately 8 tables implied by the prompt. No rollback mechanism included despite this being critical for production migrations. Data validation steps are missing entirely. Confidence is very low because the source schema was never provided, so the entire migration is based on inferred structure."
What went wrong: The prompt said "migrate my database" without specifying the engine, schema, or target structure. Stage 2 questions about the schema were answered with "just figure it out." No research was provided. The AI had to guess at nearly everything, which the evaluator correctly flagged.
Technical Details
For those interested in the implementation:
- PAE is triggered via
POST /api/pipeline/:id/paeand requires the pipeline to be at Stage 5 with a completed output. - The Evaluation Agent receives the Master Prompt (from Stage 2), all Stage 1 chip constraints with their full descriptions resolved from the chip library, the short summary, and the detailed output.
- Chips are sent in compact format (name + description, comma-separated) so the evaluator knows exactly what behavioral guidelines were in effect.
- The AI is instructed to respond with JSON only: three integer scores and a comments string.
- Scores are saved to the pipeline database and can be viewed in the Stage 5 panel or exported with your profile data.
- PAE uses the same AI provider (OpenClaw or Local AI) and model as the rest of the pipeline.