Most AI vendor pitches sound identical. "Custom GPT integration." "Agentic workflows." "End-to-end intelligent automation." The decks look polished. The terminology is impressive. The proposals arrive within 24 hours.
Three months later, a familiar story: the chatbot misquotes the refund policy, the monthly bill has tripled, and the vendor's response times have slowed to a crawl.
This pattern is common enough to be predictable. In most cases, the founder skipped a structured evaluation because the vendor sounded credible.
The checklist below is designed to close that gap. It assumes no technical background. Each question is paired with what a strong answer looks like, what a weak answer looks like, and why the distinction matters.
1. Request a live demonstration, not a recording
Pre-recorded demos and curated screenshots reveal little about production reliability. Ask the vendor to open the system live, accept inputs you provide on the call, and demonstrate behaviour under unexpected conditions.
A capable team will welcome the request. A team that cannot demonstrate live is either still building or hiding inconsistent results.
2. Ask how the system handles incorrect outputs
No AI system is correct 100% of the time. Any vendor who claims otherwise has either misunderstood the technology or is overselling it.
A strong response will describe confidence scoring, human-in-the-loop review for low-confidence cases, and clear escalation paths. A weak response will quote accuracy percentages without explaining how errors are detected or contained.
3. Clarify ownership of data, prompts, and configurations
If your vendor retains ownership of the prompts, conversation logs, or fine-tuning data, you are effectively leasing the operational core of your product.
Insist that the following are explicitly assigned to you in the contract:
- All prompt templates and system instructions
- Any data you supplied for training or fine-tuning
- All conversation logs and user interaction records
- All custom code developed specifically for your use case
Vendor resistance to these terms is a meaningful signal.
4. Request cost projections at scale
API and infrastructure costs scale non-linearly. A pilot bill is rarely representative of operating costs at production volume.
Ask for a written cost estimate at 10x current usage, with assumptions documented. The estimate should specify which model is used and why. Different models can differ in cost by an order of magnitude for comparable tasks. A vendor unable to produce this estimate has not done the underlying analysis.
5. Verify model selection and failover strategy
Major model providers experience outages. A production system reliant on a single provider with no fallback represents concentrated operational risk.
A well-architected system will specify a primary model, a fallback model on a different provider, and a tested routing mechanism (such as OpenRouter or a custom proxy). Ask when the failover was last tested.
6. Review the system prompt
The system prompt defines the assistant's behaviour, tone, and boundaries. It is reasonable to request a review, particularly if the engagement is bespoke.
A serious vendor will share it. The prompt should be specific to your business, reference your actual policies and product, and reflect the edge cases discussed during scoping. Generic prompts indicate generic delivery.
7. Examine guardrails and off-topic handling
Ask precisely how the system responds when users attempt to redirect the conversation. Examples:
- A user asks about a competitor's pricing
- A user requests an unauthorised discount
- A user attempts prompt injection or jailbreaking
- A user discusses topics outside the system's intended scope
Strong implementations use intent classification, allow-listed topics, and output validators. Weak implementations rely solely on instructions in the prompt, which can be circumvented.
8. Confirm observability and audit capability
Every interaction should be logged, searchable, and exportable. Standard tooling includes Langfuse, LangSmith, and Helicone. Logging is no longer optional for production AI systems.
Ask for a demonstration: search for all conversations in which the assistant mentioned refunds, or all sessions exceeding a given duration. If the vendor cannot perform basic queries, you will be unable to audit your own product post-launch.
9. Measure iteration speed
Behavioural adjustments — a change of tone, a new escalation rule, an updated policy reference — should take minutes to deploy, not days. Slow iteration suggests over-engineering or inefficient internal processes.
Request a live edit during the evaluation call. Time the result. This is one of the more revealing exercises in vendor selection.
10. Establish a handover plan from day one
Vendor relationships end. Companies are acquired, priorities shift, key personnel depart. A handover plan is not a sign of distrust; it is sound governance.
Request the following, updated monthly:
- Architecture diagram and data flow
- Inventory of API keys and credentials
- Deployment and rollback procedures
- Prompt library with version history
- Test cases and evaluation results
A vendor unwilling to maintain this documentation is structuring the engagement around lock-in.
11. Ask for comparable prior work
General experience in a sector is not the same as having shipped a comparable system. Ask for a specific reference: same use case, similar industry, similar scale.
If the work is genuinely novel for the vendor, that is acceptable, provided the pricing and scope reflect it. Consider a fixed-price arrangement with a defined acceptance criterion and a refund clause, rather than time-and-materials on a learning curve.
12. Justify the engagement versus an internal alternative
A reasonable founder will consider whether the work could be handled internally — perhaps by a capable team member with access to ChatGPT Enterprise or Claude for Work.
The vendor's answer should reference specific engineering deliverables that an internal team cannot easily produce: evaluation pipelines, retrieval-augmented generation with proper chunking and reranking, multi-provider failover, observability infrastructure, security review, prompt injection defence, and load testing.
If the answer rests primarily on "experience," the internal alternative is likely the better choice.
How to apply this checklist
Do not present all 12 questions in a single email. The response will be a formatted document of marketing language.
Send three or four in writing before the evaluation call. Questions 1, 3, 4, and 11 are factual and allow the vendor to prepare. Reserve the remaining questions for the call itself, where unprepared responses are more informative than rehearsed ones.
Bring a printed copy to the meeting. Vendors who recognise that you are evaluating them systematically will respond in one of two ways: capable vendors will quote you accurately and engage seriously, while less capable vendors will adjust their scope, raise their price, or withdraw. Either outcome serves you.
Need a second opinion on a vendor proposal?
Noetic Laboratories offers structured AI vendor audits for non-technical founders. One-week scope, fixed fee of $750, deliverable is a written report identifying technical red flags, contractual risks, and renegotiation points.
Book a call and bring the vendor's proposal. The audit either saves the project or saves the investment.
See our services and pricing for the full scope of what we ship.