Prepared for H Company · April 2026  ·  Confidential

RL Training Environments for Computer Use Models

Production-grade, fully deterministic SaaS environments with real-world data, dual GUI/MCP interfaces, and expert-annotated tasks — built for agent training and long-horizon benchmarking.

Why Abaka

Built for Frontier Agent Training

Designed to meet the demands of computer use and MCP-based agent development at scale.

01
Real-World Data, Reproducible State
All data is sourced from real production platforms. Byte-identical seed data across all instances guarantees reproducible evaluation from the first step.
02
Unified GUI, API & MCP
Every environment includes a full frontend GUI, backend API and MCP server built on the same underlying implementation. Any operation available in the GUI works equivalently via API or MCP.
03
Scale and Complexity at Pace
9 environments available now, scaling to 20. We support up to 3,000 tasks per month with 2–4 week delivery. Our tasks average 2–5× more steps and up to 4.7× more MCP calls than Scale AI benchmarks.

Environments

Live and Simulated Environments with Real Data

All environments are sourced from real-world platforms. No synthetic data. All are ready for task annotation now.

Greenhouse
ATS — 276 objects in production dataset
Live
Slack
3+ years of messages across modalities — ~0.87 GiB source content per instance
Live
Discord
10 servers, 486 channels
Live
Amazon
Total objects: 81,120
Live
Trip
Hotels: 1,712 · 1 country · 300+ cities
Live
Google Workspace
Total files: 1,681
Live
9
Live environments
with real data
20
Total environments
in progress
1k+
Tasks per environment
(customizable)
3,000
Tasks per month
2–4 wk
Delivery timeline

Task Complexity

Harder Tasks Than Scale AI — Across Every Metric

Detailed performance metrics comparing Abaka's current task set against Scale AI's library.

Abaka vs Scale AI: detailed performance metrics comparison
Abaka tasks are 2–5× more complex than Scale AI across avg steps, MCP calls, and unique tools
Task distribution: subtasks per task, unique tools, rubrics, MCP calls
Distribution across the current annotation set — Subtasks (med 4), Unique Tools (med 8), Rubrics (med 4), MCP Calls (med 13, max 95)

Model Evaluation

SOTA Model Results

Initial evaluation snapshot across 18–20 instances per model. Not final benchmark figures.

Claude Opus 4.6
Avg Reward0.3683
Best@40.4313
Error Rate18.0%
Avg Steps12.1
Avg Duration235s
GPT-5.3-Codex
Avg Reward0.3603
Best@40.4208
Error Rate33.9%
Avg Steps11.3
Avg Duration151s
Grok
Avg Reward0.3444
Best@40.3908
Error Rate38.3%
Avg Steps10.0
Avg Duration172s
Grading: DBDiff v1 — SHA-256 hash of a canonicalised row-level DB diff. Single string comparison, zero subjective interpretation. Agents are free to take any valid path; only the final state is evaluated.

Architecture

Built for Deterministic Agent Evaluation

Each environment is a complete, isolated clone of a production SaaS application — not a mock or simplified simulator.

ENV

Full-Stack SaaS Clone

Each instance packages a complete frontend GUI, backend API, MCP server, and dedicated PostgreSQL database into a single Docker image. One-command setup; fully isolated teardown after each task. Supports parallel evaluation at scale.

DATA

Byte-Identical Seed State

Every instance initialises from byte-identical seed data — including primary keys and all timestamps. The same task launched on any instance at any time begins from exactly the same state. This is a prerequisite for deterministic RL evaluation that existing benchmarks rarely achieve.

EVAL

DBDiff v1 Grading

Before and after agent execution, the system snapshots the database. A row-level diff is hashed to SHA-256 — grading is a single string comparison with zero subjective interpretation. Unexpected table changes are flagged automatically, with full audit trails for debugging.

GUI and MCP/API share the same implementation — 1:1 operation parity across all interfaces
Reversible database snapshots for agent training and evaluation rollback
Three rubric types: outcome-based, process-based, and state-based — each a verifiable true/false binary check
70% of development time dedicated to data quality control — each task is designed and reviewed by domain experts with 5–15 years of industry experience
Human domain expert task design with golden walkthroughs
Double-blind review by domain experts (HRs for Greenhouse); E2E testing against live environments

Delivery

Delivery Package

Three ways to integrate Abaka environments into your training pipeline.

Tasks with Rubrics

Every task ships with a full rubric set — binary outcome, process, and state checks — plus a golden human walkthrough and annotated tool trajectory for ground-truth comparison.

API Access

Run agents directly against Abaka-hosted environments via API. Evaluation reports and trajectory dashboards generated per run are provided.

Source + Docker

Full server code included for complete infrastructure flexibility. Available case by case.

Suited for clients requiring full infrastructure ownership.


Demos

Explore Our Environments

Walk through a live Greenhouse ATS task — compare the expert human walkthrough against GUI and MCP agent trajectories side by side.

Task — Identical Prompt Across All Agents

"We're hiring for a Sr. Software Engineer role to expand our Cloud+AI engineering — it's the one based in Redmond. Pull up the job posting and review what the role needs. Then find me candidates in our system who have solid cloud experience and strong programming backgrounds. Check how well each one matches, apply the best 3 to the role, and write me a short comparison brief, covering who you picked, why, and what made you pass on the others. The login information needed is email = admin@greenhouse.com; password = admin123."

Target Role Sr. Software Engineer Department Cloud + AI Location Redmond, WA Action Find + Score + Apply Top 3
Golden Human Walkthrough
GUI Agent
GUI Agent Intent Breakdown
# Action Intent Outcome
1 Navigate to /login and authenticate as admin@greenhouse.com Establish authenticated session Logged in, landed on /jobs
2 Open the existing 3rd applicant Kevin Williams' candidate profile directly Reverse-engineer the recruiter's selection signal before guessing — Kevin was independently chosen by a human, his profile is ground truth Kevin profile: YoE = 6, Skills = Python, Java, Javascript, Sql, Mysql, Azure, Agile, Management, C#, Go, Department = Cloud + AI
3 Run Calculate Match for Kevin against Sr. SWE Cloud + AI Redmond Get a baseline score from the system's built-in scorer 57% match (Skills 46% / Experience 98%); matched skills JavaScript, Python, Java, Azure
4 Identify decisive scoring dimensions Translate Kevin's score into a search heuristic Two dominant signals: (a) Azure tag, (b) populated Years of Experience field — the latter alone drives ~half of total match score
5 Filter the candidate pool by Azure tag via /candidates?tags=Azure (UI tag-dropdown autocomplete returned "No options" — a frontend bug; URL filter bypasses it) Surface the 45 Azure-tagged candidates the dropdown was hiding 45 results loaded
6 Programmatically extract the top 15 rows (name + tags) via JS Avoid clicking through one by one; pick paper-strong candidates in one pass Identified 4 prime targets: Jessica Claire 218 (Java + HTML + CSS + Azure), Karen Carter (Python + Java + Azure + Big Data), Emily Miller (Angular + .Net Core + Azure full-stack), Timothy Duncan (Java + C# + Azure)
7 Open Jessica Claire 218 and run Calculate Match Test the highest paper-overlap candidate first 70% match (Skills 46% / Experience 100%, 7 yrs). Matched: Java, HTML, CSS, Azure. New leader
8 Open Karen Carter and run Calculate Match Test second paper candidate; she has 10 yrs and Python+Java+Azure 66% match (Skills 37% / Experience 93%, 10 yrs). Matched: Python, Java, Azure
9 Open Emily Miller and run Calculate Match Test whether Angular / .Net Core get translated to system targets React / JavaScript 60% match (Skills 19% / Experience 98%, 6 yrs). Only Azure matched directly — system does not generalise Angular → React or .Net Core → JS. Strong experience score still carries the total
10 Lock the top 3 = Jessica 218 (70), Karen Carter (66), Emily Miller (60); skip Timothy to bound time Three candidates ≥60% match, none yet applied — clean win Selection finalized
11 Click Apply on Emily Miller's profile (Job Match & Apply panel) Submit application APPLIED
12 Navigate to Karen Carter, re-search the role in the panel, click Apply Submit application APPLIED
13 Navigate to Jessica Claire 218, re-search the role, click Apply Submit application APPLIED
14 Open the job's Candidates tab to verify Confirm all three submissions landed Candidates count = 7 (the 3 new picks plus 4 pre-existing); all three new picks visible with APPLIED status
15 Pause before deleting weaker pre-existing applications Permanent deletion is a destructive action; defer to user Cleanup left as a recommendation in the brief
Outcome Summary

Three high-match candidates with cloud experience and strong programming backgrounds were sourced, scored, and successfully applied to the Sr. Software Engineer (Cloud + AI, Redmond) role:

  • Jessica Clairejessica.claire218@gmail.com70% match, 7 yrs experience, Java / HTML / CSS / Azure server-side stack
  • Karen Carterkaren.carter@mail.com66% match, 10 yrs experience, Python / Java / Azure with Big Data / Spark depth
  • Emily Milleremily.miller@protonmail.com60% match, 6 yrs experience, full-stack .NET / Angular / Azure

All three were validated against the role's built-in Calculate Match scoring (average 65.3%) and now hold APPLIED status in the ATS. The three skill profiles are intentionally complementary: Jessica covers Java + web UI fundamentals, Karen covers Python + Big Data on cloud, Emily covers full-stack .NET + Azure — together giving the hiring team coverage across the JD's Java, Python, front-end, and Azure requirements.

Sourcing methodology hinged on first reverse-engineering an existing applicant's profile to identify the two scoring dimensions that mattered most (Azure tag and populated Years of Experience), then filtering the candidate pool via direct URL parameter to bypass a UI autocomplete bug, and finally running the system's Calculate Match on a handful of paper-strong contenders to commit only to objectively top-ranked picks.

MCP Agent Trajectories
Claude Code with Opus FAIL
31 tool calls · 540 s · 16/19
AUTH JOB CANDIDATE MATCH APPLY REASONING
I'll help you with this hiring task. Let me start by authenticating and then finding the Sr. Software Engineer role in Redmond.
1
AUTHmcp_greenhouse_auth_login
email=admin@greenhouse.com   password=admin123
ADMIN token issued
I'm authenticated. Now let me find the Sr. Software Engineer role in Redmond.
2
JOBmcp_greenhouse_job_list
search="Sr. Software Engineer Redmond"
No exact match found
Let me search more broadly for the role:
3
JOBmcp_greenhouse_job_list
location="Redmond"
Multiple Redmond jobs returned, target not matched
I found several Redmond positions. Let me search more specifically for the Sr. Software Engineer role in Cloud + AI:
4
JOBmcp_greenhouse_job_list
department="Cloud + AI"   search="Software Engineer"
Results returned, target not matched
5
JOBmcp_greenhouse_job_list
location="Redmond"   search="Senior Software Engineer"
Results returned, target not matched
6
JOBmcp_greenhouse_job_get
id="12043eeb..."
Wrong job retrieved
7
JOBmcp_greenhouse_job_list
department="Cloud + AI"   location="Redmond"
Results returned
8
JOBmcp_greenhouse_job_list
department="Cloud + AI"   location="Redmond"   search="Sr Software Engineer"
Results returned
9
JOBmcp_greenhouse_job_list
department="Cloud + AI"   location="Redmond"   search="Senior Software"
Target id f93267cf found
10
JOBmcp_greenhouse_job_get
id="f93267cf..."
Senior Software Engineer — Cloud+AI, Redmond
11
CANDIDATEmcp_greenhouse_candidate_list
tags="cloud,Azure,AWS,Python,Java,C++,C#,Go"   tagsMatchMode="any"
Candidates returned
12
CANDIDATEmcp_greenhouse_candidate_list
tags="Software Engineer,Backend,Cloud,Azure,AWS,Kubernetes,Docker"   tagsMatchMode="any"
More candidates returned
13
CANDIDATEmcp_greenhouse_candidate_get
id="ad46cd88..." (Daniel Walker)
Profile retrieved
14
CANDIDATEmcp_greenhouse_candidate_get
id="8ff1c4a0..." (Kevin Williams)
Profile retrieved
15
CANDIDATEmcp_greenhouse_candidate_list
tags="distributed systems,microservices,Kubernetes,Docker,REST API"   tagsMatchMode="any"
More candidates returned
16
CANDIDATEmcp_greenhouse_candidate_get
id="f852492f..." (James Lewis)
Profile retrieved
17
CANDIDATEmcp_greenhouse_candidate_get
id="6ee2d18d..." (Gary Clark)
Profile retrieved
18
ERRORmcp_greenhouse_candidate_get
id="e8e8e0f5..." (invalid)
MCP error -32603 — invalid candidate ID
19
CANDIDATEmcp_greenhouse_candidate_list
tags="C++,C#,Java,Scala,backend,systems programming"   tagsMatchMode="any"
More candidates returned
20
MATCHmcp_greenhouse_candidate_match_job
id="8ff1c4a0..." (Kevin Williams)   jobId="f93267cf..."
Score: 68   Skills 75 / Experience 60
21
MATCHmcp_greenhouse_candidate_match_job
id="ad46cd88..." (Daniel Walker)   jobId="f93267cf..."
Score: 25   Skills 30 / Experience 15
22
MATCHmcp_greenhouse_candidate_match_job
id="f852492f..." (James Lewis)   jobId="f93267cf..."
Score: 48   Skills 45 / Experience 55
23
CANDIDATEmcp_greenhouse_candidate_list
name="software engineer"
More candidates returned
24
CANDIDATEmcp_greenhouse_candidate_list
tags="C++,C#,Java,Python,software development,backend,database"   tagsMatchMode="any"
More candidates returned
25
MATCHmcp_greenhouse_candidate_match_job
id="6ee2d18d..." (Gary Clark)   jobId="f93267cf..."
Score: 25   Skills 30 / Experience 15
26
MATCHmcp_greenhouse_candidate_match_job
id="dd23c3c1..." (Margaret Robinson)   jobId="f93267cf..."
Score: 35   Skills 45 / Experience 25
I can see the candidate pool doesn't have strong matches for this Senior Software Engineer role in Cloud + AI. Let me search more specifically for candidates with the right technical background.
27
CANDIDATEmcp_greenhouse_candidate_list
limit=100
Full candidate pool fetched
I can see the candidate pool doesn't have strong matches. The job requires deep systems programming expertise. Best available: Kevin Williams (68), James Lewis (48), Daniel Walker (25).
28
CANDIDATEmcp_greenhouse_candidate_list
limit=100   page=2
More candidates fetched
29
APPLYmcp_greenhouse_application_create
candidateId="8ff1c4a0..." (Kevin Williams)   jobId="f93267cf..."
Applied — score 68
30
APPLYmcp_greenhouse_application_create
candidateId="f852492f..." (James Lewis)   jobId="f93267cf..."
Applied — score 48
31
APPLYmcp_greenhouse_application_create
candidateId="ad46cd88..." (Daniel Walker)   jobId="f93267cf..."
Applied score-25; skipped Robinson (35) — ranking error
FINAL OUTPUTHiring manager brief written by the agent
Job Overview
Senior Software Engineer — Azure Data Engineering Team
Redmond, WA  ·  Cloud + AI  ·  $119,800–$234,700
Focus: Microsoft Fabric, database drivers, SDK development, distributed systems
Selected Candidates (Applied)
1. Kevin Williams (Pavithra Shetty) — 68%  ·  Principal Consultant/BI Developer, 6 yrs
Strengths: Python, Java, C#, SQL  ·  Azure  ·  Master's in CS  ·  Agile  ·  Data systems background
Gaps: No C/C++; BI/application-level rather than SDK/driver development
Why selected: Best overall match with relevant languages and Azure experience. Education and years of experience meet requirements.
2. James Lewis — 48%  ·  Director of Data Science, 7+ yrs
Strengths: Python, Java, Scala  ·  AWS, Docker  ·  Strong algorithmic skills  ·  CS degree  ·  Leadership
Gaps: Data Science focus rather than systems engineering; missing C/C++/C# and DB driver experience
Why selected: Solid cloud experience (AWS/Docker). Programming skills transferable to backend development.
3. Daniel Walker — 25%  ·  Senior Director of IT, 3+ yrs
Strengths: Azure cloud platform  ·  SQL  ·  Project management  ·  Scrum methodology
Gaps: Management/ops role rather than hands-on engineering; lacks programming depth in required languages
Why selected: Azure and SQL knowledge provide domain familiarity, though a stretch candidate given management focus.
Candidates Not Selected
Gary Clark (25%) — Junior web developer, PHP/Laravel focus. Lacks seniority and systems-level programming experience.
Margaret Robinson (35%) — IT support specialist with hardware troubleshooting background. Not a software engineer; lacks required programming depth.
Others reviewed — Most candidates lack relevant software engineering backgrounds (teachers, floral designers, etc.) or have insufficient technical depth.
Summary
The candidate pool is limited for this specialised role. Kevin Williams (68%) is the strongest candidate with relevant programming languages and Azure experience, though all three selected candidates have gaps in low-level systems programming (C/C++) that the role emphasises. The position may benefit from additional sourcing focused on SDK development, database driver engineering, or distributed systems experience.
OpenAI SDK with Opus FAIL
26 tool calls · 302 s · 17/19
AUTH JOB CANDIDATE MATCH APPLY REASONING
I'll help you with this hiring task. Let me start by logging in and then finding the Sr. Software Engineer role in Redmond.
1
AUTHauth_login
email="admin@greenhouse.com"   password="admin123"
ADMIN token issued
2
JOBjob_list
search="Sr. Software Engineer Redmond"
Results returned
Let me search more broadly for the role:
3
JOBjob_list
location="Redmond"
Multiple Redmond jobs returned
Let me search specifically for Software Engineer roles in Cloud + AI:
4
JOBjob_list
department="Cloud + AI"   search="Software Engineer"
Target id 0643bd41 found
Found it! Let me get the full details of the Sr. Software Engineer role in Redmond:
5
JOBjob_get
id="0643bd41..."
Sr. Software Engineer — Cloud+AI, Redmond
The description is truncated. Let me search for candidates with cloud and programming experience while I have the job ID. I'll search for candidates with relevant skills:
6
CANDIDATEcandidate_list
tags="cloud"   tagsMatchMode="any"
0 results
7
CANDIDATEcandidate_list
tags="Azure"   tagsMatchMode="any"
Candidates returned
8
CANDIDATEcandidate_list
tags="AWS"   tagsMatchMode="any"
More candidates returned
Let me search for more candidates with programming and software engineering skills:
9
CANDIDATEcandidate_list
tags="Python,Java,C++"   tagsMatchMode="any"
More candidates returned
10
CANDIDATEcandidate_list
tags="Software Engineer"   tagsMatchMode="any"
0 results
Let me search for candidates with more specific software engineering and cloud-related skills:
11
CANDIDATEcandidate_list
tags="C++"
Candidates returned
12
CANDIDATEcandidate_list
tags="C#"
More candidates returned
13
CANDIDATEcandidate_list
tags="Kubernetes"
More candidates returned
14
CANDIDATEcandidate_list
tags="Docker"
More candidates returned
Great! I found several strong candidates. Let me now check their match scores against the Sr. Software Engineer role in Redmond. I'll evaluate the top candidates:
15
MATCHcandidate_match_job
id="711d96d3..." (Steven Hernandez)   jobId="0643bd41..."
Score: 72   Skills 68 / Experience 75
16
MATCHcandidate_match_job
id="b08af382..." (Lindsay Beer)   jobId="0643bd41..."
Score: 72   Skills 65 / Experience 80
17
MATCHcandidate_match_job
id="91a87c69..." (Jennifer Harris)   jobId="0643bd41..."
Score: 62   Skills 55 / Experience 70
18
MATCHcandidate_match_job
id="f852492f..." (James Lewis)   jobId="0643bd41..."
Score: 48   Skills 50 / Experience 45
19
MATCHcandidate_match_job
id="a0e953fc..." (Paul Satchell)   jobId="0643bd41..."
Score: 78   Skills 70 / Experience 95
20
MATCHcandidate_match_job
id="8ff1c4a0..." (Kevin Williams)   jobId="0643bd41..."
Score: 68   Skills 72 / Experience 65
Let me check a few more candidates with different skill combinations:
21
MATCHcandidate_match_job
id="ad46cd88..." (Daniel Walker)   jobId="0643bd41..."
Score: 35   Skills 30 / Experience 40
22
MATCHcandidate_match_job
id="91a2c2e2..." (Paul Davis)   jobId="0643bd41..."
Score: 78   Skills 70 / Experience 85
23
MATCHcandidate_match_job
id="15cc872d..." (Gary Hernandez)   jobId="0643bd41..."
Score: 5   Skills 15 / Experience 0
Based on the match scores, I'll apply the top 3 candidates to the role. The best matches are: Paul Satchell (78), Paul Davis (78), Lindsay Beer (72). Let me create applications for them.
24
APPLYapplication_create
candidateId="a0e953fc..." (Paul Satchell)   jobId="0643bd41..."
Applied — score 78
25
APPLYapplication_create
candidateId="91a2c2e2..." (Paul Davis)   jobId="0643bd41..."
Applied — score 78
26
APPLYapplication_create
candidateId="b08af382..." (Lindsay Beer)   jobId="0643bd41..."
Applied — score 72
FINAL OUTPUTHiring manager brief written by the agent
Role Overview
Sr. Software Engineer — Cloud + AI
Redmond, WA  ·  Microsoft SCHIE team  ·  $119.8K–$234.7K
Building cloud infrastructure for Azure.
Selected Candidates (Applied)
1. Paul Davis — 78%  ·  Software Engineering Manager, 20+ yrs
Strengths: C++, C#, Java, JavaScript  ·  Distributed systems (SOA)  ·  REST APIs  ·  Scrum  ·  Large-scale microservices
Gap: Limited React/TypeScript; has JavaScript/HTML5 foundation
Why selected: Strongest technical depth in backend/architecture. Extensive distributed systems and microservices experience. Strong mentoring background aligns with team collaboration needs.
2. Paul Satchell — 78%  ·  Senior Python Developer, 15+ yrs  ·  PCPP2 & Cloudera certified
Strengths: Python  ·  Distributed systems  ·  Microservices  ·  Docker  ·  DB optimisation (20%)  ·  Agile/Scrum  ·  Active mentor
Gap: Front-end Django/Bootstrap rather than React/TypeScript
Why selected: Exceptional backend and infrastructure skills. Strong cloud architecture experience. Demonstrated performance impact and active mentoring track record.
3. Lindsay Beer — 72%  ·  Senior Python Developer, 11+ yrs
Strengths: Python 3  ·  Kubernetes  ·  DevOps  ·  Docker  ·  CI/CD  ·  Apache Kafka  ·  NoSQL  ·  Prometheus
Gap: Front-end limited to Angular; lacks React/TypeScript and C# experience
Why selected: Deep cloud-native expertise (Kubernetes, Prometheus). Strong DevOps and data pipeline background directly relevant to Azure infrastructure.
Candidates Not Selected
Steven Hernandez (72%) — PL/SQL Developer; primarily database-focused, limited full-stack development experience.
Kevin Williams (68%) — Principal Consultant; consulting/advisory rather than hands-on engineering depth required.
Jennifer Harris (62%) — VP of IT; executive leadership role mismatch for IC4 position requiring hands-on coding.
James Lewis (48%) — Director of Data Science; expertise is data science/ML rather than systems engineering.
Daniel Walker (35%) — IT Director; focused on infrastructure/ERP management, not software development.
Gary Hernandez (5%) — HR/Technical Recruiter; not a software engineer.
Bottom Line
All three selected candidates bring 10+ years of senior engineering experience with strong cloud and distributed systems backgrounds. The main trade-off is front-end expertise — the role emphasises React/TypeScript, but these candidates excel in backend/infrastructure which is critical for Azure's cloud platform work. Their mentoring experience and architectural skills make them strong senior hires who can grow into the front-end requirements.

Abaka environments support any model or SDK — trajectories shown here use Claude and OpenAI as representative examples.

Agent Evaluation Results — 19-Rubric Scorecard
Rubric Type Rubric Description GUI Agent MCP Claude Code MCP OpenAI
Outcome — 10 pts
Outcome Match scores cited in brief are supported by trajectory evidence
Outcome Candidate profile claims in brief are evidentially grounded
Outcome Every applied candidate discussed individually in brief
Outcome Each applied candidate includes match score and rationale
Outcome Candidates not selected are named with reasoning
Outcome Job requirements referenced when justifying candidate selections
Outcome Candidate ranking is consistent with computed match scores
Outcome Trade-offs and imperfect fits are acknowledged
Outcome Brief is clearly structured and easy to follow
Outcome Brief includes concrete next steps for the hiring team
Process — 5 pts
Process Job posting reviewed before any candidate or profile work
Process Candidate search uses cloud technology filters (e.g. Azure)
Process Candidate search uses programming language filters
Process At least 3 distinct candidate filter or query combinations used
Process Match scores computed for all candidates before any application is submitted
State — 4 pts
State Exactly 3 applications created on the target job
State Applied candidates are the 3 highest scorers from the agent's own evaluations
State No unintended writes beyond applications, job matches, and login
State Applications confirmed in ATS after submitting
DB Diff — 5 binary checks
DB Diff Exactly 3 application INSERT operations in DB N/A
DB Diff All application writes target a single job N/A
DB Diff Applied candidate IDs match agent's 3 highest match scores N/A
DB Diff Only expected operation types and tables appear in DB changes N/A
DB Diff No unexpected table modifications beyond expected scope N/A
Total Score & Verdict 16/19  FAIL 16/19 · 4/5  FAIL 17/19 · 5/5  FAIL
★ Must-pass rubric — a single failure on any must-pass item results in FAIL regardless of total score.  |  GUI Agent DB diff not available: no artifact was captured for that run.
Pilot Proposal

Two Paths to Get Our Partnership Started

Both options give your team full access to Abaka's quality control pipeline before committing to full-scale.

Option A

Small Paid Pilot — Up to 20 Tasks

A focused pilot across a selected environment. Lets your team evaluate task quality, grading accuracy, and environment fidelity directly before making a larger purchase decision.

Recommended
Option B

One Complete Environment

Full delivery of one paid environment with 100 tasks. Enables evaluation of Abaka's full annotation and infrastructure stack at representative scale, including Docker image and API access.