Methodology · Q2 · 2026
How the AAI Index is computed
The Agentic Accounting Index ranks tools using a transparent composite score synthesized from public data, plus a synthetic Elo rating for head-to-head positioning. Every input is public, every weight is published, and every claim in an editorial research note is cited.
This is explicitly not a controlled experimental benchmark. For that, see DualEntry's 101-task evaluation or FinanceArena's FinanceQA. The AAI Index is a weighted synthesis of the public signals a buyer can actually observe.
The composite formula
Each tool is scored on seven dimensions, each on a 0–100 scale. The final score is a weighted average:
| Dimension | Weight | What we measure |
|---|---|---|
| Autonomy | 20% | Documented level of human-in-the-loop across the product's core workflow |
| Accuracy | 20% | Cited third-party benchmarks, review-site sentiment on accuracy, public case-study results |
| Coverage | 15% | Breadth of feature support across our 15-category taxonomy |
| Local Fit | 15% | Country-specific compliance — chart of accounts, tax filing, e-invoicing, local banks |
| Integrations | 10% | Documented integration count and quality tiers |
| Traction | 10% | Funding, public customer count, hiring signals, press coverage |
| Pricing | 10% | Cost efficiency — value for feature coverage, not absolute price |
Formula: Index = 0.2·Autonomy + 0.2·Accuracy + 0.15·Coverage + 0.15·LocalFit + 0.1·Integrations + 0.1·Traction + 0.1·Pricing
When viewing a country-scoped leaderboard, LocalFit is the country-specific sub-score (so Fortnox and Visma benefit from Swedish Local Fit on /se but not on the global ranking).
Sources for each signal
- Autonomy — product documentation, marketing claims compared against substantiated case studies, interviews published by the vendor, investor-deck snippets from public announcements.
- Accuracy — DualEntry's 101-task benchmark, FinanceQA performance tiers, G2 / Capterra sentiment analysis on the specific accuracy dimension, published customer case studies with measurable outcomes.
- Coverage — documented via the 15-category feature taxonomy by reading the vendor's product pages and cross-checking with customer reviews.
- Local Fit — by country: chart-of-accounts support (BAS / GAAP / IFRS), e-invoicing / Peppol support, local banking integrations, tax-authority e-filing, local language UI, documented compliance certifications.
- Integrations — the vendor's public integration marketplace, cross-referenced with independent listings (AppExchange, etc.).
- Traction — Crunchbase / Pitchbook funding, public customer counts, LinkedIn headcount trajectory, press coverage volume, partner-program announcements.
- Pricing — the vendor's public pricing page (where available), customer-reported pricing on G2 / Capterra, and third-party reviews that publish actual-paid pricing.
Elo — synthetic head-to-head
Alongside the composite, every tool has an Elo rating. This is computed from synthetic pairwise matchups: for each pair of tools, we generate N matches with outcomes weighted by each tool's signal strength and injected with deterministic noise (seed-locked so rankings are reproducible). The Elo lets readers see "who beats whom" rather than just an ordered list.
Elo is useful because the composite score can have arbitrarily close ties — Elo resolves those by asking a simple pairwise question. It is also how leaderboards like Chatbot Arena surface relative strength when absolute scores are noisy.
Known limitations
- No controlled experiments. We are not running DualEntry-style evaluations on these tools at MVP. The "Accuracy" dimension leans on public benchmarks and sentiment — it is not a primary measurement.
- Availability ≠ local fit. A tool that "supports" Sweden may not map cleanly to BAS or integrate with Skatteverket. We discount for this in the Local Fit sub-score.
- Synthetic Elo. The Elo rating is not from real user comparisons (no such database exists at the needed scale for this category). It is seeded from public signals and explicit editorial judgment. Do not read it as crowd-wisdom.
- Quarter lag. Rankings reflect Q2 · 2026 public data. Between issues, fast-moving players may have shipped meaningful changes not yet reflected in the index.
Why this framing
The alternative — running controlled benchmarks on every tool — requires API access many vendors do not offer, would take months per issue, and would miss the non-product dimensions (integrations, local fit, pricing) that dominate buying decisions in practice. The composite model sacrifices experimental purity for practical usefulness while remaining transparent about what it is.
We publish this methodology page so the score is reproducible from inputs. If any number on this site doesn't trace back to a public source, that's a bug — please file it.
The Models Index — a separate methodology
The Models Index is a sibling benchmark that ranks foundation LLMs on raw accounting tasks, independent of the commercial tools built on top of them. It uses a different formula and a different set of signals.
Composite formula
| Dimension | Weight | Source |
|---|---|---|
| Accounting tasks (8 sub-scores, mean) | 70% | DualEntry 101-task benchmark where published; synthesized otherwise |
| Cost efficiency | 15% | Public vendor pricing per 1M tokens, blended 60% input / 40% output |
| Context window | 10% | Vendor-published maximum context, normalized to 1M baseline |
| Speed | 5% | Output tokens per second, largely from Artificial Analysis |
The 8 accounting sub-scores
The task categories are borrowed directly from DualEntry's 101-task evaluation so our synthesized scores are anchored to a real, cited public eval. They are: Transaction Classification, Journal Entry Creation, Accounts Payable, Accounts Receivable, Bank Reconciliation, Financial Reporting, Month-End Close, and Accounting Knowledge.
For models DualEntry has published (Claude Opus 4.7, GPT-5.4, GPT-5.4-Nano, GPT-5.4-Mini, Z.ai GLM-5, MiniMax M2.7, Gemini 3.1 Pro, Claude Sonnet 4.6, Claude Haiku 4.5), we anchor the overall to the measured number and distribute sub-scores consistent with DualEntry's published category detail (where available) or with the structural pattern the benchmark has shown (structured tasks score highest, month-end close and financial reporting lowest).
For models DualEntry has not yet scored (DeepSeek V4, Grok 4.1 Fast at time of publication), the score is synthesized — derived from adjacent benchmarks (HumanEval, SWE-bench Verified, Artificial Analysis Intelligence Index) plus editorial judgment. These rows are explicitly tagged as Synth in the leaderboard.
Why this matters alongside Tools
A tool's Accuracy dimension on the Tools index reflects the tool's accuracy, which is a function of prompt engineering, RAG architecture, fine-tuning, and the underlying model. The Models Index isolates the raw-capability layer so readers can see what the best-available model can do, which is the upper bound for any tool built on it.