Documentation
How the scanner works
AST traversal, lockfile inspection, and the detector pipeline that turns code into findings.
Last updated May 6, 2026
The scanner is the part of Attestly you'll never see, but it's where everything else starts. This page explains what it does, in order.
Phase 1 — fetch
We download the repository tarball at the requested commit SHA via the GitHub REST API. The archive is streamed to a temporary directory inside an isolated worker container. Maximum repo size: 500 MB compressed.
Phase 2 — fingerprint
Before we touch any code, we compute three fingerprints:
- Lockfile hash —
pnpm-lock.yaml,package-lock.json,yarn.lock,requirements.txt,go.sum,Cargo.lock,Gemfile.lock. - Manifest hash —
package.json,pyproject.toml,go.mod,Cargo.toml,Gemfile. - Source hash — a Merkle tree of all
*.{ts,tsx,js,jsx,py,go,rs,rb}files.
If all three are unchanged from the previous scan, we skip the AST pass entirely and reuse the prior findings. This is what makes per-PR drift checks fast.
Phase 3 — lockfile sweep
For each lockfile we extract (name, version, license, sha) tuples. This
gives us:
- The full transitive dependency graph.
- Every OSS license in the build.
- The exact version of every AI SDK or subprocessor library.
Phase 4 — AST traversal
For TypeScript and JavaScript, we use ts-morph to
build a project graph and walk each file. The walker emits import edges
(import { OpenAI } from "openai") and call sites
(openai.chat.completions.create({ ... })). For Python, we use the AST
module; for Go and Rust, we parse import blocks textually.
Phase 5 — detector matching
A detector is a small declarative rule:
{
key: "openai",
packages: ["openai", "@azure/openai"],
imports: [/^openai/, /@azure\/openai/],
legalEntity: "OpenAI, L.L.C.",
purpose: "AI inference",
isAi: true,
dataCategoriesByDefault: ["communications", "behavioral"],
}
Around 200 such detectors ship in the box. You can add your own — see Custom detectors.
Phase 6 — output
The scan produces a findings table populated with one row per match. Each
row carries enough metadata for downstream generators to build accurate
documents without ever re-reading your source.