Documentation

How the scanner works

AST traversal, lockfile inspection, and the detector pipeline that turns code into findings.

Last updated May 6, 2026

The scanner is the part of Attestly you'll never see, but it's where everything else starts. This page explains what it does, in order.

Phase 1 — fetch

We download the repository tarball at the requested commit SHA via the GitHub REST API. The archive is streamed to a temporary directory inside an isolated worker container. Maximum repo size: 500 MB compressed.

Phase 2 — fingerprint

Before we touch any code, we compute three fingerprints:

Lockfile hash — pnpm-lock.yaml, package-lock.json, yarn.lock, requirements.txt, go.sum, Cargo.lock, Gemfile.lock.
Manifest hash — package.json, pyproject.toml, go.mod, Cargo.toml, Gemfile.
Source hash — a Merkle tree of all *.{ts,tsx,js,jsx,py,go,rs,rb} files.

If all three are unchanged from the previous scan, we skip the AST pass entirely and reuse the prior findings. This is what makes per-PR drift checks fast.

Phase 3 — lockfile sweep

For each lockfile we extract (name, version, license, sha) tuples. This gives us:

The full transitive dependency graph.
Every OSS license in the build.
The exact version of every AI SDK or subprocessor library.

Phase 4 — AST traversal

For TypeScript and JavaScript, we use ts-morph to build a project graph and walk each file. The walker emits import edges (import { OpenAI } from "openai") and call sites (openai.chat.completions.create({ ... })). For Python, we use the AST module; for Go and Rust, we parse import blocks textually.

Phase 5 — detector matching

A detector is a small declarative rule:

{
  key: "openai",
  packages: ["openai", "@azure/openai"],
  imports: [/^openai/, /@azure\/openai/],
  legalEntity: "OpenAI, L.L.C.",
  purpose: "AI inference",
  isAi: true,
  dataCategoriesByDefault: ["communications", "behavioral"],
}

Around 200 such detectors ship in the box. You can add your own — see Custom detectors.

Phase 6 — output

The scan produces a findings table populated with one row per match. Each row carries enough metadata for downstream generators to build accurate documents without ever re-reading your source.