|
| 1 | +--- |
| 2 | +agent: agent |
| 3 | +--- |
| 4 | + |
| 5 | +# Data Extension Development Workflow |
| 6 | + |
| 7 | +Use this workflow to create CodeQL data extensions (Models-as-Data) for third-party libraries and frameworks. Data extensions let you customize taint tracking without writing QL code — you author YAML files that declare which functions are sources, sinks, summaries, barriers, or barrier guards. |
| 8 | + |
| 9 | +For format reference, read the MCP resource: `codeql://learning/data-extensions` |
| 10 | +For language-specific guidance: `codeql://languages/{{language}}/library-modeling` |
| 11 | + |
| 12 | +## Workflow Checklist |
| 13 | + |
| 14 | +### Phase 1: Identify the Target |
| 15 | + |
| 16 | +- [ ] **Confirm the target library and language** |
| 17 | + - Library name and version: {{libraryName}} |
| 18 | + - Target language: {{language}} |
| 19 | + - Determine the model format: |
| 20 | + - **MaD tuple format** (9–10 column tuples): C/C++ (`codeql/cpp-all`), C# (`codeql/csharp-all`), Go (`codeql/go-all`), Java/Kotlin (`codeql/java-all`) |
| 21 | + - **API Graph format** (3–5 column tuples): JavaScript/TypeScript (`codeql/javascript-all`), Python (`codeql/python-all`), Ruby (`codeql/ruby-all`) |
| 22 | + - Using the wrong format will cause the extension to silently fail to load. |
| 23 | + |
| 24 | +- [ ] **Locate a CodeQL database** |
| 25 | + - Tool: #list_codeql_databases |
| 26 | + - Or create one: #codeql_database_create |
| 27 | + - The database must contain code that exercises the target library |
| 28 | + |
| 29 | +- [ ] **Explore the library's API surface** |
| 30 | + - Tool: #read_database_source — browse source files to identify relevant API calls |
| 31 | + - Tool: #codeql_query_run with `queryName="PrintAST"` — visualize how library calls are represented |
| 32 | + - Skim the library's public API docs, type stubs, or source code |
| 33 | + |
| 34 | +### Phase 2: Classify the API Surface |
| 35 | + |
| 36 | +For each public function or method on the library, classify it: |
| 37 | + |
| 38 | +1. **Does it return data from outside the program** (network, file, env, stdin)? → `sourceModel` with `kind` matching the threat model (usually `"remote"`) |
| 39 | +2. **Does it consume data in a security-sensitive operation** (SQL, exec, path, redirect, eval, deserialize)? → `sinkModel` with `kind` matching the vulnerability class (e.g. `"sql-injection"`, `"command-injection"`) |
| 40 | +3. **Does it pass data through opaque library code** (encode, decode, wrap, copy, iterate)? → `summaryModel` with `kind: "taint"` (derived) or `kind: "value"` (identity) |
| 41 | +4. **Does it sanitize data so its output is safe for a specific sink kind?** → `barrierModel` with `kind` matching the sink kind it neutralizes |
| 42 | +5. **Does it return a boolean indicating whether data is safe?** → `barrierGuardModel` with the appropriate `acceptingValue` (`"true"` or `"false"`) and matching `kind` |
| 43 | +6. **Is the type a subclass of something already modeled?** → `typeModel` (API Graph languages) or set `subtypes: True` (MaD tuple languages) |
| 44 | +7. **Did the auto-generated model assign a wrong summary?** → `neutralModel` to suppress it |
| 45 | + |
| 46 | +A complete chain of **source → (summary\*) → sink** is required for end-to-end findings; missing a single hop will cause false negatives. |
| 47 | + |
| 48 | +### Phase 3: Choose the Deployment Scope |
| 49 | + |
| 50 | +Choose between two paths: |
| 51 | + |
| 52 | +- **Single-repo shortcut** — drop `.model.yml` files under `.github/codeql/extensions/<pack-name>/` in the consuming repo. **No `codeql-pack.yml` is required**; Code Scanning auto-loads extensions from this directory. Use when the models only need to apply to one repo. |
| 53 | +- **Reusable model pack** — create a pack directory with a `codeql-pack.yml` declaring `extensionTargets` and `dataExtensions`. Use when models will be consumed by multiple repos or by org-wide Default Setup. |
| 54 | + |
| 55 | +### Phase 4: Author the `.model.yml` File(s) |
| 56 | + |
| 57 | +- [ ] **Create the model file** |
| 58 | + - Use naming convention `<library>-<module>.model.yml` (lowercase, hyphen-separated) |
| 59 | + - Split per logical module rather than putting an entire ecosystem in one file |
| 60 | + - Read `codeql://languages/{{language}}/library-modeling` for the exact column layout and examples |
| 61 | + |
| 62 | +- [ ] **Write the YAML with correct extensible predicates** |
| 63 | + |
| 64 | + ```yaml |
| 65 | + extensions: |
| 66 | + - addsTo: |
| 67 | + pack: codeql/{{language}}-all |
| 68 | + extensible: sinkModel |
| 69 | + data: |
| 70 | + # Add tuples here — column count must exactly match the predicate schema |
| 71 | + - [...] |
| 72 | + ``` |
| 73 | +
|
| 74 | + - Every row must have the **exact column count** for its extensible predicate — an invalid row will fail silently or cause errors |
| 75 | + - Use `provenance: 'manual'` (MaD format) for hand-written rows |
| 76 | + - Ensure `kind` values match across the chain (e.g. a `"sql-injection"` barrier must guard a `"sql-injection"` sink) |
| 77 | + |
| 78 | +### Phase 5: Configure `codeql-pack.yml` (Model-Pack Path Only) |
| 79 | + |
| 80 | +Skip this step if you chose the `.github/codeql/extensions/` shortcut in Phase 3. |
| 81 | + |
| 82 | +For a reusable pack, create or update `codeql-pack.yml`: |
| 83 | + |
| 84 | +```yaml |
| 85 | +name: <org>/<language>-<pack-name> |
| 86 | +version: 0.0.1 |
| 87 | +library: true |
| 88 | +extensionTargets: |
| 89 | + codeql/<language>-all: '*' |
| 90 | +dataExtensions: |
| 91 | + - models/**/*.yml |
| 92 | +``` |
| 93 | + |
| 94 | +- `library: true` — model packs are always libraries, never queries |
| 95 | +- `extensionTargets` — names the upstream pack the extensions extend |
| 96 | +- `dataExtensions` — a glob that picks up every `.model.yml` you author |
| 97 | + |
| 98 | +- [ ] **Install pack dependencies** |
| 99 | + - Tool: #codeql_pack_install — resolve dependencies for the model pack |
| 100 | + |
| 101 | +### Phase 6: Test with `codeql query run` |
| 102 | + |
| 103 | +Validate the model against a real database: |
| 104 | + |
| 105 | +- [ ] **Run a relevant security query with the extension applied** |
| 106 | + - Tool: #codeql_query_run |
| 107 | + - Pass the model pack directory via the `additionalPacks` parameter |
| 108 | + - Pick a query whose sink kind matches what you modeled (e.g. a `sql-injection` query when adding SQL sinks) |
| 109 | + - Decode results: #codeql_bqrs_decode or #codeql_bqrs_interpret |
| 110 | + |
| 111 | +- [ ] **Verify expected findings appear** |
| 112 | + - New sources/sinks should produce findings that were absent without the extension |
| 113 | + - Barriers/barrier guards should suppress findings that were previously reported |
| 114 | + |
| 115 | +### Phase 7: Run Unit Tests with `codeql test run` |
| 116 | + |
| 117 | +- [ ] **Create a test case for the extension** |
| 118 | + - Write a small test file that exercises the new source/sink/summary chain end-to-end |
| 119 | + - Include both positive cases (vulnerable code detected) and negative cases (safe code not flagged) |
| 120 | + |
| 121 | +- [ ] **Run the tests** |
| 122 | + - Tool: #codeql_test_run |
| 123 | + - Pass the model pack directory via the `additionalPacks` parameter |
| 124 | + - Note: `codeql test run` does **not** accept `--model-packs`; extensions must be wired via `codeql-pack.yml` or `--additional-packs` |
| 125 | + |
| 126 | +- [ ] **Accept correct results** |
| 127 | + - Tool: #codeql_test_accept — accept the `.actual` output as the `.expected` baseline once you confirm it is correct |
| 128 | + |
| 129 | +### Phase 8: Decide Next Steps |
| 130 | + |
| 131 | +- If the `.model.yml` lives under `.github/codeql/extensions/` of the consuming repo, you are **done** — Code Scanning will load it on the next analysis. |
| 132 | +- If you authored a reusable model pack and want it to apply across an organization, publish it to GHCR with `codeql pack publish` and configure it under org Code security → Global settings → CodeQL analysis → Model packs. |
| 133 | + |
| 134 | +## Validation Checklist |
| 135 | + |
| 136 | +- [ ] Correct tuple format for the language (API Graph vs MaD) |
| 137 | +- [ ] Every row has the exact column count for its extensible predicate |
| 138 | +- [ ] Sink/barrier `kind` values match across the chain |
| 139 | +- [ ] At least one end-to-end test exercises the new model and produces expected findings |
| 140 | +- [ ] `codeql-pack.yml` `dataExtensions` glob actually matches the new files |
| 141 | +- [ ] No regressions in pre-existing tests under the same pack |
| 142 | + |
| 143 | +## Related Resources |
| 144 | + |
| 145 | +- `codeql://learning/data-extensions` — Common data extensions overview (both model formats) |
| 146 | +- `codeql://languages/{{language}}/library-modeling` — Language-specific library modeling guide |
| 147 | +- `codeql://templates/security` — Security query templates |
| 148 | +- `codeql://learning/test-driven-development` — TDD workflow for CodeQL queries |
0 commit comments