Add skill and evals for dynamic mode usage by rostan-t · Pull Request #6271 · NVIDIA/DALI

rostan-t · 2026-03-20T15:34:32Z

Category:

Other (e.g. Documentation, Tests, Configuration)

Description:

Since dynamic mode is fairly new, AI agents are not very good at writing code using it. For instance, according to Anthropic, Claude Sonnet 4.6's knowledge cutoff is August 2026. Even when presented with a few examples, agents miss some dynamic-mode specific patterns and are not very helpful to write code using it.

This PR adds a Claude Code skill containing guidelines on how to use dynamic mode. It was generated with the /skill-creator which generates evals for the skill. Here are the results on running the eval with Claude Code using Sonnet 4.6:

Eval	Task	With Skill	Without Skill
1	Image classification pipeline	10/10	1/10
2	Batch column extraction	4/4	1/4
3	Pipeline-to-dynamic conversion	7/7	0/7
4	Debugging intermittent corruption	6/6	1/6
5	Audio mel spectrogram	6/6	1/6
6	Object detection pipeline	7/7	0/7
Total	40 assertions	40/40 (100%)	4/40 (10%)

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

.claude/skills/using-dali-dynamic-mode-workspace/evals/files/pipeline_to_convert.py

greptile-apps · 2026-03-20T15:38:39Z

Greptile Summary

This PR adds a Claude Code skill (SKILL.md) and a set of evals (evals.json + pipeline_to_convert.py) to help AI agents write correct DALI dynamic mode code. The skill documents the imperative nvidia.dali.experimental.dynamic API — covering core data types (Tensor, Batch), stateful readers, device handling, the async execution model, RNG, and a pipeline-mode migration table. The evals validate 6 representative tasks (image classification, batch slicing, pipeline conversion, debugging, audio mel spectrogram, object detection) and the PR reports a 10× improvement in assertion pass-rate when the skill is active.

Key points from the review:

The skill content is accurate and well-structured; previously raised issues (batch-size immutability, eval ID gap, missing imports in the eval file, max_batch_size constructor guidance) have all been resolved in this version.
The Pipeline Mode Migration table does not include a row for device_id, which appears in pipeline_to_convert.py (device_id=0). An agent converting that file has no guidance on the dynamic mode equivalent and may silently drop the parameter or invent an incorrect API in multi-GPU scenarios.
The skill's description field triggers on the phrase 'DALI preprocessing', which is broad enough to activate the skill in plain pipeline-mode contexts, potentially injecting dynamic-mode advice where it was not requested.

Confidence Score: 4/5

Safe to merge; contains only documentation/skill files with no runtime code changes.
The PR is documentation-only (a Claude Code skill + evals). All previously raised substantive issues have been addressed. The two remaining observations are minor: a missing device_id migration row in the SKILL.md table and an overly broad skill trigger phrase. Neither blocks correctness for the primary use cases covered by the evals.
.claude/skills/using-dali-dynamic-mode/SKILL.md — Pipeline Mode Migration table missing device_id entry; skill description trigger phrase could be narrowed.

Important Files Changed

Filename	Overview
.claude/skills/using-dali-dynamic-mode/SKILL.md	Comprehensive skill document for DALI dynamic mode. Well-structured with core data types, readers, device handling, execution model, RNG, and migration table. Minor gaps: `device_id` migration not covered; skill trigger phrase "DALI preprocessing" is broad enough to activate in pipeline-mode contexts.
.claude/skills/using-dali-dynamic-mode-workspace/evals/evals.json	Six well-constructed evals covering the key dynamic mode patterns (image classification, batch slicing, pipeline conversion, debugging, audio mel spectrogram, object detection). IDs are now sequential (1–6). Assertions are specific and actionable. No issues found.
.claude/skills/using-dali-dynamic-mode-workspace/evals/files/pipeline_to_convert.py	Reference pipeline-mode script used as input for eval 3. Contains all necessary imports and a complete, syntactically valid pipeline with `@pipeline_def`, `fn.readers.file`, `device="mixed"` decode, random augmentation, and `pipe.run()` — exactly the patterns the eval tests for correct conversion.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["import nvidia.dali.experimental.dynamic as ndd"] --> B["ndd.set_num_threads(N)"]
    B --> C["reader = ndd.readers.File(...)\n(created once, stateful)"]
    C --> D["Epoch loop\nfor epoch in range(num_epochs)"]
    D --> E["reader.next_epoch(batch_size=N)\nyields tuples of Batch"]
    E --> F["Batch loop\nfor jpegs, labels in ..."]
    F --> G["ndd.decoders.image(jpegs, device='gpu')\nNOT device='mixed'"]
    G --> H["ndd operators\nndd.resize, ndd.crop_mirror_normalize, etc."]
    H --> I{"Variable-length\nbatch?"}
    I -- "Yes (audio/bbox)" --> J["batch.torch(pad=True)"]
    I -- "No (uniform shapes)" --> K["batch.torch()"]
    J --> L["train_step(...)"]
    K --> L
    L --> F
    F --> D

    subgraph Debugging
        M["with ndd.EvalMode.sync_full:"]
        M --> N["errors surface at exact call site"]
    end

    subgraph RNG
        O["ndd.random.set_seed(42)\nor ndd.random.RNG(seed=42)"]
        O --> P["ndd.random.uniform(batch_size=N, ...)"]
    end

_{Last reviewed commit: "Clarify PyTorch conv..."}

greptile-apps · 2026-03-20T15:38:43Z

.claude/skills/using-dali-dynamic-mode/SKILL.md

+t.torch(copy=False)            # zero-copy PyTorch tensor (default)
+t[1:3]                         # slicing supported
+np.asarray(t)                  # NumPy via __array__ (CPU only)
+```
+
+Supports `__dlpack__`, `__cuda_array_interface__`, `__array__`, arithmetic operators.
+
+### Batch -- collection of samples (variable shapes OK)
+
+```python
+b = ndd.batch([arr1, arr2])    # copy
+b = ndd.as_batch(data)         # wrap, no copy if possible
+```
+
+**Batch has no `__getitem__`** -- `batch[i]` raises `TypeError` because indexing is ambiguous (sample selection vs. per-sample slicing). Use the explicit APIs instead:
+
+| Intent | Method | Returns |
+|--------|--------|---------|
+| Get sample i | `batch.select(i)` | `Tensor` |
+| Get subset of samples | `batch.select(slice_or_list)` | `Batch` |
+| Slice within each sample | `batch.slice[...]` | `Batch` (same batch_size) |
+
+`.select()` picks **which samples**. `.slice` indexes **inside each sample**.
+
+```python
+xy = ndd.random.uniform(batch_size=16, range=[0, 1], shape=2)
+crop_x = xy.slice[0]       # Batch of 16 scalars, first element from each sample
+crop_y = xy.slice[1]       # Batch of 16 scalars, second element from each sample
+sample_0 = xy.select(0)    # Tensor, the entire first sample [x, y]
+```
+
+**PyTorch conversion:**
+- `batch.torch()` -- works for uniform shapes; raises for ragged batches
+- `batch.torch(pad=True)` -- zero-pads ragged batches to max shape (use for variable-length audio, detection boxes, etc.)
+- `batch.torch(copy=None)` is the default (avoids copy if possible)
+- Batch has **no `__dlpack__`** -- use `ndd.as_tensor(batch)` first for DLPack consumers. `ndd.as_tensor` supports `pad` as well.
+- `Tensor.torch(copy=False)` is default (no copy)


Inconsistent copy parameter defaults for Tensor.torch() vs Batch.torch()

The skill documents two different defaults for the same copy parameter, which could lead AI agents to generate subtly incorrect code:

Line 23: t.torch(copy=False) # zero-copy PyTorch tensor (default) — implies copy=False is the Tensor default

Line 57: batch.torch(copy=None) is the default (avoids copy if possible) — implies copy=None is the Batch default

Line 59: Tensor.torch(copy=False) is default (no copy) — restates the Tensor claim

In standard Python/NumPy/PyTorch semantics, copy=False typically means "never copy, raise an error if a copy is required", while copy=None means "copy only if necessary". If Tensor.torch() genuinely defaults to copy=False, that's a stricter contract than Batch.torch()'s copy=None — an agent following this skill might generate Tensor conversion code that raises unexpectedly when a copy is unavoidable, while Batch code silently succeeds.

If both actually default to the same behaviour, one of these needs correcting. Please verify the actual DALI API defaults and align the documentation so agents receive consistent guidance.

@greptileai This is the actual DALI behavior documented in both _tensor.py and _batch.py. Batch.torch() accepts None for the copy parameter because the copy cannot be avoided if the batch is not stored as a dense tensor. This doesn't make sense in Tensor.torch() because it's always possible to avoid copying (it takes no other arguments like dtype that could force a copy).

The skill is correct there and helps agents catch this non-trivial difference.

Got it — thanks for the clarification! You're right that this is a meaningful API difference, not an inconsistency:

Tensor.torch(copy=False) as the default makes sense because a single dense tensor can always be zero-copied

Batch.torch(copy=None) as the default is necessary because ragged batches physically require copying to convert to a dense PyTorch tensor

The skill correctly documents this distinction, and agents need to know that Batch.torch() may copy even when not explicitly requested (for ragged data), while Tensor.torch() guarantees zero-copy by default.

The documentation as written will help agents avoid incorrect assumptions about copy behavior when working with ragged batches. No changes needed here.

.claude/skills/using-dali-dynamic-mode-workspace/evals/evals.json

greptile-apps · 2026-03-20T15:38:45Z

.claude/skills/using-dali-dynamic-mode/SKILL.md

+| `.cpu()` before GPU model | `.torch()` directly | Avoids wasteful D2H + H2D round-trip |
+| Recreate reader each epoch | `reader.next_epoch()` | Readers are stateful -- create once, reuse |
+| `ndd.readers.file(...)` | `ndd.readers.File(...)` | Reader classes are PascalCase |
+| `break` from `next_epoch()` loop | Exhaust iterator or create new reader | Iterator must be fully consumed before next `next_epoch()` |


Misleading "Exhaust iterator" advice after a break

The "Right" column for breaking out of next_epoch() says "Exhaust iterator or create new reader". However, after a break the iterator is already abandoned — you cannot exhaust it without explicitly draining it in a separate loop, which is non-obvious.

An AI agent reading this might attempt to do something like list(iterator) after breaking, which is wasteful. The more actionable guidance is to either avoid breaking in the first place (restructure the loop condition) or, if you must exit early, create a new reader. Consider updating the wording:

Suggested change

| `break` from `next_epoch()` loop | Exhaust iterator or create new reader | Iterator must be fully consumed before next `next_epoch()` |

| `break` from `next_epoch()` loop | Avoid breaking (use a flag/condition instead), or create a new reader | Iterator must be fully consumed before next `next_epoch()` |

@greptileai We actually do need to fully consume the iterator before going to the next epoch. Breaking can be legitimate in some cases and the two approaches that work are clearly documented already. Your suggestion ("Using a flag/condition instead") wouldn't help.

You're right, I apologize for the confusion. The documentation clearly states the two valid approaches when you need to exit early:

Manually drain the remaining iterator items before calling next_epoch() again

Create a new reader instance

Both are legitimate solutions for cases where breaking is necessary (early stopping, errors, etc.). The table is clear as written.

Signed-off-by: Rostan Tabet <[email protected]>

.claude/skills/using-dali-dynamic-mode-workspace/evals/files/pipeline_to_convert.py

.claude/skills/using-dali-dynamic-mode/SKILL.md

Signed-off-by: Rostan Tabet <[email protected]>

.claude/skills/using-dali-dynamic-mode/SKILL.md

Signed-off-by: Rostan Tabet <[email protected]>

greptile-apps · 2026-03-20T21:03:56Z

.claude/skills/using-dali-dynamic-mode/SKILL.md

+## Pipeline Mode Migration
+
+| Pipeline Mode | Dynamic Mode |
+|--------------|--------------|
+| `@pipeline_def` / `pipe.build()` / `pipe.run()` | Direct function calls in a loop |
+| `fn.readers.file(...)` | `ndd.readers.File(...)` (PascalCase, stateful) |
+| `fn.decoders.image(jpegs, device="mixed")` | `ndd.decoders.image(jpegs, device="gpu")` |
+| `fn.op_name(...)` | `ndd.op_name(...)` |
+| Pipeline-level `batch_size=64` | `reader.next_epoch(batch_size=64)` + random ops `batch_size=64` |
+| Pipeline-level `seed=42` | `ndd.random.set_seed(42)` or `ndd.random.RNG(seed=42)` |
+| Pipeline-level `num_threads=4` | `ndd.set_num_threads(4)` at startup |
+| `output.at(i)` | `batch.select(i)` |
+| `output.as_cpu()` | `batch.cpu()` |
+| `pipe.run()` returns tuple of `TensorList` | `reader.next_epoch(batch_size=N)` yields tuples of `Batch` |


device_id migration not documented

The Pipeline Mode Migration table is missing a row for the device_id parameter. The pipeline_to_convert.py used in eval 3 passes device_id=0 to the pipeline constructor:

pipe = training_pipeline( image_dir="/data/images", batch_size=64, num_threads=4, device_id=0, # <- no equivalent shown in migration table seed=42, )

An agent converting this code using the migration table will encounter device_id=0 with no guidance. In multi-GPU scenarios it might silently drop the parameter or invent an incorrect API like ndd.set_device_id(0). Consider adding a row clarifying the dynamic mode equivalent (e.g., CUDA_VISIBLE_DEVICES, torch.cuda.set_device, or that dynamic mode picks up the default device automatically).

greptile-apps · 2026-03-20T21:03:57Z

.claude/skills/using-dali-dynamic-mode/SKILL.md

+---
+name: using-dali-dynamic-mode
+description: "Use when writing DALI data loading or preprocessing code with `nvidia.dali.experimental.dynamic` (ndd), or when converting DALI pipeline-mode code to dynamic mode, or when the user asks about DALI dynamic mode, imperative DALI, or ndd. Use this skill any time someone mentions 'ndd', 'dynamic mode', 'DALI preprocessing', or wants to load/augment data with DALI outside of a pipeline definition."
+---


Skill description over-triggers on "DALI preprocessing"

The description includes the trigger phrase 'DALI preprocessing', which is a broad term that applies equally to pipeline-mode workflows. A user working on an existing pipeline-mode codebase who asks "how do I add DALI preprocessing to my model?" would have this skill injected, potentially nudging them toward dynamic mode when they never asked for it.

Consider narrowing the trigger to something more specific, e.g., 'DALI dynamic preprocessing' or 'DALI imperative API', so the skill only activates for genuinely dynamic-mode contexts.

Suggested change

---

name: using-dali-dynamic-mode

description: "Use when writing DALI data loading or preprocessing code with `nvidia.dali.experimental.dynamic` (ndd), or when converting DALI pipeline-mode code to dynamic mode, or when the user asks about DALI dynamic mode, imperative DALI, or ndd. Use this skill any time someone mentions 'ndd', 'dynamic mode', 'DALI preprocessing', or wants to load/augment data with DALI outside of a pipeline definition."

---

description: "Use when writing DALI data loading or preprocessing code with `nvidia.dali.experimental.dynamic` (ndd), or when converting DALI pipeline-mode code to dynamic mode, or when the user asks about DALI dynamic mode, imperative DALI, or ndd. Use this skill any time someone mentions 'ndd', 'dynamic mode', 'DALI dynamic preprocessing', or wants to load/augment data with DALI outside of a pipeline definition."

mzient · 2026-03-23T11:20:21Z

.claude/skills/using-dali-dynamic-mode-workspace/evals/evals.json

+        {"name": "no-pipeline-mode", "text": "No pipeline-mode constructs (no @pipeline_def, pipe.build(), pipe.run()) and operators called directly on ndd (e.g. ndd.resize, not fn.resize or ndd.fn.resize)"},
+        {"name": "torch-handoff", "text": "Uses .torch() for PyTorch conversion"},
+        {"name": "no-unnecessary-evaluate", "text": "No unnecessary .evaluate() calls"},
+        {"name": "set-num-threads", "text": "Calls ndd.set_num_threads() at startup"}


Do we really want that?

mzient · 2026-03-23T11:21:55Z

.claude/skills/using-dali-dynamic-mode/SKILL.md

+            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
+            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
+        )
+        train_step(images.torch(), labels.torch())


Is there a way to test this code against going stale?

rostan-t added the Dynamic Mode label Mar 20, 2026

github-advanced-security bot found potential problems Mar 20, 2026

View reviewed changes

.claude/skills/using-dali-dynamic-mode-workspace/evals/files/pipeline_to_convert.py Dismissed Show dismissed Hide dismissed

.claude/skills/using-dali-dynamic-mode-workspace/evals/files/pipeline_to_convert.py Dismissed Show dismissed Hide dismissed

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

Add skill and evals for dynamic mode usage

eebd995

Signed-off-by: Rostan Tabet <[email protected]>

rostan-t force-pushed the dynamic-mode-skill branch from e525b96 to eebd995 Compare March 20, 2026 17:17

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

.claude/skills/using-dali-dynamic-mode-workspace/evals/files/pipeline_to_convert.py Show resolved Hide resolved

.claude/skills/using-dali-dynamic-mode/SKILL.md Outdated Show resolved Hide resolved

rostan-t added 2 commits March 20, 2026 17:34

Add imports to the pipeline mode reference skill

bd30e92

Signed-off-by: Rostan Tabet <[email protected]>

Improve the description of max_batch_size

3b57572

Signed-off-by: Rostan Tabet <[email protected]>

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

.claude/skills/using-dali-dynamic-mode/SKILL.md Outdated Show resolved Hide resolved

klecki reviewed Mar 20, 2026

View reviewed changes

.claude/skills/using-dali-dynamic-mode/SKILL.md Outdated Show resolved Hide resolved

rostan-t added 2 commits March 20, 2026 20:54

Fix a factual issue with variability of batch sizes between epoch

932cb74

Signed-off-by: Rostan Tabet <[email protected]>

Clarify PyTorch conversion comment

85ea429

Signed-off-by: Rostan Tabet <[email protected]>

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

dali-automaton assigned mzient and stiepan Mar 23, 2026

mzient reviewed Mar 23, 2026

View reviewed changes

	\| `break` from `next_epoch()` loop \| Exhaust iterator or create new reader \| Iterator must be fully consumed before next `next_epoch()` \|
	\| `break` from `next_epoch()` loop \| Avoid breaking (use a flag/condition instead), or create a new reader \| Iterator must be fully consumed before next `next_epoch()` \|

Conversation

rostan-t commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

rostan-t Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

rostan-t Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

mzient Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

mzient Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rostan-t commented Mar 20, 2026 •

edited

Loading

greptile-apps bot commented Mar 20, 2026 •

edited

Loading