Skip to content

Add skill and evals for dynamic mode usage#6271

Open
rostan-t wants to merge 5 commits intoNVIDIA:mainfrom
rostan-t:dynamic-mode-skill
Open

Add skill and evals for dynamic mode usage#6271
rostan-t wants to merge 5 commits intoNVIDIA:mainfrom
rostan-t:dynamic-mode-skill

Conversation

@rostan-t
Copy link
Collaborator

@rostan-t rostan-t commented Mar 20, 2026

Category:

Other (e.g. Documentation, Tests, Configuration)

Description:

Since dynamic mode is fairly new, AI agents are not very good at writing code using it. For instance, according to Anthropic, Claude Sonnet 4.6's knowledge cutoff is August 2026. Even when presented with a few examples, agents miss some dynamic-mode specific patterns and are not very helpful to write code using it.

This PR adds a Claude Code skill containing guidelines on how to use dynamic mode. It was generated with the /skill-creator which generates evals for the skill. Here are the results on running the eval with Claude Code using Sonnet 4.6:

Eval Task With Skill Without Skill
1 Image classification pipeline 10/10 1/10
2 Batch column extraction 4/4 1/4
3 Pipeline-to-dynamic conversion 7/7 0/7
4 Debugging intermittent corruption 6/6 1/6
5 Audio mel spectrogram 6/6 1/6
6 Object detection pipeline 7/7 0/7
Total 40 assertions 40/40 (100%) 4/40 (10%)

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR adds a Claude Code skill (SKILL.md) and a set of evals (evals.json + pipeline_to_convert.py) to help AI agents write correct DALI dynamic mode code. The skill documents the imperative nvidia.dali.experimental.dynamic API — covering core data types (Tensor, Batch), stateful readers, device handling, the async execution model, RNG, and a pipeline-mode migration table. The evals validate 6 representative tasks (image classification, batch slicing, pipeline conversion, debugging, audio mel spectrogram, object detection) and the PR reports a 10× improvement in assertion pass-rate when the skill is active.

Key points from the review:

  • The skill content is accurate and well-structured; previously raised issues (batch-size immutability, eval ID gap, missing imports in the eval file, max_batch_size constructor guidance) have all been resolved in this version.
  • The Pipeline Mode Migration table does not include a row for device_id, which appears in pipeline_to_convert.py (device_id=0). An agent converting that file has no guidance on the dynamic mode equivalent and may silently drop the parameter or invent an incorrect API in multi-GPU scenarios.
  • The skill's description field triggers on the phrase 'DALI preprocessing', which is broad enough to activate the skill in plain pipeline-mode contexts, potentially injecting dynamic-mode advice where it was not requested.

Confidence Score: 4/5

  • Safe to merge; contains only documentation/skill files with no runtime code changes.
  • The PR is documentation-only (a Claude Code skill + evals). All previously raised substantive issues have been addressed. The two remaining observations are minor: a missing device_id migration row in the SKILL.md table and an overly broad skill trigger phrase. Neither blocks correctness for the primary use cases covered by the evals.
  • .claude/skills/using-dali-dynamic-mode/SKILL.md — Pipeline Mode Migration table missing device_id entry; skill description trigger phrase could be narrowed.

Important Files Changed

Filename Overview
.claude/skills/using-dali-dynamic-mode/SKILL.md Comprehensive skill document for DALI dynamic mode. Well-structured with core data types, readers, device handling, execution model, RNG, and migration table. Minor gaps: device_id migration not covered; skill trigger phrase "DALI preprocessing" is broad enough to activate in pipeline-mode contexts.
.claude/skills/using-dali-dynamic-mode-workspace/evals/evals.json Six well-constructed evals covering the key dynamic mode patterns (image classification, batch slicing, pipeline conversion, debugging, audio mel spectrogram, object detection). IDs are now sequential (1–6). Assertions are specific and actionable. No issues found.
.claude/skills/using-dali-dynamic-mode-workspace/evals/files/pipeline_to_convert.py Reference pipeline-mode script used as input for eval 3. Contains all necessary imports and a complete, syntactically valid pipeline with @pipeline_def, fn.readers.file, device="mixed" decode, random augmentation, and pipe.run() — exactly the patterns the eval tests for correct conversion.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["import nvidia.dali.experimental.dynamic as ndd"] --> B["ndd.set_num_threads(N)"]
    B --> C["reader = ndd.readers.File(...)\n(created once, stateful)"]
    C --> D["Epoch loop\nfor epoch in range(num_epochs)"]
    D --> E["reader.next_epoch(batch_size=N)\nyields tuples of Batch"]
    E --> F["Batch loop\nfor jpegs, labels in ..."]
    F --> G["ndd.decoders.image(jpegs, device='gpu')\nNOT device='mixed'"]
    G --> H["ndd operators\nndd.resize, ndd.crop_mirror_normalize, etc."]
    H --> I{"Variable-length\nbatch?"}
    I -- "Yes (audio/bbox)" --> J["batch.torch(pad=True)"]
    I -- "No (uniform shapes)" --> K["batch.torch()"]
    J --> L["train_step(...)"]
    K --> L
    L --> F
    F --> D

    subgraph Debugging
        M["with ndd.EvalMode.sync_full:"]
        M --> N["errors surface at exact call site"]
    end

    subgraph RNG
        O["ndd.random.set_seed(42)\nor ndd.random.RNG(seed=42)"]
        O --> P["ndd.random.uniform(batch_size=N, ...)"]
    end
Loading

Last reviewed commit: "Clarify PyTorch conv..."

Comment on lines +23 to +59
t.torch(copy=False) # zero-copy PyTorch tensor (default)
t[1:3] # slicing supported
np.asarray(t) # NumPy via __array__ (CPU only)
```

Supports `__dlpack__`, `__cuda_array_interface__`, `__array__`, arithmetic operators.

### Batch -- collection of samples (variable shapes OK)

```python
b = ndd.batch([arr1, arr2]) # copy
b = ndd.as_batch(data) # wrap, no copy if possible
```

**Batch has no `__getitem__`** -- `batch[i]` raises `TypeError` because indexing is ambiguous (sample selection vs. per-sample slicing). Use the explicit APIs instead:

| Intent | Method | Returns |
|--------|--------|---------|
| Get sample i | `batch.select(i)` | `Tensor` |
| Get subset of samples | `batch.select(slice_or_list)` | `Batch` |
| Slice within each sample | `batch.slice[...]` | `Batch` (same batch_size) |

`.select()` picks **which samples**. `.slice` indexes **inside each sample**.

```python
xy = ndd.random.uniform(batch_size=16, range=[0, 1], shape=2)
crop_x = xy.slice[0] # Batch of 16 scalars, first element from each sample
crop_y = xy.slice[1] # Batch of 16 scalars, second element from each sample
sample_0 = xy.select(0) # Tensor, the entire first sample [x, y]
```

**PyTorch conversion:**
- `batch.torch()` -- works for uniform shapes; raises for ragged batches
- `batch.torch(pad=True)` -- zero-pads ragged batches to max shape (use for variable-length audio, detection boxes, etc.)
- `batch.torch(copy=None)` is the default (avoids copy if possible)
- Batch has **no `__dlpack__`** -- use `ndd.as_tensor(batch)` first for DLPack consumers. `ndd.as_tensor` supports `pad` as well.
- `Tensor.torch(copy=False)` is default (no copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inconsistent copy parameter defaults for Tensor.torch() vs Batch.torch()

The skill documents two different defaults for the same copy parameter, which could lead AI agents to generate subtly incorrect code:

  • Line 23: t.torch(copy=False) # zero-copy PyTorch tensor (default) — implies copy=False is the Tensor default
  • Line 57: batch.torch(copy=None) is the default (avoids copy if possible) — implies copy=None is the Batch default
  • Line 59: Tensor.torch(copy=False) is default (no copy) — restates the Tensor claim

In standard Python/NumPy/PyTorch semantics, copy=False typically means "never copy, raise an error if a copy is required", while copy=None means "copy only if necessary". If Tensor.torch() genuinely defaults to copy=False, that's a stricter contract than Batch.torch()'s copy=None — an agent following this skill might generate Tensor conversion code that raises unexpectedly when a copy is unavoidable, while Batch code silently succeeds.

If both actually default to the same behaviour, one of these needs correcting. Please verify the actual DALI API defaults and align the documentation so agents receive consistent guidance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai This is the actual DALI behavior documented in both _tensor.py and _batch.py. Batch.torch() accepts None for the copy parameter because the copy cannot be avoided if the batch is not stored as a dense tensor. This doesn't make sense in Tensor.torch() because it's always possible to avoid copying (it takes no other arguments like dtype that could force a copy).

The skill is correct there and helps agents catch this non-trivial difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it — thanks for the clarification! You're right that this is a meaningful API difference, not an inconsistency:

  • Tensor.torch(copy=False) as the default makes sense because a single dense tensor can always be zero-copied
  • Batch.torch(copy=None) as the default is necessary because ragged batches physically require copying to convert to a dense PyTorch tensor

The skill correctly documents this distinction, and agents need to know that Batch.torch() may copy even when not explicitly requested (for ragged data), while Tensor.torch() guarantees zero-copy by default.

The documentation as written will help agents avoid incorrect assumptions about copy behavior when working with ragged batches. No changes needed here.

| `.cpu()` before GPU model | `.torch()` directly | Avoids wasteful D2H + H2D round-trip |
| Recreate reader each epoch | `reader.next_epoch()` | Readers are stateful -- create once, reuse |
| `ndd.readers.file(...)` | `ndd.readers.File(...)` | Reader classes are PascalCase |
| `break` from `next_epoch()` loop | Exhaust iterator or create new reader | Iterator must be fully consumed before next `next_epoch()` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading "Exhaust iterator" advice after a break

The "Right" column for breaking out of next_epoch() says "Exhaust iterator or create new reader". However, after a break the iterator is already abandoned — you cannot exhaust it without explicitly draining it in a separate loop, which is non-obvious.

An AI agent reading this might attempt to do something like list(iterator) after breaking, which is wasteful. The more actionable guidance is to either avoid breaking in the first place (restructure the loop condition) or, if you must exit early, create a new reader. Consider updating the wording:

Suggested change
| `break` from `next_epoch()` loop | Exhaust iterator or create new reader | Iterator must be fully consumed before next `next_epoch()` |
| `break` from `next_epoch()` loop | Avoid breaking (use a flag/condition instead), or create a new reader | Iterator must be fully consumed before next `next_epoch()` |

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai We actually do need to fully consume the iterator before going to the next epoch. Breaking can be legitimate in some cases and the two approaches that work are clearly documented already. Your suggestion ("Using a flag/condition instead") wouldn't help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I apologize for the confusion. The documentation clearly states the two valid approaches when you need to exit early:

  1. Manually drain the remaining iterator items before calling next_epoch() again
  2. Create a new reader instance

Both are legitimate solutions for cases where breaking is necessary (early stopping, errors, etc.). The table is clear as written.

@rostan-t rostan-t force-pushed the dynamic-mode-skill branch from e525b96 to eebd995 Compare March 20, 2026 17:17
Comment on lines +180 to +193
## Pipeline Mode Migration

| Pipeline Mode | Dynamic Mode |
|--------------|--------------|
| `@pipeline_def` / `pipe.build()` / `pipe.run()` | Direct function calls in a loop |
| `fn.readers.file(...)` | `ndd.readers.File(...)` (PascalCase, stateful) |
| `fn.decoders.image(jpegs, device="mixed")` | `ndd.decoders.image(jpegs, device="gpu")` |
| `fn.op_name(...)` | `ndd.op_name(...)` |
| Pipeline-level `batch_size=64` | `reader.next_epoch(batch_size=64)` + random ops `batch_size=64` |
| Pipeline-level `seed=42` | `ndd.random.set_seed(42)` or `ndd.random.RNG(seed=42)` |
| Pipeline-level `num_threads=4` | `ndd.set_num_threads(4)` at startup |
| `output.at(i)` | `batch.select(i)` |
| `output.as_cpu()` | `batch.cpu()` |
| `pipe.run()` returns tuple of `TensorList` | `reader.next_epoch(batch_size=N)` yields tuples of `Batch` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 device_id migration not documented

The Pipeline Mode Migration table is missing a row for the device_id parameter. The pipeline_to_convert.py used in eval 3 passes device_id=0 to the pipeline constructor:

pipe = training_pipeline(
    image_dir="/data/images",
    batch_size=64,
    num_threads=4,
    device_id=0,   # <- no equivalent shown in migration table
    seed=42,
)

An agent converting this code using the migration table will encounter device_id=0 with no guidance. In multi-GPU scenarios it might silently drop the parameter or invent an incorrect API like ndd.set_device_id(0). Consider adding a row clarifying the dynamic mode equivalent (e.g., CUDA_VISIBLE_DEVICES, torch.cuda.set_device, or that dynamic mode picks up the default device automatically).

Comment on lines +1 to +4
---
name: using-dali-dynamic-mode
description: "Use when writing DALI data loading or preprocessing code with `nvidia.dali.experimental.dynamic` (ndd), or when converting DALI pipeline-mode code to dynamic mode, or when the user asks about DALI dynamic mode, imperative DALI, or ndd. Use this skill any time someone mentions 'ndd', 'dynamic mode', 'DALI preprocessing', or wants to load/augment data with DALI outside of a pipeline definition."
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Skill description over-triggers on "DALI preprocessing"

The description includes the trigger phrase 'DALI preprocessing', which is a broad term that applies equally to pipeline-mode workflows. A user working on an existing pipeline-mode codebase who asks "how do I add DALI preprocessing to my model?" would have this skill injected, potentially nudging them toward dynamic mode when they never asked for it.

Consider narrowing the trigger to something more specific, e.g., 'DALI dynamic preprocessing' or 'DALI imperative API', so the skill only activates for genuinely dynamic-mode contexts.

Suggested change
---
name: using-dali-dynamic-mode
description: "Use when writing DALI data loading or preprocessing code with `nvidia.dali.experimental.dynamic` (ndd), or when converting DALI pipeline-mode code to dynamic mode, or when the user asks about DALI dynamic mode, imperative DALI, or ndd. Use this skill any time someone mentions 'ndd', 'dynamic mode', 'DALI preprocessing', or wants to load/augment data with DALI outside of a pipeline definition."
---
description: "Use when writing DALI data loading or preprocessing code with `nvidia.dali.experimental.dynamic` (ndd), or when converting DALI pipeline-mode code to dynamic mode, or when the user asks about DALI dynamic mode, imperative DALI, or ndd. Use this skill any time someone mentions 'ndd', 'dynamic mode', 'DALI dynamic preprocessing', or wants to load/augment data with DALI outside of a pipeline definition."

{"name": "no-pipeline-mode", "text": "No pipeline-mode constructs (no @pipeline_def, pipe.build(), pipe.run()) and operators called directly on ndd (e.g. ndd.resize, not fn.resize or ndd.fn.resize)"},
{"name": "torch-handoff", "text": "Uses .torch() for PyTorch conversion"},
{"name": "no-unnecessary-evaluate", "text": "No unnecessary .evaluate() calls"},
{"name": "set-num-threads", "text": "Calls ndd.set_num_threads() at startup"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want that?

mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
)
train_step(images.torch(), labels.torch())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to test this code against going stale?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants