Inputs: where your files live & how they're matched¶
The tool is deliberately simple about input: you give it files and/or directories. There is no "environment" concept — a file is a file.
The mental model (three ideas)¶
- Every file is a source. Each XML you point at becomes one source, labelled by its file path.
- A unit is the recipe's
unitelement (for Control-M,SMART_FOLDER). A single file may contain many units. - Comparison happens per unit, across every source that contains it — but only for units present in 2 or more sources. A unit that appears in just one file is left alone (nothing to compare it against).
flowchart TB
A[a.xml] --> U1[(FOLDER_X)] & U2[(FOLDER_Y)]
B[b.xml] --> U1 & U3[(FOLDER_Z)]
C[c.xml] --> U1
U1 --- X{{FOLDER_X in 3 files → compared N-way}}
The layouts you can point at¶
1. Two (or more) files¶
xmldiffreport old.xml new.xml -o report.md
xmldiffreport v1.xml v2.xml v3.xml -o report.md # as many as you like
2. A directory (scanned recursively)¶
3. A mix of files and directories¶
4. From a config (the usage harness)¶
When you'd rather keep the paths and output settings in a file, use the
usage harness: a config.toml with an inputs list (files and/or
dirs).
# usage/config.toml
recipe = "controlm"
report_dir = "reports"
inputs = ["/data/ctm/uat", "/data/ctm/bench", "/data/ctm/prod"]
How discovery works (the exact rules)¶
- A file argument is taken as-is.
- A directory argument is scanned recursively for
*.xml; each match is a source. Pass several directories and they all contribute. - Every source is labelled by its file path — that's the column header in the report. (If it matters which file is production, name it accordingly.)
- A file may hold many units; the engine indexes them all.
- Only units present in ≥ 2 sources are diffed. Identical content across sources produces no rows (it's not a difference).
Worked example (end to end)¶
The synthetic dataset shipped in examples/controlm/ is just a tree of XML files:
examples/controlm/
├── test/ patch-d.xml (GLX_NIGHTLY_START, GLX_DISK_CHECK)
├── uat/ patch-b.xml (GLX_INGEST_DAILY, GLX_SUMMARY_DAILY, GLX_LEDGER_DAILY)
│ patch-e.xml (GLX_RISK_SCAN)
├── bench/ patch-a.xml (GLX_INGEST_DAILY, GLX_SUMMARY_DAILY, GLX_PRICING_DAILY, GLX_LEDGER_DAILY)
│ patch-x.xml (GLX_RISK_SCAN)
└── prod/ hotfix-c.xml (GLX_INGEST_DAILY, GLX_PRICING_DAILY)
→ 5 unit(s) with differences across 6 file(s). The report's summary:
| Unit | In how many files | Why |
|---|---|---|
GLX_INGEST_DAILY |
3 | present in bench/patch-a, uat/patch-b, prod/hotfix-c and differs |
GLX_SUMMARY_DAILY |
2 | in uat/patch-b and bench/patch-a, differ at folder level |
GLX_LEDGER_DAILY |
2 | in uat/patch-b and bench/patch-a, folder + a shared job |
GLX_PRICING_DAILY |
2 | in bench/patch-a and prod/hotfix-c, a job differs |
GLX_RISK_SCAN |
2 | in uat/patch-e and bench/patch-x, one extra INCOND |
GLX_NIGHTLY_START and GLX_DISK_CHECK exist in only one file → not reported.
Gotchas (read this if a result surprises you)¶
- “My unit doesn't show up.” It's present in only one file. You need the same unit in ≥ 2 files to get a comparison.
- “Two identical copies, nothing reported.” Correct — identical content (ignoring volatile attributes) is not a difference.
- Large files: each file is parsed into memory; fine up to tens of MB. See Performance & scale below.
Performance & scale¶
The cost model is simple: parse every file → index units by (tag, key) →
deep-compare only the units present in ≥ 2 sources. Time is roughly linear in
total input; the report size tracks the changes, not the input.
Measured on synthetic data (Apple silicon, Python 3.14):
| Input | Folders | Jobs | Time | Peak RSS |
|---|---|---|---|---|
| 17 files, little overlap | 438 | ~1.3k | 0.05 s | 26 MB |
| 2 × 2.8 MB | 16 000 | 80 000 | 0.35 s | 75 MB |
| 2 × 7.3 MB | 40 000 | 200 000 | 0.83 s | 153 MB |
Rules of thumb:
- Time scales linearly — ~7 MB diffs in well under a second.
- Memory is the ceiling: roughly ~10× the total XML bytes, because every parsed tree is held at once to find overlaps. It sums across all files, not just the largest. Comfortable to tens of MB; not designed for gigabytes.
- N-way width: a unit found in K files renders a K-column table — only the files that contain that unit. Very wide tables read better as HTML.
- Many files, little overlap: all files are parsed, but only the units that appear in ≥ 2 files are reported — the rest are ignored cheaply. 17 files where only 3 unit names overlap → a 3-row report.