ML-Dash

Background Buffering

ML-Dash writes are non-blocking. Logs, metrics, tracks, and files are queued and flushed from a background daemon thread so the hot path stays fast.

Flush Triggers

Buffered data flushes when any of these occur:

  1. Time-based: every flush_interval seconds (default 5.0).
  2. Size-based: when a queue reaches its batch size (default 100 items per queue).
  3. Manual: experiment.flush() blocks until queues drain.
  4. Context exit: leaving the Experiment(...).run context waits for a full drain (no timeout).

Forcing a Flush

Call flush() before any action that depends on uploads being durable (checkpoint markers, downstream readers, etc.):

python
with Experiment("my-project/exp").run as experiment:
    experiment.metrics("train").log(loss=loss)
    experiment.flush()
    torch.save(model, "checkpoint.pt")

Configuration

Environment Variables

bash
export ML_DASH_BUFFER_ENABLED=true       # default: true
export ML_DASH_FLUSH_INTERVAL=5.0        # seconds
export ML_DASH_LOG_BATCH_SIZE=100
export ML_DASH_METRIC_BATCH_SIZE=100
export ML_DASH_TRACK_BATCH_SIZE=100
export ML_DASH_FILE_UPLOAD_WORKERS=4

Programmatic

python
from ml_dash import Experiment
from ml_dash.buffer import BufferConfig

config = BufferConfig(
    flush_interval=10.0,
    log_batch_size=200,
    metric_batch_size=500,
    file_upload_workers=8,
)

with Experiment("my-project/exp", buffer_config=config).run as exp:
    exp.log("custom buffer config")

Pass BufferConfig(buffer_enabled=False) to make every write synchronous. Useful only for debugging.

Context Exit

On __exit__, the buffer manager drains all queues before returning. Expect a short pause and console output like:

[ML-Dash] Flushing buffered data...
[ML-Dash]   - 1000 log(s), 100 metric(s), 50 track(s), 10 file(s)
[ML-Dash]   Uploading 10 file(s)...
[ML-Dash] All data flushed successfully

Upload failures are logged as warnings, not raised, so a flaky network won't crash training. Authentication errors usually mean re-running ml-dash login.

File Uploads

When you call save_image, save_json, etc., content is written to a temp file, queued, uploaded by one of file_upload_workers threads, then the temp file is removed. Cleanup runs even if the upload is delayed.

Thread Safety

Queues are thread-safe; logging from multiple worker threads against the same Experiment is supported.

For end-to-end examples, see Complete Examples.