[Feature]: Support distributing service replicas across multiple fleets

### Problem

When running a service with multiple replicas, it's common to want a **guaranteed baseline** (reserved/on-demand instances) with **spot overflow** for burst capacity. The natural approach is to create two fleets (one on-demand, one spot) and reference both in the service's `fleets` parameter.

However, dstack currently selects a **single fleet for all replicas** of a service. This makes it impossible to distribute replicas across fleets with different purchasing options.

### Observed behavior

**Setup:**
- Fleet A: `nodes: 1`, `spot_policy: on-demand` (with capacity reservation)
- Fleet B: `nodes: 0..1`, `spot_policy: spot`
- Service: `replicas: 2`, `fleets: [fleet-a, fleet-b]`, `spot_policy: auto`

**Result:** `failed_to_start_due_to_no_capacity` — "Failed to use specified fleets"

**When Fleet B is changed to `nodes: 0..2`:**
All 2 replicas are provisioned on Fleet B (spot), Fleet A (reserved) is completely ignored.

### Root cause (code analysis)

1. **Single fleet per run** — `_should_wait_for_run_fleet_assignment` in `jobs_submitted.py` ensures only `replica_num=0` selects the fleet. Once `run_model.fleet_id` is set, all subsequent replicas are constrained to the same fleet.

2. **Capacity check excludes undersized fleets** — `_run_can_fit_into_fleet` in `plan.py` requires:
   ```python
   fleet_available_capacity = fleet.nodes.max - occupied_instances
   if fleet_available_capacity < nodes_required_num:  # total min replicas
       return False  # fleet excluded entirely
   ```
   A fleet is excluded if its `nodes.max` can't accommodate ALL replicas.

3. **Price-based sort with no fleet priority** — `find_optimal_fleet_with_offers` sorts by:
   ```python
   sort_key=(
       not candidate.has_pool_capacity,      # idle instances first
       candidate.min_instance_offer_price,   # cheapest price
       min_backend_offer_price,
   )
   ```
   Spot always wins over on-demand/reserved on price.

### Desired behavior

Allow replicas of a single service to be distributed across multiple fleets. Example use case:

```yaml
type: service
name: my-service
replicas: 3..10

fleets:
  - name: gpu-reserved
    min_replicas: 1  # always keep at least 1 replica here
  - name: gpu-spot
    # overflow capacity

spot_policy: auto
```

### Possible solutions

1. **Per-fleet replica pinning** — allow specifying min/max replicas per fleet in the `fleets` list
2. **Fleet priority/weight** — ordered preference with fallback (try fleet A first, overflow to fleet B)
3. **Remove the single-fleet constraint** — allow `run_model.fleet_id` to be per-job instead of per-run, and change `_run_can_fit_into_fleet` to check aggregate capacity across all candidate fleets

### Workaround

Currently the only workaround is running two separate services (one per fleet) behind an external load balancer, which adds operational complexity and breaks the single-endpoint abstraction.

### Would you like to help us implement this feature by sending a PR?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Support distributing service replicas across multiple fleets #3929

Problem

Observed behavior

Root cause (code analysis)

Desired behavior

Possible solutions

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Support distributing service replicas across multiple fleets #3929

Description

Problem

Observed behavior

Root cause (code analysis)

Desired behavior

Possible solutions

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions