Allow long-running jobs to be processed as they come by antonije · Pull Request #457 · runpod/runpod-python

antonije · 2025-09-12T12:39:44Z

These changes fix the following issues:

long-running jobs were being blocked until the first one completes
jobs having double states (being in "QUEUE" and "IN_PROCESS") before actually being accepted
stop recreating the jobs_queue and setting initial maxsize

NOTE: I have added longer sleep in the run_jobs method to suit my needs, but feel free to edit those.

…progress Previously, jobs were marked as "in progress" in `get_jobs()` immediately after being fetched from the server, before they were actually scheduled by `run_jobs()`. This caused jobs to appear both in the queue and in progress simultaneously, which inflated `current_occupancy()` and blocked new jobs from starting until earlier ones had finished. The fix moves the `job_progress.add(job)` call into `run_jobs()`, right after the job is dequeued and scheduled as a task. This ensures that: - Jobs are only marked "in progress" once they actually start running. - Queue and progress metrics no longer overlap. - Concurrency limits are enforced correctly (e.g. with concurrency=4, two jobs now run in parallel instead of sequentially). As a result, jobs are dispatched immediately when capacity is available, and status logging reflects the real state of the worker.

…ne finished tasks each tick.

- add a timeout of 100 milliseconds that doesn't block the code and accepts new jobs

Copilot

Pull request overview

This PR updates the serverless worker’s JobScaler loop to better support long-running jobs by decoupling job acquisition from job execution, avoiding queue backpressure, and aligning job state tracking with when work actually begins.

Changes:

Replace the bounded asyncio.Queue(maxsize=concurrency) with an unbounded queue to prevent acquisition from blocking on long-running jobs.
Adjust scaling logic to update current_concurrency without recreating the queue.
Rework run_jobs to manage a live set of tasks and process jobs as capacity opens up; move job_progress.add() to dequeue-time to avoid “queued + in-process” double-state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        while self.current_occupancy() > 0:
            # not safe to scale when jobs are in flight
            await asyncio.sleep(1)
            continue

-        self.jobs_queue = asyncio.Queue(maxsize=self.current_concurrency)
-        log.debug(
-            f"JobScaler.set_scale | New concurrency set to: {self.current_concurrency}"
-        )
+        self.current_concurrency = new_concurrency
+        log.debug(f"JobScaler.set_scale | New concurrency set to: {self.current_concurrency}")


        Runs the block in an infinite loop while the worker is alive or jobs queue is not empty.
        """
-        tasks = []  # Store the tasks for concurrent job processing
+        tasks: set[asyncio.Task] = set()  # Store the tasks for concurrent job processing


                done, pending = await asyncio.wait(
-                    tasks, return_when=asyncio.FIRST_COMPLETED
+                    tasks,
+                    timeout=0.1,
+                    return_when=asyncio.FIRST_COMPLETED,
                )


antonije added 8 commits September 7, 2025 15:08

- Stop recreating the job_queue and don't set the size right away.

7625d91

- Stop recreating the job_queue and don't set the size right away.

763c843

- No more blocking and waiting until a job completes, but instead pru…

8d33f2e

…ne finished tasks each tick.

- wait a bit when there are no jobs

5520f0b

- wait a bit when there are jobs

b4f844d

- make tasks be a set of asyncio.Task and auto-discarded when done

68041ce

- add a timeout of 100 milliseconds that doesn't block the code and accepts new jobs

- stop using add_done_callback so we can avoid mutation bugs

0c3c35c

Yhlong00 requested a review from deanq September 12, 2025 20:06

jhcipar mentioned this pull request Dec 22, 2025

Fix: poll for jobs while tasks are running #473

Merged

Merge branch 'main' into main

7197966

deanq requested a review from Copilot June 3, 2026 08:13

Copilot started reviewing on behalf of deanq June 3, 2026 08:14 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow long-running jobs to be processed as they come#457

Allow long-running jobs to be processed as they come#457
antonije wants to merge 9 commits into
runpod:mainfrom
antonije:main

antonije commented Sep 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

antonije commented Sep 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants