Currently it takes 10-15s for a run to go from submitted to running on an idle instance with pre-pulled image. This is unnecessary slow since all we need is a few RTTs to instance to submit the run.
I prototyped a few optimizations that prove run startup time can be ~1s (even with a large 200ms RTT). These include:
- Reusing SSH connection to instance (most gain)
- Removing completely artificial pipelines delays between job and run state transitions.
- Skipping getting backend offers if instance offers suffice.
- Dropping RSA key generation for sshd inside container.
- Dropping some unnecessary runner API calls (repeated healthchecks).
Currently it takes 10-15s for a run to go from submitted to running on an idle instance with pre-pulled image. This is unnecessary slow since all we need is a few RTTs to instance to submit the run.
I prototyped a few optimizations that prove run startup time can be ~1s (even with a large 200ms RTT). These include: