But from the example shown in the data sheet, if we want to say sample 100 points, it seems like the kernel has to execute a for loop for this, meaning no RTIO events can be submitted during this period, effectively blocking the kernel for the sampling duration.
This is not as such true, depending on what exactly you mean. Reading the new manual page on RTIO may clarify things. It is true that no RTIO events can be submitted while input RTIO events are being read, but emphasis on submitted and read respectively; it's completely possible to schedule RTIO events (into the future) to be output/fired in a time window where the CPU will be busy processing input events. So if your sequence of RTIO output events is fixed in advance, you can possibly simply schedule, then sample, with the RTIO output events firing at the same time the samples are being taken.
On the other hand, if you want to be scheduling RTIO events while sample is running, then it might be profitable to take a look inside the Sampler driver. As far as I understand here, the scheduling of the output events triggering the sample read is distinct from the input event self.bus_adc.read().
Again, depending on what you are trying to do, it might be possible to accomplish what you want by e.g. intermittently scheduling these output triggers, allowing incoming samples to collect in the input buffer, and reading them out at the end of the experiment. Or it might be possible to accomplish what you want just by manipulation of now_mu as above; note for example that it's entirely possible to move the time cursor backwards (if that will leave you with enough slack for your purposes), and in fact that this is the underlying mechanism for with parallel.
Ultimately it is true that the CPU cannot simultaneously be processing output events (scheduling them for the future) and processing input events (retrieving them from the past), just because the CPU can only physically be doing one thing at a time. But if there is any CPU idle time available somewhere, and if your input and output buffers are large enough, you can probably do what you're trying to do just with clever scheduling, though not exactly 'constantly'.
From this discussion from 6 years ago, it seems like this would be possible if the inputs supported DMA.
Unless I'm misunderstanding what you want to do, that discussion is not all that relevant, as it's primarily about reaching the hardware maximum sample rate of Sampler, which is bottlenecked by the software processing time of collecting an input event. If you are looking for a very high sample rate, then yes, your core device will be busy simply processing input as fast as it can, but this would also be true for DMA; the maximum achievable rate would just be higher. Edit: On second thought that's not true, since input DMA would in principle bypass the CPU entirely. But in any case AFAIK it remains an entirely theoretical feature.
Also, the alternative alternative solution if you would like Sampler to be running at a high sample rate with the corresponding processing overhead is probably a second core device and satellite kernels.