What happens if I start too many background jobs?

7,310

Solution 1

Could all 700 instances possibly run concurrently?

That depends on what you mean by concurrently. If we're being picky, then no, they can't unless you have 700 threads of execution on your system you can utilize (so probably not). Realistically though, yes, they probably can, provided you have enough RAM and/or swap space on the system. UNIX and it's various children are remarkably good at managing huge levels of concurrency, that's part of why they're so popular for large-scale HPC usage.

How far could I get until my server reaches its limit?

This is impossible to answer concretely without a whole lot more info. Pretty much, you need to have enough memory to meet:

  • The entire run-time memory requirements of one job, times 700.
  • The memory requirements of bash to manage that many jobs (bash is not horrible about this, but the job control isn't exactly memory efficient).
  • Any other memory requirements on the system.

Assuming you meet that (again, with only 50GB of RAM, you still ahve to deal with other issues:

  • How much CPU time is going to be wasted by bash on job control? Probably not much, but with hundreds of jobs, it could be significant.
  • How much network bandwidth is this going to need? Just opening all those connections may swamp your network for a couple of minutes depending on your bandwidth and latency.
  • Many other things I probably haven't thought of.

When that limit is reached, will it just wait to begin the next iteration off foo or will the box crash?

It depends on what limit is hit. If it's memory, something will die on the system (more specifically, get killed by the kernel in an attempt to free up memory) or the system itself may crash (it's not unusual to configure systems to intentionally crash when running out of memory). If it's CPU time, it will just keep going without issue, it'll just be impossible to do much else on the system. If it's the network though, you might crash other systems or services.


What you really need here is not to run all the jobs at the same time. Instead, split them into batches, and run all the jobs within a batch at the same time, let them finish, then start the next batch. GNU Parallel (https://www.gnu.org/software/parallel/) can be used for this, but it's less than ideal at that scale in a production environment (if you go with it, don't get too aggressive, like I said, you might swamp the network and affect systems you otherwise would not be touching). I would really recommend looking into a proper network orchestration tool like Ansible (https://www.ansible.com/), as that will not only solve your concurrency issues (Ansible does batching like I mentioned above automatically), but also give you a lot of other useful features to work with (like idempotent execution of tasks, nice status reports, and native integration with a very large number of other tools).

Solution 2

It's hard to say specifically how many instances could be run as background jobs in the manner you describe. But a normal server can certainly maintain 700 concurrent connections as long as you do it correctly. Webservers do this all the time.

May I suggest that you use GNU parallel (https://www.gnu.org/software/parallel/) or something similar to accomplish this? It would give you a number of advantages to the background job approach:

  • You can easily change the number of concurrent sessions.
  • And it will wait until sessions complete before it starts new ones.
  • It it easier to abort.

Have a look here for a quick start: https://www.gnu.org/software/parallel/parallel_tutorial.html#A-single-input-source

Solution 3

Using & for parallel processing is fine when doing a few, and when you monitor progress. But if you are running in a corporate production environment you need something that gives you better control.

ls ~/sagLogs/ | parallel --delay 0.5 --memfree 1G -j0 --joblog my.log --retries 10 foo {}

This will run foo for each file in ~/sagLogs. It start a job every 0.5 seconds, it will run as many jobs in parallel as possible as long as 1 GB RAM is free, but will respect the limits on your system (e.g. number of files and processes). Typically this means you will be running 250 jobs in parallel if you have not adjusted the number of open files allowed. If you adjust the number of open files, you should have no problem running 32000 in parallel - as long as you have enough memory.

If a job fails (i.e. returns with an error code) it will be retried 10 times.

my.log will tell you if a job succeed (after possibly retries) or not.

Solution 4

What happens if I start too many background jobs?

the system will become slow and unresponsive, worst case is so unresponsive it would be best to just push the power button and do a hard reboot... this would be running something as root where it had the privilege to get away with doing that. If your bash script is running under regular user privileges, then the first thing that comes to mind is /etc/security/limits.conf and /etc/systemd/system.conf and all the variables therein to [ideally speaking] prevent user(s) from overloading the system.

  • cpu = xeon E5649, that is a 12-core cpu; so you have 12 cores for 12 processes to run concurrently each utilizing one of twelve cores at 100%. If you kick off 24 processes, then each would run at 50% utilization on each of twelve cores, 700 processes = 1.7% but it's a computer as long as everything completes properly in an ok amount of time then that = success; being efficient is not always relevant.

    1. Could all 700 instances possibly run concurrently? Certainly, 700 is not a large number; my /etc/security/limits.conf maxproc default is 4,135,275 for example

    2. How far could I get until my server reaches its limit? Much farther than 700 I'm sure.

    3. Limits... what will happen if the script is kicked off under a user account [and generally root as well limits.conf pretty much applies to everyone] is that the script will just exit after having tried to do foo & 700 times; you would expect to then see 700 foo processes each with a different pid but you might only see 456 (random number choice) and the other 244 never started because they got blocked by some security or systemd limit.

Million $ question: how many should u run concurrently?

being involved with network and you said each will do a telnet connection, educated guess is you will run into network limits and overhead before you do for cpu and ram limits. But I don't know what you are doing specifically, what will likely happen is u can kick off all 700 at once, but things will automatically block until previous processes and network connections finish and close based on various system limits, or something like the first 500 will kick off then the remaining 200 won't because system or kernel limits prevent it. But however many run at once, there will be some sweetish spot to get things done as fast as possible... minimizing overhead and increasing efficiency. Being 12 cores (or 24 if you have 2 cpu's) then start with 12 (or 24) at once and then increase that concurrent batch number by 12 or 24 until you don't see run time improvement.

hint: google max telnet connections and see how this applies to your system(s). Also don't forget about firewalls. Also do quick calculation of memory needed per process x 700; make sure < available RAM (about 50gb in your case) otherwise system will start using SWAP and basically become unresponsive. So kick of 12, 24, N processes at a time and monitor RAM free, then increase N already having some knowledge of what's happening.

By default, RHEL limits the number of telnet connections from a single host to 10 simultaneous sessions. This is a security feature... set to 10, /etc/xinetd.conf, change “per_source” value.

Share:
7,310

Related videos on Youtube

KuboMD
Author by

KuboMD

Updated on September 18, 2022

Comments

  • KuboMD
    KuboMD almost 2 years

    I need to do some work on 700 network devices using an expect script. I can get it done sequentially, but so far the runtime is around 24 hours. This is mostly due to the time it takes to establish a connection and the delay in the output from these devices (old ones). I'm able to establish two connections and have them run in parallel just fine, but how far can I push that?

    I don't imagine I could do all 700 of them at once, surely there's some limit to the no. of telnet connections my VM can manage.

    If I did try to start 700 of them in some sort of loop like this:

    for node in `ls ~/sagLogs/`; do  
        foo &  
    done
    

    With

    • CPU 12 CPUs x Intel(R) Xeon(R) CPU E5649 @ 2.53GHz

    • Memory 47.94 GB

    My question is:

    1. Could all 700 instances possibly run concurrently?
    2. How far could I get until my server reaches its limit?
    3. When that limit is reached, will it just wait to begin the next iteration off foo or will the box crash?

    I'm running in a corporate production environment unfortunately, so I can't exactly just try and see what happens.

    • Stephen Kitt
      Stephen Kitt about 5 years
      I’m guessing each job uses very little CPU and RAM, is that right?
    • KuboMD
      KuboMD about 5 years
      Honestly I have a hard time telling. Htop isn't very helpful - when I'm running one instance the CPU reads: CPU: 86.9% sys: 13.1% low: 0.0% and RAM reads Mem:3.86G used:178M buffers:2.28G cache:608M Any guess?
    • Adam
      Adam about 5 years
      I've had good luck with parallel, using around 50 concurrent jobs. It's a great medium between parallelism of 1 and 700. The other nice thing is that's batchless. A single stalled connection will only stall itself, not any of the others. The main downside is error management. None of these shell-based approaches will gracefully handle errors. You'll have to manually check for success yourself, and do your own retries.
    • ChuckCottrill
      ChuckCottrill about 5 years
      Your task queue may be 700 today, but can the size expand? Watch for swap space to grow - that is indication you have reached memory limit. And cpu % is not a good measure (for linux/unix), better to consider load average (run queue length).
    • michaelb958--GoFundMonica
      michaelb958--GoFundMonica about 5 years
      The most recent way I broke production at my still-kinda-new job was by accidentally running a million plus short-lived background jobs at once. They involved JVMs (wait wait put the pitchforks down), so the consequences were 'limited' to hundreds of thousands of error report files that threads couldn't be started.
    • Peter Cordes
      Peter Cordes about 5 years
      Google OOM killer for what happens if you run out of swap space.
    • l0b0
      l0b0 about 5 years
    • KuboMD
      KuboMD about 5 years
      @l0b0 neat! Luckily I'm the only one working on this system, so as long as I don't name anything with a newline it'll be alright.
    • l0b0
      l0b0 about 5 years
      @KuboMD And as long as nobody else ever wants to use your code.
    • Peter Cordes
      Peter Cordes about 5 years
      But still, why would you want to write it that way when you could have used for node ~/sagLogs/*? (You can use basename "$node" or "${node##*/}" if you need the bare filename without the full path.)
    • KuboMD
      KuboMD about 5 years
      @PeterCordes No reason in particular. I didn't know i could use for node ~/sagLogs/* - that seems much simpler.
  • KuboMD
    KuboMD about 5 years
    Interesting! I'll take a look at this. Do you know if attempting this kind of operation (without the help of Parallel) would risk crashing the hypervisor?
  • ChuckCottrill
    ChuckCottrill about 5 years
    There are ways to run a limited number of background tasks (using bash, perl, python, et al), monitor for task completion, and run more tasks as prior tasks complete. A simple approach would be to collect batches of tasks represented by files in subdirectories, and process a batch at a time. There are other ways...
  • hobbs
    hobbs about 5 years
    @KuboMD if you can crash the hypervisor with something so mundane, it's a bug in the hypervisor :)
  • Biswapriyo
    Biswapriyo about 5 years
    Does this also include unix-like systems? And what is "GUN parallel"?
  • Baldrickk
    Baldrickk about 5 years
    @Biswapriyo I think it's a typo - he meant GNU gnu.org/software/parallel
  • KuboMD
    KuboMD about 5 years
    This looks very promising, thank you.
  • Austin Hemmelgarn
    Austin Hemmelgarn about 5 years
    @Biswapriyo Baldrickk is correct, it is indeed a typo, which I've now corrected. And yes, this is intended to cover most UNIX-like systems.
  • KuboMD
    KuboMD about 5 years
    Ran a simple test doing cat ~/sagLogs/* >> ~/woah | parallel and holy moly that was fast. 1,054,552 lines in the blink of an eye.
  • Ole Tange
    Ole Tange about 5 years
    The command you gave has dual redirection, so I donot think it does what you intend it to do. GNU Parallel has an overhead of 10 ms per job, so 1M jobs should take in the order of 3 hours.
  • KuboMD
    KuboMD about 5 years
    I see. If i wanted to use parallel properly to get the cat of every file in ~/sagLogs/ and move it into ~/woah, what would that command look like?
  • Ole Tange
    Ole Tange about 5 years
    cat ~/sagLogs/* > ~/woah (no | parallel).
  • KuboMD
    KuboMD about 5 years
    Is that to say parallel would be invoked automatically, or that it isn't applicable at all in that situation?
  • Ole Tange
    Ole Tange about 5 years
    It is not applicable at all if all you want to do is simply to concatenate the files.
  • pipe
    pipe about 5 years
  • KuboMD
    KuboMD about 5 years
    I see. What's an example of a relatively simple job I could test with parallel that would be less intensive than my expect script?
  • forest
    forest about 5 years
    You can prevent the system from going down by setting sensible rlimits.
  • Austin Hemmelgarn
    Austin Hemmelgarn about 5 years
    @forest Yes, you could use rlimits to prevent the system from crashing, but getting them right in a case like this is not easy (you kind of need to know what the resource requirements for the tasks are beforehand) and doesn't protect the rest of the network from any impact these jobs may cause (which is arguably a potentially much bigger issue than crashing the local system).
  • KuboMD
    KuboMD about 5 years
    @AustinHemmelgarn Yes. I realize now that crashing the box would be inconvenient, but causing widespread network problems would light up the NOC and probably cost me my job.
  • Peter Cordes
    Peter Cordes about 5 years
    GNU Parallel can work similarly to make -j20, where instead of waiting for the last job in a "batch", it keeps n jobs in flight. So one slow job every now and then won't lower throughput unless it's one of the last few to be started. Of course, GNU make is usually already installed, and with the right Makefile can work as a parallel job manager.
  • Peter Cordes
    Peter Cordes about 5 years
    @KuboMD a trivial CPU busy loop like awk 'BEGIN{for(i=rand()*10000000; i<100000000;i++){}}' would work for playing around with. Or try it on a task like sleep 10 to see it keep n jobs in flight without using much CPU time. e.g. time parallel sleep ::: {100..1} to run sleeps from 100 down to 1 second.
  • KuboMD
    KuboMD about 5 years
    @OleTange I'm trying to visualize how the 'jobs' will be grouped at the moment. Right now I have groups of between 1 and 38 nodes per 'site'. I've piped their filenames into parallel with a max of 11 jobs at once (just for echoing). I want to show my supervisor how they'll be grouped. I can separate them by site easily, but is there a way I can get parallel to insert some sort of index or marker when a new batch is started? For those sites with >11 nodes that'll require 2+ batches.
  • Ole Tange
    Ole Tange about 5 years
    @KuboMD I have no idea what you mean by "grouped". Please read doi.org/10.5281/zenodo.1146014 (at least chapter 1+2). Then re-state your question.