What happens if I start too many background jobs?
Solution 1
Could all 700 instances possibly run concurrently?
That depends on what you mean by concurrently. If we're being picky, then no, they can't unless you have 700 threads of execution on your system you can utilize (so probably not). Realistically though, yes, they probably can, provided you have enough RAM and/or swap space on the system. UNIX and it's various children are remarkably good at managing huge levels of concurrency, that's part of why they're so popular for large-scale HPC usage.
How far could I get until my server reaches its limit?
This is impossible to answer concretely without a whole lot more info. Pretty much, you need to have enough memory to meet:
- The entire run-time memory requirements of one job, times 700.
- The memory requirements of bash to manage that many jobs (bash is not horrible about this, but the job control isn't exactly memory efficient).
- Any other memory requirements on the system.
Assuming you meet that (again, with only 50GB of RAM, you still ahve to deal with other issues:
- How much CPU time is going to be wasted by bash on job control? Probably not much, but with hundreds of jobs, it could be significant.
- How much network bandwidth is this going to need? Just opening all those connections may swamp your network for a couple of minutes depending on your bandwidth and latency.
- Many other things I probably haven't thought of.
When that limit is reached, will it just wait to begin the next iteration off foo or will the box crash?
It depends on what limit is hit. If it's memory, something will die on the system (more specifically, get killed by the kernel in an attempt to free up memory) or the system itself may crash (it's not unusual to configure systems to intentionally crash when running out of memory). If it's CPU time, it will just keep going without issue, it'll just be impossible to do much else on the system. If it's the network though, you might crash other systems or services.
What you really need here is not to run all the jobs at the same time. Instead, split them into batches, and run all the jobs within a batch at the same time, let them finish, then start the next batch. GNU Parallel (https://www.gnu.org/software/parallel/) can be used for this, but it's less than ideal at that scale in a production environment (if you go with it, don't get too aggressive, like I said, you might swamp the network and affect systems you otherwise would not be touching). I would really recommend looking into a proper network orchestration tool like Ansible (https://www.ansible.com/), as that will not only solve your concurrency issues (Ansible does batching like I mentioned above automatically), but also give you a lot of other useful features to work with (like idempotent execution of tasks, nice status reports, and native integration with a very large number of other tools).
Solution 2
It's hard to say specifically how many instances could be run as background jobs in the manner you describe. But a normal server can certainly maintain 700 concurrent connections as long as you do it correctly. Webservers do this all the time.
May I suggest that you use GNU parallel (https://www.gnu.org/software/parallel/) or something similar to accomplish this? It would give you a number of advantages to the background job approach:
- You can easily change the number of concurrent sessions.
- And it will wait until sessions complete before it starts new ones.
- It it easier to abort.
Have a look here for a quick start: https://www.gnu.org/software/parallel/parallel_tutorial.html#A-single-input-source
Solution 3
Using &
for parallel processing is fine when doing a few, and when you monitor progress. But if you are running in a corporate production environment you need something that gives you better control.
ls ~/sagLogs/ | parallel --delay 0.5 --memfree 1G -j0 --joblog my.log --retries 10 foo {}
This will run foo
for each file in ~/sagLogs
. It start a job every 0.5 seconds, it will run as many jobs in parallel as possible as long as 1 GB RAM is free, but will respect the limits on your system (e.g. number of files and processes). Typically this means you will be running 250 jobs in parallel if you have not adjusted the number of open files allowed. If you adjust the number of open files, you should have no problem running 32000 in parallel - as long as you have enough memory.
If a job fails (i.e. returns with an error code) it will be retried 10 times.
my.log
will tell you if a job succeed (after possibly retries) or not.
Solution 4
What happens if I start too many background jobs?
the system will become slow and unresponsive, worst case is so unresponsive it would be best to just push the power button and do a hard reboot... this would be running something as root where it had the privilege to get away with doing that. If your bash script is running under regular user privileges, then the first thing that comes to mind is /etc/security/limits.conf
and /etc/systemd/system.conf
and all the variables therein to [ideally speaking] prevent user(s) from overloading the system.
cpu = xeon E5649, that is a 12-core cpu; so you have 12 cores for 12 processes to run concurrently each utilizing one of twelve cores at 100%. If you kick off 24 processes, then each would run at 50% utilization on each of twelve cores, 700 processes = 1.7% but it's a computer as long as everything completes properly in an ok amount of time then that = success; being efficient is not always relevant.
Could all 700 instances possibly run concurrently? Certainly, 700 is not a large number; my /etc/security/limits.conf
maxproc
default is 4,135,275 for exampleHow far could I get until my server reaches its limit? Much farther than 700 I'm sure.
Limits... what will happen if the script is kicked off under a user account [and generally root as well
limits.conf
pretty much applies to everyone] is that the script will just exit after having tried to dofoo &
700 times; you would expect to then see 700 foo processes each with a different pid but you might only see 456 (random number choice) and the other 244 never started because they got blocked by some security or systemd limit.
Million $ question: how many should u run concurrently?
being involved with network and you said each will do a telnet connection, educated guess is you will run into network limits and overhead before you do for cpu and ram limits. But I don't know what you are doing specifically, what will likely happen is u can kick off all 700 at once, but things will automatically block until previous processes and network connections finish and close based on various system limits, or something like the first 500 will kick off then the remaining 200 won't because system or kernel limits prevent it. But however many run at once, there will be some sweetish spot to get things done as fast as possible... minimizing overhead and increasing efficiency. Being 12 cores (or 24 if you have 2 cpu's) then start with 12 (or 24) at once and then increase that concurrent batch number by 12 or 24 until you don't see run time improvement.
hint: google max telnet connections and see how this applies to your system(s). Also don't forget about firewalls. Also do quick calculation of memory needed per process x 700; make sure < available RAM (about 50gb in your case) otherwise system will start using SWAP and basically become unresponsive. So kick of 12, 24, N processes at a time and monitor RAM free, then increase N already having some knowledge of what's happening.
By default, RHEL limits the number of telnet connections from a single host to 10 simultaneous sessions. This is a security feature... set to 10, /etc/xinetd.conf, change “per_source” value.
Related videos on Youtube
KuboMD
Updated on September 18, 2022Comments
-
KuboMD almost 2 years
I need to do some work on 700 network devices using an expect script. I can get it done sequentially, but so far the runtime is around 24 hours. This is mostly due to the time it takes to establish a connection and the delay in the output from these devices (old ones). I'm able to establish two connections and have them run in parallel just fine, but how far can I push that?
I don't imagine I could do all 700 of them at once, surely there's some limit to the no. of telnet connections my VM can manage.
If I did try to start 700 of them in some sort of loop like this:
for node in `ls ~/sagLogs/`; do foo & done
With
CPU 12 CPUs x Intel(R) Xeon(R) CPU E5649 @ 2.53GHz
Memory 47.94 GB
My question is:
- Could all 700 instances possibly run concurrently?
- How far could I get until my server reaches its limit?
- When that limit is reached, will it just wait to begin the next iteration off
foo
or will the box crash?
I'm running in a corporate production environment unfortunately, so I can't exactly just try and see what happens.
-
Stephen Kitt about 5 yearsI’m guessing each job uses very little CPU and RAM, is that right?
-
KuboMD about 5 yearsHonestly I have a hard time telling. Htop isn't very helpful - when I'm running one instance the CPU reads:
CPU: 86.9% sys: 13.1% low: 0.0%
and RAM readsMem:3.86G used:178M buffers:2.28G cache:608M
Any guess? -
Adam about 5 yearsI've had good luck with
parallel
, using around 50 concurrent jobs. It's a great medium between parallelism of 1 and 700. The other nice thing is that's batchless. A single stalled connection will only stall itself, not any of the others. The main downside is error management. None of these shell-based approaches will gracefully handle errors. You'll have to manually check for success yourself, and do your own retries. -
ChuckCottrill about 5 yearsYour task queue may be 700 today, but can the size expand? Watch for swap space to grow - that is indication you have reached memory limit. And cpu % is not a good measure (for linux/unix), better to consider load average (run queue length).
-
michaelb958--GoFundMonica about 5 yearsThe most recent way I broke production at my still-kinda-new job was by accidentally running a million plus short-lived background jobs at once. They involved JVMs (wait wait put the pitchforks down), so the consequences were 'limited' to hundreds of thousands of error report files that threads couldn't be started.
-
Peter Cordes about 5 yearsGoogle
OOM killer
for what happens if you run out of swap space. -
l0b0 about 5 yearsNitpick: Don't parse
ls
output -
KuboMD about 5 years@l0b0 neat! Luckily I'm the only one working on this system, so as long as I don't name anything with a newline it'll be alright.
-
l0b0 about 5 years@KuboMD And as long as nobody else ever wants to use your code.
-
Peter Cordes about 5 yearsBut still, why would you want to write it that way when you could have used
for node ~/sagLogs/*
? (You can usebasename "$node"
or"${node##*/}"
if you need the bare filename without the full path.) -
KuboMD about 5 years@PeterCordes No reason in particular. I didn't know i could use
for node ~/sagLogs/*
- that seems much simpler.
-
KuboMD about 5 yearsInteresting! I'll take a look at this. Do you know if attempting this kind of operation (without the help of Parallel) would risk crashing the hypervisor?
-
ChuckCottrill about 5 yearsThere are ways to run a limited number of background tasks (using bash, perl, python, et al), monitor for task completion, and run more tasks as prior tasks complete. A simple approach would be to collect batches of tasks represented by files in subdirectories, and process a batch at a time. There are other ways...
-
hobbs about 5 years@KuboMD if you can crash the hypervisor with something so mundane, it's a bug in the hypervisor :)
-
Biswapriyo about 5 yearsDoes this also include unix-like systems? And what is "GUN parallel"?
-
Baldrickk about 5 years@Biswapriyo I think it's a typo - he meant GNU gnu.org/software/parallel
-
KuboMD about 5 yearsThis looks very promising, thank you.
-
Austin Hemmelgarn about 5 years@Biswapriyo Baldrickk is correct, it is indeed a typo, which I've now corrected. And yes, this is intended to cover most UNIX-like systems.
-
KuboMD about 5 yearsRan a simple test doing
cat ~/sagLogs/* >> ~/woah | parallel
and holy moly that was fast. 1,054,552 lines in the blink of an eye. -
Ole Tange about 5 yearsThe command you gave has dual redirection, so I donot think it does what you intend it to do. GNU Parallel has an overhead of 10 ms per job, so 1M jobs should take in the order of 3 hours.
-
KuboMD about 5 yearsI see. If i wanted to use
parallel
properly to get thecat
of every file in~/sagLogs/
and move it into~/woah
, what would that command look like? -
Ole Tange about 5 years
cat ~/sagLogs/* > ~/woah
(no| parallel
). -
KuboMD about 5 yearsIs that to say parallel would be invoked automatically, or that it isn't applicable at all in that situation?
-
Ole Tange about 5 yearsIt is not applicable at all if all you want to do is simply to concatenate the files.
-
pipe about 5 years@Baldrickk geekz.co.uk/lovesraymond/archive/gun-linux
-
KuboMD about 5 yearsI see. What's an example of a relatively simple job I could test with
parallel
that would be less intensive than myexpect
script? -
forest about 5 yearsYou can prevent the system from going down by setting sensible rlimits.
-
Austin Hemmelgarn about 5 years@forest Yes, you could use rlimits to prevent the system from crashing, but getting them right in a case like this is not easy (you kind of need to know what the resource requirements for the tasks are beforehand) and doesn't protect the rest of the network from any impact these jobs may cause (which is arguably a potentially much bigger issue than crashing the local system).
-
KuboMD about 5 years@AustinHemmelgarn Yes. I realize now that crashing the box would be inconvenient, but causing widespread network problems would light up the NOC and probably cost me my job.
-
Peter Cordes about 5 yearsGNU Parallel can work similarly to
make -j20
, where instead of waiting for the last job in a "batch", it keepsn
jobs in flight. So one slow job every now and then won't lower throughput unless it's one of the last few to be started. Of course, GNUmake
is usually already installed, and with the right Makefile can work as a parallel job manager. -
Peter Cordes about 5 years@KuboMD a trivial CPU busy loop like
awk 'BEGIN{for(i=rand()*10000000; i<100000000;i++){}}'
would work for playing around with. Or try it on a task likesleep 10
to see it keepn
jobs in flight without using much CPU time. e.g.time parallel sleep ::: {100..1}
to run sleeps from 100 down to 1 second. -
KuboMD about 5 years@OleTange I'm trying to visualize how the 'jobs' will be grouped at the moment. Right now I have groups of between 1 and 38 nodes per 'site'. I've piped their filenames into parallel with a max of 11 jobs at once (just for echoing). I want to show my supervisor how they'll be grouped. I can separate them by site easily, but is there a way I can get
parallel
to insert some sort of index or marker when a new batch is started? For those sites with >11 nodes that'll require 2+ batches. -
Ole Tange about 5 years@KuboMD I have no idea what you mean by "grouped". Please read doi.org/10.5281/zenodo.1146014 (at least chapter 1+2). Then re-state your question.