Create a cluster of co-workers' Windows 7 PCs for parallel processing in R?

21,256

Solution 1

Yes you can. There are a number of ways. One of the easiest is to use redis as a backend (as easy as calling sudo apt-get install redis-server on an Ubuntu machine; rumor has that you could have a redis backend on a windows machine too).

By using the doRedis package, you can very easily en-queue jobs on a task queue in redis, and then use one, two, ... idle workers to query the queue. Best of all, you can easily mix operating systems so yes, your co-workers' windows machines qualify. Moreover, you can use one, two, three, ... clients as you see fit and need and scale up or down. The queue does not know or care, it simply supplies jobs.

Bost of all, the vignette in the doRedis has working examples of a mix of Linux and Windows clients to make a bootstrapping example go faster.

Solution 2

Perhaps not the answer you were looking for, but - this is one of those situations where an alternative is sooo much better that it's hard to ignore.

The cost of AWS clusters is ridiculously low (my emphasis) for exactly these types of computing problems. You pay only for what you use. I can guarantee you that you will save money (at the very least in opportunity costs) by not spending the time trying to convert 12 windows machines into a cluster. For your purposes, you could probably even do this for free. (IIRC, they still offer free computing time on clusters)

References:

Some of these instances are so powerful you probably wouldn't even need to figure out how to setup your work on a cluster (given your current description). As you can see from the references costs are ridiculously low, ranging from 1-4$ per hour of compute time.

Solution 3

What about OpenCL?

This would require rewriting the C code, but would allow potentially large speedups. The GPU has immense computing power.

Share:
21,256
Thomas Browne
Author by

Thomas Browne

Transport, factorization, visualization of high dimensional real time streaming data across disciplines, with focus on finance and cryptocurrency APIs. Ex emerging markets bond trader, PM, strategist, with comprehensive at-the-coalface knowledge of all cash, option, swap, and FX markets, and now crypto! Also: full-stack data engineer from Linux through Postgres, Kafka, messaging protocols, API expert, comprehensive Python / R / Numpy, visualization libraries, Elixir, soon...Rust GPU programming. Get in touch!

Updated on June 18, 2020

Comments

  • Thomas Browne
    Thomas Browne almost 4 years

    I am running the termstrc yield curve analysis package in R across 10 years of daily bond price data for 5 different countries. This is highly compute intensive, it takes 3200 seconds per country on a standard lapply, and if I use foreach and %dopar% (with doSNOW) on my 2009 i7 mac, using all 4 cores (8 with hyperthreading) I get this down to 850 seconds. I need to re-run this analysis every time I add a country (to compute inter-country spreads), and I have 19 countries to go, with many more credit yield curves to come in the future. The time taken is starting to look like a major issue. By the way, the termstrc analysis function in question is accessed in R but is written in C.

    Now, we're a small company of 12 people (read limited budget), all equipped with 8GB ram, i7 PCs, of which at least half are used for mundane word processing / email / browsing style tasks, that is, using 5% maximum of their performance. They are all networked using gigabit (but not 10-gigabit) ethernet.

    Could I cluster some of these underused PCs using MPI and run my R analysis across them? Would the network be affected? Each iteration of the yield curve analysis function takes about 1.2 seconds so I'm assuming that if the granularity of parallel processing is to pass a whole function iteration to each cluster node, 1.2 seconds should be quite large compared with the gigabit ethernet lag?

    Can this be done? How? And what would the impact be on my co-workers. Can they continue to read their emails while I'm taxing their machines?

    I note that Open MPI seems not to support Windows anymore, while MPICH seems to. Which would you use, if any?

    Perhaps run an Ubuntu virtual machine on each PC?

  • Thomas Browne
    Thomas Browne about 11 years
    Wow - hadn't even thought about the cloud. Okay - I'll give this a shot. At the kind of price points that you're talking about it would indeed be interesting.
  • Thomas Browne
    Thomas Browne about 11 years
    This looks very interesting. Indeed I googled around on Redis and find that it's probably going to solve another problem that I have, that is, sharing large amounts of timeseries data amongst many computers (please tell me if I'm misguided here). On the original question: will I be able, using doRedis, to ensure that the R instance on the other PCs does not hog all their CPU resource? Can I for example limit it to 4 out of 8 computer cores? I ask because if I give doSNOW all 8 cores on my mac or PC, nothing else runs acceptably anymore despite the multitasking OS.
  • Dirk Eddelbuettel
    Dirk Eddelbuettel about 11 years
    Yes, each client should be able to control its own limits.
  • Thomas Browne
    Thomas Browne about 11 years
    Having thought about this, because a large part of my work involves paramaterizing the function and re-running it, it is quite possible to do 5 hours of work a day on this even in a big could based parallel installation. Let's say $2.50 per hour = $12.50 per day, 20 days per month, we're talking $250 per month. I wouldn't describe it as "ridiculously" low though I guess if I'm getting tons of computer power for it will indeed be cost effective.
  • Thomas Browne
    Thomas Browne almost 11 years
    I would dearly love to use openCL. I am back to taking 2 hours per country for optimization, using 5x4-core computers clustered using doRedis. Don't get me wrong, doRedis is great, as it would otherwise take over 9 hours, but it seems to me that massive teraflops of computing horsepower are being left idle. I think I would need the uniroot function to use openCL. What are the ways of using openCL on R without being an indepth C programmer, anyway?
  • Demi
    Demi over 10 years
    I don't know, sorry. I have never used OpenCL - just heard about it. What you could do is look for which parts of the algorithm are the biggest computing hogs (by profiling), and see if there are GPU-accelerated libraries available for any of them.
  • Thomas Browne
    Thomas Browne about 10 years
    I will add that I have happily been using doRedis now since you answered the question (so for about a year), and it works very well indeed (though sometimes I have to shutdown the R sessions that it creates on the co-worker machines, manually, once the jobs are over)