tisdag 23 november 2010

DropBox distributed computing

Nothing beats a free lunch. One tool that is getting more and more important to me is DropBox. Great tool to use for storing and sharing your files across many desktops and project members.

Anything else... yes why not use it as a distributed computer? Because after all we all know that if you can leave GUI-stuff and infrastructure out of it many things are really simple.

What I did:

Create one folder DistributedComputer
This will be shared with everybody that runs a node. Share this with all persons trusted to run a computing node for you.

In it place all the stuff you need to pick a job start an executable that does the processing and then write the stuff to a directory.

I used one large text-file to contain all the parts of the problem and two directories for tracking: Started and Done. A simple program picks a random problem (important to make it random) and starts another program that then starts the calculation, a file is created in the Started folder with the problem id as name. When the calculation is finished a file is created in the Done folder with the problem id as name and the output as content.

Super simple and it works!

Some notes:
Picking a random problem is important. If a node loses network connection it can still keep work on randomly selected problems with small risk of duplicating work. If picking the problems in sequence all nodes without network connection would duplicate their work.

The program that picks the problem and starts the program to solve the program should of course verify that the problem isn't already started (known by checking the Started directory) or done. If there are no problems left that aren't started pick a problem that at least isn't done already.

The real beauty of it:
  • Coding some simple commandline tools that works with files is really simple. It is also much less work to make them platform independent if needed.
  • DropBox will handle tracking for you. Can see which computer created the file and when a problem started and was done.
  • You can easily continue with refinements and optimization of your program to solve the problem, each time a new problem is picked the executable is reloaded, all you need to do is put a new executable in the shared folder.
  • Zero code for communication and synchronization, only two directories to check for status.
  • All members can easily follow the project and participate. This is really nice since it makes it easy for everybody to follow the project and contribute with improvements.

Issues:
  • Size limits, 2GB isn't huge but it should cover a decent set of problems. For the results, if they are big just have two files in the done folder one to mark the problem done and one with the actual results. Then you can gather all the results and process them as they appear and then delete them.
  • Since the files are shared there is potential for users grab the komplete results.
  • There is also a risk for users to wreck your files and data, not very nice in a internet project with thousands of users, but quite ok if you are doing this with some friends.