From ccruz@argento.bu.edu Fri Nov 16 15:58:22 2001 Received: (from ccruz@localhost) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id PAA04172 for cps; Fri, 16 Nov 2001 15:58:16 -0500 (EST) Date: Fri, 16 Nov 2001 15:58:16 -0500 (EST) From: Luis Cruz-Cruz Message-Id: <200111162058.PAA04172@argento.bu.edu> To: cps@argento.bu.edu Subject: to MOSIX users - 16 cpu's to test Status: OR Hi, I have configured 8 machines to run MOSIX. Although it is not running smoothly yet, it "kind" of works. If anyone is interested in testing, please do so by submitting a couple (or so) jobs so that I can see if things break down. Remember to login to clio by usin 'rsh' and submit to the background with the 'nice' command. You can then see all of the jobs with 'top'. Thanks, - Luis ------------------------------------------- Luis Cruz Center for Polymer Studies Boston University ccruz@bu.edu http://polymer.bu.edu/~ccruz From ccruz@argento.bu.edu Thu Nov 15 11:43:54 2001 Received: (from ccruz@localhost) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id LAA24044 for cps; Thu, 15 Nov 2001 11:43:49 -0500 (EST) Date: Thu, 15 Nov 2001 11:43:49 -0500 (EST) From: Luis Cruz-Cruz Message-Id: <200111151643.LAA24044@argento.bu.edu> To: cps@argento.bu.edu Subject: to MOSIX users - clio up Status: OR I rebooted clio, and hope not to do so in the near future, but who knows... I would like to get feedback from the people that are submitting jobs in the MOSIX cluster, so I'll probably go from person to person in the future. Thanks, - Luis ------------------------------------------- Luis Cruz Center for Polymer Studies Boston University ccruz@bu.edu http://polymer.bu.edu/~ccruz From ccruz@argento.bu.edu Thu Nov 15 09:39:47 2001 Received: (from ccruz@localhost) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id JAA96126 for cps; Thu, 15 Nov 2001 09:39:40 -0500 (EST) Date: Thu, 15 Nov 2001 09:39:40 -0500 (EST) From: Luis Cruz-Cruz Message-Id: <200111151439.JAA96126@argento.bu.edu> To: cps@argento.bu.edu Subject: to MOSIX users (ATHLONS) Status: OR I need to reboot clio sometime today. I am still troubleshooting the network problem that causes it to hang, but I think that I am getting closer to solving it. Thanks, - Luis ------------------------------------------- Luis Cruz Center for Polymer Studies Boston University ccruz@bu.edu http://polymer.bu.edu/~ccruz From ccruz@argento.bu.edu Tue Nov 13 11:23:18 2001 Received: (from ccruz@localhost) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id LAA74497 for cps; Tue, 13 Nov 2001 11:23:13 -0500 (EST) Date: Tue, 13 Nov 2001 11:23:13 -0500 (EST) From: Luis Cruz-Cruz Message-Id: <200111131623.LAA74497@argento.bu.edu> To: cps@argento.bu.edu Subject: to MOSIX users (ATHLONS) Status: OR 'clio' had problems over the weekend and I had to reboot this morning. This means that all jobs running on the MOSIX cluster died. Sorry for the reboot, but I really appreciate that you keep testing the new machines. Since it is not stable yet, please expect more (or less) problems in the future. I am currently trying to fix the problem with clio hanging and setting other machines as well ------------------------------------------- Luis Cruz Center for Polymer Studies Boston University ccruz@bu.edu http://polymer.bu.edu/~ccruz From ccruz@argento.bu.edu Wed Nov 7 15:52:31 2001 Received: (from ccruz@localhost) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id PAA28277 for cps; Wed, 7 Nov 2001 15:52:25 -0500 (EST) Date: Wed, 7 Nov 2001 15:52:25 -0500 (EST) From: Luis Cruz-Cruz Message-Id: <200111072052.PAA28277@argento.bu.edu> To: cps@argento.bu.edu Subject: FREE CPU hours - 1.2Ghz Athlons Status: O Hi, As you may (or may not) know, we have new machines in the center. I already hooked up (configured and hid) 4 of them in the CPS network. They are dual Athlon CPU's (1.2Ghz) with 512Mb of RAM. They run an upgraded RedHat 7.2, with XFree86 4.1, KDE 2.2.1 and Koffice 1.1 (plus a bunch of other things that are too many to enumerate). Of course, there are a bunch of programs that I have not had time to install yet (e.g. xmgr). I also put something called MOSIX that I'll explain below. --note: please, do NOT sit and login at the console yet-- I did a preliminary 'whetstone' benchmark and got a 67% speed improvement over the yanko's, but this number should only be taken as a "best case" scenario, since your own code might have a very different performance. The MOSIX software is a load-balancing scheme at the level of the kernel. It will automatically try to load all machines in a cluster at the same level. This means that if there are 4 machines and 3 jobs in one of them, MOSIX will send two of these jobs to two different machines (or CPU's), thus leveling the load. If, as time progresses, there is more/less load, MOSIX dynamically moves jobs around (takes about a couple of seconds) to achieve good performance. The beauty is that MOSIX does this and the user does not have to care about how, but just to submit jobs in the usual way. So the setup is the following: the four Athlons form a MOSIX cluster and will share jobs between them. Being dual CPU's, this means that they form an 8 CPU cluster -- 8 jobs in that cluster will run at full speed. I would like people to test this mini-cluster and let me know how you like (dislike) it. So the procedure is the following if you want to participate (MAX 1 job per user, please): 1. rsh into clio (NOT ssh), e.g. machine% rsh clio 2. submit your job, e.g. clio% nice a.out & 3. go for lunch (or dinner, whichever is appropriate). You can check your job in clio by running 'top'. You will see that if there are 8 or less jobs running, each should take about 99% of CPU. If more, then CPU effort is shared according to MOSIX's algorithms. NOTE: not all programs will migrate around the cluster. Migration is calculated based on memory usage, I/O rate, etc. Also, you do not need to know the names of the other machines in the MOSIX cluster... I will be adding more machines to the Athlon cluster up to 10, which will mean a 20 CPU cluster. If things work out, I can also start configuring the yanko's to form their own MOSIX cluster for better performance. Let me know of any problems/questions, - Luis ------------------------------------------- Luis Cruz Center for Polymer Studies Boston University ccruz@bu.edu http://polymer.bu.edu/~ccruz From ccruz@argento.bu.edu Tue Nov 20 12:13:01 2001 Received: (from ccruz@localhost) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id MAA38950 for cps; Tue, 20 Nov 2001 12:12:56 -0500 (EST) Date: Tue, 20 Nov 2001 12:12:56 -0500 (EST) From: Luis Cruz-Cruz Message-Id: <200111201712.MAA38950@argento.bu.edu> To: cps@argento.bu.edu Subject: clio - up Status: OR Clio is now up and running, and the problems communicating with meta have dissapeared. They are both talking with each other at 100Mbps. You may resubmit your jobs now. Just one thing, if your job uses more than 100Mb of RAM, please see me first so that when submitting the program, it does not compete with other ones for memory before migrating. You can visualize the load on the whole MOSIX cluster by logging into clio and typing 'mon'. Thanks, - Luis ------------------------------------------- Luis Cruz Center for Polymer Studies Boston University ccruz@bu.edu http://polymer.bu.edu/~ccruz From ccruz@argento.bu.edu Wed Nov 21 12:00:36 2001 Received: (from ccruz@localhost) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id MAA45163 for cps; Wed, 21 Nov 2001 12:00:29 -0500 (EST) Date: Wed, 21 Nov 2001 12:00:29 -0500 (EST) From: Luis Cruz-Cruz Message-Id: <200111211700.MAA45163@argento.bu.edu> To: cps@argento.bu.edu Subject: clio - reboot - fix for I/O Status: OR Hi, Although the majority of programs seem to be running ok in the MOSIX cluster, the big ones (a lot of RAM > 100Mb) and that do a lot of I/O, seem to be getting almost 0% CPU. I am working on a fix and will try this afternoon, which means a reboot of clio. If this works, large programs will be able to run on a remote node and write directly to meta, bypassing the communication with clio. This should improve the I/O and CPU yield. I'll write with more details later. Thanks, - Luis ------------------------------------------- Luis Cruz Center for Polymer Studies Boston University ccruz@bu.edu http://polymer.bu.edu/~ccruz From ccruz@argento.bu.edu Wed Nov 21 15:15:20 2001 Received: (from ccruz@localhost) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id PAA52077 for cps; Wed, 21 Nov 2001 15:15:15 -0500 (EST) Date: Wed, 21 Nov 2001 15:15:15 -0500 (EST) From: Luis Cruz-Cruz Message-Id: <200111212015.PAA52077@argento.bu.edu> To: cps@argento.bu.edu Subject: clio - up Status: OR Hi, the MOSIX cluster is up again. There is a new feature that I am testing... all clusters see each other mirrored in a directory called /mfs This directory permits machines to run guest jobs and to let them write directly to meta without the need of going back to clio (the node they were initially submitted) So the changes for small programs that write once in a while is to submit with 'nice' in clio, and MOSIX will take care of migrating to the optimum node to run. For larger programs (Usage RAM > 100Mb), then you need a couple of things. First, you should submit your job to a particular node, other than clio, directly (this is done by writing the desired node 1-7 and 10, to a file called /proc/self/migrate) The second thing is to append '/mfs/here' to the path of the file that you are writing. For example, if you are writing to a file in /project/meta/AD/ccruz/myfile.txt, then in your executable define the file as /mfs/here/project/meta/AD/ccruz/myfile.txt -- this insures that every node sees the file locally through the /mfs filesystem... If the above sounds too complicated or does not work, please come by and i'll be happy to explain in more detail. Thanks, - Luis p.s.: (i) to see all jobs and where are they running use the modified top - mtop - in clio. (ii) to see a dynamic output of the load of the cluster type 'mon' (iii) for yet more graphics about the load and memory load type 'mosixview' ------------------------------------------- Luis Cruz Center for Polymer Studies Boston University ccruz@bu.edu http://polymer.bu.edu/~ccruz From ccruz@hypate.bu.edu Mon Nov 26 14:05:22 2001 Received: from buphy.bu.edu (BUPHY.BU.EDU [128.197.41.42]) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) with ESMTP id OAA30999; Mon, 26 Nov 2001 14:05:22 -0500 (EST) Received: from relay3.bu.edu (relay3.bu.edu [128.197.27.246]) by buphy.bu.edu ((8.9.3.buoit.v1.0)/8.9.3/(BU-S-10/28/1999-v1.0pre2)) with ESMTP id OAA23362855; Mon, 26 Nov 2001 14:05:21 -0500 (EST) Received: from argento.bu.edu (ARGENTO.BU.EDU [128.197.42.78]) by relay3.bu.edu ((8.9.3.buoit.v1.0)/8.8.5/(BU-RELAY-11/18/99-b2)) with ESMTP id OAA08081; Mon, 26 Nov 2001 14:05:04 -0500 (EST) Received: from hypate.bu.edu (IDENT:root@hypate.bu.edu [128.197.42.67]) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) with ESMTP id OAA30874; Mon, 26 Nov 2001 14:05:03 -0500 (EST) Received: (from ccruz@localhost) by hypate.bu.edu (8.9.3/8.9.3) id OAA30537; Mon, 26 Nov 2001 14:05:03 -0500 Date: Mon, 26 Nov 2001 14:05:03 -0500 From: Luis Cruz-Cruz Message-Id: <200111261905.OAA30537@hypate.bu.edu> To: hes@bu.edu, trunfio@bu.edu Subject: Mosix - is it worth it? Cc: ccruz@hypate.bu.edu Status: OR after three weeks of configuring, fixing, tuning, and stressing over the new machines, I think that Mosix is finally running at some acceptable level (~85%). Or at least to a level that the option of removing it because it ``stinks'' is no longer true. At least it is showing some promise -- the goal behing me loosing nights configuring this thing is basically to optimize the use of the machines and to make it very easy to administer. Now, I (or anyone) goes to only one machine (clio) and can see all processes on all 10 machines, how much memory they are using, how long have they been running, and who is running how many. The users can also submit jobs that will go automatically to any of the other machines, from clio. To me this is an advantage cause people will police themselves on job restrictions and at least keeping some kind of sense on how these machines are used. In addition, I can reboot any of the machines any time without loosing background jobs (except from the head machine, of course). To all practical purposes, they are a "cluster" much in the same sense of the bigger one that we want to buy later. Of course, once people start logging in and using the interactive session, the performance will decrease, and that is another reason for the bigger dedicated cluster. Anyhow, I am starting to collect people's reaction to the cluster and the setup and tuning accordingly. If this really works and people are happy, then I can do the same for the older linux, such that anyone can take advantage of the 40 or so CPU's that we have in a transparent fashion. - Luis From ccruz@argento.bu.edu Tue Dec 4 14:42:00 2001 Received: (from ccruz@localhost) by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id OAA16072 for cps; Tue, 4 Dec 2001 14:41:55 -0500 (EST) Date: Tue, 4 Dec 2001 14:41:55 -0500 (EST) From: Luis Cruz-Cruz Message-Id: <200112041941.OAA16072@argento.bu.edu> To: cps@argento.bu.edu Subject: Access to new machines in 101 Status: OR Hi, If anybody is interested, you can log in into any of the four new Athlon machines in 101. Please note that since they are using a new version of KDE, you should hit the "ignore" button pop-up that appears shortly after your login name and password. Otherwise, you might have problems when login back into one of the older linuxes. I have not had time to install every conceivable program that currently exists on the other linuxes, but if you need to run e.g. xmgr, you can always login remotely into the older and run it. I'll install things as time permits. DO NOT run background jobs on those machines. They are part of the Mosix cluster. Mosix users should run from clio. Any problems/questions, you know where to find me... Thanks, - Luis ------------------------------------------- Luis Cruz Center for Polymer Studies Boston University ccruz@bu.edu http://polymer.bu.edu/~ccruz