From ccruz@argento.bu.edu  Fri Nov 16 15:58:22 2001
Received: (from ccruz@localhost)
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id PAA04172
	for cps; Fri, 16 Nov 2001 15:58:16 -0500 (EST)
Date: Fri, 16 Nov 2001 15:58:16 -0500 (EST)
From: Luis Cruz-Cruz <ccruz@argento.bu.edu>
Message-Id: <200111162058.PAA04172@argento.bu.edu>
To: cps@argento.bu.edu
Subject: to MOSIX users - 16 cpu's to test
Status: OR


Hi,

I have configured 8 machines to run MOSIX. Although
it is not running smoothly yet, it "kind" of works.
If anyone is interested in testing, please do so
by submitting a couple (or so) jobs so that I can
see if things break down.

Remember to login to clio by usin 'rsh' and submit
to the background with the 'nice' command. You can
then see all of the jobs with 'top'.

Thanks,

- Luis
-------------------------------------------
Luis Cruz
Center for Polymer Studies
Boston University
ccruz@bu.edu
http://polymer.bu.edu/~ccruz

From ccruz@argento.bu.edu  Thu Nov 15 11:43:54 2001
Received: (from ccruz@localhost)
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id LAA24044
	for cps; Thu, 15 Nov 2001 11:43:49 -0500 (EST)
Date: Thu, 15 Nov 2001 11:43:49 -0500 (EST)
From: Luis Cruz-Cruz <ccruz@argento.bu.edu>
Message-Id: <200111151643.LAA24044@argento.bu.edu>
To: cps@argento.bu.edu
Subject: to MOSIX users - clio up
Status: OR


I rebooted clio, and hope not to do so
in the near future, but who knows...

I would like to get feedback from the
people that are submitting jobs in the
MOSIX cluster, so I'll probably go from
person to person in the future.

Thanks,

- Luis
-------------------------------------------
Luis Cruz
Center for Polymer Studies
Boston University
ccruz@bu.edu
http://polymer.bu.edu/~ccruz

From ccruz@argento.bu.edu  Thu Nov 15 09:39:47 2001
Received: (from ccruz@localhost)
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id JAA96126
	for cps; Thu, 15 Nov 2001 09:39:40 -0500 (EST)
Date: Thu, 15 Nov 2001 09:39:40 -0500 (EST)
From: Luis Cruz-Cruz <ccruz@argento.bu.edu>
Message-Id: <200111151439.JAA96126@argento.bu.edu>
To: cps@argento.bu.edu
Subject: to MOSIX users (ATHLONS)
Status: OR


I need to reboot clio sometime today. I am still
troubleshooting the network problem that causes it
to hang, but I think that I am getting closer to
solving it.

Thanks,

- Luis
-------------------------------------------
Luis Cruz
Center for Polymer Studies
Boston University
ccruz@bu.edu
http://polymer.bu.edu/~ccruz

From ccruz@argento.bu.edu  Tue Nov 13 11:23:18 2001
Received: (from ccruz@localhost)
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id LAA74497
	for cps; Tue, 13 Nov 2001 11:23:13 -0500 (EST)
Date: Tue, 13 Nov 2001 11:23:13 -0500 (EST)
From: Luis Cruz-Cruz <ccruz@argento.bu.edu>
Message-Id: <200111131623.LAA74497@argento.bu.edu>
To: cps@argento.bu.edu
Subject: to MOSIX users (ATHLONS)
Status: OR


'clio' had problems over the weekend and
I had to reboot this morning. This means
that all jobs running on the MOSIX cluster
died. 

Sorry for the reboot, but I really appreciate
that you keep testing the new machines. Since
it is not stable yet, please expect more (or
less) problems in the future.

I am currently trying to fix the problem with
clio hanging and setting other machines as well
-------------------------------------------
Luis Cruz
Center for Polymer Studies
Boston University
ccruz@bu.edu
http://polymer.bu.edu/~ccruz

From ccruz@argento.bu.edu  Wed Nov  7 15:52:31 2001
Received: (from ccruz@localhost)
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id PAA28277
	for cps; Wed, 7 Nov 2001 15:52:25 -0500 (EST)
Date: Wed, 7 Nov 2001 15:52:25 -0500 (EST)
From: Luis Cruz-Cruz <ccruz@argento.bu.edu>
Message-Id: <200111072052.PAA28277@argento.bu.edu>
To: cps@argento.bu.edu
Subject: FREE CPU hours - 1.2Ghz Athlons
Status: O


Hi,

As you may (or may not) know, we have new machines in the center. I already
hooked up (configured and hid) 4 of them in the CPS network. They are dual
Athlon CPU's (1.2Ghz) with 512Mb of RAM. They run an upgraded RedHat 7.2, with
XFree86 4.1, KDE 2.2.1 and Koffice 1.1 (plus a bunch of other things that are
too many to enumerate). Of course, there are a bunch of programs that I have
not had time to install yet (e.g. xmgr). I also put something called MOSIX that
I'll explain below.

--note: please, do NOT sit and login at the console yet--

I did a preliminary 'whetstone' benchmark and got a 67% speed improvement over
the yanko's, but this number should only be taken as a "best case" scenario,
since your own code might have a very different performance.

  The MOSIX software is a load-balancing scheme at the level of the kernel. It
  will automatically try to load all machines in a cluster at the same level.
  This means that if there are 4 machines and 3 jobs in one of them, MOSIX will
  send two of these jobs to two different machines (or CPU's), thus leveling
  the load. If, as time progresses, there is more/less load, MOSIX dynamically
  moves jobs around (takes about a couple of seconds) to achieve good
  performance. The beauty is that MOSIX does this and the user does not have to
  care about how, but just to submit jobs in the usual way. 

So the setup is the following: the four Athlons form a MOSIX cluster and will
share jobs between them. Being dual CPU's, this means that they form an 8 CPU
cluster -- 8 jobs in that cluster will run at full speed. I would like people
to test this mini-cluster and let me know how you like (dislike) it.

So the procedure is the following if you want to participate (MAX 1 job per
user, please):
	1. rsh into clio (NOT ssh), e.g. machine% rsh clio  
	2. submit your job,         e.g. clio%    nice a.out &
	3. go for lunch (or dinner, whichever is appropriate).
You can check your job in clio by running 'top'. You will see that if there
are 8 or less jobs running, each should take about 99% of CPU. If more, then
CPU effort is shared according to MOSIX's algorithms. NOTE: not all programs
will migrate around the cluster. Migration is calculated based on memory
usage, I/O rate, etc. Also, you do not need to know the names of the other
machines in the MOSIX cluster...

I will be adding more machines to the Athlon cluster up to 10, which will mean
a 20 CPU cluster. If things work out, I can also start configuring the yanko's
to form their own MOSIX cluster for better performance. 

Let me know of any problems/questions, 

- Luis


-------------------------------------------
Luis Cruz
Center for Polymer Studies
Boston University
ccruz@bu.edu
http://polymer.bu.edu/~ccruz

From ccruz@argento.bu.edu  Tue Nov 20 12:13:01 2001
Received: (from ccruz@localhost)
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id MAA38950
	for cps; Tue, 20 Nov 2001 12:12:56 -0500 (EST)
Date: Tue, 20 Nov 2001 12:12:56 -0500 (EST)
From: Luis Cruz-Cruz <ccruz@argento.bu.edu>
Message-Id: <200111201712.MAA38950@argento.bu.edu>
To: cps@argento.bu.edu
Subject: clio - up
Status: OR


Clio is now up and running, and the problems
communicating with meta have dissapeared.
They are both talking with each other at 100Mbps.

You may resubmit your jobs now. Just one
thing, if your job uses more than 100Mb of
RAM, please see me first so that when submitting
the program, it does not compete with other ones
for memory before migrating.

You can visualize the load on the whole
MOSIX cluster by logging into clio and
typing 'mon'.

Thanks,

- Luis
-------------------------------------------
Luis Cruz
Center for Polymer Studies
Boston University
ccruz@bu.edu
http://polymer.bu.edu/~ccruz

From ccruz@argento.bu.edu  Wed Nov 21 12:00:36 2001
Received: (from ccruz@localhost)
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id MAA45163
	for cps; Wed, 21 Nov 2001 12:00:29 -0500 (EST)
Date: Wed, 21 Nov 2001 12:00:29 -0500 (EST)
From: Luis Cruz-Cruz <ccruz@argento.bu.edu>
Message-Id: <200111211700.MAA45163@argento.bu.edu>
To: cps@argento.bu.edu
Subject: clio - reboot - fix for I/O
Status: OR


Hi,

Although the majority of programs seem to be running ok
in the MOSIX cluster, the big ones (a lot of RAM > 100Mb)
and that do a lot of I/O, seem to be getting almost
0% CPU.

I am working on a fix and will try this afternoon, which
means a reboot of clio.

If this works, large programs will be able to run on a
remote node and write directly to meta, bypassing the
communication with clio. This should improve the I/O and
CPU yield.

I'll write with more details later.

Thanks,

- Luis
-------------------------------------------
Luis Cruz
Center for Polymer Studies
Boston University
ccruz@bu.edu
http://polymer.bu.edu/~ccruz

From ccruz@argento.bu.edu  Wed Nov 21 15:15:20 2001
Received: (from ccruz@localhost)
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id PAA52077
	for cps; Wed, 21 Nov 2001 15:15:15 -0500 (EST)
Date: Wed, 21 Nov 2001 15:15:15 -0500 (EST)
From: Luis Cruz-Cruz <ccruz@argento.bu.edu>
Message-Id: <200111212015.PAA52077@argento.bu.edu>
To: cps@argento.bu.edu
Subject: clio - up
Status: OR


Hi,

the MOSIX cluster is up again. There is a new feature that
I am testing... all clusters see each other mirrored in
a directory called /mfs

This directory permits machines to run guest jobs and to
let them write directly to meta without the need of going
back to clio (the node they were initially submitted)

So the changes for small programs that write once in a
while is to submit with 'nice' in clio, and MOSIX will
take care of migrating to the optimum node to run.

For larger programs (Usage RAM > 100Mb), then you need
a couple of things. First, you should submit your job
to a particular node, other than clio, directly (this
is done by writing the desired node 1-7 and 10, to
a file called /proc/self/migrate)

The second thing is to append '/mfs/here' to the path
of the file that you are writing. For example, if you
are writing to a file in /project/meta/AD/ccruz/myfile.txt,
then in your executable define the file as
/mfs/here/project/meta/AD/ccruz/myfile.txt -- this
insures that every node sees the file locally through 
the /mfs filesystem...

If the above sounds too complicated or does not work,
please come by and i'll be happy to explain in more
detail.

Thanks,

- Luis

p.s.: (i) to see all jobs and where are they running
use the modified top - mtop - in clio. (ii) to see
a dynamic output of the load of the cluster type 'mon'
(iii) for yet more graphics about the load and
memory load type 'mosixview'
-------------------------------------------
Luis Cruz
Center for Polymer Studies
Boston University
ccruz@bu.edu
http://polymer.bu.edu/~ccruz

From ccruz@hypate.bu.edu  Mon Nov 26 14:05:22 2001
Received: from buphy.bu.edu (BUPHY.BU.EDU [128.197.41.42])
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) with ESMTP id OAA30999;
	Mon, 26 Nov 2001 14:05:22 -0500 (EST)
Received: from relay3.bu.edu (relay3.bu.edu [128.197.27.246])
	by buphy.bu.edu ((8.9.3.buoit.v1.0)/8.9.3/(BU-S-10/28/1999-v1.0pre2)) with ESMTP id OAA23362855;
	Mon, 26 Nov 2001 14:05:21 -0500 (EST)
Received: from argento.bu.edu (ARGENTO.BU.EDU [128.197.42.78])
	by relay3.bu.edu ((8.9.3.buoit.v1.0)/8.8.5/(BU-RELAY-11/18/99-b2)) with ESMTP id OAA08081;
	Mon, 26 Nov 2001 14:05:04 -0500 (EST)
Received: from hypate.bu.edu (IDENT:root@hypate.bu.edu [128.197.42.67])
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) with ESMTP id OAA30874;
	Mon, 26 Nov 2001 14:05:03 -0500 (EST)
Received: (from ccruz@localhost)
	by hypate.bu.edu (8.9.3/8.9.3) id OAA30537;
	Mon, 26 Nov 2001 14:05:03 -0500
Date: Mon, 26 Nov 2001 14:05:03 -0500
From: Luis Cruz-Cruz <ccruz@hypate.bu.edu>
Message-Id: <200111261905.OAA30537@hypate.bu.edu>
To: hes@bu.edu, trunfio@bu.edu
Subject: Mosix - is it worth it?
Cc: ccruz@hypate.bu.edu
Status: OR


after three weeks of configuring, fixing, tuning, and stressing over the new
machines, I think that Mosix is finally running at some acceptable level
(~85%). Or at least to a level that the option of removing it because it
``stinks'' is no longer true. At least it is showing some promise --

the goal behing me loosing nights configuring this thing is basically to
optimize the use of the machines and to make it very easy to administer. Now, I
(or anyone) goes to only one machine (clio) and can see all processes on all 10
machines, how much memory they are using, how long have they been running, and
who is running how many. The users can also submit jobs that will go
automatically to any of the other machines, from clio.  To me this is an
advantage cause people will police themselves on job restrictions and at least
keeping some kind of sense on how these machines are used. In addition, I can
reboot any of the machines any time without loosing background jobs (except
from the head machine, of course).

To all practical purposes, they are a "cluster" much in the same sense of the
bigger one that we want to buy later. Of course, once people start logging in
and using the interactive session, the performance will decrease, and that is
another reason for the bigger dedicated cluster.

Anyhow, I am starting to collect people's reaction to the cluster and the setup
and tuning accordingly. If this really works and people are happy, then I can
do the same for the older linux, such that anyone can take advantage of the 40
or so CPU's that we have in a transparent fashion.

- Luis

From ccruz@argento.bu.edu  Tue Dec  4 14:42:00 2001
Received: (from ccruz@localhost)
	by argento.bu.edu (SGI-8.9.3/8.9.3/(BU-S-10/28/1999-v1.0pre2)) id OAA16072
	for cps; Tue, 4 Dec 2001 14:41:55 -0500 (EST)
Date: Tue, 4 Dec 2001 14:41:55 -0500 (EST)
From: Luis Cruz-Cruz <ccruz@argento.bu.edu>
Message-Id: <200112041941.OAA16072@argento.bu.edu>
To: cps@argento.bu.edu
Subject: Access to new machines in 101
Status: OR


Hi,

If anybody is interested, you can log in into any
of the four new Athlon machines in 101. Please note
that since they are using a new version of KDE,
you should hit the "ignore" button pop-up that
appears shortly after your login name and password.
Otherwise, you might have problems when login
back into one of the older linuxes.

I have not had time to install every conceivable
program that currently exists on the other linuxes,
but if you need to run e.g. xmgr, you can always
login remotely into the older and run it. I'll
install things as time permits.

DO NOT run background jobs on those machines. They
are part of the Mosix cluster. Mosix users should
run from clio.

Any problems/questions, you know where to find me...

Thanks,

- Luis


-------------------------------------------
Luis Cruz
Center for Polymer Studies
Boston University
ccruz@bu.edu
http://polymer.bu.edu/~ccruz