[LinuxFailSafe] Failsafe Licence LGPL / GPL confusion

Discussion:

Kashif Shaikh

2003-11-28 17:20:49 UTC

Hello everyone,

I know this list is not hardly used by anyone, but I was wondering if
someone from SGI or SuSE can answer my question:

Most of the header files in FailSafe/cluster_services/include/ are LGPL,
which permits proprietary add-ons to use the underlying CHOAS
libraries(cms/gcs/srm). However, some of the header files are
confusing: For example, FailSafe/cluster_services/include/ci_config.h is
LGPL, but the header files it includes(i.e. cdb.h) are GPL only.

Won't this contaminate proprietary add-ons? In this case ci_config.h
should be labeled as GPL, and not LGPL. Or perhaps the cdb library
should be marked as LGPL(can only be done by copyright owners i.e. SGI).
I'm not arguing if a library should be GPL or LGPL, I'm just saying the
license application should be made clearer, so people like me can
evaluate failsafe v.s. heartbeat(it's API is consistently LGPL). I can't
make the headers more clearer, because it depends on what the copyright
owners want to LGPL. So the balls in your court, I just need objective
information.

Kashif

Lars Marowsky-Bree

2003-11-28 19:44:16 UTC

Permalink

On 2003-11-28T12:20:49,

Post by Kashif Shaikh
I'm not arguing if a library should be GPL or LGPL, I'm just saying the
license application should be made clearer, so people like me can
evaluate failsafe v.s. heartbeat(it's API is consistently LGPL). I can't
make the headers more clearer, because it depends on what the copyright
owners want to LGPL. So the balls in your court, I just need objective
information.

SGI owns most of the files and would have to reevaluate the copyright
decisions.

However, if you are comparing FailSafe to heartbeat, I can tell you that
as of today, it's evaluating a space ship to a car ;-) FailSafe is much
further advanced; it's not a reasonable comparison. The two benefits
heartbeat has offer FailSafe is less complexity and tighter security.

Sincerely,
Lars Marowsky-Br?e <lmb at suse.de>

--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett

Kashif Shaikh

2003-11-28 20:43:00 UTC

Permalink

Post by Lars Marowsky-Bree
On 2003-11-28T12:20:49,

SGI owns most of the files and would have to reevaluate the copyright
decisions.

Maybe someone from SGI can comment here? Otherwise someone(me?) could just
strip out GPL symbols from the LGPL header files.

Post by Lars Marowsky-Bree
However, if you are comparing FailSafe to heartbeat, I can tell you that
as of today, it's evaluating a space ship to a car ;-) FailSafe is much
further advanced; it's not a reasonable comparison. The two benefits
heartbeat has offer FailSafe is less complexity and tighter security.

Yes, FailSafe tries to be a space-ship, but ends up being the
kitchen-sink and too over-engineered. The nicest things Failsafe has is
the cms/gcs layer, and IMO the rest is crap; the cdb should be placed
over the cms/gcs layer, but it's pointless since cdb uses rpc for
communication and will be difficult to use gcs. FSD/SRM is hell -- both
trying to keep the same states and similar logic results in many bugs
due to 'assumptions' and what not.

The scary part is Heartbeat's proposed CRM/CIB/SRM/CRS looks very
similar to FailSafe's CHOAS layer(no offense, I know you wrote the crm
docs)...and that's when complexity will shoot through the roof
again...only good for the 'enterprise' market. The reason why heartbeat
gets adopted in a 'heartbeat' because its easy to configure with
down-to-earth configuration and only a single daemon to manage.

Kashif

Post by Lars Marowsky-Bree
Sincerely,
Lars Marowsky-Br?e <lmb at suse.de>

Dominique Chabord

2003-11-29 10:07:52 UTC

Permalink

Hello,
Sorry for being somehow off-topic. We might move this discussion in the
linux-ha mailing list instead.

----- Original Message -----
From: "Kashif Shaikh"

Post by Kashif Shaikh
Yes, FailSafe tries to be a space-ship, but ends up being the
kitchen-sink and too over-engineered.
...The reason why heartbeat
gets adopted in a 'heartbeat' because its easy to configure with
down-to-earth configuration and only a single daemon to manage.

I am interested in what you are doing to position different opensource
solutions. I've never read anything on this.
As you invoke the trend towards complexity, I would be interested in getting
your opinion on WDX too (www.wdx.shaman-x.org) Design choices for WDX were
simplicity and robustness. WDX capabilities are frozen now (118kB binary).
There is no plan to make it more complex in the future. I don't know if it
matters, but WDX is fully GPL since it has no published API.
I'd like to understand if it is correct to position WDX functionality in the
low end and heartbeat capabilities in the high end, provided the future of
failsafe is uncertain. If we do so, what would be the benefits of complexity
of futur Heartbeat, in your opinion ?

Regards
Dominique

Post by Kashif Shaikh
On 2003-11-28T12:20:49,

SGI owns most of the files and would have to reevaluate the copyright
decisions.

Maybe someone from SGI can comment here? Otherwise someone(me?) could just
strip out GPL symbols from the LGPL header files.

Post by Kashif Shaikh
However, if you are comparing FailSafe to heartbeat, I can tell you that
as of today, it's evaluating a space ship to a car ;-) FailSafe is much
further advanced; it's not a reasonable comparison. The two benefits
heartbeat has offer FailSafe is less complexity and tighter security.

Post by Kashif Shaikh
Sincerely,
Lars Marowsky-Br?e <lmb at suse.de>

Lars Marowsky-Bree

2003-11-29 14:57:51 UTC

Permalink

On 2003-11-28T15:43:00,
Kashif Shaikh <kshaikh at consensys.com> said:

Hi Kashif, I'd suggest to move the heartbeat development related thread
to the linux-ha-dev list. As that's the bulk of this mail except for the
SGI license question, I'm cc'ing linux-ha-dev directly.

Post by Kashif Shaikh

Post by Lars Marowsky-Bree
SGI owns most of the files and would have to reevaluate the copyright
decisions.

Maybe someone from SGI can comment here? Otherwise someone(me?) could just
strip out GPL symbols from the LGPL header files.

I'm not sure whether that would be helpful for anyone, actually.

What is your intention?

Post by Kashif Shaikh

Yes, FailSafe tries to be a space-ship, but ends up being the
kitchen-sink and too over-engineered.

You are preaching to the choir ;-) Yes, FailSafe is much too complex in
my opinion too. If I was thinking differently, I'd try to revive it,
instead of taking the good ideas from it and reimplementing them (and
hopefully not too many of the problems). I just was involved in porting
it, not in writing it. And as of today, I still don't grasp all the
interactions.

The design papers I have read were all very good. Alas, during
implementation some of the nice and clean separation into modules has
been compromised. And it's way too difficult to debug because of the
thread model and the complexity of the subcomponents. (And the lack of
good documentation on them.)

Post by Kashif Shaikh
The nicest things Failsafe has is the cms/gcs layer, and IMO the rest
is crap;

I would disagree with that. The CMS layer is nice, but the algorithm
used is pure horror - its N! complexity, and will easily saturate a
100mbit/s network and use substantial bandwidth on gigE, and that's
ignoring the latency issues it has. The problem is that it is mapping an
algorithm designed for a ring topology to a broadcast based network,
which does imply a certain overhead.

The other parts do have their good sides. The CDB is a very, very nice
concept (albeit the implementation sucks). The GUI is probably the
nicest GUI I've ever seen for a cluster, even though it is implemented
in Java.

Post by Kashif Shaikh
the cdb should be placed over the cms/gcs layer, but it's pointless
since cdb uses rpc for communication and will be difficult to use
gcs.

Yes. The CDB is a sucky implementation. The original papers called for
the cdb to run on top of cms/gcs, but I guess then they found out that
they needed info from the CDB to bootstrap the startup of cms/gcs, and
thus the layers blurred and complexity exploded.

The security model of the autodiscovery of cluster nodes is also
dubious to say the least; the networks need to be completely secure,
FailSafe is more than prone to attacks.

Post by Kashif Shaikh
FSD/SRM is hell -- both trying to keep the same states and similar
logic results in many bugs due to 'assumptions' and what not.

Yup.

SRM in itself makes sense: Keeping of local resources, starting,
stopping them et al is a clearly separate piece from the rest. But
again, they were married too closely I think.

Post by Kashif Shaikh
The scary part is Heartbeat's proposed CRM/CIB/SRM/CRS looks very
similar to FailSafe's CHOAS layer(no offense, I know you wrote the crm
docs)...and that's when complexity will shoot through the roof
again...only good for the 'enterprise' market. The reason why heartbeat
gets adopted in a 'heartbeat' because its easy to configure with
down-to-earth configuration and only a single daemon to manage.

Well, the division into such components is just natural. Most more
powerful solutions have roughly the same structure; just as most every
cluster software has a membership layer at a low level, and some GUI on
top.

Rest assured that I do want to complexity as low as possible for several
reasons. One of them being that I got lost in FailSafe and don't want
that to happen again. This will be addressed by several measures, one of
them is better documentation of the software itself (FailSafe has
excellent docs, but only from the user perspective), and the other is of
course "KISS" - I have no other choice anyway, as we have an aggressive
schedule and too little budget as always and there will be little time
for getting too academic ;-)

However, moving the local resource book keeping out of heartbeat's core
is actually lowering complexity. Right now the resource book keeping in
heartbeat is a mess already. Adding any little feature is getting
dangerous. Alan has planned on this anyway, and that's a good thing to
attack. You can't tell me that you call the mess with combining
heartbeat + mon for resource monitoring a down-to-earth configuration!
That belongs 3ft _under_ the earth! ;-)

The "CRM" is also already there in heartbeat as it stands; however, it's
spread out over hb_resource.c, various other places and even shell
scripts. Moving that into a clear framework will, I believe, also help
make it more manageable. And note that the "CRM" is already going to the
most simple version: A master node is elected and resources are
coordinated from there.

The "Policy Engine" is meant to separate out the complexity of these
decisions. It will be clearly separated to faciliate testing so it can
be tried without an actual cluster. This will allow us to pinpoint
errors more quickly. It doesn't need to be all that fancy to start
with.

Now, the CIB _is_ adding complexity. I am not disagreeing. Let me try to
convince you that it is necessary complexity. If you look at the
postings to the mailing lists, a substantial number of bugs are due to
configuration mismatches between the two nodes, and users being unsure
about how to change the configuration online. This alone needs
addressing, but I'm very afraid of how often users would get it wrong
with even 3 or 4 nodes. And there are reasonably sane setups where 3 or
4 nodes make quite a bit of sense.

The CRM now could just always use the 'local' configuration on the
elected master node; that already would prevent such configuration
mismatches. All the CIB does is add a small step to that: Figuring out
the most recent consistent configuration version, retrieving it from the
node, and replicating it. If it fails, hey, the worst what can happen
is that the cluster will revert to a slightly older configuration.
Little harm done, and _much_ better than operating on different
configuration versions at once.

That's all the CIB really is meant to be: Making sure the users screw up
less often. It's really reasonably straightforward. And it _will_ run on
top of the messaging/membership layers, so that can of worms from
FailSafe will stay closed.

The crm.txt gets a little bit fancy by calling the CIB a distributed
database with weak transactional semantics, I think I had some very good
chocolate when I wrote that ;-)

Another design goal I absolutely intend to stick with is 'security'.
heartbeat's messaging & membership layers are pretty secure, and
building on top of these will give that to the 'bigger' solution too.
We obviously also will reuse the rather good and helpful IPC layer.

All in all I agree that the heartbeat-Next Generation will likely be
more complex than heartbeat right now; definetely internally. But by
dividing it into manageable pieces with separate test suites,
documentation etc, I hope to actually have it more manageable and more
easily customized than heartbeat is right now.

And of course, I will kick the implementation and design discussions out
into the public onto the linux-ha-dev lists, so that even in the future,
one will be able to figure out why something was done that way. That was
a pain with FailSafe, not knowing why it was done as it was. Typically
there have been very good reasons, but I didn't know, and couldn't find
out! This discussion for one is already an important part: Figuring out
whether the approach is worth it.

The new design allows for more features; proper support for replicated
resources for example, resource monitoring etc. We'll see.

Of course, the very important part is to keep the complexity for the
user down. I believe this will be possible, even though some paradigms
will change naturally, but I believe that they actually change for the
better. Again, documentation is a key role here.

(I _have_ noticed that FailSafe gets this wrong, don't worry. It's easy
to configure on the surface, but it has lots of hidden dependencies (the
most obvious one being how anal-retentive it is about /etc/hosts).)

I hope this addresses some of your concerns.

Next week I will have to defend^Wexplain my design to the project team
for implementing the CRM suite. If you see any big holes in it which
need addressing, or any other concerns, please do voice them. It is
highly appreciated!

Sincerely,
Lars Marowsky-Br?e <lmb at suse.de>

--
High Availability & Clustering \ ever tried. ever failed. no matter.
SUSE Labs | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ -- Samuel Beckett

Kashif Shaikh

2003-12-01 16:44:48 UTC

Permalink

Post by Lars Marowsky-Bree
On 2003-11-28T15:43:00,
Hi Kashif, I'd suggest to move the heartbeat development related thread
to the linux-ha-dev list. As that's the bulk of this mail except for the
SGI license question, I'm cc'ing linux-ha-dev directly.

Post by Kashif Shaikh

Post by Lars Marowsky-Bree
SGI owns most of the files and would have to reevaluate the copyright
decisions.

Maybe someone from SGI can comment here? Otherwise someone(me?) could just
strip out GPL symbols from the LGPL header files.

I'm not sure whether that would be helpful for anyone, actually.
What is your intention?

My intention was to make the cms/gcs layer independent of all other
subsystems, since the cms/gcs API is LGPL. But these APIs are mixed with
CDB, which has a GPL API. So any 'third-party' process-group
applications could cleanly interact with cms/gcs API, without having to
touch any CDB GPL symbols/data-structures. So it's only helpful to
people who don't really need the CDB like me.

Post by Lars Marowsky-Bree
I would disagree with that. The CMS layer is nice, but the algorithm
used is pure horror - its N! complexity, and will easily saturate a
100mbit/s network and use substantial bandwidth on gigE, and that's
ignoring the latency issues it has. The problem is that it is mapping an
algorithm designed for a ring topology to a broadcast based network,
which does imply a certain overhead.

CMS could be modified to use hardware ethernet mcasts.

Post by Lars Marowsky-Bree
The other parts do have their good sides. The CDB is a very, very nice
concept (albeit the implementation sucks).

...

Post by Lars Marowsky-Bree

Post by Kashif Shaikh
the cdb should be placed over the cms/gcs layer, but it's pointless
since cdb uses rpc for communication and will be difficult to use
gcs.

The CDB API is simple, but you need a custom GUI front-end to read/write
key/vals or try to descend the tree using cryptic cdbutils cli(which is
one of the reasons I think why SGI wrote dozens of CLI programs to
manipulate the CDB).

Perhaps for the future CDB/CIB, there could be a graphical 'CIB Editor'
similar to the 'Windows Registry' or Gnome's gconf. And the CDB/CIB
manipulation could possibly be SQL-based so that any front-end can
interact with the db using normal SQL commands. Though you would need a
replicated SQL cluster with Postgres + replication component.

Post by Lars Marowsky-Bree
Now, the CIB _is_ adding complexity. I am not disagreeing. Let me try to
convince you that it is necessary complexity. If you look at the
postings to the mailing lists, a substantial number of bugs are due to
configuration mismatches between the two nodes, and users being unsure
about how to change the configuration online. This alone needs
addressing, but I'm very afraid of how often users would get it wrong
with even 3 or 4 nodes. And there are reasonably sane setups where 3 or
4 nodes make quite a bit of sense.

Yes I agree CIB is needed -- but what I found out, once configuration is
done(especially for HA), the information is set in stone after cluster
is used in production. Meaning 99% of the time you are going to be doing
reads form the CIB and 1% of time writing(i.e adding new resource,
changing IP address, etc).

So maybe all that is needed is a simple file-syncer using Dnotify/FAM or
special commands like this:
#!/bin/sh

# lock files
/usr/sbin/lockCIB write /etc/node.conf
/usr/sbin/lockCIB write /etc/resources.conf

# change the files.
ed - /etc/node.conf <<!
g/CLUSTER_IP/d
w
!

ed - /etc/resources.conf <<!
g/PARAM/d
w
!

/usr/sbin/unlockCIB /etc/node.conf
/usr/sbin/unlockCIB /etc/resources.conf

This is just an idea, but it's not perfect(it has race conditions -- a
global lock would be needed, but you get the idea). This allows us to
write rich shell scripts and any program(perl, C, C++) can access the DB
cleanly without going through a special API.

Kashif