15 Minutes and 150MB of RAM to Compare Unix and Linux
SCO has recently made two accusations: (1) IBM has contributed IBM employees' code to Linux in violation of certain SCO/IBM contracts, and (2) some proprietary Unix code has somehow been illegally contributed to Linux. I'm not qualified to comment on whether or not IBM owns the code IBM wrote--though on behalf of software authors everywhere, I hope IBM does. However, I've written a tool which will allow SCO to find any code shared between Linux and Unix in about 15 minutes. What SCO does with this tool is up to them.
My Motivations
When I was young, my father once told me, "The truth will out." (I think he was quoting someone.) He felt that it was better to face the truth, get all the facts before the public, and do the right thing. You might pay a price for your honesty, but you'd pay a bigger price if you lied, because sooner or later, "The truth will out."
I don't fear the truth. And as a creator, I deeply respect other people's copyrights. I don't want other people misuing my work. Linus Torvalds is a creator, too, and he says he respects the work of others. If there's Unix code in the Linux kernel, I want it removed swiftly, and I want those responsible to be barred from future contribution. Such a illegal copying would be stain upon the honor of many good, creative people.
When I read Egan Orion's excellent article, I decided to implement the idea he described, and to make the tool publically available to help copyright owners figure out whether their code has been copied.
How to Use It
You'll need a Linux or Unix system with a decent C++ compiler. Download srcdupchk-0.2.tar.gz (that's "source duplication checker" to people who don't speak Unix), decompress it, and type:
$ cd srcdupchk-0.2 $ ./configure $ make $ make install
You may need be root to run the last command. Before continuing,
please read the README and COPYING files carefully. srcdupchk
comes
with no warranties and is provided "AS IS".
Now place the programs to be be compared in two different directories, and type:
$ srcdupchk my-program-src linux-2.4.20
After 15 minutes or so, you'll get output which looks like this:
linux-2.4.20/foo/bar.c:20:107 linux-2.4.20/foo/baz.c:52:57 ...
This means that lines 20 to 107 of bar.c
are similar to code
in my-program-src
, as are lines 52 to 57 of baz.c
. By
default, srcdupchk
won't print out the corresponding lines in
my-program-src
, so you don't have to reveal more than necessary
about your own program. (If you want to see the other half of the
matches, use the --show-both
option.)
For details on how srcdupchk
actually works, see the README.
srcdupchk
uses some neat tricks to ignore whitespace, commenting
style, brace placement, and other irrelevant details.
A Caveat
srcdupchk
will find lots of perfectly legal code sharing. For
example, both Linux and Unix are allowed to contain BSD code, or public
domain code from various textbooks or the web. Other common sources of
duplication are license notices, and the boilerplate code generated by
popular tools. Once you've found the duplication, you need to
investigate it carefully before you know what it means.
A Personal Request
Please don't publish the results on running srcdupchk
on
other people's code. It isn't polite, it almost certainly violates any
non-disclosure agreements you've signed, and it may get you sued. Your
actions would reflect poorly on the reputations of many free software
developers, who in my experience, are painfully scrupulous about
their legal and moral responsibilities.
I wrote this tool so that software developers could quickly find improper uses of their code, and report those problems. Please respect my wishes in this matter. If you want to violate your NDAs in the name of journalism, there are people who will help you do that; they are already offering to provide similar tools for just that purpose. The ethics of a law-abiding citizen and the ethics of investigative journalism are sometimes in conflict; how you resolve them is on your own conscience. But whatever you do, please don't do it with my tools.
What I'd Like SCO to Do
SCO has a choice. They could run this tool, keep the results secret, and issue press releases saying there are 5,241 lines of code which appear in both Unix and Linux. Or they could call up Linus Torvalds on the phone and say: "The following files and line numbers in Linux look suspicious to us. Would you please work with the Linux community to figure out who 'contributed' this code, and if it is illegally copied, would you please remove it promptly?"
The latter choice would end the wrongdoing which concerns SCO, and would allow hundreds of dedicated, hardworking people to clear their names. All I'm asking for is 15 minutes, 150MB of RAM, and one phone call to Linus.
Want to contact me about this article? Or if you're looking for something else to read, here's a list of popular posts.