15 Minutes and 150MB of RAM to Compare Unix and Linux

Posted by Eric Fri, 20 Jun 2003 00:00:00 GMT

SCO has recently made two accusations: (1) IBM has contributed IBM employees' code to Linux in violation of certain SCO/IBM contracts, and (2) some proprietary Unix code has somehow been illegally contributed to Linux. I'm not qualified to comment on whether or not IBM owns the code IBM wrote--though on behalf of software authors everywhere, I hope IBM does. However, I've written a tool which will allow SCO to find any code shared between Linux and Unix in about 15 minutes. What SCO does with this tool is up to them.

My Motivations

When I was young, my father once told me, "The truth will out." (I think he was quoting someone.) He felt that it was better to face the truth, get all the facts before the public, and do the right thing. You might pay a price for your honesty, but you'd pay a bigger price if you lied, because sooner or later, "The truth will out."

I don't fear the truth. And as a creator, I deeply respect other people's copyrights. I don't want other people misuing my work. Linus Torvalds is a creator, too, and he says he respects the work of others. If there's Unix code in the Linux kernel, I want it removed swiftly, and I want those responsible to be barred from future contribution. Such a illegal copying would be stain upon the honor of many good, creative people.

When I read Egan Orion's excellent article, I decided to implement the idea he described, and to make the tool publically available to help copyright owners figure out whether their code has been copied.

How to Use It

You'll need a Linux or Unix system with a decent C++ compiler. Download srcdupchk-0.2.tar.gz (that's "source duplication checker" to people who don't speak Unix), decompress it, and type:

$ cd srcdupchk-0.2
$ ./configure
$ make
$ make install

You may need be root to run the last command. Before continuing, please read the README and COPYING files carefully. srcdupchk comes with no warranties and is provided "AS IS".

Now place the programs to be be compared in two different directories, and type:

$ srcdupchk my-program-src linux-2.4.20

After 15 minutes or so, you'll get output which looks like this:

linux-2.4.20/foo/bar.c:20:107
linux-2.4.20/foo/baz.c:52:57
...

This means that lines 20 to 107 of bar.c are similar to code in my-program-src, as are lines 52 to 57 of baz.c. By default, srcdupchk won't print out the corresponding lines in my-program-src, so you don't have to reveal more than necessary about your own program. (If you want to see the other half of the matches, use the --show-both option.)

For details on how srcdupchk actually works, see the README. srcdupchk uses some neat tricks to ignore whitespace, commenting style, brace placement, and other irrelevant details.

A Caveat

srcdupchk will find lots of perfectly legal code sharing. For example, both Linux and Unix are allowed to contain BSD code, or public domain code from various textbooks or the web. Other common sources of duplication are license notices, and the boilerplate code generated by popular tools. Once you've found the duplication, you need to investigate it carefully before you know what it means.

A Personal Request

Please don't publish the results on running srcdupchk on other people's code. It isn't polite, it almost certainly violates any non-disclosure agreements you've sig