SCO has recently made two accusations: (1) IBM has contributed IBM employees' code to Linux in violation of certain SCO/IBM contracts, and (2) some proprietary Unix code has somehow been illegally contributed to Linux. I'm not qualified to comment on whether or not IBM owns the code IBM wrote--though on behalf of software authors everywhere, I hope IBM does. However, I've written a tool which will allow SCO to find any code shared between Linux and Unix in about 15 minutes. What SCO does with this tool is up to them.

My Motivations

When I was young, my father once told me, "The truth will out." (I think he was quoting someone.) He felt that it was better to face the truth, get all the facts before the public, and do the right thing. You might pay a price for your honesty, but you'd pay a bigger price if you lied, because sooner or later, "The truth will out."

I don't fear the truth. And as a creator, I deeply respect other people's copyrights. I don't want other people misuing my work. Linus Torvalds is a creator, too, and he says he respects the work of others. If there's Unix code in the Linux kernel, I want it removed swiftly, and I want those responsible to be barred from future contribution. Such a illegal copying would be stain upon the honor of many good, creative people.

When I read Egan Orion's excellent article, I decided to implement the idea he described, and to make the tool publically available to help copyright owners figure out whether their code has been copied.

How to Use It

You'll need a Linux or Unix system with a decent C++ compiler. Download srcdupchk-0.2.tar.gz (that's "source duplication checker" to people who don't speak Unix), decompress it, and type:

$ cd srcdupchk-0.2
$ ./configure
$ make
$ make install

You may need be root to run the last command. Before continuing, please read the README and COPYING files carefully. srcdupchk comes with no warranties and is provided "AS IS".

Now place the programs to be be compared in two different directories, and type:

$ srcdupchk my-program-src linux-2.4.20

After 15 minutes or so, you'll get output which looks like this:

linux-2.4.20/foo/bar.c:20:107
linux-2.4.20/foo/baz.c:52:57
...

This means that lines 20 to 107 of bar.c are similar to code in my-program-src, as are lines 52 to 57 of baz.c. By default, srcdupchk won't print out the corresponding lines in my-program-src, so you don't have to reveal more than necessary about your own program. (If you want to see the other half of the matches, use the --show-both option.)

For details on how srcdupchk actually works, see the README. srcdupchk uses some neat tricks to ignore whitespace, commenting style, brace placement, and other irrelevant details.

A Caveat

srcdupchk will find lots of perfectly legal code sharing. For example, both Linux and Unix are allowed to contain BSD code, or public domain code from various textbooks or the web. Other common sources of duplication are license notices, and the boilerplate code generated by popular tools. Once you've found the duplication, you need to investigate it carefully before you know what it means.

A Personal Request

Please don't publish the results on running srcdupchk on other people's code. It isn't polite, it almost certainly violates any non-disclosure agreements you've signed, and it may get you sued. Your actions would reflect poorly on the reputations of many free software developers, who in my experience, are painfully scrupulous about their legal and moral responsibilities.

I wrote this tool so that software developers could quickly find improper uses of their code, and report those problems. Please respect my wishes in this matter. If you want to violate your NDAs in the name of journalism, there are people who will help you do that; they are already offering to provide similar tools for just that purpose. The ethics of a law-abiding citizen and the ethics of investigative journalism are sometimes in conflict; how you resolve them is on your own conscience. But whatever you do, please don't do it with my tools.

What I'd Like SCO to Do

SCO has a choice. They could run this tool, keep the results secret, and issue press releases saying there are 5,241 lines of code which appear in both Unix and Linux. Or they could call up Linus Torvalds on the phone and say: "The following files and line numbers in Linux look suspicious to us. Would you please work with the Linux community to figure out who 'contributed' this code, and if it is illegally copied, would you please remove it promptly?"

The latter choice would end the wrongdoing which concerns SCO, and would allow hundreds of dedicated, hardworking people to clear their names. All I'm asking for is 15 minutes, 150MB of RAM, and one phone call to Linus.