On CICLing verifiability, reproducibility, and open source policy

Alexander Gelbukh

The CICLing verifiability, reproducibility, and open source policy was largely inspired by Ted Pedersen's paper Empiricism is not a matter of faith, as well as discussions with various attendees of CICLing 2010 in Iași.

See also FAQ and the information on your conference's page under the Software section.

The problem

Are we doing science or faith?

To err is human. What is more, getting a paper published translates into better university evaluation and eventually into money. Therefore, in principle a paper may report results that are erroneous, falsified, or made up. However, reviewers spend a lot of effort on checking English, style, clarity of explanation, etc.,—but do they ever ask themselves whether the reported results are true in the first place? No, for a simple reason: there is no way to know. They have no choice but to trust the author. Is this science?

Many publications, any clarity?

Scientific results must be verifiable and reproducible. Can you imagine a paper on physics that says "we made some apparatus, with some electrodes and lights (here is a photo), put there 5.4321 grams of some blue substance, and it gave 1.2345% more energy than coal"? What have you learnt from such a paper? Now substitute apparatus by algorithm and substance by corpus—wouldn't you get a quite decent computational linguistics paper?

Many publications, any knowledge?

In 1637, Pierre de Fermat wrote: "I have discovered a truly marvelous proof <...>. This margin is too narrow to contain it". It took 358 years or intensive research for a proof of the theorem in question to become known to anybody else other than himself (if he really had a correct proof at all). Still many of us find this tempting formula—"but the space limitations do not allow us to deep into details"—so saving when we are in hurry to meet the conference deadline. How many years will it take for other people to get to know what we had in mind when writing this phrase? Are we then really communicating novel information to the reader—or are we boasting of our own intelligence, advertizing our group, getting university promotion points, or, worse, hiding details that the readers may find flawed? Does such science generate knowledge or scientific spam?

Recently a colleague, well-reputed high-rank researcher, explained me why he didn't want to provide an implementation of his algorithm with his paper: "if other groups see my implementation, then they will be able to improve their results using my programs, and then our group will have fewer opportunities to publish." That's perfectly normal reasoning in industry, but is he then doing science? Isn't the whole purpose of science to make other people be able to improve their results? Do you think his printed paper was written with an honest effort to make it complete, understandable, and reproducible, or was it cynically generated scientific spam that mimics a research report?

Many publications, any programs?

I've graduated 50+ students, who have written a lot of programs... but does our group have access to any of these programs? Each time a student is about to be graduated, we are too busy with the thesis to think of the programs, and after that, I have no control over him anymore—so I am left with a thesis describing his program, but no program. Am I the only one? Recently a student of mine needed a parser for a specific language. We've wrote one some 10 years ago, but wait—I've seen hundreds of recent papers on advances in such parsers. So I spent two days digging in the Web... a lot of papers, no parsers. Sounds familiar?

Gift of knowledge or advertizing?

Can you imagine a mathematical paper saying "a = b, but the proof is commercial property of our company"? Not any intellectual activity is science: science is producing knowledge for all, not for oneself. The fact that some close-source software is good can be advertized, but it cannot be object of a scientific publication. If you have a program, make money of it—but don't call it science unless you are willing to show others how to write such a program. Advertizing disguised as scientific publication is misleading, frustrating, and dishonest: it gives a double reward to the author, money for the program and money from the university promotion.

One paper that erroneously (or intentionally) reports too high results on a task (though looking plausible enough to be publishable) can completely block any progress in the field.

A student of mine suggested a new method for some task. She found a long paper published in a very respectable journal, which reported higher results—in fact a bit too high to be realistic, so she suspected an error. She spent months trying to reproduce the algorithm from the paper, and gave up: the description was not clear enough to write a program following it. When she asked the author, a very respectable scientist, he did not already remember the details, and his student who implemented the programs has graduated long ago. Even the corpus on which the results were reported has been lost on a broken laptop. Sounds familiar? Finally, she had no way to publish her results, because there was no way to compare them with previously published ones and because her figures were not as high as those reported in that paper.

So the current situation in that field of research is: there is a paper published in a very prestigious journal that reports very optimistic figures; there is no way to know how one can achieve this quality in his or her own system; there is no way to know whether those figures are true; and there is no way to publish new research in the field because the figures that all new research obtains are lower than those reported in that paper. If, for whatever reason, those figures were erroneous, then no advance in this field is ever possible. One paper killed a whole field of science, in exchange for university promotion points to one person.

Only bad guys do bad things, but do we good guys need to be verified?

Another student of mine achieved a very good result with some algorithm. We considered her PhD graduation guaranteed and wrote a long paper describing the results, to be sent to a best journal. When the paper was nearly ready, one day she was examining her code and accidentally saw a subtle error, which she found to be critical for the results. She was honest enough to throw away the paper and notify the thesis jury (would every your student do that?); finally she graduated much later with other results. But imagine if on that sunny day she'd have a date with her boyfriend instead of examining the code of her old program! Would we kill a whole field of science?

Bad examples

I will use here examples from my own old work, in order not to offend anybody,—but do you think examples from your publication record could not have appeared here instead?

In the paper A Very Large Database of Collocations and Semantic Links, I reported the creation of a large lexical resource, which until now has not been made available to public, even for a fee. What's the point in reporting something that nobody can use or even see?

In the paper Information Retrieval with a Simplified Conceptual Graph-like Representation, I reported an algorithm. I say there: "we used a simplified structure, which is basically syntactic structure minimally adapted to semantics represented in conceptual graphs." What's the point of saying that it was an adapted structure without explaining what specifically was adapted? These details could have been seen from the implementation, but it was not provided with the paper. What is more, I say there that we developed an English grammar that we used for parsing; however, the grammar itself is not available—and it is too large to be given in the paper. Without this exact grammar, our results are not reproducible, as they do largely depend on the specific grammar we used. Now it's too late to make the grammar available: the published paper does not make any reference to where the grammar can be found.

In the paper Detecting Inflection Patterns in Natural Language by Minimization of Morphological Model, I reported an algorithm. However, much later when I was once more examining the code I found a threshold that was not mentioned in the paper, which I introduced temporarily and forgot to remove. When I removed the threshold, the algorithm did not work as expected—the threshold seems to be essential for the algorithm to work. So, the published algorithm is incomplete, not to say incorrect, and is not reproducible; however, the readers have no way to know it.

These publications were not made in bad faith; only later I realized that I should not have done that. Badly enough, these papers have been reviewed and accepted at decent conferences—because they do meet the usual standards of our science and the reviewers did not see any problem with them! What is to be changed is our perception of what the usual standards are.

Good examples

Porter's An algorithm for suffix stripping is accompanied by a page describing the algorithm itself, so that no doubts are left about how it works.

Umemura and Church's Substring Statistics is accompanied by a sample of code and data, referenced within the paper as [21].

Our paper An Associative Network of Concepts that Enter to Internet Queries, which reports the creation of a lexical resource, is accompanied by the dataset itself, referenced within the paper as [10]. Note that even though we made a new version of the resource available, the version reported in the paper is preserved and is clearly marked as such.

In all cases the code or data that accompany the paper is something simple—not a huge system with a fancy user interface—but essential for the precise understanding of the algorithm and/or the possibility to use the information reported in the paper.

The solution

Publication of an algorithm or experimental results must be accompanied by software and data.

While a printed conference paper is too short, we have all possibilities to provide megabytes of information together with the paper—and even more so given that most publications nowadays are distributed online. A paper can—and should—have a link to a page where all mentioned resources and implementation of all mentioned algorithms can be found. What is more, I'd put it the other way round: what we should make publicly available (i.e., publish in the literal meaning of the word) are those implementations and resources, i.e., working programs; it's the paper that should accompany those resources and not the other way round. The paper should be an explanation of the program or resource. Over time we will hopefully grow to understand that a paper without a program is a nonsense—pretty much as a theorem without a proof is a nonsense.

One can object that this will make "papers" many times longer and will greatly increase the effort on their preparation and review. Yes. Look at a math paper: 90% of it is the proof of what the other 10% says—and, of course, it took the author 90% of effort to write this "useless" part or it, which does not add any new ideas but only serves to convince the reader that the ideas communicated in the other 10% of the text are correct. This is what makes mathematics a science. Unfortunately, our field has not yet developed this culture.

Scientific results must be open-source.

It is not enough to provide an executable or a binary resource with a scientific paper. Knowledge is human, so only human-readable resources serve the purpose of science: to advance human knowledge. Of course they must also be machine-readable, for people to be able to reproduce the results.

One can object again that this will create copyright problems. I can only repeat: either do science or do profit; if you are not willing to give your knowledge to the community, then make money of it, don't make science of it.

I know there are many problems to solve, many objections to discuss, and many bad habits to get rid of. The way is long. Be part of the change!

Comments: A.Gelbukh.

On CICLing verifiability, reproducibility, and open source policy

Contents:

The problem

Bad examples

Good examples

The solution