FAQ
on the CICLing verifiability, reproducibility, and open source policy

This page is under development

Alexander Gelbukh

See more explanations on CICLing verifiability, reproducibility, and open source policy page and on your conference's webpage under the Software section.

Contents:

Submitting the software

Data that are not freely available

Data that are freely available


Submitting the software

Q: The code with the datasets is too large to be submitted via EasyChair.

Please place the file somewhere and send the program chair the URL to download it. You can use ftp.CICLing.org/in for this; see below. Please also submit with your paper an attachment (a simple TXT file) that just says that you have submitted your code directly to the organizers. This will help the reviewers to realize the existence of the code.

If you use ftp.CICLing.org/in: please upload an appropriately named single ZIP file. Some users reported that they could not upload files to this site; then use some another storage or contact the program chair.

Q: I need more time to submit the code, may I submit it later than the paper?

Please contact the program chair. If with his permission you submit the code later, please at least submit with your paper an attachment (a simple TXT file) that just says that you will submit the code by such and such date, and that you might submit it directly to the organizers. This will help the reviewers to know that you will submit the code.

Data that are not freely available

Q: The test data I'm using do not belong to me and I am not allowed to publish them. This is unfortunate, but I have no choice, since I'm using the same materials as previously published studies in order to be able to compare my results to theirs.

This is precisely the situation that the CICLing policy aims to extinguish. One publishes not openly and publicly verifiable results; the next one has to use the same dataset, etc., and now you with your paper are contributing to the vicious circle. Do break it. Do not work with data that you can't provide to open public—your work is useless for the humankind, and in fact harmful by contributing to the vicious circle.

I advise you to look for (or make) publicly available data, and then publish both results: the one on the "standard" dataset (unfortunately, this helps you to get your paper accepted) and the other with your new dataset: this is a great contribution to science because it breaks the vicious circle and sets up an open gold standard for future researchers to work with. (Any better idea? Please let me know.)

If nothing helps, just clearly indicate in the instructions how the data can be obtained.

Q: May I use commercially available data or programs?

Freely available materials are much preferred, and much better if they are included in your submission. This would encourage others to work with the same data and compare their results with yours (i.e., your work will be cited!) and will ensure preservation of exactly the same version of datasets and programs that you used, which will allow for a long-term verifiability of your results.

However, if the nature of the task really justifies it, then you have to use commercial data or software. In this case, at least anybody who has money can in theory verify your results. However, if just repeating your whole research would cost one less than buying the data, then what is your contribution? If, in contrast, repeating your research would be too costly in comparison with buying the data necessary to verify it, then yes, your contribution is somewhat useful.

Q: I use commercially available data or programs, or ones available upon some non-trivial procedure, such as signing an agreement. Should I provide them to your reviewers?

Yes. The conference does not have funding for buying materials only for reviewing or time to ask for materials from third parties, unless they can be simply downloaded at any moment.

Please clearly indicate which data should not be made public: say, put them in a directory named DONT_MAKE_THIS_PUBLIC and assume that the end users will see all except this directory. Please also send me a message to attract my attention to the fact that a directory should be deleted before making your data public. However, elsewhere (outside this secret directory) in your installation instructions please very clearly indicate for the end users where and how the data can be bought and how much they cost.

Q: I use data or programs that general public cannot obtain, even for a fee or by signing an agreement. Should I provide them to your reviewers?

No. The reviewers should not see anything that ordinary readers can't see. Instead, they should follow your instructions to obtain the data. If obtaining the data is impossible or is too much of a trouble, which is not justified by the complexity of the task itself or specificity of the data required by the nature of the task, then the reviewers will correctly score your software as impossible or too hard to setup to be of any practical use.

Data that are freely available

Q: I use data or programs that can be easily downloaded from some URL. Should I include them into the package?

Yes, preferably. This will ensure availability and preservation of exactly the same version of datasets and programs that you used, which will allow for a long-term verifiability of your results. Exceptions are large datasets that are very well maintained by a large and stable institution and are widely used, are likely to be installed at any researcher's computer, and thus are widely available in many copies, such as WordNet or PennTreebank. In contrast, data that are available from a small company or a personal website (say, yours) that in long run can be no longer available should be included with your submission for long-term preservation.

I thank CICLing authors and general public for providing the questions answered here.

Comments: A.Gelbukh.