Reliability and Meta-reliability of Language Resources

António Branco

University of Lisbon

Given the increasing complexity and expertise involved in the development of language resources, there have been a growing interest in finding mechanisms so that the designing and the development of language resources may be taken as a first class citizen in terms of scientific work, and accordingly individual cvs and careers can be fairly credited and rewarded for that. Ongoing initiatives such as the single registry number for language resources, or the recent studies on measures and metrics to ascertain their reliability are illustrative examples of this trend.

Concomitant to this movement of credibilization, and in an opposite direction, there have been appearing worrying signs that, in what concerns mature and well established scientific fields, scientific activities and results may be unreliable to an extent larger than possibly expected and acceptable. That this issue has recently hit the mass media is but an indicator of the volume and relevance of these signs, whose assessment and discussion is unavoidable.

These signs have been related, for instance, to the realization that for a considerable proportion of published results their replication is not being obtained by independent researchers; to the deliberately falsified submissions of papers for publication, with introduced errors and fake authors, which get easily accepted even in respectable journals; or to the stats and the outcome of inquiries to scientists on questionable practices, with scores higher than one might expect or would be ready to accept.

A number of causes have been aired for these signs including, for instance, increasingly sloppy reviewing; the growing number of so-called “minimal-threshold” journals; policies for publication that do not require the sharing of at least the raw data; or the non disclosure of the software developed and used to obtain the results published.

Underneath these immediate causes, a number of factors have been pointed out, including, for instance, not enough negative incentives or peer-pressure for the above practices; career and promotion pressure too biased for quantity; widespread disinterest on negative results as an intrinsic part of the scientific progress; widespread disfavoring of activities of replication by funding agencies; poor or inexistent retraction procedures for results that are eventually noticed to be wrong or flawed after having been published; ideological pressure to get economic return from research results; etc.

In this talk I’ll be interested in discussing what part of these meta-reliability issues may be recognized as having the conditions to be eventually happening also in our field, what part does not apply to it given its specific nature, and what may be the risks that may be specific to it. The ultimate goal of this exercise is to contribute for the reinforcement of the meta-reliability of language resources, and to the credibility of our scientific work around them.