Shallow Processing of Large Corpora SProLaC 2003

Lancaster University (UK), 27 March, 2003

      The workshop is sponsored by: OntoText Lab, Sofia

Workshop Proceedings

Workshop Programme

Workshop motivation and aims

Corpora have developed with respect to two main directions:

  • large corpora of size min. 100 mln. tokens, and
  • small corpora of size up to 1 mln. tokens.

The data in the former is only morpho-syntactically annotated and the data in the latter is assigned more detailed syntactic (and) semantic information. Needless to say, both types of language corpora are valuable. However, a question arises, whether it is possible to build a really large corpus, which is fully processed linguistically. Since it is a hard task and concerns metadata problems (theories, availability of appropriate tools etc), we put the stress on shallow parsing of unrestricted data. In our view, the creation of such a resource, using automation, is a task of great importance. It would serve as a template for linguistic research, consistency checking and validation, large-scale applications in Information Retrieval and Information Extraction, testing of machine learning algorithms and many others. This task is related to other subtasks, such as: an adequate combination of diverse shallow processing techniques in a sound and robust processor, and smoothing shallow parsing approaches for stages of deeper linguistic analyses.

The workshop aims at being a forum for researchers to present their work in the area of Computational Corpus Linguistics and Language Engineering and to discuss the problems in design, management, linguistic interpretation and exploration of unrestricted data from both perspectives.

We envisage a one-day workshop and 10-12 presentations.

Topics of interest:

  • design principles for shallow-parsed large corpora;
  • text segmentation and preprocessing;
  • definition of the connection between the levels of processing;
  • chunk and partial parsing of large amounts of texts;
  • machine learning methods with large coverage;
  • software systems for management and accessibility to shallow-parsed large corpora;
  • applications of shallow-parsed large corpora

There will be a general discussion at the end of the workshop.

Important dates

Deadline for workshop abstract submission: 10th January 2003

Notification of acceptance 3rd February 2003

Final version of paper for workshop proceedings 3rd March 2003


Papers should describe existing research connected to the topics of the workshop. The presentation at the workshop will be 25 minutes long (20 minutes for presentation and 5 minutes for questions and discussion). Each submission should show: title; author(s); affiliation(s); and contact author’s e-mail address, postal address, telephone and fax numbers. Abstracts (maximum 500 words, plain-text format) should be sent to:

Kiril Simov

The final version of the accepted papers should not be longer than 10 A4 pages. The format of the papers has to follow the guidelines for authors of the conference, which can be found at

There will be a proceedings of the workshop.

Programme committee

Michael Barlow, USA
Tomaz Erjavec, Slovenia
Silvia Hansen, Germany
Atanas Kiryakov, Bulgaria
Sandra Kuebler, Germany
Ghassan Mourad, France
Joakim Nivre, Sweden
Kemal Oflazer, Turkey
Karel Oliva, Austria
Petya Osenova, Bulgaria (co-chair)
Vladimir Petkevic, Czech Republic
Adam Przepi’orkowski, Poland
Geoffrey Sampson, UK
Kiril Simov, Bulgaria (co-chair)
Marko Tadic, Croatia
Dan Tufis, Romania
Tylman Ule, Germany
Tamas Varadi, Hungary
Nikolaj Vazov, Bulgaria
Andreas Wagner, Germany

Organizing committee

Kiril Simov
BulTreeBank Project
Linguistic Modelling Laboratory, CLPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria
Tel: (+359 2) 979 2825
Fax: (+359 2) 70 72 73

Petya Osenova
BulTreeBank Project
Linguistic Modelling Laboratory, CLPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria
Tel: (+359 2) 979 2825
Fax: (+359 2) 70 72 73