BioC-formatted corpora and BioC-compliant tools

BioC is a simple XML format to share text documents and annotations. It allows a large number of different annotations to be represented. We provide simple code to hold this data, read it and write it back to XML files, and perform basic text processing tasks.

BioC-formatted corpora

  • Abbreviation detection in Biomedical domain.

    These collections of PubMed abstracts manually annotated for abbreviated terms in biomedical text, have been converted to BioC format and re-evaluated by four annotators to improve their consistency and quality levels.
    • Schwartz and Hearst Corpus

      1000 PubMed abstracts, as presented in original paper. The BioC-compliant tool, the Shwartz and Hearst Algorithm, is included in the BioC-Java package.

    • Ab3P corpus

      1250 PubMed abstracts, as presented in: original paper The BioC-compliant Ap3P Algorithm, is included in BioC-C++ package.

    • BIOADI corpus.

      1200 PubMed abstracts, as presented in: original paper.

    • MEDSTRACT corpus

      199 PubMed citations, the old version of the corpus presented in orginal paper.

  • Other ...

BioC-compliant tools

  • BioC Implementations
    To help developers and improve interoperability between systems, BioC libaries have been implemented in several programming languages:
    • BioC-C++
    • BioC-Java
    • BioC-SWIG for Python and Perl
    • PyBioC

  • BioC NLP Pipeline
  • NCBI Text Mining tools
  • Nactem BioC resources
  • NICTA Brat2BioC
  • iSimp

Last updated: 2013-11-25