BioC: A Minimalist Approach to Interoperability for Biomedical Text Processing

BioC is a simple format to share text data and annotations. It allows a large number of different annotations to be represented. We provide simple code to hold this data, read it and write it back to XML, and perform some sample processing.

Background

Strong research efforts have produced many manually labeled text corpora and many NLP and text mining tools that are essential in searching for and extracting information from text. To encourage combining these efforts into larger, more powerful, and more capable systems, it is highly desirable to have a common interchange format to represent, store and exchange the data in a simple manner.

BioC goals

  • simplicity
  • interoperability
  • broad use and reuse

There should be little investment required to learn to use a format or a software module to process that format. We are interested in reuse, and we focus on common NLP tasks that are broadly useful for textmining.

Last updated: 2016-07-28