Tools to create and deploy COCOA files from CommonCOW corpora
These tools create COCOA stand-off files from CommonCOW corpora stored in COW-XML format and the corresponding CommonCrawl files. Also, they can re-create CommonCOW corpora from COCOA and CommonCrawl files. These are reference implementations NOT optimized for efficiency. You're invited to write better and more efficient ones in C, move the processing to the S3 cloud, or whatever.
Please don't ask yourself why! It's because copyright legislation is completely messed up in Europe, especially Germany. This is the only way we can distribute our high-quality web corpora created from CommonCrawl data.
My LREC 2016 paper about COCO/COCOA http://rolandschaefer.net/?p=994
Our web corpora: http://corporafromtheweb.org/
CommonCrawl: http://commoncrawl.org/