bioSyntax Release TODO

Hey team,
Our little project is really coming together! I know it's a busy time for all of coming up but I think pushing back the release deadline by a week and a half will let us finish this strong. Wednesday December 13th is the goal deadline.

VanBug

On Dec 14th there is a VanBug (Vancouver Bioinformatics) meeting and 3-minute lightning talks. We can use this to let all the bioinformaticians across Vancouver know about bioSyntax and to get the word out. This is a great opportunity to put yourself out there and get to meet more of the bioinformatics community, I'm hoping one of you will be keen to represent bioSyntax.

Prepare and Give 3 minute lightning talk on bioSyntax (Alyssa)
🎉 Alyssa won 3rd place =D w00t

Manuscript

This past weekend I spent a few hours going over the manuscript and I've evenly divided the remaining work. As per ICMJE, being a scientific author means you must contribute a non-trivial portion to writing and take responsibility for everything included (so proof read others work too). This needs to be done earlier then the final deadline because it's going to have to go through some revisions. Please take 3-4 hours over the next week to complete your section, they are mostly short but have to be thoughtfully written. Due: Dec 5th

Manuscript Draft Finished (All)
Compile references in google doc into one reference manager and upload library (?)

Tasks Remaining

Core Syntaxes: We are incredibly close! I can do sam-vim. Anicet is working on PDB-gedit. We need VCF-Gedit and SAM-Gedit.
VCF-sublime (German)
PDB-gedit (Anicet)
SAM-vim (Artem)
VCF-gedit (German)
SAM-gedit (Jeff)
Go back to all Fasta formats (Sublime / Vim / Gedit) add 16-color NT coloring (no context recog.)

Open Bug-fixes

I've made our gedit-theme bioSyntax/syntax/gedit/bioSyntax.xml. We need someone to go through our gedit-syntax files and change the <styles> section so that all colors are map-to="bioSyntax:VARNAME". I've done bed.lang, clustal.lang and faidx.lang as examples

Meeting up

Let's take a break this week from meeting as last week went pretty long. If you can all work independently on the thing you're assigned and the manuscript then we can meet next week to plan out the finish and do what we have to do. I think we're all a bit less busy then too.

Fill out Poll: https://dudle.inf.tu-dresden.de/bioSyntax3/

OK, I think that's the major points.If there's anything more we'd like to discuss. Please sign yourself up for tasks which don't have someone working on them (Just Edit this comment). Also do 1 thing at a time.

bioSyntax TODO -- Post-Release

We're coming up on a few good ideas of things we should work on but don't fall into the initial release. Feel free to edit and add things here:

Add Atomic Coloring. JMol / CPK coloring to atoms/elements when they appear in a file-format (PDB).
Set-up Vim / Sublime / Gedit to only use bioSyntax theme when a bioSyntax format is being used; otherwise use the default or preset theme. Sublime Example
Re-write the bed-gedit syntax to use 'Robust Column Selection'
Optimize the Regex Engine for VCF -gedit -sublime -vim(?) to account for catastrophic backtracking. (See vcf-less for a fixed example)
Secondary Color Gradient: In BED/WIG files where there can be a score, have one color scheme (like we have) for 0-1000 range. Have a second color gradient (orange?) which recognizes 0.0 - 1.00 (decimal scale). This will support two widely used data-ranges then 0-1 and 0-1000.
Make 'Infographics' for complex file-types (SAM, VCF, GTF) to help users learn and intepret the file formats. Include things such as PHRED numeric scale, FLAG conversion bits, what each field is etc... Use bioSyntax theme colours as a teaching tool here.

Installer Updates:

The script should output the requirements if it fails; or prompt user to type in which software to install for
For less: inform user and prompt (Y/N) for software updates and adding alias commands
On website, include an uninstall instructions. (i.e. delete these files)

hackseq Day 3 Goals

Priorities for Day 3

Sprint Meet-up - Nov 20th + 24th

As discussed in the meeting, it would be fun to meet up and work together on this.

Monday November 20th -- 1:30 pm-9:00 pm+. We'll meet at the BCCRC (675 W 10th Ave), 13th Floor Meeting Room. You'll have to get a hold of me to let you up. You don't have to come right at 1:30 pm but whenever you're available and stay as long as your available : )

Parking is available outside but is paid; free after 5pm (but I have to let you in).

Remote people; we can get on slack and chat during this time and make it a party 👍

Phase 1: Alpha-completion (Due Tuesday 21st)

Goals

Complete all core syntax files (#14)
Define a complete theme set
On the website have "Install Instructions" page up and correct for beta-testers
Have a running less/vim installer.
Create test sublime / gedit packages

Round 2: *Friday November 24th -- 5:00 pm-9:00 pm+**

Team Meeting Times

Schedule for team meetings on Slack Channel

For the people working remote; make sure to login for these times on Slack.

Day 1

Initial Meeting: 9:30AM PST
Mid-day Meet: 1:00PM PST
End of day Meet: 4:30PM PST

Day 2

Initial Meeting: 9:30AM PST
Mid-day Meet: 1:00PM PST
End of day Meet: 4:30PM PST

Day 3

Initial Meeting: 9:30AM PST
Mid-day Meet: 1:00PM PST
End of day Meet: 4:30PM PST

Porting to less

Two formats, .sam and .vcf, are often very large and cannot be opened quickly in vim or any other text editor without loading to memory (although vim is decent if you have enough memory). This can be sort of solved by using head. The better solution is using less for .sam and .vcf. So can we have syntax highlighting there for these important formats?

We can leverage the source-highlight package to accomplish this. I believe the syntax-language files may be shared with gedit which will save work on that end.

Installing `source-highlight` in less (Ubuntu)

Install source-highlight to your system:

sudo apt-get update
sudo apt-get install source-highlight

Append these lines to your ~/.bashrc and/or ~/.zshrc


## Syntax highlighting in less
## For Ubuntu / Fedora
export LESSOPEN="| /usr/share/source-highlight/src-hilite-lesspipe.sh %s"
export LESS=" -R "

alias less='less -NSi -# 10'
alias more='less'

# Explicit fasta / sam less call for piping
# i.e:   samtools view -h aligned_hits.bam | sam-less
#
alias fa-less='source-highlight -f esc --lang-def=fasta.lang --outlang-def=bioSyntax.outlang --style-file=fa.style | less'
alias sam-less='source-highlight -f esc --lang-def=sam.lang --outlang-def=bioSyntax.outlang --style-file=sam.style | less'
alias vcf-less='source-highlight -f esc --lang-def=vcf.lang --outlang-def=bioSyntax-vcf.outlang --style-file=vcf.style | less'

Note: On different systems the /usr/share/source-highlight/src-hilite-lesspipe.sh may be installed to a different directory. (i.e CentOS: export LESSOPEN="| /usr/bin/src-hilite-lesspipe.sh %s")

Installing `bioSyntax` for less (Ubuntu)

Update the src-hilite-lesspipe.sh script in the source-highlight directory.

# source-highlight directory on your system
SRCDIR='/usr/share/source-highlight'

cd  $bioSyntax_PATH/syntax/less/

sudo cp src-hilite-lesspipe_BIO.sh $SRCDIR/src-hilite-lesspipe.sh

Copy over the *.lang, .outlang and .syntax files to the source-highlight directory.

#!/bin/bash
# quickInstall.sh
# Quick installer for less syntax
# for testing purposes

SRCDIR='/usr/share/source-highlight'

# Copy over src-hilite script
sudo cp src-hilite-lesspipe_BIO.sh $SRCDIR/src-hilite-lesspipe.sh


# Copy over language files
sudo cp fasta.lang $SRCDIR/
sudo cp sam.lang $SRCDIR/
sudo cp vcf.lang $SRCDIR/

# Copy over syle files
sudo cp fasta.style $SRCDIR/
sudo cp sam.style $SRCDIR/
sudo cp vcf.style $SRCDIR/

# Copy over language files
sudo cp bioSyntax.outlang $SRCDIR/
sudo cp bioSyntax-vcf.outlang $SRCDIR/

Restart your computer for the rc file to update in your terminal.

Running bio-aware `less`

Automatic detection of file-extensions when reading entire file *.fa, *.fasta, *.sam
less hgr1.fa
Piping requires explicit use of fa-less, sam-less or vcf-less which can be combined in all the interesting ways you can come up with.
samtools view -h accepted_hits.bam | sam-less

Developing language syntax files (ongoing)

Source-highlight Documentation
Coloring is done by ANSI escape code.
While any color can be used, we'll impose a limit to 256 colors to maximize compabilitiy
Sadly Source-highlight only has 17 colors defined in its colors.h file. We would have to re-compile it to add more color compatibility. Can do quite a bit with 17 colors; just not amino-acid coloring.

Syntax regex are defined in /usr/share/source-highlight/<Language>.lang
1b) and have an associated <Language>.style
Are piped into less-readable format by esc.outlang
Which is then made pretty by /usr/share/source-highlight/esc.style
Automatic file-extension recognition for less is performed in src-hilite-lesspipe.sh there is logic for running source-highlight. At Line 11 insert:

	*.fasta|*.fa|*.mfa)
	source-highlight -f esc --lang-def=fasta.lang --outlang-def=bioSyntax.outlang --style-file=fasta.style -i "$source" ;;
	*.sam)
	source-highlight -f esc --lang-def=sam.lang --outlang-def=bioSyntax.outlang --style-file=sam.style -i "$source" ;;
	*.vcf)
	source-highlight -f esc --lang-def=vcf.lang --outlang-def=bioSyntax-vcf.outlang --style-file=vcf.style -i "$source" ;;

We define a single <language>.lang and <language>.style per language & bioSyntax.outlang file for fasta.lang, sam.lang and bioSyntax-vcf.outlang for vcf.lang file each to get less working.

Known Bugs

Some Terminals have 8-color support; some have 256-color. If the output in less looks like gibberish then chances are your terminal doesn't support 256 colors. Try tput colors to tell how many colors are supported. Will add 8-color theme in the future.

Porting to Atom

By popular demand, I think it'd be a good idea to port all the syntax files to Atom as well. Atom is Github's open source text-editor that is cross-platform (Mac, Windows, Linux).

Its syntax highlighting system is based on TextMate's language grammar so converting the current Sublime Text files that we have to be compatible with Atom should be fairly straightforward. The only difference is that we'll be creating .cson files (basically custom JSON-like CSS files) and a less file for the colour scheme. I would like to work on this and if anyone else would as well, let me know :)

Here's some resources I found that seem helpful:

hs17: Introductions

Welcome buddies,
Looks like we're a team for hackseq 17 to work on bioSyntax. Welcome! Let's start with some brief introductions?

My name is Artem, I'm a grad student at UBC in Genetics. I'm (mainly) a biologist and have merged computational work to further my research over my PhD. My research is split now between studying variation in human ribosomal RNAs and studying the effects of Transposable Elements on transcriptional innovation in cancer. Besides that I'm an avid climber and love talking about crazy / far off biology ideas.

@fransilvion
@Jwong684
@alyeffy
@lazypanda10117
@Ebedthan
@ahmdeen

Fixing GTF Syntax

Hi all,
I am trying to complete the GTF syntax and port it over to gedit. (@Ebedthan, @ababaian). As suggested in issue 14, this is near-complete, so I want to know what do I need to fix for the sublime version first, and then following that to port it over to gedit. Thank you.

Biosyntax Publication

Hi all,

During the Hackathon, I have been working on drafting up a report for a paper that we could put together for publication. I have drafted up a brief skeleton of what our project is about and added figures to demonstrate our tool's utility.

UPDATED MANUSCRIPT FILE. SEE COMMENT BELOW.

I'm not exactly sure how public this is, so I have set the share settings to "can comment" for now. Let me know if you want to add/modify anything in there.

Cheers,

J

Shared Team Resources

List of resources for bioSyntax [Edit as needed]

bioSyntax Meeting 2

Time / Place

Please complete the dudle poll to select a date for the next meeting.

The next meeting will be 6:30pm on Wednesday November 15th, in Room 416 Irving Barber Library. Note the half hour delay due to room booking. We'll discord in remote people.

Assignments / Items Due

Minutes from Meeting 1

Vim Core Syntaxes - Jasper / Gherman
Gedit Core Syntaxes - Anicet / Jeff
Less syntaxes - Artem
SAM gedit/sublime - Eric
Running Installer - Eric / Alyssa
Port to Atom - Alyssa
Outline for website documentation - Artem // Alyssa

I'll book a room in the library once we have a date; remote ppl we can Discord again.

Agenda (add items)

Wrap-up-a-thon Date
Define Release 1.0 'Finish Criteria' for bioSyntax (All)
-- File Formats
-- Software Ports
-- Themes
Report (Jasper)
-- Authorship. Requirements + Responsibilities for each us
-- Funding for publication, hackseq... others
Installer (Eric + Alyssa)
Documentation / Website (Alyssa + Artem)
-- Website drafted
-- Manual Installations
-- How To: Make your own syntax
-- How To: Contribute
Syntax Specific Discussion
Assign Tasks

Vision + Plan for hackathon

BioSyntax: Parsing biological file formats for humans with syntax highlighting

A large component of bioinformatics involves reading and writing data in biological file-formats such as fasta, fastq, bed, gtf, vcf, sam, etc... While being easy to parse computationally, these and other biological file-formats are often illegible for scientists to read and write to directly. I’d propose you join the bioSyntax team and together we will develop a suite of syntax highlighting for bio-formats to be used with common text editors such as gedit or vim. This design solution will help researchers interact with their data more efficiently and gain better insight into the biological world. This project requires a strong understanding of regular expressions, an intimate familiarity with use-cases for some biological file specifications and a flare for human-interface design.

The core idea here is how can we bring scientists closer to the underlying data and able to interact / interpret it more intuitively.

I'm trying to brainstorm some of the things we'll need to prep for this. Feel free to edit and add to the list. This isn't my project or my team, it's all of ours so chip in :)

Literature Of Interst

Which file formats do we want to develop syntax for? (What do you use?)

SAM / BAM
FASTA / FASTQ
VCF
Wig
Bed
GTF
PDB

Which programs are we going to develop the syntax for? (What do you use?)

gedit (and other simple text editors)
Vim / gVim
Emacs ?

How can we unify the colors, look and feel of all the file formats into one standard so it's universal?

Define a central color scheme which includes biological classes (nucleotides / amino acids / coordinates / strings / number values / ...)

Is there other non-syntax functionality which we would like to develop as well to help understand data.

Include a simple installer which installs auto-detection for file formats and the syntax files to a system

Feature List (To Do)

Features to improve bioSyntax which we are working on

Automated Installer Scripts (Eric -- Nov 12th)

Linux -

Sublime
Gedit / gtksourceview
Vim
less (+ Source Highlight)

Mac

Sublime
Gedit
Vim
less

Windows

Sublime
Gedit

Port to Vim

Core Syntaxes
Auxiliary Syntaxes

Port to Gedit

Core Syntaxes
Auxiliary Syntaxes

Pretty Features

To the main color scheme; add a slightly different blue for Uradine
In the SAM syntax; make distinct colors for SO:unsorted (invalid highlight) and SO:coordinate (nice color) or other SO:
In the VCF Syntax: add REGEX for recognition and parsing of the data fields
Make a block gradient coloring scope where foreground == background and it's all scaled
Research for stream highlighting. less more or something else for big data (Artem)
Write up short 'feature' description for front README (highlight things from the presentation as opposed to listing each format?)

File Format / Syntax Compatibility Matrix

File format and software compatibility matrix for bioSyntax.

	status
X	Syntax Complete
o	In Development
-	Unavailable
*	Bug Fix Needed

Core Syntaxes

File Format	Description	sublime	vim	gedit	less
.fasta	Generic nt/aa sequence	X	X	X	X
.fastq	Fasta + PHRED quality	X	X	X	X
.clustal	Multiple Sequence Alignment	X	X	X	X
.bed	Genomic Ranges	X	X	X	X
.gtf	Genomic Annotation	X	X	X	X
.pdb	Protein Structure	X	X	Anc	X
.vcf	Variant Call Format	X	X	X	X
.sam	NGS Sequence Data	X	X	X	X

Auxillary Syntaxes

File Format	Description	sublime	vim	gedit	less
.fasta	fasta alternative AA colors
-	Clustal	X	o	X	-
-	Taylor	X	o	X	-
-	Zappo	X	o	X	-
-	Hydrophobicity	X	o	X	-
.fai	Fasta Index (faidx)	X	X	X	X
.flagstat	samtools flag summary	X	-	-	X
.wig	Wiggle data	o	-	X	-
.newick	Tree Format	-	-	-	-
.pdbx	Protein Structure (large)	-	-	-	-
.phylip	Multiple Sequence Alignment	-	-	-	-
.cwl	Common Workflow Language	-	-	-	-

Porting to vim

This is a useful tutorial to get into vim syntax:
http://vim.wikia.com/wiki/Creating_your_own_syntax_files

Enable syntax in ~/.vimrc:

syntax enable

There is essentially a vim folder in your home directory:
Make these subdirectories:

~/.vim/syntax
~/.vim/ftdetect
The files in these subdirectories must match. (i.e. fasta.vim in each of those folders for *.fasta)
in ftdetect/fasta.vim: (detects file formats)

au BufRead,BufNewFile *.fasta set filetype=fasta
au BufRead,BufNewFile *.fa set filetype=fasta

In syntax/fasta.vim: (specifications)

if exists("syntax_on")
        syntax reset
endif

syntax match comment ">.*$"
syntax match ntA "A"
syntax match ntG "G"
syntax match ntC "C"
syntax match ntT "T"

hi def link comment Identifier
highlight ntA ctermfg=Black ctermbg=Green guibg=#272822
highlight ntG ctermfg=Black ctermbg=Yellow guibg=#FF8C00
highlight ntC ctermfg=Black ctermbg=Blue guibg=#2A0AFD
highlight ntT ctermfg=Black ctermbg=Red guibg=#FD0A0A

Something kept breaking when I followed the online tutorial so I broke it down to this skeleton for now. More to come.

bioSyntax ToDo List Day 1

Example files

upload example files to git repo

Installer

Windows Installer script
Linux Installer script
Mac Installer script

Syntax for File Formats

Color Scheme

Define Color Scheme File
Gradient Coloring
Full IUPAC nucleotides
Amino Acids (various kinds)

Hackseq Day 2 Goals

Goals for Day 2 Hackseq

Porting to Sublime

Initially we're going to be focusing on SublimeText / YAML for all the formats; we'll port it from there.

SublimeText uses YAML syntax highlighting
Readme

Starting an Example File for your bioformat

In Sublime > Tools > Developer > New Syntax

Installing a syntax for

Copy the '.sublime-syntax' file to

Linux: ~/.config/sublime-text-3/Packages/User
Windows: %APPDATA%/Roaming/Sublime Text 3/Packages/
Mac: ~/Library/Application Support/Sublime Text 3/Packages/User/

NOTE: The .sublime-syntax file cannot contain any Tabs; everything is space-indented

Defining / Changing the color scheme

Overview

We'll be using 'Monoka.tmTheme' as the base theme for now.

Your bioformat syntax file should use already existing definitions in the theme file for the 'scope'

The default theme file is available here or it's zipped under
'sublime_text_3/Packages/Color Scheme - Default.sublime-package'

Polish SAM syntax highlighting capability

use column based selection logic in SAM syntax highlighting (update sublime)
port over to gedit

bioSyntax Theme

Special Biological Classes and their target 'scope'

In the bioSyntax Theme there are custom defined colors / classes for highlighting. Entities which are 'biological' such as genomic coordinates, nucleotides, software names should look the same across all formats so we are defining one naming scheme (also called Scope) for each of these bio-classes

If you have a different type of data to add; comment below and I'll add it to our list so it's color scheme works with everything here

bioSyntax Theme File

As of November 30th, 2017. There is now a single unified theme defined in Hex / ANSI / Cterm for biological classes. All of the sublime syntax variable names and the bioMonokai theme have been updated to refer to the unified theme.

Porting core syntax to gedit

Hey team,

First of all i want to focus on porting core syntax to gedit. The job done already (i need your review and critics on this job, also testing it on your laptop) is in this folder.

Completed syntax

fasta.lang: defining colors for fasta files;
fasta-zappo.lang: defining zappo colors for fasta files;
fasta-hydrophobicity.lang: defining hydrophobicity colors for fasta files;
fasta-taylor.lang: defining taylor colors for fasta files;
fastq.lang: defining color for fastq files.

Remaining syntax

sam.lang;
bed.lang;
pdb.lang;
vcf.lang.

I plan to work on theses files on a daily basis and finish this work as soon as possible.
I'm open to your remarks and wait for your analysis of this part of the project on what we have to do.

bioSyntax Meeting 1

Moving forward with bioSyntax we'll meet for ~1 hour to discuss the logistics of the next month, maybe also use the opportunity to hammer out some code while we're together. I've set times for 6pm PST since we have work/school. Please also indicate if you would prefer to meet at the UBC library or in the BC Cancer Research Centre (Broadway + Cambie).

We'll Meet at Wednesday Nov 1 6pm at the Irving K Barber Library and video conference in remote people.

Agenda: (add things)

Define Release 1.0 'Finish Criteria' for bioSyntax (All)
-- File Formats
-- Software Ports
-- Themes
Report (Jasper)
-- What research do we each need to do
Installer (Eric)
Documentation / Front-end
-- Manual Installations
-- How To: Make your own syntax
-- How To: Contribute
Syntax Specific Discussion
-- Robust column selection (Artem)
-- Nucleotide Color Scheme
Assign Tasks
... hack till you have to go home.

Sublime-Fastq Syntax breaks when comment included

@Jwong684, When you get a moment there's a bug in the fastq-sublime syntax files. If there is text in the comment row (+ ...) then it breaks that row and the subsequent one. Also check out the other fq file from the SRA examples/nt-seq/test2.fq.gz

Fasta Syntax

Fasta files are appropriates for viewing:

DNA sequences;
RNA sequences;
and protein sequences.

I have added example files of RNA and protein sequences.

Selecting Arbitrary Nth Column in Syntax

I was working on the mostly trivial case of fasta-index format (faidx) and I think because it was so simple I found a very nice way to select columns by the order in which they appear. The only requirement right now is that it is in a tab-delimited file.

What it does is match the first column until the first tab, scopes it, then pushes to contig.length

In contig.length every non-whitespace character is selected and scoped. Then when it hits the next tab it pops out.

The third column is then selected, scoped and pushed to genomic.offset. The fourth column is selected and then popped at the tab.

etc... This push-pop back and forth with tabs can be repeated for N number of columns which means that .bed, .bedpe, .gtf, .sam, and possibly some of .vcf can now be 'solved' since we know what type of data is supposed to be in the Nth column.

Can anyone think of a reason that this won't work or will break at some edge-case?

If not, we'll need to re-work those syntaxes as I think this is a more robust approach then trying to select each column by the data range which could be there.

faidx.sublime-syntax

%YAML 1.2
---
name: faidx
file_extensions: [fa.fai,fasta.fai]
scope: source.faidx

contexts:
  main:
    # COLUMN 1
    - match: '^[\S]*\t'
      scope: coord.Chr.faidx
      push: contig.length

    # COLUMN 3
    - match: '(?<=\t)[\S]*\t'
      scope: constant.numeric.faidx
      push: genomic.offset

    # COLUMN 5
    - match: '[\S]*$'
      scope: comment.line.faidx

  contig.length:
    # COLUMN 2
    - match: '[\S]*'
      scope: coord.Start.faidx
    - match: \t
      pop: true

  genomic.offset:
    # COLUMN 4
    - match: '[\S]*'
      scope: comment.line.faidx
    - match: \t
      pop: true

ababaian / biosyntax-archive Goto Github PK

biosyntax-archive's Introduction

ARCHIVED REPOSITORY

SEE: bioSyntax Repository FOR NEW VERSIONS

biosyntax-archive's People

Contributors

Stargazers

Watchers

Forkers

biosyntax-archive's Issues