biothings / mygene.info Goto Github PK

View Code? Open in Web Editor NEW

114.0 18.0 20.0 2.91 MB

MyGene.info: A BioThings API for gene annotations

Home Page: http://mygene.info

License: Other

Python 92.70% HTML 1.17% CSS 0.86% JavaScript 5.26%

biothings gene gene-annotations api webservice bioinformatics ncats-translator

mygene.info's Introduction

MyGene.info

Description

MyGene.info is a web API for accessing gene annotation information (Gene Annotation Query as a Service). MyGene.info is part of BioThings API collection, together with MyVariant.info, MyChem.info and more.

For more information see this reference:

Xin J, Mark A, Afrasiabi C, Tsueng G, Juchler M, Gopal N, Stupp GS, Putman TE, Ainscough BJ, Griffith OL, Torkamani A, Whetzel PL, Mungall CJ, Mooney SD, Su AI, Wu C. High-performance web services for querying gene and variant annotation. Genome Biol. 2016 May 6;17(1):91. doi: 10.1186/s13059-016-0953-9. https://www.ncbi.nlm.nih.gov/pubmed/27154141

Setup Mygene.info Web Server Locally

1. Prerequisites

python (>=3.4)
git

In Ubuntu/Debian system, you can install all prerequisites by

sudo apt-get install python-dev python-setuptools git

2. Clone this repo:

git clone https://github.com/biothings/mygene.info.git

3. Setup a Python "virtualenv" (optional, but highly recommended):

sudo easy_install pip
sudo pip install virtualenv

virtualenv ~/opt/devpy

4. Install required python modules:

pip install -r ./requirements_web.txt

5. Make your own "config.py" file

cd src
vim config.py

from config_web import *
from config_hub import *
# And additional customizations

6. Run your dev server

python index.py --debug

python index.py --debug --port=9000

mygene.info's People

Contributors

Stargazers

Watchers

Forkers

olveirap sulab kingdynasty raonyguimaraes annatsw0609 tianyunwang inambioinfo quiltomics typekey shunsunsun yosranemri erikyao mlebeur amiteshksharma nikkibytes jal347 wook2014

mygene.info's Issues

AttributeError in querymany

After about 40,000 queries through a querymany, this error crops up:

.../Documents/project/venv/lib/python3.6/site-packages/mygene/__init__.py in querymany(self, qterms, scopes, **kwargs)
    569                 out.extend(hits)
    570                 for hit in hits:
--> 571                     if hit.get('notfound', False):
    572                         li_missing.append(hit['query'])
    573                     else:

AttributeError: 'str' object has no attribute 'get'

What's going on?

tuning of default search scoring for wildcard searches

I believe we have scoring in place to prioritize human, mouse, and rat over other species, and that seems to be working well:

https://mygene.info/v3/query?q=BRCA2

But scoring with wildcard searches is far from perfect, eg:

https://mygene.info/v3/query?q=BRCA*&fields=symbol,name,alias

{
  "max_score": 1.55,
  "took": 15,
  "total": 5502,
  "hits": [
    {
      "_id": "106721785",
      "_score": 1.55,
      "name": "BRCA2 promoter\/silencer region",
      "symbol": "LOC106721785"
    },
    {
      "_id": "79184",
      "_score": 1.55,
      "alias": [
        "BRCC36",
        "C6.1A",
        "CXorf53"
      ],
      "name": "BRCA1\/BRCA2-containing complex subunit 3",
      "symbol": "BRCC3"
    },
    {
      "_id": "11200",
      "_score": 1.55,
      "alias": [
        "CDS1",
        "CHK2",
        "HuCds1",
        "LFS2",
        "PP1425",
        "RAD53",
        "hCds1"
      ],
      "name": "checkpoint kinase 2",
      "symbol": "CHEK2"
    },
    {
      "_id": "56647",
      "_score": 1.55,
      "alias": [
        "TOK-1",
        "TOK1"
      ],
      "name": "BRCA2 and CDKN1A interacting protein",
      "symbol": "BCCIP"
    },
    {
      "_id": "5932",
      "_score": 1.55,
      "alias": [
        "COM1",
        "CTIP",
        "JWDS",
        "RIM",
        "SAE2",
        "SCKL2"
      ],
      "name": "RB binding protein 8, endonuclease",
      "symbol": "RBBP8"
    },
    {
      "_id": "672",
      "_score": 1.55,
      "alias": [
        "BRCAI",
        "BRCC1",
        "BROVCA1",
        "FANCS",
        "IRIS",
        "PNCA4",
        "PPP1R53",
        "PSCP",
        "RNF53"
      ],
      "name": "BRCA1, DNA repair associated",
      "symbol": "BRCA1"
    },
    {
      "_id": "111589216",
      "_score": 1.55,
      "name": "BRCA1 intron 2 regulatory region",
      "symbol": "LOC111589216"
    },
    {
      "_id": "1845",
      "_score": 1.55,
      "alias": "VHR",
      "name": "dual specificity phosphatase 3",
      "symbol": "DUSP3"
    },
    {
      "_id": "57697",
      "_score": 1.55,
      "alias": [
        "FAAP250",
        "KIAA1596"
      ],
      "name": "Fanconi anemia complementation group M",
      "symbol": "FANCM"
    },
    {
      "_id": "29086",
      "_score": 1.55,
      "alias": [
        "C19orf62",
        "HSPC142",
        "MERIT40",
        "NBA1"
      ],
      "name": "BRISC and BRCA1 A complex member 1",
      "symbol": "BABAM1"
    }
  ]
}

In my mind, the scoring should preferentially weight matches to symbol (e.g., BRCA1 and BRCA2), then aliases (e.g., LINC01488/BRCAT8 and LINC02224/BRCAT107, both not shown here in the top 10), then name (e.g., BRCC3 and BCCIP), then any other field (e.g., CHEK2, RBBP8).

More info and context: cognoma/frontend#169

Restructure MyGene.info Homolog field

Is this something of interest to restructure the homolog field in MyGene.info?
Current structure:

"homologene": {
    "genes": [
        [
            9606,
            1017
        ],
        [
            10090,
            12566
        ],
        [
            10116,
            362817
        ]
    ]
}

This is less intuitive and no compatible with JSON-LD.

Proposed structure:

"homologene": {
    "genes": [
        {
            "taxid": 9606,
            "geneid": 1017
        },
        {
            "taxid": 10090,
            "geneid": 12566
        },
        {
            "taxid": 10116,
            "geneid": 362817
        }
    ]
}

UMLS CUIs for genes

Is there any way they can be added? Right now, getting some kind of a map from EntrezGeneID to UMLS CUI (or vice versa) is an incredibly convoluted process.

Here is an example of the gene FOXP3. I can get the CUI for a single gene from here, but I have a list of thousands of genes for which I need to translate IDs, making this inefficient.

It looks like data for UMLS CUIs could be available for download here.

Import genomes of Myxococcus Xanthus

I'm working on creating a "MyxoBase" for the myxo research community. Since wikigenomes uses WikiData as its main database, I need to import the annotated genes of 3 different strains into WD. But to do that, I need the genes in MyGene.Info to run the MicrobeBot.

The 3 strains of myxo are as follows (taxid):
246197, 1198133, 1198538 (The first three strains).
The taxon tree is listed here: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=34.

POST searching complex fields with uppcase field names

Consider the following biothings_client code snippet:

import biothings_client
client = biothings_client.get_client('gene')
qr = client.querymany(['P24941'],
scopes='uniprot.Swiss-Prot',
fields='uniprot.Swiss-Prot',
as_generator=True,
returnall=True)
print(qr)

No hits are returned:
{'out': [{'query': 'P24941', 'notfound': True}], 'dup': [], 'missing': ['P24941']}

However, using the biothings_client.query API call hits are returned.

There appears to be a problem in using scopes with complex fields and uppercase.

we need to setup a better way to handle user inquiries

Currently, we just use "[email protected]" for user inquiries, and we probably should setup a better way (public, and searchable). Options:

google group
biostars.org special channel or tag?
free or commercial knowledgebase

new data source: dgidb

drug-gene interaction database

downloads page: http://www.dgidb.org/downloads

Integrate Ensembl Plant

Based on existing dumper & parsers (https://github.com/biothings/mygene.info/tree/master/src/hub/dataload/sources/ensembl) integrate Ensembl plant data in mygene. This BioMart should be used, on the same principle as the one we use for human, rat, ... https://plants.ensembl.org/biomart/martview

We'll start with Arabidopsis Thaliana, but will then continue with other plants.

Return consistent types

Looks like the types of some fields can change between string, list, or non-existent. For example, "ensembl" and "alias".

This gets tedious to work with on the client side. I'll need to add checks for these fields now, which will inevitably end up as a helper library I use to wrap all calls. It'd be a lot easier if the API adhered to a strict schema.

Convert Entrez/Ensembl mapping dict to datatransform edge

Using bt.hub.datatransform module, need to define a new "dict" edge which will lookup data from a dictionary instead of a mongo collection

load pathway.reactome data from Reactome directly

Currently pathway.reactome data is loaded from ConsensusPathDB. A user reported that the Reactome data might not be up-to-date:

https://mygene.info/v3/query?q=pathway.reactome.id:R-HSA-983712

returns 209 gene hits (that is 206 proteins if counting by "uniprot.Swiss-Prot" field)

While Reactome website reports 184 proteins in this pathway:

https://reactome.org/PathwayBrowser/#/R-HSA-983712&DTAB=MT

Need to investigate why the difference, and decide if we need to load Reactome data directly from Reactome.

Proteins with no genes

I think it would be useful for mygene to also store information about proteins with no associated Entrez record. For example:
http://www.uniprot.org/uniprot/A2NXD2
http://www.uniprot.org/uniprot/Q5NV61

generate a data schema and allow returned JSON objects follow strict schema

A more "formal" solution to #42 and #7:

generate a JSON-schema style schema on the gene object
- there could be two versions of schemas available:
  - one is faithfully representing the actual data objects (e.g. some field should allow both string and list of strings)
  - one is "strict schema", every field can have only one type (e.g. any field with even small chance to be a list will be defined as a list)
- schemas can be accessible at /metadata/schema endpoint
by default, the gene objects are still returned as they are stored
With an optional "strict_schema=true" parameter, the returned gene objects will follow the "strict schema"

Capture HomoloGene version number

HomoloGene data is downloaded from Entrez, but it has its own versioning system separate from Entrez. The version number is in this file in the same dir (ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/). Is it possible to add this to the metadata page?

Some ncRNAs don't have ensembl field, but apparently they do have ensembl ids.

I do not know whether this is a desired feature?
https://mygene.info/v3/gene/7012?fields=all has no ensembl field, but it does have an ensembl id: ENSG00000270141
And these 'genes' lack hg38 coordinates as well.

Load OMIM mappings

This file (https://omim.org/static/omim/data/mim2gene.txt) contains mappings between OMIM IDs and Ensembl Gene IDs.

Add Additional Field called "ontology_source" (e.g. BP, CC) for each GO term

link to bioconductor R package from docs

I was expecting to find a link to the mygene bioconductor package from somewhere in mygene.info documentation, but wasn't able to find it.

(also, I don't think the official mygene python package should be listed under "Third-party packages"...)

"No API definition provided." on https://mygene.info/v3/api

Incomplete search results for uncharacterized pig genes

E.g.

mygene.info/v3/query?q=A0A075B7H6
mygene.info/v3/query?q=ENSSSCG00000030825

Don't return results

Wrong gene coordinate for UBE2V2

http://mygene.info/v3/gene/7336

"genomic_pos_hg19": [
    {
      "chr": "20",
      "end": 48732496,
      "start": 48697661,
      "strand": -1
    },
    {
      "chr": "8",
      "end": 48977268,
      "start": 48920960,
      "strand": 1
    }
  ]

The right coordinate should be the second one. No sure where is the first one from.

issues running master branch of the project

Hello everyone!

Lately I have been trying to run the project in a Ubuntu 16.04 environment and I am facing the following issues:

In case I run the hub with git+https://github.com/biothings/biothings.api.git@1c3227c397250daaeef387be743d9e60c26d9bdd#egg=biothings that is in the requirements, I get the following error:
ImportError: cannot import name 'ThrottledESJsonDiffSyncer'
and I had to use the master branch of the biothings library
During a manual merge of the following configuration:
{ "_id" : "mygene", "name" : "mygene", "sources" : [ "cpdb" ], "root" : [ "cpdb" ], "doc_type" : "gene" },
I get the following error:
AssertionError: Expecting 2 collection _ids, got: ['mygene_20171102_6qat8yzn']
During a manual index of the following build: mygene_20171102_6qat8yzn,
I get the following error:
NameError: name 'INDEXER_CATEGORY' is not defined
During a scheduled dump of pharmgkb resource, I get the following error:
ERROR:pharmgkb_dump:Error while dumping source: stat: can't specify None for path argument
During a scheduled dump of homologene resource, I get the following errors:
- ERROR:homologene_batch_1:Can't find Entez data directory
- ERROR:homologene_dump:Error while dumping source: EOFError
During a scheduled dump of exac resource, I get the following errors:
- ERROR:exac.broadinstitute_exac_batch_1:Can't find Entrez data folder
- ERROR:exac_dump:Error while dumping source: 550
  fordist_cleaned_exac_nonTCGA_z_pli_rec_null_data.txt: No such file or directory
During a scheduled dump of entrez resource, I get the following error:
ERROR:entrez.entrez_genomic_pos_batch_1:[Errno 2] No such file or directory: '/home/mahdi/mygene.info.tmp/data/entrez/20171028/../ref_microbe_taxids.pyobj'
During a scheduled dump of refseq resource, I get the following error:
ERROR:refseq_dump:Error while dumping source: EOFError
During a scheduled dump of generif resource, I get the following error:
ERROR:generif_dump:Error while dumping source: EOFError
During a scheduled dump of ensembl resource, I get the following errors:
- ERROR:ensembl.ensembl_acc_batch_1:Can't find Entrez data folder
- ERROR:ensembl.ensembl_acc_upload:failed [steps=data,post,master,clean]: Can't find Entrez data
  folder
- ERROR:ensembl.ensembl_gene_batch_1:list index out of range
- ERROR:ensembl.ensembl_gene_upload:failed [steps=data,post,master,clean]: list index out of range
- ERROR:ensembl.ensembl_genomic_pos_batch_1:Can't find Entrez data folder
- ERROR:ensembl.ensembl_genomic_pos_upload:failed [steps=data,post,master,clean]: Can't find
  Entrez data folder
- ERROR:ensembl.ensembl_interpro_batch_1:Can't find Entrez data folder
- ERROR:ensembl.ensembl_interpro_upload:failed [steps=data,post,master,clean]: Can't find Entrez
  data folder
- ERROR:ensembl.ensembl_pfam_batch_1:Can't find Entrez data folder
- ERROR:ensembl.ensembl_pfam_upload:failed [steps=data,post,master,clean]: Can't find Entrez data
  folder
- ERROR:ensembl.ensembl_prosite_batch_1:Can't find Entrez data folder
- ERROR:ensembl.ensembl_prosite_upload:failed [steps=data,post,master,clean]: Can't find Entrez
  data folder
During a scheduled merge, I get the following error:
ERROR:asyncio:Exception in callback Cron.set_result(<_GatheringFu... 'mygene'",)]>) handle: <Handle Cron.set_result(<_GatheringFu... 'mygene'",)]>)>: No such builder for 'mygene'
Finally, could you please provide me with an explanation on how the hub schedules and performs the index operation automatically?

You can find a more detailed trace of these issues in the file below:
Errors of mygene.info.txt

Looking forward to hearing from you soon.

Thanks in advance!

Searching for common gene names using post requests do not work

Clojure example:

(client/post "http://mygene.info/v3/query" {:form-params {:q "CDK3"}})
{:request-time 1527, :repeatable? false, :protocol-version {:name "HTTP", :major 1, :minor 1}, :streaming? true, :chunked? false, :reason-phrase "OK", :headers {"Access-Control-Max-Age" "60", "Access-Control-Allow-Headers" "Content-Type, Depth, User-Agent, X-File-Size, X-Requested-With, If-Modified-Since, X-File-Name, Cache-Control", "Server" "TornadoServer/4.5.1", "Content-Type" "application/json; charset=UTF-8", "Access-Control-Allow-Origin" "*", "Content-Length" "53", "Connection" "Close", "Access-Control-Allow-Methods" "GET,POST,OPTIONS", "Date" "Thu, 21 Sep 2017 14:41:27 GMT", "Access-Control-Allow-Credentials" "false", "Cache-Control" "max-age=604800, public"}, :orig-content-encoding nil, :status 200, :length 53, 
:body "[\n  {\n    \"query\": \"CDK3\",\n    \"notfound\": true\n  }\n]", :trace-redirects []}

You can also just search for CDK3 in the API and see that you get no results. CDK3 and other common gene names work for the get queries though :)

Thanks for the awesome work btw.

WNT7B

A name "WNT7B" gives this output (below). I don't think WNT4 should be there.

{
"max_score": 427.66583,
"took": 6,
"total": 4,
"hits": [
{
"_id": "7477",
"_score": 427.66583,
"name": "Wnt family member 7B"
},
{
"_id": "22422",
"_score": 341.75598,
"name": "wingless-type MMTV integration site family, member 7B"
},
{
"_id": "315196",
"_score": 303.5048,
"name": "Wnt family member 7B"
},
{
"_id": "54361",
"_score": 0.3820109,
"name": "Wnt family member 4"
}
]
}

Query genes by alternative names

@cgreene suggested we look into mygene.info for Project Cognoma: cognoma/core-service#29 (comment). My first impression is that this is a really awesome service that will help us a lot.

When I tried searching mygene.info/v3/query by gene name, no results were returned.

By name I mean that A1BG has the following Entrez Gene information:

Preferred Names

alpha-1B-glycoprotein

Names

HEL-S-163pA
epididymis secretory sperm binding protein Li 163pA

Is this feature missing because biologists usually search by symbol? It seems like there would be many situations where name search would help you identify a gene you were interested in.

go.BP.pubmed is sometimes a list and sometimes a number

This makes it harder to parse the json in statically typed languages.

e.g. see this response from the query http://mygene.info/v3/query?q=ATM&fields=all

      "go": {
        "BP": [
          {
            "evidence": "IDA",
            "id": "GO:0006468",
            "pubmed": 15916964,
            "term": "protein phosphorylation"
          },
          {
            "evidence": "IMP",
            "id": "GO:0006974",
            "pubmed": [
              15790808,
              17875758,
              24550317
            ],
            "term": "cellular response to DNA damage stimulus"
          },

How are the Entrez and Ensembl mappings determined?

I'm confused on how entrez and ensembl are merged. There are lots of genes with entrez and ensembl mapping that I can't find online. For example:

https://mygene.info/v3/gene/2576 (GAGE4)
has
"entrezgene": "2576" (https://www.ncbi.nlm.nih.gov/gene?cmd=retrieve&dopt=default&list_uids=2576)
and
"ensembl" "gene": "ENSG00000215269", (https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000215269;r=X:49570434-49577754;t=ENST00000445148)
"ensembl" "gene": "ENSG00000236362" (http://uswest.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000236362;r=X:49551278-49568218)

On the entrez page, I see a entrez link out to: ENSG00000215269 but not to ENSG00000236362.
So how/why was ENSG00000236362 mapped to 2576?

Please let me know, Thanks!

Consistently order data in lists

There are many (Entrez) genes which match multiple genes in Ensembl. The data for each ensembl field is a list and the order of the values in each field is not guaranteed to be the same. This makes it impossible to determine which values belong to which gene.

For example: this gene.
There are two ensembl genes linked: YBR181C and YPL090C (in that order).
The genomic positions (chromosomes) are (in order): XVI and II.
However: YBR181C is on chromosome II, and YPL090C is on chromosome XVI.
The order is not correct, and is not guaranteed to be consistent. This makes it impossible to match the ensembl ID with the correct gene's genomic position.

SNORD80 Missing data?

The mygene response for https://mygene.info/v3/gene/26774 is missing the genomic_pos field, even though the data exists here: https://www.ncbi.nlm.nih.gov/gene?cmd=retrieve&dopt=default&list_uids=26774
Is this right? Thanks!

Allow for exact matches only?

Querying something like "BRCA1", I get a lot of seemingly unrelated matches such as "BRAT1".

This is obviously a symptom of the nature of ElasticSearch. In analytical use cases, personally, I think fuzzy matches are dangerous.

Could we add a query parameter to require an exact match? Or maybe it exists and I'm not seeing the docs?

get genomic_pos data from NCBI?

Currently, our genomic_pos and genomic_pos_hg19 fields are coming from Ensembl, therefore, only available for those species included in Ensembl release.

We should find out a good data source from NCBI for the genomic position data, hopefully, cover more species.

Species field parsing

There's something weird going on with the species field I think. Looks like its only considering the first 10 values.

https://mygene.info/v3/query?q=__all__&species=9606 gives 96245 hits

The same query plus 9 more taxids gives 150569 hits
https://mygene.info/v3/query?q=__all__&species=9606,3702,107806,272634,260799,198094,265311,272947,471472,331636

But add another, say, 10090, which we know has genes, and you still get 150569 hits
https://mygene.info/v3/query?q=__all__&species=9606,3702,107806,272634,260799,198094,265311,272947,471472,331636,10090

include "gene_type" from Ensembl

Ensembl provides "gene_type" annotations for their Ensembl Genes, they can be retrieved via BioMart query. This is an example to get the mapping from Ensembl Gene ID to "gene_type":

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "TSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
			
	<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
		<Attribute name = "ensembl_gene_id" />
		<Attribute name = "gene_biotype" />
	</Dataset>
</Query>

We can integrate "gene_type" values into the current Ensembl data ingestion process.

add ortholog data from panther

data files at ftp://ftp.pantherdb.org/ortholog/current_release

Need to contact Paul Thomas about permission to redistribute (TOU: http://pantherdb.org/tou.jsp)

Entrez genomic position

Since the genes are indexed by the Entrez ID, it would be helpful in certain cases to retrieve the Entrez genomic position (as well as the ensembl genomic position) especially in cases where they didn't agree.

I can't retrieve KEGG info from zebrafish geneIDs

Hi,
I am using the queryMany function from the mygene R package 1.14.0 to retrieve annotations for zebrafish.

I have tried "refseq" and "entrezgene" as scopes, and it works great for several fields I have tried, except for those related to KEGG pathways ("pathway.kegg", "kegg", "pathway.kegg.name", "pathway.kegg.id").

code example:

Drerio.gene.ids <- queryMany(prot, scopes = "refseq", fields=c("entrezgene", "symbol", "name", "go", "pathway.kegg"), species=7955, returnall=TRUE)

As troubleshooting, I retrieved the KEGG fields but with a human entrezgene list as the scope, and it worked just fine.

Is there something I am missing?

Thanks

Annotate each key with the source of the data

There should to be a way to determine where exactly each field within each document comes from. This is necessary to be able to accurately use the data. This can lead to confusion. For instance, genes are indexed by their Entrez IDs, but the genomic positions come from Ensembl. I don't see anything in the documentation ?

Currently, there is no way to determine the source of a particular field without looking through the source code. Even if the source of each field was described in the documentation, this is not ideal because certain fields may come from multiple different sources depending on the document you are looking at. For that reason, it would be more helpful if each document contained metadata describing where each field came from. This can, of course be optional.

suggest_from symbol query matches fields other than symbol

https://mygene.info/v3/query?q=GABA*&suggest_from=symbol^10&species=human&entrezonly=true intends to partial search by gene symbol for the term GABA, so all results should have an official symbol that starts with GABA.

However, the following payload is returned:

{
  "total": 40,
  "max_score": 1.55,
  "took": 6,
  "hits": [
    {
      "_id": "4942",
      "_score": 1.55,
      "entrezgene": 4942,
      "name": "ornithine aminotransferase",
      "symbol": "OAT",
      "taxid": 9606
    },
    {
      "_id": "2554",
      "_score": 1.55,
      "entrezgene": 2554,
      "name": "gamma-aminobutyric acid type A receptor alpha1 subunit",
      "symbol": "GABRA1",
      "taxid": 9606
    },
    {
      "_id": "9568",
      "_score": 1.55,
      "entrezgene": 9568,
      "name": "gamma-aminobutyric acid type B receptor subunit 2",
      "symbol": "GABBR2",
      "taxid": 9606
    },
    {
      "_id": "11345",
      "_score": 1.55,
      "entrezgene": 11345,
      "name": "GABA type A receptor associated protein like 2",
      "symbol": "GABARAPL2",
      "taxid": 9606
    },
    {
      "_id": "6529",
      "_score": 1.55,
      "entrezgene": 6529,
      "name": "solute carrier family 6 member 1",
      "symbol": "SLC6A1",
      "taxid": 9606
    },
    {
      "_id": "11337",
      "_score": 1.55,
      "entrezgene": 11337,
      "name": "GABA type A receptor-associated protein",
      "symbol": "GABARAP",
      "taxid": 9606
    },
    {
      "_id": "23710",
      "_score": 1.55,
      "entrezgene": 23710,
      "name": "GABA type A receptor associated protein like 1",
      "symbol": "GABARAPL1",
      "taxid": 9606
    },
    {
      "_id": "7915",
      "_score": 1.55,
      "entrezgene": 7915,
      "name": "aldehyde dehydrogenase 5 family member A1",
      "symbol": "ALDH5A1",
      "taxid": 9606
    },
    {
      "_id": "223",
      "_score": 1.55,
      "entrezgene": 223,
      "name": "aldehyde dehydrogenase 9 family member A1",
      "symbol": "ALDH9A1",
      "taxid": 9606
    },
    {
      "_id": "2566",
      "_score": 1.55,
      "entrezgene": 2566,
      "name": "gamma-aminobutyric acid type A receptor gamma2 subunit",
      "symbol": "GABRG2",
      "taxid": 9606
    }
  ]
}

It looks like only three of the hits actually match the symbol field. Is this a bug or am I misunderstanding the effect of the query?

Also is there documentation of suggest_from and all of its options?

Deterministic JSON ordering

Is it possible to make it so the same query always returns the same JSON text?

Currently, sometimes the max_score, total, and took fields appear at the beginning of the JSON and sometimes they appear at the end.

If the implementation is in python, OrderedDicts would probably fix this issue.

Restrict query to protein-coding genes

We'd like a way to restrict our query to entrez genes where type_of_gene="protein-coding". Essentially we want to query for human protein-coding entrez genes. The closest could get is:

https://mygene.info/v3/query?q=TP53&fields=symbol%5E2%2Calias%2Ctype_of_gene&species=human&size=2&facets=type_of_gene&entrezonly=true

which returned:

{
  "facets": {
    "type_of_gene": {
      "terms": [
        {
          "term": "protein-coding",
          "count": 29
        },
        {
          "term": "pseudo",
          "count": 6
        },
        {
          "term": "ncRNA",
          "count": 1
        }
      ],
      "_type": "terms",
      "total": 36,
      "missing": 0,
      "other": 0
    }
  },
  "max_score": 459.45786,
  "took": 23,
  "total": 43,
  "hits": [
    {
      "_id": "7157",
      "_score": 459.45786,
      "alias": [
        "BCC7",
        "LFS1",
        "P53",
        "TRP53"
      ],
      "type_of_gene": "protein-coding"
    },
    {
      "_id": "653550",
      "_score": 23.245737,
      "alias": [
        "TP53TG3",
        "TP53TG3E",
        "TP53TG3F"
      ],
      "type_of_gene": "protein-coding"
    }
  ]
}

So we could filter by type_of_gene after receiving the response. But is there a way to filter at query time?

Cannot wildcard search for gene ID's

Hello all. Recently I was having trouble doing a wildcard search for gene ID's. That is, when making a query such as: http://mygene.info/v3/query?q=10800454%2A I would get results
{ "max_score": 1.55, "took": 7, "total": 2, "hits": [ { "_id": "9249", "_score": 1.55, "entrezgene": 9249, "name": "dehydrogenase/reductase 3", "symbol": "DHRS3", "taxid": 9606 }, { "_id": "ENSMUSG00000075014", "_score": 1.3, "name": "predicted gene 10800", "symbol": "Gm10800", "taxid": 10090 } ] }

Although the entrez gene ID 108000 matches to Cenpf, which can be verified pretty easily. The proposed workaround was to perform a batch query on the _id field, which was not default searched, as well symbol to allow for a generalized query.

However, it appears that a recent update has made the _id field unsearchable via prefix. The query http://mygene.info/v3/query?q=_id:1687*%20OR%20symbol:1687*&species=mouse now returns
{ "success": false, "error": "Could not execute query due to the following exception(s): ['query_shard_exception Can only use prefix queries on keyword and text fields - not on [_id] which is of type [_id]']" }

The other field for gene ID's, entrezgene is also unsearchable by prefixed query, since it is of type 'long':

{ "success": false, "error": "Could not execute query due to the following exception(s): ['query_shard_exception Can only use prefix queries on keyword and text fields - not on [entrezgene] which is of type [long]']" }

I might suggest changing this field to a string, which would allow wildcard'ed and prefixed queries? Either way, would love to see this issue fixed or if a developer could suggest another workaround. Thanks!

Add pharos as xrefs

Couldn't query on hgnc using POST method

Trying to query based on hgnc id:

This returns results: http://mygene.info/v3/query?q=hgnc:1771

However this doesn't:
import requests
requests.post('http://mygene.info/v3/query', params='q=1771&scopes=hgnc').json()

import EBI's gene2phenotype

https://www.ebi.ac.uk/gene2phenotype/downloads

seems like a reasonably simple data parser to write.

currently 2332 data records with these fields (from https://www.ebi.ac.uk/gene2phenotype/README):

  - gene symbol:                  HGNC gene symbol 
  - gene mim:                     OMIM number for a gene entry
  - disease name:                 Name provided by the curator
  - disease mim:                  OMIM number for a disease entry
  - disease confidence:           One value from the list of possible categories: both DD and IF, confirmed, possible, probable
  - allelic requirement:          comma-separated list of allelic requirement attributes. Possible values are: biallelic, monoallelic (Y),
                                  imprinted, uncertain, monoallelic, hemizygous, x-linked dominant, x-linked over-dominance, mosaic,
                                  mitochondrial, digenic 
  - mutation consequence:         One value from the list of possible consequences: 5_prime or 3_prime UTR mutation, activating,
                                  all missense/in frame, cis-regulatory or promotor mutation, dominant negative, increased gene dosage,
                                  loss of function, part of contiguous gene duplication, part of contiguous genomic interval deletion, uncertain
  - phenotypes:                   semicolon-separated list of HPO (http://www.human-phenotype-ontology.org/) IDs
  - organ specificity list:       semicolon-separated list of organs
  - pmids:                        semicolon-separated list of PMIDs 
  - panel:                        G2P panel: Cancer, Cardiac, DD, Ear, Eye or Skin
  - prev symbols:                 Symbols previously approved by the HGNC for this gene
  - hgnc id:                      HGNC identifier
  - gene disease pair entry date: Entry date for the gene disease pair into the database```

problems querying with POST (all ids with "notfound": true)

I'm trying to get the symbols from a bunch of uniprot IDs, but I'm getting "notfound": true on all POST queries (using uniprot or any other ids).

For example, for me this works:

curl 'http://mygene.info/v3/query?q=TP53'
{
  "max_score": 448.4826,
  "took": 120,
  "total": 2973,
  "hits": [
    {
      "_id": "7157",
      "_score": 448.4826,
      "entrezgene": 7157,
      "name": "tumor protein p53",
      "symbol": "TP53",
      "taxid": 9606
    },
   ...

but this doesn't:

curl -XPOST -d 'q=TP53' -H "Content-Type: application/x-www-form-urlencoded" 'http://mygene.info/v3/query'
[
  {
    "query": "TP53",
    "notfound": true
  }
]

What would be the correct way of making these POST queries?

Gene query scores hits poorly

https://mygene.info/v3/query?q=A1BG returns

{
  "total": 4,
  "took": 3,
  "max_score": 27.817915,
  "hits": [
    {
      "_id": "503538",
      "_score": 27.817915,
      "entrezgene": 503538,
      "name": "A1BG antisense RNA 1",
      "symbol": "A1BG-AS1",
      "taxid": 9606
    },
    {
      "_id": "117586",
      "_score": 9.105442,
      "entrezgene": 117586,
      "name": "alpha-1-B glycoprotein",
      "symbol": "A1bg",
      "taxid": 10090
    },
    {
      "_id": "140656",
      "_score": 5.982859,
      "entrezgene": 140656,
      "name": "alpha-1-B glycoprotein",
      "symbol": "A1bg",
      "taxid": 10116
    },
    {
      "_id": "1",
      "_score": 5.10959,
      "entrezgene": 1,
      "name": "alpha-1-B glycoprotein",
      "symbol": "A1BG",
      "taxid": 9606
    }
  ]
}

The first result returned is actually the worst result. The last result is the best (full symbol match with no case differences). ~~So what determines the order of genes in hits. I expected they would be ordered by relevance/score. If the ordering doesn't reflect score, then shouldn't a score field be included for each hit?~~

The use case I have in mind is that a user will submit a query and we will display the resulting hits in order of best to worst, like Google does for searches.

What are _id and _score in returned queries?

This isn't well explained in the docs. It may make sense to add some information about what those fields mean. =)

Partial results matching

We are considering using mygene.info to serve as a search backend for genes in the cognoma project front end (more discussion: cognoma/core-service#29 (comment) ). One use case that we have is an autocomplete style query. For this, we'd need partial queries to be supported. Is it possible to enable this with the current API either through the standard querystring or a specific string?

There is a bit more discussion of an ngram field in https://github.com/SuLab/mygene.info/issues/2

Thanks!

Entrez and Ensembl mappings

Looking at 10168, it has ensembl mappings to both ENSG00000186448 and ENSG00000281709. However, looking at ensembl and entrez records about these genes:
https://www.ncbi.nlm.nih.gov/gene?cmd=retrieve&dopt=default&list_uids=10168
https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000186448
https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000281709

I only see the single 1-to-1 mappings between 10168 and ENSG00000186448. Is this right?

See: SuLab/scheduled-bots#19