Comments (12)
Hi,
The attached script is a very simple python script that you need to modify a bit to make it work with your output files name. I created a modified version that may work.
The script works with the _genes.out files (compressed or not) expect files like this:
sample1_genes.out.gz
sample2_genes.out.gz
sample3_genes.out.gz
It will generate a matrix for the ExonTPM values with these columns:
Gene_Chr_Start Chr Start End ExonLength sample1 sample2 sample3
and another similar for the exon reads:
Gene_Chr_Start Chr Start End ExonLength sample1 sample2 sample3
Please, try and let me know.
from tpmcalculator.
Hi,
Are your chromosomes names or gene names only numbers?
Change line 30 to:
data[column]['Gene_Chr_Start']` = data[column]['Gene_Id'].map(str) + '_' + data[column]["Chr"].map(str) + '_' + data[column]["Start"].map(str)
Let me know if this works.
from tpmcalculator.
Dear TPMcalculator team,
its quite urgent s any hints much appreciated!
best Karen
from tpmcalculator.
Hi,
You can use this simple python script to create the matrix file.
Execute it on the folder with all TPMCalculator results.
It will process the _sorted_genes.out files. If you want to process any other file just change the suffix in the script.
from tpmcalculator.
Did you solve the problem with the script I sent?
from tpmcalculator.
Hi,
I've tested the tpmcalculator2matrixes.py.gz script without any success.
Alex
from tpmcalculator.
What are the errors?
What are your input files?
from tpmcalculator.
The input files were bam files generated by subread-align. TPMcalculator generastes results_genes.uni, .out, .ent for each sample, but no merge file.
When running the tpmcalculator2matrixes.py script, here is was I got:
(tpmcalculator) alexandre@alexandre-Precision-Tower-5810:~/Documents/Spombe_data_tmp$ python tpmcalculator2matrixes.py
ExonTPM
Data columns: 0
Data rows: 0
Traceback (most recent call last):
File "/home/alexandre/miniconda3/envs/tpmcalculator/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Gene_Id'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "tpmcalculator2matrixes.py", line 29, in
data[column]['Gene_Chr_Start'] = data[column]['Gene_Id'] + '' + data[column]["Chr"] + '' + data[column]["Start"].map(str)
File "/home/alexandre/miniconda3/envs/tpmcalculator/lib/python3.8/site-packages/pandas/core/frame.py", line 2902, in getitem
indexer = self.columns.get_loc(key)
File "/home/alexandre/miniconda3/envs/tpmcalculator/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 'Gene_Id'
from tpmcalculator.
Thanks a lot ! It works just fine.
Alex
from tpmcalculator.
It uses file name as sample name.
from tpmcalculator.
Hi Roberto and Alex,
I have _genes.out files in a folder like below:
Sample1.sorted_genes.out
Sample2.sorted_genes.out
Sample3.sorted_genes.out
And I have this python code:
import os
import pandas
data = {}
columns = ['ExonTPM', 'ExonReads']
output_suffix = "_genes.out"
files = [ f for ds, df, files in os.walk('./') for f in files if output_suffix in f]
for column in columns:
print(column)
data[column] = pandas.DataFrame()
for f in files:
# Get sample name removing the suffix and check if the output is compressed
if f.endswith('.gz'):
output_suffix_real = output_suffix + '.gz'
else:
output_suffix_real = output_suffix
s = f.replace(output_suffix_real, '')
df = pandas.read_csv(f, sep='\t')
df = df[['Gene_Id', 'Chr', 'Start', 'End', 'ExonLength', column]]
df = df.rename(index=str, columns={column: s})
if data[column].empty:
data[column] = df
else:
data[column] = data[column].merge(df, on=['Gene_Id', 'Chr', 'Start', 'End', 'ExonLength'], how='outer')
print('Data columns: ' + str(len(data[column].columns)))
print('Data rows: ' + str(len(data[column])))
# Printing TSV matrices
data[column]['Gene_Chr_Start'] = data[column]['Gene_Id'] + '_' + data[column]["Chr"] + '_' + data[column]["Start"].map(str)
data[column] = data[column].drop(['Gene_Id'], axis=1)
cols = data[column].columns.tolist()
cols = cols[-1:] + cols[:-1]
data[column] = data[column][cols]
data[column].to_csv( column + '.tsv', sep='\t', index=False, na_rep='0')
I used python tpmcalculator2matrixes.py
This gave an Error like below:
ExonTPM
sys:1: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
Data columns: 8
Data rows: 49773
Traceback (most recent call last):
File "tpmcalculator2matrixes.py", line 30, in <module>
data[column]['Gene_Chr_Start'] = data[column]['Gene_Id'] + '_' + data[column]["Chr"] + '_' + data[column]["Start"].map(str)
File "/soft/apps/Python/2.7.11-goolf-1.7.20/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/core/ops.py", line 639, in wrapper
arr = na_op(lvalues, rvalues)
File "/soft/apps/Python/2.7.11-goolf-1.7.20/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/core/ops.py", line 586, in na_op
result[mask] = op(x[mask], _values_from_object(y[mask]))
TypeError: cannot concatenate 'str' and 'int' objects
May I know what could be the issue?
from tpmcalculator.
@r78v10a07 Thanks a lot Roberto. It worked.
from tpmcalculator.
Related Issues (20)
- Gene number in input GTF differs from the TPMCalculator output HOT 3
- symbol lookup error HOT 9
- Sets the name of the output HOT 2
- Add a new option to use a directory as output destination HOT 1
- For the paired end reads, is it recommended to use option -p or should i go with default without it? HOT 1
- Output file description HOT 2
- /usr/bin/ld: cannot find -lbamtools HOT 2
- *_genes.out duplicate genes HOT 3
- Compilation error: collect2: error: ld returned 1 exit status HOT 2
- Is the read counting strand-specific? HOT 2
- Key ID for gene name was not found on GTF line HOT 3
- No TPM values, no reads processed HOT 9
- Chromosome with name: ENST.... does not exist HOT 10
- Output files desctiption HOT 1
- Installation without docker HOT 1
- Meaning of "UniqueReads" in genes.out file? HOT 2
- Possible to use gff3 in TPMCalculator v 0.4? HOT 1
- Build problems on Ubuntu and MacOSX HOT 1
- Help me please HOT 1
- After installing version 0.0.4, -version still prints 0.0.3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tpmcalculator.