Hello, I am randomly generating reads from a BED and fasta file to simulate read c

Too many files open when generating reads about pybedtools HOT 4 CLOSED

daler commented on July 28, 2024

Too many files open when generating reads

from pybedtools.

Comments (4)

daler commented on July 28, 2024

It's possible this has been fixed in the development version. Try installing pybedtools from the github repository (see http://packages.python.org/pybedtools/main.html#installation-for-developers-or-for-running-tests) to see if that helps.

If it still doesn't work, then could you paste the relevant part of the script here so I can debug?

from pybedtools.

olgabot commented on July 28, 2024

Hi Ryan,
The development version from github is the one I'm using. Below is the relevant code. FYI cl.args is the command line arguments. I think the main problem is the multiple calls to bed-creation and sequence-fetching.
Thanks,
Olga

import random as r
import pybedtools
import itertools
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
        all_transcripts = pybedtools.BedTool(cl.args['bed'])
        random_id = r.randint(100, 999)
        fastq1 = open(str(cl.args['n_reads'])+"reads_"+str(random_id)+"-1.fastq", "wb")
        fastq2 = open(str(cl.args['n_reads'])+"reads_"+str(random_id)+"-2.fastq", "wb")
        for i in range(cl.args['n_reads']):
            fragment_size = 0 # fragment size
            txpt = r.choice(all_transcripts) # random transcript
            fragment_size = int(r.gauss(cl.args['fragment_size'], 
                            cl.args['fragment_size_var']))
            if fragment_size < cl.args['min']:
                fragment_size = cl.args['min']
            elif fragment_size > cl.args['max']:
                fragment_size = cl.args['max']

            if fragment_size > txpt.stop-txpt.start:
                fragment_size = txpt.stop-txpt.start
            # print 'fragment size:', fragment_size
            read_start = r.randint(txpt.start, txpt.stop-fragment_size+1)
            read_name = ":" + ":".join([txpt.name, 
                                    str(read_start), 
                                    str(fragment_size),
                                    str(r.random())])
            if txpt.strand == '+':
                interval1 = " ".join([txpt.chrom, 
                                    str(read_start),
                                    str(read_start+cl.args['read1_length']),
                                    txpt.strand])
                interval2 = " ".join([txpt.chrom, 
                                    str(read_start+fragment_size-cl.args['read2_length']), 
                                    str(read_start+fragment_size),
                                    txpt.strand])
            else:
                interval1 = " ".join([txpt.chrom, 
                                    str(read_start+fragment_size-cl.args['read1_length']),
                                    str(read_start+fragment_size),
                                    txpt.strand])
                interval2 = " ".join([txpt.chrom, 
                                    str(read_start),
                                    str(read_start+cl.args['read2_length']),
                                    txpt.strand])
            # print 'making BED'
            bed1 = pybedtools.BedTool(interval1, from_string=True)
            bed2 = pybedtools.BedTool(interval2, from_string=True)

            # print 'Getting sequence'
            seq1 = bed1.sequence(cl.args['fasta']).print_sequence().split()
            seq2 = bed2.sequence(cl.args['fasta']).print_sequence().split()

            # print "making SeqRecord"
            rec1 = SeqRecord(Seq(seq1[1]), 
                id="".join([seq1[0][1:], read_name, "/1"]))
            rec2 = SeqRecord(Seq(seq2[1]), 
                id="".join([seq2[0][1:], read_name, "/2"]))

            # print 'Assigning PHRED quality scores'
            rec1.letter_annotations["phred_quality"] = \
                list(itertools.repeat(40, len(seq1[1])))
            rec2.letter_annotations["phred_quality"] = \
                list(itertools.repeat(40, len(seq2[1])))

            # print 'writing sequences'
            SeqIO.write(rec1, fastq1, "fastq")
            SeqIO.write(rec2, fastq2, "fastq")

            if i > 0 and i % 1000 == 0:
                print 'finished generating', i, 'paired-end reads'

        fastq1.close()
        fastq2.close()

from pybedtools.

daler commented on July 28, 2024

Yep, by making 2 BedTools for every iteration, you'll quickly hit the open file limit of your OS. There's a more efficient strategy that I should make more prominent in the docs: you can create a BedTool out of an iterable of Interval objects. Here's a skeleton of an example that uses a generator of random Intervals:

import pybedtools


def make_intervals(n):
    for i in range(n):
        # Do stuff to get chrom/start/stop/strand for each end
        #
        # Note that start and stop should be ints and that name and score
        # fields must be provided if you want to use strand

        # Read 1
        yield pybedtools.create_interval_from_list(
                [chrom, start1, stop1, 'read_%s/1' % i, '.', strand1])

       # Read 2
        yield pybedtools.create_interval_from_list(
                [chrom, start2, stop2, 'read_%s/2' % i, '.', strand2])

# Create BedTool out of the generator.
bed = pybedtools.BedTool(make_intervals(1000))

# Use name=True to use feature names created in generator
bed_with_seq = bed.sequence(fi='genome.fa', fo='random_reads.fa', name=True)

# Do stuff to create fastq1 and fastq2 filehandles

# Parse the FASTA file, edit SeqRecords, and write out to file
records = SeqIO.parse('random_reads.fa', format='fastq')
for i, record in enumerate(records):
    # Do stuff to records -- editing names, adding qualscores, etc, based on
    # whether it's read 1 or read 2
    if i % 2 == 0:  # even, so first end
        # do stuff
        SeqIO.write(record, fastq1, 'fastq')
    else:  # odd, so second end
        # do stuff
        SeqIO.write(record, fastq2, 'fastq')

fastq1.close()
fastq2.close()

from pybedtools.

olgabot commented on July 28, 2024

Thank you for the speedy and thorough response!

from pybedtools.

Too many files open when generating reads about pybedtools HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent