Query SRA Sequence Runs with Python

Retrieving data from SRA is a common task. NCBI has provided a nice tool collection named E-utilities to query and retrieve data from it. The example Python snippet below shows how to query NCBI SRA database using sample identifiers and get a table of linked NCBI BioProject, BioSample, Run, Download location and Size.

import sys, os                                                                                                                                                                                     
import subprocess                                                                                                                                                                                  
import shlex                                                                                                                                                                                       
import pandas as pd 

def get_SRR_from_biosamples(csv: str, batch_size=10, debug=True):                                                                                                                                  
    """Gete SRA run ID from BioSample ID.                                                                                                                                                          
    epost_cmd = 'epost -db biosample -format acc'                                                                                                                                                  
    elink_cmd = 'elink -target sra'                                                                                                                                                                
    efetch_cmd = 'efetch -db sra -format runinfo -mode xml'                                                                                                                                        
    xtract_cmd = """xtract -pattern Row -def "NA" -element BioProject\n                                                                                                                            
     BioSample Run download_path size_MB"""                                                                                                                                                        
    sample_ids = []                                                                                                                                                                                
    results = []                                                                                                                                                                                   
    with open(csv, 'r') as fh:                                                                                                                                                                     
        total_samples = fh.readlines()                                                                                                                                                             
        print(f'Total samples: {total_samples}')                                                                                                                                                   
        for idx, l in enumerate(total_samples):                                                                                                                                                    
            l = l.rstrip()                                                                                                                                                                         
            batch_num = int(idx/batch_size) + 1                                                                                                                                                    
            run_flag = True                                                                                                                                                                        
            if debug:                                                                                                                                                                              
                if batch_num > 1:                                                                                                                                                                  
                    print('Debug mode. Stop execution after 1 batch.')                                                                                                                             
                    run_flag = None                                                                                                                                                                
            if run_flag:                                                                                                                                                                           
                if  ((idx+1)%batch_size == 0) | (idx == len(total_samples) - 1):                                                                                                                   
                    print(f'Processing batch {batch_num}: {sample_ids}')                                                                                                                           
                    batch_results = []                                                                                                                                                             
                    sample_ids = ','.join(sample_ids)                                                                                                                                              
                    epost_cmd += f' -id "{sample_ids}"'                                                                                                                                            
                    epost = subprocess.Popen(shlex.split(epost_cmd),                                                                                                                               
                    elink = subprocess.Popen(shlex.split(elink_cmd),                                                                                                                               
                    efetch = subprocess.Popen(shlex.split(efetch_cmd),                                                                                                                             
                    xtract = subprocess.Popen(shlex.split(xtract_cmd),                                                                                                                             
                    while epost.returncode is None:                                                                                                                                                
                    for l in xtract.stdout.readlines():                                                                                                                                            
                        if not l.startswith('PRJ'):  # "502 Bad Gateway" when server is busy                                                                                                       
                            sys.stderr.write(f'Error processing {sample_ids}: {l}')                                                                                                                
                            if debug:                                                                                                                                                              
                    print(f'\nTotal SRA Runs in batch {batch_num}: {len(batch_results)}.\n')                                                                                                       
                    sample_ids = []                                                                                                                                                                
    print(f'Total runs in collection: {len(results)} with {idx+1} samples.')                                                                                                                       
    data = pd.DataFrame(results, columns=['BioProject', 'BioSample', 'Run', 'Download', 'size_MB'])                                                                                                
    return data                                                                                      

These E-utilities tools are used and need to be accessible from the environment: epost, elink, efetch, xtract. The subprocess module in Python is used to chain together these steps similar to Linux pipes. The samples are queried in batches to prevent too frequent queries to NCBI, which could lead to blocking of your future queries. After receiving the sample run identifiers, one can use the prefetch tool from E-utilities to download the files. And, of course, prefetch can be wrapped and chained together as well.


Some Improvements of the Python Language

Recently I have been reading a lot of Python code and noticed several improvements of the language that are really useful and wished that I had knew earlier since some of the changes were added to the Python language specification back from Python 3.5.

One improvement is the Type Hints (PEP 484, implemented in Python 3.5). It allows developers to add a type after a function parameter and a return type, which allows the code reader or another developer to understand what types are intended for the parameters. There is no checking of the actual parameter types being passed. The following example from the PEP 484 page explains its usage.

def greeting(name: str) -> str:
    return 'Hello ' + name

Another nice improvement is the new string formatting method introduced in Python 3.6 under “Formatted string literals.” It allows one to add Python variable values (or even calls) to a string object. Previously one has to use the “.format()” method to format a string. The example below explains how convenient it is now with the method.

def greeting(name: str) -> str:
    return f'Hello {name}'