Categories
Uncategorized

Query SRA Sequence Runs with Python

Retrieving data from SRA is a common task. NCBI has provided a nice tool collection named E-utilities to query and retrieve data from it. The example Python snippet below shows how to query NCBI SRA database using sample identifiers and get a table of linked NCBI BioProject, BioSample, Run, Download location and Size.

import sys, os                                                                                                                                                                                     
import subprocess                                                                                                                                                                                  
import shlex                                                                                                                                                                                       
import pandas as pd 
.....


def get_SRR_from_biosamples(csv: str, batch_size=10, debug=True):                                                                                                                                  
    """Gete SRA run ID from BioSample ID.                                                                                                                                                          
    """                                                                                                                                                                                            
    epost_cmd = 'epost -db biosample -format acc'                                                                                                                                                  
    elink_cmd = 'elink -target sra'                                                                                                                                                                
    efetch_cmd = 'efetch -db sra -format runinfo -mode xml'                                                                                                                                        
    xtract_cmd = """xtract -pattern Row -def "NA" -element BioProject\n                                                                                                                            
     BioSample Run download_path size_MB"""                                                                                                                                                        
                                                                                                                                                                                                   
    sample_ids = []                                                                                                                                                                                
    results = []                                                                                                                                                                                   
                                                                                                                                                                                                   
    with open(csv, 'r') as fh:                                                                                                                                                                     
        total_samples = fh.readlines()                                                                                                                                                             
        print(f'Total samples: {total_samples}')                                                                                                                                                   
        for idx, l in enumerate(total_samples):                                                                                                                                                    
            l = l.rstrip()                                                                                                                                                                         
            sample_ids.append(l)                                                                                                                                                                   
            batch_num = int(idx/batch_size) + 1                                                                                                                                                    
            run_flag = True                                                                                                                                                                        
            if debug:                                                                                                                                                                              
                if batch_num > 1:                                                                                                                                                                  
                    print('Debug mode. Stop execution after 1 batch.')                                                                                                                             
                    run_flag = None                                                                                                                                                                
                    break                                                                                                                                                                          
            if run_flag:                                                                                                                                                                           
                if  ((idx+1)%batch_size == 0) | (idx == len(total_samples) - 1):                                                                                                                   
                    print(f'Processing batch {batch_num}: {sample_ids}')                                                                                                                           
                    batch_results = []                                                                                                                                                             
                                                                                                                                                                                                   
                    sample_ids = ','.join(sample_ids)                                                                                                                                              
                    epost_cmd += f' -id "{sample_ids}"'                                                                                                                                            
                    epost = subprocess.Popen(shlex.split(epost_cmd),                                                                                                                               
                                             stdout=subprocess.PIPE,                                                                                                                               
                                             encoding='utf8')                                                                                                                                      
                    elink = subprocess.Popen(shlex.split(elink_cmd),                                                                                                                               
                                             stdin=epost.stdout,                                                                                                                                   
                                             stdout=subprocess.PIPE,                                                                                                                               
                                             encoding='utf8')                                                                                                                                      
                    efetch = subprocess.Popen(shlex.split(efetch_cmd),                                                                                                                             
                                              stdin=elink.stdout,                                                                                                                                  
                                              stdout=subprocess.PIPE,                                                                                                                              
                                              encoding='utf8')                                                                                                                                     
                    xtract = subprocess.Popen(shlex.split(xtract_cmd),                                                                                                                             
                                              stdin=efetch.stdout,                                                                                                                                 
                                              stdout=subprocess.PIPE,                                                                                                                              
                                              encoding='utf8')                                                                                                                                     
                                                                                                                                                                                                   
                    while epost.returncode is None:                                                                                                                                                
                        epost.poll()                                                                                                                                                               
                                                                                                                                                                                                   
                    for l in xtract.stdout.readlines():                                                                                                                                            
                        if not l.startswith('PRJ'):  # "502 Bad Gateway" when server is busy                                                                                                       
                            sys.stderr.write(f'Error processing {sample_ids}: {l}')                                                                                                                
                        else:                                                                                                                                                                      
                            if debug:                                                                                                                                                              
                                print(l.rstrip())                                                                                                                                                  
                            batch_results.append(l.split())                                                                                                                                        
                    print(f'\nTotal SRA Runs in batch {batch_num}: {len(batch_results)}.\n')                                                                                                       
                    results.extend(batch_results)                                                                                                                                                  
                    sample_ids = []                                                                                                                                                                
    print(f'Total runs in collection: {len(results)} with {idx+1} samples.')                                                                                                                       
                                                                                                                                                                                                   
    data = pd.DataFrame(results, columns=['BioProject', 'BioSample', 'Run', 'Download', 'size_MB'])                                                                                                
    return data                                                                                      

These E-utilities tools are used and need to be accessible from the environment: epost, elink, efetch, xtract. The subprocess module in Python is used to chain together these steps similar to Linux pipes. The samples are queried in batches to prevent too frequent queries to NCBI, which could lead to blocking of your future queries. After receiving the sample run identifiers, one can use the prefetch tool from E-utilities to download the files. And, of course, prefetch can be wrapped and chained together as well.

Categories
Uncategorized

PyTorch Geometric and CUDA

PyTorch Geometric (PyG) is an add-on library for developing graph neural networks using Python. It supports CUDA but you’ll have to make sure to install it correctly. Below is one error message I got after installing PyG:

from torch_geometric.data import Data
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
...

OSError: /anaconda3/lib/python3.7/site-packages/torch_sparse/_version_cuda.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSs

It is clear this error is related to CUDA version. So, I checked it:

print(torch.version.cuda, torch.version)
10.2, 1.9.0

Running $ nvidia-smi, gave a CUDA version 11.2. So my system was somehow messed up with mixed versions of CUDA. To fix the mess and get PyG working, I did the following:

$ pip uninstall torch
$ pip install torch===1.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install torch-scatter -f https://data.pyg.org/whl/torch-1.9.1+cu111.html
$ pip install torch-sparse -f https://data.pyg.org/whl/torch-1.9.1+cu111.html
$ pip install torch-geometric
$ apt-get install nvidia-modprobe

Note that there is no existing wheel built with CUDA 11.2 (cu112) so I used the closest version (cu111). Now PyG works! The “nvidia-modprobe” kernel extension fixes “RuntimeError: CUDA unknown error – this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero,” which I got after having two Python sessions running and both trying to using CUDA.

Update from some other testing regarding these errors:

RuntimeError: Detected that PyTorch and torch_cluster were compiled with different CUDA versions. PyTorch has CUDA version 11.1 and torch_cluster has CUDA version 10.2. Please reinstall the torch_cluster that matches your PyTorch install.

RuntimeError: Detected that PyTorch and torch_spline_conv were compiled with different CUDA versions. PyTorch has CUDA version 11.1 and torch_spline_conv has CUDA version 10.2. Please reinstall the torch_spline_conv that matches your PyTorch install.

The following commands fixed it:

$ pip install --upgrade pip
$ CUDA=cu111
$ TORCH=1.9.1
$ pip install torch-cluster==1.5.9 -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html
$ pip install torch-spline-conv -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html

Categories
Linux

Update R and Bioconductor

It’s not a straightforward thing to update R and Bioconductor (a bioinformatics package collection for R), especially if you used the R package that comes with your Linux distribution. For example, the R package from Ubuntu package repository is still 3.5, while the latest version from the r-project site is already 4.0.3. While using an older version of R itself for statistics may not be a problem, many of the Bioconductor packages do have updates and bug fixes that require R v4. And to make things worse, updating Bioconductor itself is a painful process. Below is what I did to update R, Bioconductor, and associated packages. The main lesson learned here is not to use the R package from Ubuntu repository.

  1. Remove R installed from the Ubuntu repository;
  2. Install R from r-project by adding it as an apt repository;
  3. Update R packages (system-wide);
  4. Update R packages installed in the user directory;
  5. Update BiocManager
sudo apt-get purge r-base* r-recommended r-cran-*
sudo apt autoremove
sudo apt update
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo apt update
sudo apt install r-base r-base-core r-recommended r-base-dev
update.packages(ask = FALSE, checkBuilt = TRUE)
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.12")

The R code section may need to run twice once as a sudo user and once as a normal user if you also have packages installed under a user name (that is to say not a system-wide installation from system admin). The last command tells BiocManager (Bioconductor package manager) to install Bioconductor version 3.12 and update all installed Bioconductor pages.

Categories
Uncategorized

Git with Python

There are times when you want to do the source code management programmatically, and here comes GitPython. GitPython wraps the git commands so that you can execute most git functions from within Python (and without using shutils or subprocess). The example code snippet below shows how to do a “git clone” and if the destination is already there, it will try a “git checkout” first from the “master” branch if it exists then the “main” branch if the master branch does not exist.

try:
    git.Repo.clone_from(remote_repo, local_repo)
except GitCommandError as error:
    if error.status == 128:
        repo = git.Repo(local_repo)
        branch = 'master'
        try:
            repo.git.checkout(branch)
        except GitCommandError:
            branch = 'main'  
            repo.git.checkout(branch)
                
    else:
        msg = ' '.join(error.command)
        msg += error.stderr
        sys.exit(f'Error in running git: {msg}.')

Categories
Uncategorized

Signing Git Commit using GPG

I have been enjoying using Magit in Emacs to do all the git related stuff and run into an error when tagging a release. The error message is

git … tag --annotate --sign -m my_msg
error: gpg failed to sign the data
error: unable to sign the tag

This turned out to be caused by the fact that I have not set up gpg signing and signature. Below is how the problem is fixed and from now on all my git commits are going to be signed.

$ gpg --gen-key

There were a few dialogues between these commands, e.g. asking for names, e-mail, secret key, and it is recommended that you type random keys after these questions so that when gpg generate randoms there is more entropy. In the end, you will see some text with a line like this:

gpg: key 404NOTMYREALKEYID marked as ultimately trusted

This string “404NOTMYREALKEYID” is the key id. The same key id also shows up in the output of the following command:

$ gpg --list-secret-keys --keyid-format LONG
.....
---------------------------
sec   rsa3072/404NOTMYREALKEYID ......

Finally, just registering this key id with git. And the problem is solved. So the problem is not in Magit, but my configuration, since Magit uses the “–sign” option when it calls Git, which is actually a good practice.

$ git config --global commit.gpgsign true
$ git config --global user.signingkey 404NOTMYREALKEYID

Categories
Uncategorized

Some Improvements of the Python Language

Recently I have been reading a lot of Python code and noticed several improvements of the language that are really useful and wished that I had knew earlier since some of the changes were added to the Python language specification back from Python 3.5.

One improvement is the Type Hints (PEP 484, implemented in Python 3.5). It allows developers to add a type after a function parameter and a return type, which allows the code reader or another developer to understand what types are intended for the parameters. There is no checking of the actual parameter types being passed. The following example from the PEP 484 page explains its usage.

def greeting(name: str) -> str:
    return 'Hello ' + name

Another nice improvement is the new string formatting method introduced in Python 3.6 under “Formatted string literals.” It allows one to add Python variable values (or even calls) to a string object. Previously one has to use the “.format()” method to format a string. The example below explains how convenient it is now with the method.

def greeting(name: str) -> str:
    return f'Hello {name}'
Categories
emacs Linux

Configuration for Developing R code with ESS

It has been close to 10 years since I use R seriously last time. Back then I used ESS in Emacs to do all the editing and interactive sessions. I really liked it back then and I would like to check it out how ESS has evolved through the years. A quick glimpse of the ESS manual suggests it has got a lot of improvements and the Bioconductor project has evolved as well. For example, even the package installation package has changed in Bioconductor.

Unfortunately, I have forgotten the tricks and configurations that I have used to set up my ESS and Emacs environment. Below are a few things that I had to go through.

$sudo apt install -y r-cran-devtools
$sudo apt install libcurl4-gnutls-dev

The above two packages are required for quite some other R packages. To make sure the user library path is loaded correctly, I had to set the following environmental variable:

export R_LIBS="~/.local/R/lib"

The above setting works for starting R in Terminal. To make the local R library (in my home directory with no root privileges required) work with the ESS R session, I had to create a ~/.Rprofile file with the following content:

.libPaths("~/.local/R/lib")

Now my Emacs speaks ESS well. I can install packages to my local libraries by default and use Emacs to edit R code in one frame and run an interactive R session in another frame and use “C-c C-c” to execute a selected region in the R code frame.

Categories
emacs Linux

Setup NCBI E-utilities

The E-utilities toolbox from NCBI is the recommended way to access NCBI and NLM data, including the genome data. The following command can easily install it in the Ubuntu Linux distribution.

$sudo apt update
$sudo apt install ncbi-entrez-direct

After successful installation you can test it out using the following example code, which provides ftp path to the first 10 Bacillus cereus in refseq. Unfortunately, this code needs to run in the BASH shell. Running it in eshell (within Emacs) does not work.

$esearch -db assembly -query "Bacillus cereus " | efetch -format docsum -mode json|grep "ftppath_refseq" |head -10
            "ftppath_refseq": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/394/245/GCF_013394245.1_ASM1339424v1",
            "ftppath_refseq": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/303/115/GCF_013303115.1_ASM1330311v1",
            "ftppath_refseq": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/284/505/GCF_013284505.1_ASM1328450v1",
            "ftppath_refseq": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/284/455/GCF_013284455.1_ASM1328445v1",
            "ftppath_refseq": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/284/445/GCF_013284445.1_ASM1328444v1",
            "ftppath_refseq": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/284/425/GCF_013284425.1_ASM1328442v1",
            "ftppath_refseq": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/267/775/GCF_013267775.1_ASM1326777v1",
            "ftppath_refseq": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/267/475/GCF_013267475.1_ASM1326747v1",
            "ftppath_refseq": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/267/455/GCF_013267455.1_ASM1326745v1",
            "ftppath_refseq": "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/267/275/GCF_013267275.1_ASM1326727v1",

Categories
emacs

View PDF in Emacs

Be default PDF files are opened by Docviewer in Ubuntu distributes. Yet, there is a more convenient PDF tool, aptly named pdf-tools that works well with Emacs. The installation of pdf-tools is straight forward. Below is what I did.

sudo apt install libpng-dev zlib1g-dev
sudo apt install libpoppler-glib-dev
sudo apt install libpoppler-private-dev

After that, go to Emacs and use the package-install to install

package-refresh-contents
package-install pdf-tools

Finally, add the following line to the .emacs file. And make sure that if you are displaying line numbers in emacs, make sure it is disabled in pdf-view-mode. Here what I used is to just enable line number display in prog-mode.

(pdf-tools-install)
(add-hook 'prog-mode-hook 'linum-on)

Categories
emacs

Switch Frame Quickly in Emacs

Switching between multiple frames in Emacs is a pain when you have more than two frames open. I just noticed that the ace-window package offers the function for fast frame switching. All you need to do is to put the following configuration to the .emacs file and use the “M-o” key combination then pickup the frame number in red, and you will switch to that frame.

(global-set-key (kbd "M-o") 'ace-window)