从Uniprot中批量获取PDB结构

Last updated on Dec 5, 2022

Get PDB脚本源于Github中： https://github.com/Wang-Lin-boop/GetPDB

安装脚本和Julia运行库

git clone https://github.com/Wang-Lin-boop/GetPDB
cd GetPDB
echo "alias GetPDB=${PWD}/GetPDB" >> ~/.bashrc
chmod +x GetPDB
cd ..
wget https://julialang-s3.julialang.org/bin/linux/x64/1.5/julia-1.5.3-linux-x86_64.tar.gz
tar zxvf julia-1.5.3-linux-x86_64.tar.gz  
cd julia-1.5.3/bin
echo "export PATH=${PWD}:\$PATH" >> ~/.bashrc
source ~/.bashrc
julia
]add BioStructures # in Julia REPL
exit()

Get PDB用法

Usage: GetPDB [OPTION] <parameter> 

Example: GetPDB -i Uniprot_list -w -o Uniprot-PDB -n 10 -p -r 

Input parameter:  
-i    Your Uniprotlist file.   
-b    Your PDBlib, optional.   
-n    The Max number of CPU threads available for this job, default is 4.  
-l    An index for Uniprot, such as "pdb_chain_uniprot.csv".  
      This file can be download at https://www.ebi.ac.uk/pdbe/docs/sifts/quick.html  
      OR you can use -w download its latest version automatic.  
-w    Use -w instead of -l unless you know what you're doing.

Output parameter: 
-o    Processed PDB files will store in this Path, default is Uniprot-PDB.  
-d    A dir to store some list of Uniprot-PDBID-Chainid info, defult is Uniprot-info-list.  
-p    Output a Representative chain per Uniprot's PDB Entry. Such as PXXXXX:XXXX_A/B, only XXXX_A will be output. Defult is false.   
-r    Each sequence interval preserves only one representative structure. Defult is false.   
      Such as P00000:XXXX_A:27-213 and P00000:ZZZZ_A:27-213, only one of them will be saved.

获取蛋白质PDB结构首先需要将基因名转换为蛋白质

在Uniprot的检索页中批量检索基因集，下载表格数据并提取其中的蛋白质ID信息

在Uniprot_List_entry中存放基因对应的蛋白质名，文件本身及蛋白质ID不需要添加前后缀

批量获取Uniprot中的AlphaFold数据库存储的蛋白质PDB结构

for i in `cat Uniprot_List_entry`; do mkdir ${i}; wget -q -N -O ./${i}/${i}.pdb https://alphafold.ebi.ac.uk/files/AF-${i}-F1-model_v1.pdb; done

批量获取Uniprot中的源于PDB数据库中的蛋白质PDB结构或晶体CIF结构

GetPDB -i Uniprot_List_entry -w -o PDBbyUS -n 10

Bio

从Uniprot中批量获取PDB结构

Chi

Doctor of Bioengineering