SDK_Document_BBrowserX
Installation
[ ]:
!python3 -m pip install -U bioturing_connector
1. Connect to host server
Must run this step before any further analyses
User’s token is generated from host website
[39]:
import numpy as np
import pandas as pd
from bioturing_connector.typing import Species
from bioturing_connector.typing import ChunkSize
from bioturing_connector.typing import StudyType
from bioturing_connector.typing import StudyUnit
from bioturing_connector.typing import InputMatrixType
from bioturing_connector.bbrowserx_connector import BBrowserXConnector
connector = BBrowserXConnector(
host="https://talk2data.bioturing.com/t2d_index_tool/",
token="98592aac0b284c899ebf5dd0ff2eff90",
ssl=True
)
[20]:
connector.test_connection()
Connecting to host at https://talk2data.bioturing.com/t2d_index_tool/api/v1/test_connection
Connection successful
2. List groups, studies and s3
2.1. Get info of available groups
[2]:
user_groups = connector.get_user_groups()
user_groups
[2]:
[{'group_id': 'all_members', 'group_name': 'All members'},
{'group_id': 'bioturing_public_studies',
'group_name': 'BioTuring Public Studies'},
{'group_id': 'personal', 'group_name': 'Personal workspace'}]
2.2. List all available studies in a group
[3]:
# Using group_id from step 2.1
study_list = connector.get_all_studies_info_in_group(
group_id='personal',
species=Species.HUMAN.value,
)
study_list
[3]:
[{'uuid': '80d76fc8136c4dfe807e3aa2beefca76',
'study_title': 'TBD',
'study_hash_id': 'COSMX_HUMAN_CORTEX',
'created_by': 'sonvo@bioturing.com'},
{'uuid': 'a1558f8ed6064095be86a091a4118c4a',
'study_title': 'TBD',
'study_hash_id': 'GSE128223',
'created_by': 'sonvo@bioturing.com'}]
2.3. List all s3 bucket of current user
[ ]:
connector.get_user_s3()
[{'id': '505e49d2abee405f8a7b4ce2628d5270',
'bucket': 'bioturingdebug',
'prefix': ''},
{'id': 'd938706094354d7eb4726d6c9b07de9c',
'bucket': 'talk2data',
'prefix': ''}]
3. Submit study
NOTE: Get group_id from step “2.1. Get info of available groups”
3.1. Option 1: Submit study from s3
Parameters:
----
group_id: str
ID of the group to submit the data to.
s3_id: str
ID of s3 bucket. Default: None\n
If s3_id is not provided, we will use the first s3 bucket configured on the platform.
batch_info: List[dict]
File path and batch name information, the path DOES NOT include bucket path configured on platform!
Example:
For H5AD format:
[{
'matrix': 's3_path/GSE128223_1.h5ad'
}, {...}]
For RDS format:
[{
'matrix': 's3_path/GSE128223_1.rds'
}, {...}]
For MTX_10X format:
[{
'matrix': 's3_path/data_1/matrix.mtx',
'features': 's3_path/data_1/features.tsv',
'barcodes': 's3_path/data_1/barcodes.tsv',
}, {...}]
For TILE_DB format:
[{
'folder': 's3_path/GSE128223_1'
}, {...}]
study_id: str
Will be name of study (eg: GSE128223)
If no value is provided, default id will be a random uuidv4 string
name: str
Name of the study.
authors: List[str]
Authors of the study.
abstract: str
Abstract of the study.
species: str
Species of the study.
Support:
Species.HUMAN.value
Species.MOUSE.value
Species.NON_HUMAN_PRIMATE.value
Species.OTHERS.value
skip_dimred: Bool
Skip BioTuring pipeline if set to True (only appliable when input is a scanpy/seurat object).
input_matrix_type: str
Is the input matrix already normalized or not?
Support:
InputMatrixType.NORMALIZED.value (will skip BioTuring normalization, h5ad: use adata.X)
InputMatrixType.RAW.value (apply BioTuring normalization, h5ad: use adata.raw.X)
study_type: int
Format of dataset
Support:
StudyType.BBROWSER.value
StudyType.H5_10X.value
StudyType.H5AD.value
StudyType.MTX_10X.value
StudyType.BCS.value
StudyType.RDS.value
StudyType.TSV.value
StudyType.TILE_DB.value
min_counts: int
Minimum number of counts required for a cell to pass filtering.
min_genes: int
Minimum number of genes expressed required for a cell to pass filtering.
max_counts: int
Maximum number of counts required for a cell to pass filtering.
max_genes:
Maximum number of genes expressed required for a cell to pass filtering.
mt_percentage: int
Maximum number of mitochondria genes percentage required for a cell to pass filtering.
Ranging from 0 to 100
3.1.1. 10X Matrix format
[37]:
## The path DOES NOT include the bucket path configured on platform
## Support multiple batches per submission
batch_info = [{
'matrix': 'GSE128223/raw/matrix.mtx',
'features': 'GSE128223/raw/features.tsv',
'barcodes': 'GSE128223/raw/barcodes.tsv',
}, {...}]
connector.submit_study_from_s3(
group_id='personal',
batch_info=batch_info,
study_id='GSE128223',
name='This is my first study',
authors=['Huy Nguyen', 'Thao Truong'],
species=Species.HUMAN.value,
input_matrix_type=InputMatrixType.RAW.value,
study_type=StudyType.MTX_10X.value
)
[2023-09-26 06:08] Waiting in queue
[2023-09-26 06:08] Downloading GSE128223/raw/barcodes.tsv from s3: 262.1 KB / 539.5 KB
[2023-09-26 06:08] Downloading GSE128223/raw/features.tsv from s3: 262.1 KB / 322.8 KB
[2023-09-26 06:08] Downloading GSE128223/raw/matrix.mtx from s3: 262.1 KB / 927.0 MB
[2023-09-26 06:09] File downloaded
[2023-09-26 06:09] Reading batch: raw
[2023-09-26 06:09] Preprocessing expression matrix: 20923 cells x 35756 genes
[2023-09-26 06:09] Filtered: 20923 cells remain
[2023-09-26 06:09] Start processing study
[2023-09-26 06:09] Normalizing expression matrix
[2023-09-26 06:09] Running PCA
[2023-09-26 06:09] Running kNN
[2023-09-26 06:09] Running venice binarizer
[2023-09-26 06:09] Running t-SNE
[2023-09-26 06:09] Study was successfully submitted
[2023-09-26 06:09] DONE!!!
Study submitted successfully!
[37]:
True
3.1.2. Scanpy object
[ ]:
## The path DOES NOT include the bucket path configured on platform
## Support multiple batches per submission
batch_info = [{
'matrix': 's3_path/GSE128223_1.h5ad',
}, {...}]
connector.submit_study_from_s3(
group_id='personal',
batch_info=batch_info,
study_id='GSE128223',
name='This is my first study',
authors=['Huy Nguyen', 'Thao Truong'],
species=Species.HUMAN.value,
input_matrix_type=InputMatrixType.RAW.value,
study_type=StudyType.H5AD.value
)
3.1.3. Seurat object
[ ]:
## The path DOES NOT include the bucket path configured on platform
## Support multiple batches per submission
batch_info = [{
'matrix': 's3_path/GSE128223_1.rds',
}, {...}]
connector.submit_study_from_s3(
group_id='personal',
batch_info=batch_info,
study_id='GSE128223',
name='This is my first study',
authors=['Huy Nguyen', 'Thao Truong'],
species=Species.HUMAN.value,
input_matrix_type=InputMatrixType.RAW.value,
study_type=StudyType.RDS.value
)
3.1.4. Tile DB format
[ ]:
## The path DOES NOT include the bucket path configured on platform
## Support multiple batches per submission
batch_info = [{
'folder': 's3_path/GSE128223_1',
}, {...}]
connector.submit_study_from_s3(
group_id='personal',
batch_info=batch_info,
study_id='GSE128223',
name='This is my first study',
authors=['Huy Nguyen', 'Thao Truong'],
species=Species.HUMAN.value,
input_matrix_type=InputMatrixType.RAW.value,
study_type=StudyType.TILE_DB.value
)
3.1.5. Full matrix dataframe
[ ]:
## The path DOES NOT include the bucket path configured on platform
## Support multiple batches per submission
batch_info = [{
'matrix': 's3_path/GSE128223_1.tsv',
}, {...}]
connector.submit_study_from_s3(
group_id='personal',
batch_info=batch_info,
study_id='GSE128223',
name='This is my first study',
authors=['Huy Nguyen', 'Thao Truong'],
species=Species.HUMAN.value,
input_matrix_type=InputMatrixType.RAW.value,
study_type=StudyType.TSV.value
)
3.2. Option 2: Submit study from local machine
Parameters:
------
group_id: str
ID of the group to submit the data to.
batch_info: List[dict]
File path and batch name information.
Example:
For H5AD format:
[{
'matrix': 'local_path/GSE128223_1.h5ad'
}, {...}]
For RDS format:
[{
'matrix': 'local_path/GSE128223_1.rds'
}, {...}]
For MTX_10X format:
[{
'name': 'data_1',
'matrix': 'local_path/data_1/matrix.mtx',
'features': 'local_path/data_1/features.tsv',
'barcodes': 'local_path/data_1/barcodes.tsv',
}, {...}]
study_id: str
If no value is provided, default id will be a random uuidv4 string
name: str
Name of the study.
authors: List[str]
Authors of the study.
abstract: str
Abstract of the study.
species: str
Species of the study.
Support:
Species.HUMAN.value
Species.MOUSE.value
Species.NON_HUMAN_PRIMATE.value
Species.OTHERS.value
skip_dimred: bool
Skip BioTuring pipeline if set to True (only appliable when input is a scanpy/seurat object).
input_matrix_type: str
Is the input matrix already normalized or not?
Support:
InputMatrixType.NORMALIZED.value (will skip BioTuring normalization, h5ad: use adata.X)
InputMatrixType.RAW.value (apply BioTuring normalization, h5ad: use adata.raw.X)
study_type: int
Format of dataset
Support:
StudyType.BBROWSER.value
StudyType.H5_10X.value
StudyType.H5AD.value
StudyType.MTX_10X.value
StudyType.BCS.value
StudyType.RDS.value
StudyType.TSV.value
min_counts: int
Minimum number of counts required for a cell to pass filtering.
min_genes: int
Minimum number of genes expressed required for a cell to pass filtering.
max_counts: int
Maximum number of counts required for a cell to pass filtering.
max_genes: int
Maximum number of genes expressed required for a cell to pass filtering.
mt_percentage: int
Maximum number of mitochondria genes percentage required for a cell to pass filtering.
Ranging from 0 to 100
chunk_size: int
Size of each separated chunk for uploading. Default: ChunkSize.CHUNK_100_MB.value\n
Support:
ChunkSize.CHUNK_5_MB.value
ChunkSize.CHUNK_100_MB.value
ChunkSize.CHUNK_500_MB.value
ChunkSize.CHUNK_1_GB.value
3.2.1. 10X Matrix format
[38]:
## Support multiple batches per submission
batch_info = [{
'name': 'GSE128223',
'matrix': '/data/dev/example_dataset/GSE128223/raw/matrix.mtx',
'features': '/data/dev/example_dataset/GSE128223/raw/features.tsv',
'barcodes': '/data/dev/example_dataset/GSE128223/raw/barcodes.tsv',
}, {...}]
connector.submit_study_from_local(
group_id='personal',
batch_info=batch_info,
study_id='GSE128223',
name='This is my first study',
authors=['Huy Nguyen', 'Thao Truong'],
species=Species.HUMAN.value,
input_matrix_type=InputMatrixType.RAW.value,
study_type=StudyType.MTX_10X.value
)
GSE128223matrix.mtx - chunk_0: 100MMB [00:08, 12.2MMB/s]
GSE128223matrix.mtx - chunk_1: 100MMB [00:09, 11.5MMB/s]
GSE128223matrix.mtx - chunk_2: 100MMB [00:08, 12.4MMB/s]
GSE128223matrix.mtx - chunk_3: 100MMB [00:10, 10.3MMB/s]
GSE128223matrix.mtx - chunk_4: 100MMB [00:10, 10.1MMB/s]
GSE128223matrix.mtx - chunk_5: 100MMB [00:11, 9.27MMB/s]
GSE128223matrix.mtx - chunk_6: 100MMB [00:11, 8.90MMB/s]
GSE128223matrix.mtx - chunk_7: 100MMB [00:07, 13.7MMB/s]
GSE128223matrix.mtx - chunk_8: 84%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 84.0M/100M [00:02<00:00, 38.6MMB/s]
GSE128223features.tsv - chunk_0: 0%|▌ | 316k/100M [00:00<00:11, 8.95MMB/s]
GSE128223barcodes.csv - chunk_0: 1%|▊ | 527k/100M [00:00<00:05, 17.9MMB/s]
[2023-09-26 06:15] Waiting in queue
[2023-09-26 06:15] Reading batch: GSE128223
[2023-09-26 06:15] Preprocessing expression matrix: 20923 cells x 35756 genes
[2023-09-26 06:15] Filtered: 20923 cells remain
[2023-09-26 06:15] Start processing study
[2023-09-26 06:15] Normalizing expression matrix
[2023-09-26 06:15] Running PCA
[2023-09-26 06:15] Running kNN
[2023-09-26 06:15] Running venice binarizer
[2023-09-26 06:15] Running t-SNE
[2023-09-26 06:15] Study was successfully submitted
[2023-09-26 06:15] DONE!!!
Study submitted successfully!
[38]:
True
3.2.2. Scanpy object
[ ]:
## Support multiple batches per submission
batch_info = [{
'matrix': 'local_path/GSE128223_1.h5ad',
}, {...}]
connector.submit_study_from_local(
group_id='personal',
batch_info=batch_info,
study_id='GSE128223',
name='This is my first study',
authors=['Huy Nguyen', 'Thao Truong'],
species=Species.HUMAN.value,
input_matrix_type=InputMatrixType.RAW.value,
study_type=StudyType.H5AD.value
)
3.2.3. Seurat object
[ ]:
## Support multiple batches per submission
batch_info = [{
'matrix': 'local_path/GSE128223_1.rds',
}, {...}]
connector.submit_study_from_local(
group_id='personal',
batch_info=batch_info,
study_id='GSE128223',
name='This is my first study',
authors=['Huy Nguyen', 'Thao Truong'],
species=Species.HUMAN.value,
input_matrix_type=InputMatrixType.RAW.value,
study_type=StudyType.RDS.value
)
3.2.4. Full matrix dataframe
[ ]:
## Support multiple batches per submission
batch_info = [{
'matrix': 'local_path/GSE128223_1.tsv',
}, {...}]
connector.submit_study_from_local(
group_id='personal',
batch_info=batch_info,
study_id='GSE128223',
name='This is my first study',
authors=['Huy Nguyen', 'Thao Truong'],
species=Species.HUMAN.value,
input_matrix_type=InputMatrixType.RAW.value,
study_type=StudyType.TSV.value
)
4. Submit metadata
NOTE: Get group_id and study_id (uuid) from step “2. List groups and studies”
4.1. Submit a dataframe directly
This is an example metadata. Barcodes column must be DataFrame.index
[8]:
meta_df = pd.read_csv('GSE128223_metadata.tsv', sep='\t', index_col=0)
meta_df
[8]:
Cell type | |
---|---|
Barcodes | |
donor1_d1_AAACCTGGTAGAGGAA | TCRV delta 1 gamma-delta T cell |
donor1_d1_AAACGGGCAGACACTT | TCRV delta 1 gamma-delta T cell |
donor1_d1_AAAGCAAAGAGTAATC | TCRV delta 1 gamma-delta T cell |
donor1_d1_AAAGCAATCATGCATG | TCRV delta 1 gamma-delta T cell |
donor1_d1_AAAGCAATCCTCAACC | TCRV delta 1 gamma-delta T cell |
... | ... |
pbmc_8k_TTTGTCATCATGTCCC | naive CD8 T cell |
pbmc_8k_TTTGTCATCCGATATG | naive CD8 T cell |
pbmc_8k_TTTGTCATCGTCTGAA | monocyte |
pbmc_8k_TTTGTCATCTCGAGTA | CD8 T cell |
pbmc_8k_TTTGTCATCTGCTTGC | naive CD8 T cell |
19121 rows × 1 columns
[12]:
connector.submit_metadata_from_dataframe(
species=Species.HUMAN.value,
study_id='a1558f8ed6064095be86a091a4118c4a',
group_id='personal',
df=meta_df
)
[12]:
'Successful'
4.2. Submit file from local / server
[14]:
connector.submit_metadata_from_local(
species=Species.HUMAN.value,
study_id='a1558f8ed6064095be86a091a4118c4a',
group_id='personal',
file_path='./GSE128223_metadata.tsv'
)
[14]:
'Successful'
4.3. Submit file from s3
[ ]:
connector.submit_metadata_from_s3(
species=Species.HUMAN.value,
study_id='a1558f8ed6064095be86a091a4118c4a',
group_id='personal',
file_path='test_bucket/GSE128223_meta.tsv' #This path DOES NOT include the bucket path configured on platform e.g. s3://bioturing_bucket
)
5. Access study data
NOTE: Get study_id (uuid) from step “2.2. List all available studies in a group”
5.1. Get barcodes
[18]:
barcodes = np.array(connector.get_barcodes(
study_id='a1558f8ed6064095be86a091a4118c4a',
species=Species.HUMAN.value,
))
print(barcodes)
['donor1_d1_AAACCTGGTAGAGGAA' 'donor1_d1_AAACGGGCAGACACTT'
'donor1_d1_AAAGCAAAGAGTAATC' ... 'pbmc_8k_TTTGTCATCGTCTGAA'
'pbmc_8k_TTTGTCATCTCGAGTA' 'pbmc_8k_TTTGTCATCTGCTTGC']
5.2. Get features
[19]:
features = np.array(connector.get_features(
study_id='a1558f8ed6064095be86a091a4118c4a',
species=Species.HUMAN.value,
))
print(features)
['5S_RRNA' '5_8S_RRNA' '7SK' ... 'THRA1/BTR' 'UTAT33' 'ZSCAN5CP']
5.3. Get metadata dataframe
[22]:
metadata = connector.get_metadata(
study_id='a1558f8ed6064095be86a091a4118c4a',
species=Species.HUMAN.value
)
metadata.iloc[:5, :5]
[22]:
Barcodes | Cell type | Cell type (1) | Cell type (2) | Cmv status | |
---|---|---|---|---|---|
0 | donor1_d1_AAACCTGGTAGAGGAA | TCRV delta 1 gamma-delta T cell | TCRV delta 1 gamma-delta T cell | TCRV delta 1 gamma-delta T cell | CMV+ |
1 | donor1_d1_AAACGGGCAGACACTT | TCRV delta 1 gamma-delta T cell | TCRV delta 1 gamma-delta T cell | TCRV delta 1 gamma-delta T cell | CMV+ |
2 | donor1_d1_AAAGCAAAGAGTAATC | TCRV delta 1 gamma-delta T cell | TCRV delta 1 gamma-delta T cell | TCRV delta 1 gamma-delta T cell | CMV+ |
3 | donor1_d1_AAAGCAATCATGCATG | TCRV delta 1 gamma-delta T cell | TCRV delta 1 gamma-delta T cell | TCRV delta 1 gamma-delta T cell | CMV+ |
4 | donor1_d1_AAAGCAATCCTCAACC | TCRV delta 1 gamma-delta T cell | TCRV delta 1 gamma-delta T cell | TCRV delta 1 gamma-delta T cell | CMV+ |
5.4. Get embeddings
5.4.1. List all embeddings
[24]:
embeddings = connector.list_all_custom_embeddings(
study_id='a1558f8ed6064095be86a091a4118c4a',
species=Species.HUMAN.value,
)
embeddings
[24]:
[{'embedding_id': 'bee0c214d7d44dc1882313cc803aece3',
'embedding_name': '_x_pca'},
{'embedding_id': '0c856f67796b4f4b86dbedb812974ff1',
'embedding_name': '_x_tsne'},
{'embedding_id': '5ab6ae13ce344381a81aa7d6afb26616',
'embedding_name': 'PCA (no batch corrected)'},
{'embedding_id': '21f767838c1c4d5095249dcdab9388eb',
'embedding_name': 'tSNE (perplexity=30)'}]
5.4.2. Access an embedding
[25]:
chosen_embedding = connector.retrieve_custom_embedding(
study_id='a1558f8ed6064095be86a091a4118c4a',
species=Species.HUMAN.value,
embedding_id='bee0c214d7d44dc1882313cc803aece3',
)
chosen_embedding
[25]:
array([[-5.3032417 , 7.8890934 , 3.359574 , ..., 0.21355404,
-0.64777076, -1.6085205 ],
[-2.9219244 , 0.11274821, 2.3836405 , ..., 0.06213907,
-0.1660905 , 0.24691239],
[-5.4160094 , 12.229488 , 7.7536416 , ..., -0.5595666 ,
1.1389648 , 0.28183457],
...,
[17.052692 , 8.085365 , -6.64449 , ..., 0.6446202 ,
-0.95552135, -1.0086697 ],
[-2.2584836 , -3.0889986 , 2.9076786 , ..., 1.5332366 ,
-0.38599294, -0.29490623],
[-2.2893648 , -7.0735717 , 1.3277851 , ..., -0.13736992,
-1.7899635 , 0.07911549]], dtype=float32)
5.5. Query genes
Parameters:
----
group_id: str
ID of the group to submit the data to.
study_id: str
If no value is provided, default id will be a random uuidv4 string
gene_names: List[str], default=[]
If the value array is empty, the return value will be the whole matrix
unit: str
Support:
StudyUnit.UNIT_LOGNORM.value
StudyUnit.UNIT_RAW.value
[26]:
gene_exp = connector.query_genes(
study_id='a1558f8ed6064095be86a091a4118c4a',
species=Species.HUMAN.value,
gene_names=['CD3D', 'CD8A'],
unit=StudyUnit.UNIT_RAW.value,
)
gene_exp
[26]:
<19121x2 sparse matrix of type '<class 'numpy.float32'>'
with 17584 stored elements in Compressed Sparse Column format>
6. Standardize your metadata
NOTE: Get group_id and study_id (uuid) from step “2. List groups and studies”
6.1. Retrieve ontology tree
Returns
----------
Ontologies tree : Dict[Dict]
In which:
'name': name of the node, which will be used in further steps
[ ]:
connector.get_ontologies_tree(
species=Species.HUMAN.value,
group_id='bioturing_public_studies'
)
6.2. Assign standardized terms
Parameters
-----
species: str
Species of the study.
Support: Species.HUMAN.value
Species.MOUSE.value
Species.PRIMATE.value
Species.OTHERS.value
group_id: str
ID of the group to submit the data to.
study_id: str
ID of the study (uuid)
metadata_field: str
column name of meta dataframe in platform (eg: author's tissue)
metadata_value: str
metadata value within the metadata field (eg: normal lung)
root_name: str
name of root in btr ontologies tree (eg: tissue)
leaf_name: str
name of leaf in btr ontologies tree (eg: lung)
[ ]:
# This function is only usable in a group (not 'personal')
connector.assign_standardized_meta(
species=Species.HUMAN.value,
group_id='bioturing_public_studies',
study_id='a1558f8ed6064095be86a091a4118c4a',
metadata_field='Cell type',
metadata_value='TCRV delta 1 gamma-delta T cell',
root_name='cell type',
leaf_name='gamma-delta T cell',
)