Biopy - a toolkit for multi-domain translation¶
s277910 BIANCO MORGHET FRANCESCO
s269781 FANÌ EROS
s276047 GOLETTO GABRIELE
s276807 TRIVIGNO GABRIELE
This doc is meant to be a user guide on the use of the proposed ‘biopy’ package. Therefore it will focus more on the code structure and purposes, taking for granted that the reader knows about the tasks and the setting of the problems tackled.
For a more in-depth explanation and the motivations of the choices refer to the presentation.
Installation¶
There are many ways to fetch the package presented and reproduce the results reported. Below we describe the different ways to install everything properly. We also provide some code and snippets to download and preprocess the presented datasets from their original sources so that each dataset can be easiliy and reliably retrieved.
Warning
Given that our project was developed using DeepLearning techniques, it requires a specific set of libraries to run. We provide the list of dependencies, but beware that installing such dependencies can download some big packages (e.g. torch bundled with a cudaSDK).
Warning
The dependencies listed require a python version >= 3.7
Getting started¶
Here are reported 3 ways in which you can have access to our code
Install via PyPy repository¶
We deployed our package to the PyPy repository; however to avoid specifying a fixed list of dependencies together with the packages, it is deployed standalone and requirements are provided separately in a file in our repository, so they have to be installed separately
Note that the warning above applies here: running pip install -r dependencies.txt will download some heavy packages.
$ pip install biopy
$ git clone https://github.com/BioPyTeam/biopy
$ cd biopy
$ pip install -r dependencies.txt
And that’s it!
Install via setup file¶
You can also simply clone the repo and install the package with the provided script. It will install biopy in the current python environment.
> git clone https://github.com/BioPyTeam/biopy
> cd biopy
> install.bat
$ git clone https://github.com/BioPyTeam/biopy
$ cd biopy
$ ./install
Download the source code only¶
If you do not want to install any package and just download the code, you can do so. In this case, if you want to execute any of the provided snippets, you need to have src/biopy in your sys.path folder, or to execute commands from a sibling directory of src/biopy.
To download the source code, you can run this command:
$ git clone https://github.com/BioPyTeam/biopy
Note
You still need to get the third-party packages. See the provided dependencies.txt file
Downloading the datasets¶
As also stated elsewhere, the presented methods have been applied on three distinct datasets, two of which have been already presented and preprocessed in [YBV+21].
A549 Dataset¶
A549 Dataset is a paired multiomics (ATAC-seq and RNA-seq) single-cells dataset comprising A549 cells of tumoral lung tissue explanted from a 58-year-old caucasian male. One notable characteristic of this dataset is the fact that it is a paired dataset, which means that for a given sequencing in a given omic, there is also the corresponding sequencing in the other omic of the same cell.
ATAC-seq and RNA-seq have been preprocessed also in works other than [YBV+21]. So in order to download the preprocessed data from the all different original sources, we suggest to run this script:
$ python3 biopy/utils/download_dataset_nature.py --dataset_dir="dataset_a549"
CD4+ Dataset¶
- This dataset contains two very different omics:
Preprocessed RNA single-cell sequencing of naive CD4+ T cells, which have been clustered into two groups: quiescent and poised cells
Grayscale 64x64 chromatin images of poised and quiescent single cells
This is the main dataset presented in [YBV+21] and it has been published by the authors on Dropbox.
$ wget --content-disposition https://www.dropbox.com/sh/hjt57go4dyahgq7/AAAhAE8bHNn5Sq-D0jGkO_gAa?dl=1
$ unzip MultiDomainTranslationNatureComm2020.zip
GDC Dataset¶
We also applied the proposed methods on a preprocessed dataset retrieved from The NCI’s Genomic Data Commons (GDC)
The multiomics dataset contains three omics (mRNA, miRNA and methilation) obtained from multicell sequencing of breast tissue.
If you want to recreate the dataset from the GDC portal you can simply run from bash the following command which will download the data leveraging the GDC API:
$ ./download_dataset.sh
If you want to run the provided script step by step, for each action (download_omic
, …), there are additional options for customizing directories’ locations and other relevant parameters.
See details with python3 biopy/utils/download_dataset_gdc.py {action} -h
or python3 biopy/utils/download_dataset_gdc.py -h
to get the list of available actions.
Warning
Even though the final preprocessed and splitted dataset weighs only a few gigabyte, the overall data that needs to be downloaded is around 100GB, and at least 350GB after decompression. Furthermore, during file downloads, network connections may get terminated, and so the provided bash script may error out. However, it can be safely run again after every failure until all files have been downloaded. In some cases, additional instruction may be presented to the user on screen
Repository structure¶
In this section the structure of the repository is explained. The tree-like representation is reported below:
repository_root
|
├── configs
│ ├── A549
│ ├── CD4
│ └── GDC
|
├── dependencies.txt
├── install
├── install.bat
├── LICENSE
├── pyproject.toml
├── README.md
|
├── scripts
│ ├── download_dataset.sh
│ └── run.py
|
├── setup.cfg
|
├── src
└── biopy
├── __init__.py
├── datasets
├── experiments
├── metrics
├── models
├── statistic
├── training
└── utils
Our implementation is released as a python package. The source code is in the folder src/biopy. The ‘scripts’ folder contains instead examples and code that use the tools inside the package. Their purpose is to reproduce the result that we reported.
The package is made up of sub-packages each containing the tools to tackle different aspects of the proposed implementation.
- ‘datasets’
Everything concerning the handling of the different datasets that were employed in the various tasks. Due to the highly variable formats of GDC datasets with respect to the Nature datasets, there are different classes for each of those, although they do inherit from a custom class that was written to provide the general functionalities of preprocessing and train-val-test splitting. All the classes also inherit from the PyTorch Dataset class in order to fully exploit the framework.
- ‘experiments’
This sub-packages is related to the fine-tuning of the different architectures and contains the code to perform a grid-search on a given model and set of hyper-parameters.
- ‘metrics’
In tasks such as Multi-domain translation it is often hard to have a good evaluation of how well models are actually performing. This sub-package provides a number of metrics for that purpose. Some of them focus more on the latent space structure (KNN accuracy, Fraction Closer) others on the domain translation (Reconstruction error, classifier on the translations)
- ‘models’
Contains the implementations of all the architectures used. Reported as Baseline are the models implemented to reproduce results of the paper taken as starting point. The other are custom networks, divided between convolutional and non - based on whether or not they were thought for image data.
- ‘statistic’
This package is used when training Adversarial AE that need a distribution against which match their latent spaces. The classes in here provide a way to either specify a distribution in terms of combination of multi-dimensional Gaussians or Laplace distribution, or provide a model and have a sampler generated from the dataset encoded by the given model. This package isolates the logic for sampling making the code easier to read and mantain.
- ‘training’
In this package is implemented a multi-method training framework used to easily switch between training methods, datasets, models; simply by changing a couple of parameters. The wrapper class, biopy.training.ThanosTrainer, provides the interface for the external user to specify the overall parameters, and then internally there are the specific classes each one implementing a different training method. The training classes all inherit from biopy.training.Trainer that provides base mechanisms such as model checkpoints, and access to metrics.
- ‘utils’
Contains all kinds of utilities functions and classes of various nature; such as access to API for downloading dataset, formatting data, extract useful info from models and plot results.
Datasets¶
This section explains how to use the dataset classes. The DatasetMultiOmics class can be used to read the synthetic dataset, while DatasetMultiOmicsGDC and DatasetMultiOmicsGDCTrainTest can be used to read the GDC dataset downloaded with the aforementioned tool. These classes expect files to be in tsv format with .txt extension and samples along the columns.
Basic synthetic dataset usage¶
Toy dataset made up of synthetic data. In the basic usage scenario, you load all omics and then use one at a time, for instance:
from datasets import DatasetMultiOmics
dataset = DatasetMultiOmics(folder='dataset_5000samples', omics=('mRNA', 'meth',), labels='clusters')
dataset_train, dataset_test = dataset.train_val_test_split(test=.25)
mean, std = dataset_train.standardize(all_omics=True)
dataset_test.standardize(all_omics=True, mean=mean, std=std)
dataset_train.set_omic('mRNA')
model1 = train(dataset)
evaluate(dataset_test)
dataset_train.set_omic('meth')
model2 = train(dataset)
evaluate(dataset_test)
Basic GDC dataset usage¶
In order to use the downloaded and processed GDC dataset, DatasetMultiOmicsGDC and DatasetMultiOmicsGDCUnbalanced classes can be used. These two classes differ from DatasetMultiOmics because the label can be programmatically chosen since the downloaded label file is richer amd contains many columns. E.g. For a given experiment, one may want to use the sample_type column as a label, but for another experiment the label can be derived from the project_id; multiple columns can also be used.
- Furthermore:
DatasetMultiOmicsGDC has to be used when all samples have all the omics and all samples share the same label file, as it happens for the synthetic dataset;
DatasetMultiOmicsGDCUnbalanced has to be used when instead not all samples have all omics (for each omic there is a separate label file).
E.g. for using the DatasetMultiOmicsGDCUnbalanced class with the downloaded GDC dataset, which is “unbalanced” as intended above:
from datasets import DatasetMultiOmicsGDCUnbalanced
# The label here is sample type (2 possible values) + project id (2 possible values), so 4 possible values
dataset = DatasetMultiOmicsGDCUnbalanced(folder='dataset', omics=('mRNA', 'miRNA',), labels_columns=('sample_type', 'project_id'))
dataset.standardize(all_omics=True)
dataset.set_omic('mRNA')
model1 = train(dataset)
dataset.set_omic('miRNA')
model2 = train(dataset)
One shortcoming of the DatasetMultiOmicsGDCUnbalanced class is that the train_val_test_split method is not available. However the download script provides a command to split the downloaded dataset in such a way that the test set is made of samples that have all the three omics. To load it:
from datasets import DatasetMultiOmicsGDCUnbalanced
dataset_train = DatasetMultiOmicsGDCUnbalanced(folder='dataset', omics=('train_mRNA', 'train_miRNA',))
dataset_test = DatasetMultiOmicsGDC(folder='dataset', omics=('test_mRNA', 'test_miRNA',))
mean, std = dataset_train.standardize(all_omics=True)
dataset_test.standardize(all_omics=True, mean=mean, std=std)
The code above is equivalent to the simplified code below which uses the class DatasetMultiOmicsGDCTrainTest:
from datasets import DatasetMultiOmicsGDCTrainTest
dataset = DatasetMultiOmicsGDCTrainTest(folder='dataset', omics=('mRNA', 'miRNA',))
dataset.standardize(all_omics=True)
dataset_train, dataset_test = dataset.train_val_test_split()
Models¶
Here is a summary of the networks that were used across the different tasks.
Baseline¶
In biopy.models.Baseline.py are contained the models as described by [YBV+21]. The method presented in the cited paper involves a Variational AE with an additional MLP classifier used to discriminate between labels. So in this file there are different implementations for the MLP head with different versions; e.g. the Adversarial_Classifier class implements the MLP with a Gradient Reversal Layer; and there is also another version specific for the A549 dataset.
The class of the VAE to which the MLP head is attached to is FC_VAE, of which is present also the convolutional version for the image data in the CD4 dataset, NucleiImgVAE.
Expression Autoencoders¶
Files in biopy.models.ExprAutoEncoders.py includes the different custom models implemented to work with Expression data (mRNA, miRNA, meth…) in all the different datasets, and the base layer for all of them is the FC layer. We implemented 4 different kinds of AEs:
Plain AE
Variational AE
Adversarial AE
Supervised Adversarial AE
For each kind the Encoder and Decoder were treated as interchangeable modules. Infact, the Decoder is the same for all 4 kinds, the Decoder class; As regards the Encoders, instead, there is the Encoder which is made up of FC layers and is suited only for the plain AE. The VEncoder instead is used by the other kinds of AE and returns 2 tensors, representing mean and log variance of the encoded data points. The AAE class implements the Adversarial AE, and it is made up of a VEncoder, Decoder , and a MLP head with a gradient reversal layer that acts as discriminator in order to be able to impose a distribution on the latent space.
The Supervised version of the AAE is implemented by SupervisedAAE; the way it works is it concatenates the one hot encoding of the labels to the latent space before forwarding it to the Discriminator; the purpose is to be able to impose a distribution conditioned on the label provided.
Both the AAE and SupervisedAAE class include inside of them the MLP to act as discriminator; however, to be able to use a single discriminator over multiple AEs (one for each omic), we also implemented a stand-alone discriminator to be instantiated just once and used to forward data from different AEs. This discriminator come in 2 versions, ClassDiscriminator and ClassDiscriminatorBig
Image Autoencoders¶
Files biopy.models.ImageAutoEncoders.py and biopy.models.ImageAAEresnet.py. Models dedicated to image processing; for CD4 dataset. The classes name convention is very similar to the one for Expression Autoencoders.
In biopy.models.ImageAutoEncoders.py are implemented 2 kinds of AE: * Variational AE -> ImgVAE class * Adversarial AE -> AAEImg class
They both use the same decoder (ConvDecoder class) and encoder that provides means and log variances (ConvEncoder class). Additionally the AAEImg model contains the discriminator with gradient reversal. Even though there is no need of having multiple Image AEs (since there is only one domain for images), for compatibility reasons, in order to be able to use in the same way all the models, also for images a separate version of the discriminator is provided, not embedded in any model. This is a simple MLP in the Discriminator class.
The ConvEncoder is a custom Convolutional networks, with max pooling and batch normalization at every layer; and finally a couple of FC layers to encode the feature maps to a flat latent space represented by mean and log variance. For the ConvDecoder, in order to ‘upscale’ from the latent to reconstruct the input image, 2d Transposed Convolutional layer are used.
In biopy.models.ImageAAEresnet.py we tried a different approach to the implementation of the convolutional AE. This file contains the implementation of an Adversarial AE, in the class AAEImgResnet. In this model both the encoder (ConvEncoder) and the decoder (ConvDecoder) implement the idea of ‘skip connection’ to help the gradient flow. To be able to use this mechanism in the Decoder, normal Conv2d layers were alterned with upscaling.
Warning
Some of the classes in the mentioned files overlap with names. (e.g. ConvEncoder both in biopy.models.ImageAutoEncoders.py and biopy.models.ImageAAEresnet.py). However these classes are not exported in any outer scope and are only used as building blocks by the full models, which are the ones visible from outside, and their names are properly differentiated.
Training framework¶
This section covers the training framework that we implemented. In this page the approach is more technical with the purpose of explaining the structure of the classes; so that in the next section about replicating results the usage can be clearer.
The reason why we needed to structure such a framework was to provide an easy way to switch among different training methods and datasets without having to rewrite a pipeline from scratch every time. So the basic idea is to devise an interface that represents a pipeline, and then have many classes implementing it each one for a different purpose; so that then it can be easy for an external wrapper to just call the same methods on different classes.
Below we briefly explain the class diagram, and then the pipeline interface implemented by all of them is reported more in depth.
Warning
Although we are calling it an ‘interface’, it is not actually an interface in the strict technical OOP language, as in Python there is not the possibility of declaring (at least not easily) interfaces, and so we just have many classes that implement the same methods
Classes diagram¶
From now on we will refer to all the classes that implement the pipeline-representing interface as ‘trainer classes’. The base class from which they all inherit is biopy.training.Trainer.
biopy.training.Trainer¶
This class represents the most general case of training on a single domain, and so in its methods it instantiates a single dataset and a single model. It actually does not implement a training loop, that is left to the sub-classes to specify. Even though it is very general, it implements many methods that can be reused by all the trainer classes, such as for validation, dataset preprocessing, and other general setups.
- There are 2 groups of classes inheriting from Trainer:
All the classes that implement training on a single domain. For example, biopy.training.VAETrainer , that implements training of a single Adversarial AE. These kind of classes are very easy to implement, as they can inherit everything from Trainer, the user only needs to specify the dataset and model classes. What these sub classes do have to implement is the training method, as that is specific to every case.
The biopy.training.JointTrainer class. The next section explains about that.
biopy.training.JointTrainer¶
This class instead is the base for all the methods that train on multiple domains; and so require many dataset and model objects, one per domain, or as they are identified in the code, one per ‘omic’. Due to this fundamental difference, it overrides almost all the methods from its parent Trainer, expect for the validation method and few others.
Unlike Trainer which is just a general template, JointTrainer does implement a training method; which is the joint training on all domains to estimate a latent distribution, followed by a second stage where it imposes the estimated distribution. For more details and the motivations, refer to the presentation.
From this class inherit all the trainer that implement a method involving more than one domain at once.
Other classes¶
Ultimately, the classes that implement our proposed approaches all have to deal with multi-domain setting, and so they inherit from JointTrainer. Examples are src.training.VAETrainerDistribution, src.training.VAETrainer and src.training.JointDoubleDiscr. Thanks to this inheritance scheme, many of these classes only need to specify their training loop, and they get “for free” from the superclass access to many useful structures; like per-omic dataset and model objects, logging and metric support.
Now with this scheme in mind we explain the methods through which these classes are able to implement a full pipeline.
Representing a pipeline¶
Here is explained the way in which a pipeline is represented inside all the classes in the training subpackage. List of methods and their purpose:
__init__(**kwargs)
Receives a dictionary with all the hyperparameters and other available customizations.
generate_dataset_loaders(dataset_class, **kwargs)
It has the job of loading the dataset and instantiating pytorch DataLoaders. In the parameters you specify which is the class to be instantiated, and the other **kwargs are forwarded to the constructor of the dataset. Each class can customize this method to its needs; for example classes that handle training of all the omics at once can instantiate here the datasets for all of them and store it in a dictionary with omic names as keys. The reason why the dataset class is left as parameter to specify is to give more freedom to test different classes during development. Once the code is ‘deployed’, that part can be further automatized by fixing the dataset class for a given task.
preprocess_dataset(dataset, parameters)
This method moves all the logic of the preprocessing in a single place. It takes care of parsing the parameters specified by the user, and applying them on the provided dataset object. The actual implementation of the preprocessing is in the dataset classes; and this is just an interface that ensures compatibility among the different classes.
generate_models_optimizers(model_class, optimizer_SGD, **kwargs)
The purpose of this method is to build the model and the other objects required for training (scheduler, optimizer) It receives as parameter the class that implements the model to be used, and it instantiates it forwarding the **kwargs to it. The optimizer_SGD is a boolean variable that allows to specify whether to use the SGD or the Adam optimizer. As for the dataset here each class has the freedom to set up its environment in the preferred way, instantiating a model for each omic, a shared or not discriminator, and so on.
train_model(**kwargs)
This is of course the core method that each class has to implement. At this point the dataset are loaded and the models are built, so the job of this method is to implement the training loop specific for each method. It specify the way in which data are forwarded, which losses are used and so on. The other parameters can specify for example which metrics to test on or some custom parameters.
Other utilities methods: The following methods are of general utility and are implemented only by Trainer and JointTrainer; as the only variability in these comes from whether or not there are multiple domains to consider. All the other classes get these methods “for free” and can use them with no need of overriding.
pack_for_metric(metric_class, split, **kwargs)
Just like the trainer classes, also the metrics are implemented with a common structure. They all receive the datasets on which to test and a model inside a dictionary; so this method acts as an interface between the trainer classes and the metrics. The trainer classes call this for all the metrics they need to evaluate, and this method takes care of properly instatiating the metric class. The split parameter differentiates between evaluating the metrics on the train or test set.
validate_on(net, data_loader, criterion, device)
This method is meant for a simple validation pass on a given model and dataset. It is probably the most standard method across the different methods and so it is not overriden by anyone
setup_metrics(metric_classes)
To be called before starting the training. It receives the classes of the metrics to be used during training and sets up the data structure to collect results.
eval_metrics(**kwargs)
Called after every epoch, takes care of calling the metrics (for each domain in the case of JointTrainer) Moves in one place all the complexity regarding instantiating the metrics that the user specified to evaluate, calling them passing the proper parameters and collecting results.
setup_model_saving(save_models)
Called before starting the training loop, it receives the parameter specified by the user that asks whether or not to save the best models. It sets up the directory in which to do so, and data structures to hold the best results. It saves the best models for each of the metrics that were asked to evaluate.
model_saving(avg_loss, epoch)
Called after every epoch receiving the epoch number (for logging purposes) and the average loss that is used as criteria to find the best model if no metrics were specified. It removes the file for the previous best model and saves a new one, putting in the name the metric, the domain and the epoch number.
The trainer wrapper¶
On top of the trainer classes described above, there is a wrapper to provide easy and structured access to all the training subclasses; this class is ThanosTrainer in biopy.training.trainer_wrapper.py It offers a way to specify the parameters for each step of the training pipeline.
Most of the code in the class is boilerplate that makes internal state checks to ensure that its methods are called in the right order to build the pipeline, providing a robust interface to easily switch among training methods and datasets specifying all the needed hyperparameters.
Training methods are represented in the form of “strategies”, that are encoded in dictionaries inside this class. The user has to set the name of the strategy, and the class automatically knows how many trainer class are needed to perform that strategy, and how to pipeline them. Each strategy can have one or more agents; where an agent is a trainer class, and each agent can implement one or more stages
Beyond setting the strategy, what the user has to do is to call, for each agent class that composes the strategy, the main methods that represent its pipeline (as described above) in order to specify the required parameters. Specifically these methods are :
generate_dataset_loaders(dataset_class, **kwargs)
preprocess_dataset(dataset, parameters)
generate_models_optimizers(model_class, optimizer_SGD, **kwargs)
train_model(**kwargs)
Upon call of these methods Thanos will not perform any action, as these calls represent the statement of the parameters of the pipeline that has to be followed. After this step, when the user wants to start the training, it is possible to do so by calling :
exec()
That will start the pipeline relative to the strategy that has been set. Since the trainer classes require in many cases a number of other parameters that are specified in a dictionary upon initialization; in order to provide these parameters it is possible to pass them to the Thanos constructor, in the form of a list with an item for each agent class of the strategy. This is useful to specify many useful behaviors such as ask to save models, specify log directory and so on.
Now, of course beyond running experiments, what one is typically interested in is looking at results. Another useful feature offered by ThanosTrainer is logging, through the tensorboard platform. After every run, it will log first of all, all the hyperparameters used; and the progress of training. The latter by default is composed of the training and validation loss; additionally, if any metrics were specified to be evaluated, they will be logged as well.
If asked to do so, the best performing models for each metric will be saved too.
Replicating results¶
After having presented the technical details of how the different pipelines are implemented, this section provides a more practical approach with snippets on how to run experiments to replicate our results.
Warning
To execute the snippets in this section you need to install our package and its dependencies. See Installation for details on how to do that and download the datasets as well.
Strategies¶
We provide a ready-made script to execute all the proposed implementations. Below more details on the parameters that you need to specify, but the main one are
the dataset
the training method aka the strategy
For a more specific description of how are strategy used inside our framework see The trainer wrapper.
For a more theorical perspective on said strategies, see instead our presentation.
Here we report just a brief summary of all the available strategies that you can ask our framework tu run:
- one_shot
Most basic implementation. Allows to train a standard VAE, AAE, or Supervised AAE on a single domain
- baseline
Implements the method reported as baseline, with a first stage (meant for image domains) with a discriminator on the labels. The second stage has the discriminator as well and uses an anchor loss when available, on paired datasets.
- baseline_1stage
Meant for datasets that do not include an image domain. Only executes the second stage of the baseline explained above
- joint_estimator
Our first proposed variation; to avoid the need for an anchor loss to match the latent space, tries to estimate in a first stage a distribution conditioned on the label; and then to impose that distribution in a second stage separately per each domain/omic, using the framework of Adversarial AEs.
- joint_double_discr
Same principle of the previous one, but instead of using AAEs uses a discriminator to impose a common distribution.
- distribution
Apply principles from domain adaptation to ensure that the different domains get encoded on a common distribution. Possible to specify a weighted combination of MKMMD, HAFN, SAFN losses. Includes a first stage of pretraining, meant for image domains.
- distribution_1stage
Same as above but without the first stage of pretraining
Run the script¶
A script to easily replicate all the reported results is provided at scripts/run.py. It does not add anything to the proposed framework, and it is provided for convenience of the external user to take care of the argument parsing and config specification.
The high level parameters are passed to the command line, such as the dataset path, the strategy, logging folder. Instead for the more specific details, a config file is needed. Config files ae represented as .ini files with different sections, one for each of the different aspects to configure. All the config files for the reported experiments are provided under configs, ready for you to try them out.
Here is a summary of the command line argument:
usage: run.py [-h] --fold FOLD --log_dir LOG_DIR --strategy STRATEGY --config_path CONFIG_PATH
arguments:
-h, --help show this help message and exit
--fold FOLD dataset path
--log_dir LOG_DIR directory to store the log files
--strategy STRATEGY strategy to perform the training
--config_path CONFIG_PATH path of the configuration file
In our configs folder, you find more files than the ones mentioned in the snippet below. For the usage of any of them, the filename is very explicative, containing the dataset, the methodology, and any additional infromation such as additional losses (e.g. kld ) and preprocessing techniques (e.g. smote)
Reproducing baseline results¶
Below we report the command to reproduce the results applying the baseline method and models to all the 3 datasets.
- CD4 :
python scripts/run.py --fold /your/path/to/cd4_folder \ --log_dir /where/you/will/find/logs \ --strategy baseline \ --config_path configs/CD4/cd4_baseline.ini
- A549 :
python scripts/run.py --fold /your/path/to/a549_folder \ --log_dir /where/you/will/find/logs \ --strategy baseline_1stage \ --config_path configs/A549/a549_baseline.ini
- GDC :
python scripts/run.py --fold /your/path/to/gdc_folder \ --log_dir /where/you/will/find/logs \ --strategy baseline_1stage \ --config_path configs/GDC/gdc_baseline.ini
Joint Training with Adversarial AEs and double discriminator¶
This snippets are to run experiments with our proposed variation - 2 stage Joint Training, with a double discriminator.
- CD4 :
python scripts/run.py --fold /your/path/to/cd4_folder \ --log_dir /where/you/will/find/logs \ --strategy baseline \ --config_path configs/CD4/cd4_baseline.ini
- A549 :
python scripts/run.py --fold /your/path/to/a549_folder \ --log_dir /where/you/will/find/logs \ --strategy baseline_1stage \ --config_path configs/A549/a549_baseline.ini
- GDC :
python scripts/run.py --fold /your/path/to/gdc_folder \ --log_dir /where/you/will/find/logs \ --strategy baseline_1stage \ --config_path configs/GDC/gdc_baseline.ini
Training with losses from Domain adaptation:¶
These snippets are to execute experiments with our proposed variations that utilizes techniques from Domain adaptation. Use either one of the proposed losses on either one of the datasets:
- CD4 and MKMMD:
python scripts/run.py --fold /your/path/to/cd4_folder \ --log_dir /where/you/will/find/logs \ --strategy distribution \ --config_path configs/CD4/cd4_mmd_kld_no_smote.ini
- CD4 and HAFN:
python scripts/run.py --fold /your/path/to/cd4_folder \ --log_dir /where/you/will/find/logs \ --strategy distribution \ --config_path configs/CD4/cd4_hafn_no_smote.ini
- A549 and MKMMD:
python scripts/run.py --fold /your/path/to/a549_folder \ --log_dir /where/you/will/find/logs \ --strategy distribution_1stage \ --config_path configs/A549/a549_mmd.ini
- A549 and SAFN:
python scripts/run.py --fold /your/path/to/a549_folder \ --log_dir /where/you/will/find/logs \ --strategy distribution_1stage \ --config_path configs/A549/a549_safn.ini
- GDC and MKMMD:
python scripts/run.py --fold /your/path/to/gdc_folder \ --log_dir /where/you/will/find/logs \ --strategy distribution_1stage \ --config_path configs/GDC/gdc_mmd_no_smote.ini
- GDC and HAFN:
python scripts/run.py --fold /your/path/to/gdc_folder \ --log_dir /where/you/will/find/logs \ --strategy distribution_1stage \ --config_path configs/GDC/gdc_hafn_no_smote.ini
Plot latent spaces¶
It may be useful to take a look inside your model and plot what is happening inside your latent space. We provide a utility to do conveniently do so. Take a dataset, a model (hopefully trained previously), and the function will plot the different domains in the dataset, choosing a main color to each label, and then assign a gradient of that color to each domain. Ideally if the domain translation has been successfull, you should see that different gradients of each color (e.g. samples from different domains and same label) are overlapped in a common distribution.
Below we report examples for all the datasets, calling the main function provided in scripts/visualizer.py
CD4
from visualizer import main main(dataset_name='cd4', dataset_folder='dataset_nature', model_name='cd4_vae', checkpoints="results/cd4/mmd/ROCCNNRF/{omic}VAETrainerDistributionepoch891.pth", output_path='cd4.png')
A549
from visualizer import main main(dataset_name='a549', dataset_folder='dataset_nature_atac-rna', model_name='a549_vae', checkpoints="results/a549/baseline/KNNAccuracySklearn/{omic}VAETrainerepoch1494.pth", output_path='cd4.png')
GDC
from visualizer import main main(dataset_name='gdc', dataset_folder='dataset_breast', model_name='gdc_vae', checkpoints="results/gdc/gdc_mmd/ROCCNNRF/{omic}VAETrainerDistributionepoch103.pth", output_path='gdc.png')
Bibliography¶
- YBV+21
Karren Dai Yang, Anastasiya Belyaeva, Saradha Venkatachalapathy, Karthik Damodaran, Abigail Katcoff, Adityanarayanan Radhakrishnan, G. V. Shivashankar, and Caroline Uhler. Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nature Communications, January 2021. URL: https://doi.org/10.1038/s41467-020-20249-2, doi:10.1038/s41467-020-20249-2.