# alphascreen

1. [Features](#features)
2. [Workflow](#workflow)
3. [Installation](#installation)
4. [Usage](#usage)
    * [Job setup](#jobsetup)
    * [Check runs](#checkruns)
    * [Analysis](#analysis)
5. [License](#license)

## Features<a name="features"></a>

* Gets the fasta sequences from uniprot IDs contained in a table of paired proteins.

* Chops up the sequences so that they are of reasonable size for subsequent pairwise predictions.

* Interprets the PAEs only in the region relating to the protein-protein interaction of interest.

* Generates summaries, including a PDF showing the PAE plot next to snapshots of the models.

## Workflow<a name="workflow"></a>

### Setting up the fasta files

Use this package to generate fastas for a set of interaction partners to run Alphafold predictions. The input to ```--parse``` is a table which includes two columns containing uniprot IDs for the interaction partners (headers specified with ```--columnA``` and ```--columnB```). These tables can be generated by BioGRID for example.

The sequences are fetched from Uniprot and fragmented before generating fasta files, which are stored in the *fastas* folder. Fragmenting the sequences helps keep the total sequence length short enough so the jobs don't run out of memory (```--fragment```). An overlap is considered so that the fragmentation doesn't accidentally cut into an interaction interface (```--overlap```). You can dimerize any or all proteins (```--dimerize```, ```--dimerize_all```, ```dimerize_all_except```) and/or consider a specified sequence of a protein of interest (```--consider```).

### Running the predictions

The output is a bash script (*runpredictions.bsh*) that allows you to run Alphafold on all the generated fasta files on your machine/cluster. The syntax is set up for the LMB cluster, and will therefore likely not correspond to what you use in your system. You can either edit *runpredictions.bsh* or *jobsetup.py* itself so that the Alphafold submission commands have the right syntax. If you do change this, make sure the results are output into the *results* directory, which is important for the analysis command ```--show_top``` to work. The package has only been tested on Colabfold 1.3.0 and therefore its file naming system. Be careful before running the script since it will submit all jobs and rely on your queuing system to handle the submissions.

### Analyzing the results

The default behaviour for analysis (```--show_top```) is to go through one result at a time and finding the best PAE plot, only considering the region corresponding to the protein interaction you are screening for, and choosing the best one. After doing this for all the predictions, they are ranked relative to each other by comparing their respective PAE plots in the protein interaction region, as before.

If you want instead want to rank by iptm score, you can pass ```--rankby iptm```. This relies on your Alphafold/Colabfold implementation outputing the iptm and ptm scores to a *scores.txt* file within the individual results directories. It should have the same number of lines as there are models, each containing the information in this format: *iptm:0.09 ptm:0.62* (these values are just an example).

## Installation<a name="installation"></a>

* Set up a fresh conda environment with Python >= 3.8: `conda create -n alphascreen python=3.8`

* Activate the environment: `conda activate alphascreen`.

* Install alphascreen: **`pip install alphascreen`**

* Install pymol dependancies: **`conda install -c schrodinger pymol-bundle`**

## Usage<a name="usage"></a>

### Job setup<a name="jobsetup"></a>

* Generate the fasta files and Alphafold commands for the input uniprot IDs.

```
alphascreen --parse uniprot-id-1/uniprot-id-2 [options]
```

* Generate the fasta files and Alphafold commands for the input table.

```
alphascreen --parse filename [options]
```

**Options**

**```--focus```** *```uniprot-id```*

Uniprot ID to focus on. This means that it will the first chain in any predictions that contain it.

**```--fragment```** *```length```*

Approximate fragment length. Default is 500.

**```--overlap```** *```length```*

Sequence is extended by this amount on either side of slices. Default is 50.

**```--dimerize```** *```uniprot-id```* *or* *```uniprot-ids.txt```*

Uniprot ID to dimerize. Alternatively, provide a text file (.txt) with a single column list of uniprot IDs to dimerize.

**```--dimerize_all```**

Dimerize all proteins.

**```--dimerize_all_except```** *```uniprot-ids.txt```*

Provide a text file (.txt) with a single column list of uniprot IDs to NOT dimerize. Everything else will be dimerized.

**```--consider```** *```uniprot/start/end```*

Uniprot ID and sequence range to consider. Example: *Q86VS8/1/200* only considers amino acids 1-200 for uniprot ID Q86VS8.

**```--alphafold_exec```** *```alphafold-executable```*

Path to script that runs Alphafold for writing the commands. Default is *colabfold2* as per the LMB cluster usage.

**```--columnA```** *```columnA-name```*

Name of column heading for uniprot IDs for the first set of interactors. Default is *SWISS-PROT Accessions Interactor A*, which is just what BioGRID uses.

**```--columnB```** *```columnB-name```*

Name of column heading for uniprot IDs for the second set of interactors. Default is *SWISS-PROT Accessions Interactor B*, which is just what BioGRID uses.

### Check runs<a name="checkruns"></a>

```
alphascreen --check
```

Checks how many runs are finished so far and how many remain.

```
alphascreen --write_unfinished
```

Checks how many runs are finished so far and writes out a new bash script with the remaining Alphafold commands.

### Analysis<a name="analysis"></a>

```
alphascreen --show_top threshold [options]
```

Generate summary files for the runs so far. For example, ```alphascreen --show_top 0.8``` will rank predictions by interaction-site PAEs to choose the highest rank, then lists those predictions, ranking by the interaction-site PAE. Only those with scaled PAEs higher than 0.8 are shown. See the ```--rankby``` option below for more information on the scaled PAE. To output all predictions, pass ```--show_top 0```. A table is output (.xlsx and .csv), which can be used as input for a subsequent run of alphascreen if you need to test dimerization or use different alphafold executable on the top hits.

```
alphascreen --write_table [options]
```

Like ```--show_top```, but only outputs the table (.xlsx and .csv). No threshold value is considered; all predictions are ranked and output.

**Options**

**```--rankby```** ```pae``` or ```iptm``` or ```ptm```

Score by which models are ranked (***pae***, ***iptm***, or ***ptm***). The default is *pae*. This is used for both choosing the best model in a prediction as well as ranking those chosen models in the summary files. The option ```pae``` will look for the deepest PAE valleys only in the parts of the plot that are interactions between **different** proteins. The PAE is scaled to be between 0 and 1 where higher values are better predictions (```--show_top 0.8``` is a good starting point). The options ```iptm``` and ```ptm``` rely on a *scores.txt* file in each results directory (see explanation at the top) (in this case ```--show_top 0.3 --rankby iptm``` is a good starting point).

**```--overwrite```**

Overwrite snapshots that have already been generated, otherwise it will skip those to save time. This is only relevant for ``show_top``.

## License<a name="license"></a>

This project is licensed under the MIT License - see the [LICENSE.txt](https://github.com/sami-chaaban/alphascreen/blob/main/LICENSE.txt) file for details.