HIV DRP Databases and Analytical Tools

Proviral Sequence Database (PSD)

Despite the success of antiretroviral therapy (ART), HIV-1 persists in reservoirs and viremia rebounds if treatment is interrupted.  To facilitate understanding of the genetic structure and dynamics of the HIV-1 reservoir, we developed a public database, Proviral Sequence Database (PSD), for the storage and meta-analyses of near full-length (NFL) HIV-1 genomic RNA and proviral sequences that persist in donors on ART or that rebound after ART is interrupted (described in Retrovirology 13: 47, 2016).  This relational database contains information about host characteristics, treatment, HIV-1 sequences, and tools for sequence annotation/features.  PSD was developed by bioinformatics analysts Wei Shao and Jigui Shan  (Advanced Biomedical Computing Center, Leidos Biomedical Research, Inc.) in consultation with investigators John M. Coffin (Tufts University); Mary F. Kearney and Wei-Shau Hu (HIV DRP); and John W. Mellors (University of Pittsburgh).  PSD can be accessed at the website

Retrovirus Integration Database (RID)

A database on retrovirus integration sites is now available for use by intramural and extramural investigators.  The Retrovirus Integration Database (RID) was developed by bioinformatics analysts Wei Shao and Jigui Shan (Advanced Biomedical Computing Center, Leidos Biomedical Research, Inc.) in consultation with John M. Coffin (Tufts University) and HIV DRP investigators Stephen H. HughesFrank Maldarelli, and Mary F. Kearney (described in AIDS Res. Hum. Retroviruses 36: 1, 2020).  RID can be accessed at the website


The HIV-DRLink program was developed to work in conjunction with the Stanford HIV Drug Resistance Database to rapidly report linked and unlinked HIV-1 drug-resistance mutations in large data sets generated by single-genome sequencing methods that eliminate PCR-based recombination and nucleotide mixtures (described in in AIDS Res. Hum. Retroviruses 36: 942, 2020).  HIV-DRLink is a necessary tool to further investigate the effect of single versus linked preexisting drug-resistance mutations on the outcome of antiretroviral therapy.  HIV-DRLink was developed by bioinformatics analyst Wei Shao (Advanced Biomedical Computing Center, Leidos Biomedical Research, Inc.) in consultation with John M. Coffin (Tufts University) and HIV DRP investigators Mary F. Kearney and Frank Maldarelli.

Sequence Overrepresentation (SOR) Index

The Sequence Overrepresentation (SOR) index measures whether or not a cluster of identical sequences within a population is larger than expected by chance given the overall diversity of the population.  Briefly, the SOR index (described in Proc. Natl. Acad. Sci. USA 116: 25891, 2019) compares the probability of finding N identical sequence pairs in a set of sequence pairs that have a Poisson distribution with average given by the average p-distance of the supplied sequence set.  The sequence set supplied to the SOR webpage should be prealigned and devoid of any sequences that would produce artificially high genetic distances — e.g., hypermutants and outgroup consensus sequences.  Output of the SOR webpage is a bar graph showing the distribution of pairwise distances within the supplied dataset (if requested) and a table of p-values with their associated rake sizes and IDs.  The SOR index and webpage were developed by postbaccalaureate fellow Michael Bale (HIV DRP) and investigator Brian Luke (Advanced Biomedical Computing Center, Leidos Biomedical Research, Inc.) in consultation with investigators John M. Coffin (Tufts University) and Mary F. Kearney (HIV DRP).  An R script version of the web app and the supporting web app code are available at


The characterization of the HIV-1 reservoir, which consists of replication-competent integrated proviruses that persist on antiretroviral therapy (ART), is made difficult by the rarity of intact proviruses relative to those that are defective.  While the only conclusive test for the replication competence of HIV-1 proviruses is carried out in cell culture, genetic characterization of genomes by near full-length PCR and sequencing can be used to determine whether particular proviruses have insertions, deletions, or substitutions that render them defective.  Proviruses that are not excluded by having such defects can be classified as genetically intact and, possibly, replication competent.  Identifying and quantifying proviruses that are potentially replication competent is important for the development of strategies toward a functional cure.  However, to date, there are no programs that can be incorporated into deep-sequencing pipelines for the automated characterization and annotation of HIV genomes.  Existing programs that perform this work require manual intervention, cannot be widely installed, and do not have easily adjustable settings.  In collaboration with Gert van Zyl and Imogen Wright (University of Stellenbosch) and John M. Coffin (Tufts University), HIV DRP investigators Mary F. Kearney, Wei-Shau Hu, and Michael Bale and bioinformatics analyst Wei Shao (Advanced Biomedical Computing Center, Leidos Biomedical Research, Inc.) developed HIVIntact as a python-based software tool that characterizes genomic defects in near full-length HIV-1 sequences, allowing putative intact genomes to be identified in silico (described in Retrovirology 18: 16, 2021).  Unlike other applications that assess the genetic intactness of HIV genomes, this tool can be incorporated into existing sequence-analysis pipelines and applied to large next-generation sequencing datasets.  The HIVIntact pipeline and test data may be downloaded from a public GitHub repository ( under an open-source MIT license.


HIV-1 proviral single-genome sequencing by limiting-dilution PCR amplification is important for differentiating the sequence-intact from defective proviruses that persist during antiretroviral therapy (ART).  Intact proviruses may rebound if ART is interrupted and are the barrier to an HIV cure.  Oxford Nanopore Technologies (ONT) sequencing offers a promising, cost-effective approach to the sequencing of long amplicons such as near full-length HIV-1 proviruses, but the high diversity of HIV-1 and the ONT sequencing error render analysis of the generated data difficult.  Mary F. Kearney (HIV DRP) collaborated with Imogen Wright and Gert van Zyl (University of Stellenbosch) to develop NanoHIV as a new tool that uses an iterative consensus generation approach to construct accurate, near full-length HIV-1 proviral single-genome sequences from ONT data (described in Cells 10: 2577, 2021).  The NanoHIV pipeline and scripts may be downloaded from a public GitHub repository at