Bioinformatic Analysis Training
Code locations:
ChIP-seq: /data/khanlab/projects/ChIP_seq/projects/
RNA-seq: /data/khanlab/projects/ChIP_seq/RNA_DATA/RNA_projects/
Section 1: Navigating Biowulf, Pipeline Output, and Unsupervised Correlation
Video 1.1: What is biowulf? What is the pipeline?
-
ChIPseq Bioinformatic Pipeline Overview (0:00)
-
Locate your data (2:45)
-
Input metadata into ChIP_seq_samples.xlsx (4:20)
-
Example of pipeline output (8:15)
-
Viewing files in IGV (8:50)
-
Load from ENCODE (10:10)
-
Choosing a p-value (12:10 and 21:40)
-
GREAT ontology (14:35)
-
Viewing summary files to determine total number of peaks (20:15)
-
Homer motif analysis (23:10)
-
Identifying mutations from ATACseq data (27:50)
Video 1.2: Summarize peaks
-
Where to find scripts (1:10)
-
Logging on to biowulf and running scripts (2:15)
-
in putty, type biowulf2.nih.gov and click “Open”
-
answer username and password prompts
-
-
Navigating in biowulf (4:40)
-
type: cd /data/khanlab/projects/ChIP_seq/
-
-
Code_Builder (8:40)
-
***code_builder created during these videos is saved onto the google drive***
-
-
Summarize peaks across samples (10:40)
-
Find ChIP_seq_samples: khanlab/projects/ChIP_seq/manage_samples
-
-
Using summarized peaks to determine p-value (21:40)
-
Summary of other analyses that can be done with ATACseq data (28:00)
Video 1.3: Run an unsupervised correlation
-
Figures produced by unbiased correlation of ATACseq data (0:00)
-
What is the folder path for correlations? (0:45)
-
What files are generated? (1:30)
-
Code_Builder for correlation (2:30)
-
Create sample files using RStudio (4:00)
-
Defining different p-values for your samples for all downstream analysis (14:00)
-
Run correlation on created sample files using biowulf (30:15)
-
***Ultimately does not work, skip to Video 1.4 ***
-
-
Checking your “queue” on biowulf (31:40)
-
Output of correlation (38:00)
Video 1.4:
-
What was done in Video 1.3 (0:00)
-
Code_builder explanation (1:10)
-
How to format bed, bam, and sample list files to be Unix compatible (1:30)
-
Checking the slurm (5:50)
-
Viewing peaks from generated beds in IGV (11:00)
Section 2: Running BCHNV
Video 2.1:
-
What is BCHNV? (0:00)
-
Gathering necessary (configuration, run, etc.) files (7:00)
-
Code_Builder BCHNV (9:20)
-
Customizing BCHNV run parameters in the shell script and code_builder (11:50)
-
Editing configuration file (22:20)
-
Determining color scheme (31:20)
-
Gathering .bed files (38:00)
-
Edit permissions (49:00)
Section 3: Gathering, Merging, and Combining Bed Files
Video 3.1: “bed tools”
-
Using ChIP_seq_samples.xlsx to gather beds (0:00)
-
Intro to RStudio (2:30)
-
Gathering .beds with RStudio (6:00)
-
Merge .beds using bedtools merge, how to count, etc (20:00)
-
Running script on biowulf (29:00)
Video 3.2: Combining beds with “.cat”
-
How to combine bed files that are not in the ChIPseq_samples excel spreadsheet
-
Shortcut to combine all files with the same ending in a folder (2:40)
Section 4: Downloading Data from Gene Expression Omnibus (GEO)
Video 4.1: RNA data
-
Downloading RNA data using ChIPstack and the ChIPseq pipeline (0:00)
Video 4.2: ChIPseq or ATACseq data
-
Downloading ChIP or ATAC data using ChIPstack and the ChIPseq pipeline (0:00)
Section 5: Processing and Analyzing RNAseq Data (NOT main Khan Lab Pipeline)
Video 5.1:
-
Overview on pipeline output (0:00)
-
RNAseq Code Builder (3:40)
-
Putting RNA samples in ChIP_seq_samples.xlsx (6:45)
Video 5.2:
-
Review of RNAseq pipeline outputs (0:00)
-
Locating R scripts (1:40)
-
TPM vs. FKPM (4:30)
-
Viewing values in R (7:55)
-
Build Matrix (11:00)
-
Make a gene expression heat map (13:40)
-
Create a comparison scatter plot (16:20)
-
GSEA rank list maker (24:30)
-
Download GSEA (39:40)
-
software.broadinstitute.org/gsea/login.jsp
-
-
Heat map for smaller gene sets. (48:00)
-
GSEA Output (57:55)
Video 5.3:
-
Where to find/What is RNA_projects (0:00)
-
Notes:
-
Clear environment in R = “Ctrl” + “Shift” + “F10”
-
View a “value” in R = start typing the name, “up arrow”, “enter”
-
-
-
Create TPM Matrix for RNAseq data (1:45)
-
Inversion grep (14:09)
-
“invert=T”
-
-
Visualize with a heatmap (18:10)
-
Install “pheatmap” (19:08)
-
Heatmap with only genes of interest (23:20)
-
Change subset, meaning select columns from matrix (25:30)
-
-
Confirming expression level with IGV (29:20)
-
VERY important when using Access RNAseq data
-
Look for 5’UTRs and long exons without probes
-
-
Identify housekeeping genes (34:00)
-
K:/projects/ChIP_seq/RNA_DATA/RNA_projects/Genesets/Qlucore_format/HouseKeeping_genes.txt
-
From Cancer Discovery paper… a compilation of a variety of normal tissues
-
-
Define housekeeping genes within your own sample set (39:00)
-
Use a list to compare 2 gene sets (44:00)
-
Transcription Factors gene set - Based on the Broad’s + Berkley’s additions (49:00)
-
K:/projects/ChIP_seq/RNA_DATA/RNA_projects/Genesets/Qlucore_format/TFs_Epimachines.genelist.txt
-
-
Find specialized TFs for your sample set (52:30)
-
Edit heat map’s scale (1:08:15)
-
Overview of TF identification strategy (1:13:00)
-
Video summary/where to find files (1:19:40)
Video 5.4 (Continuation of Video 5.3):
-
Load in second set of samples (1:00)
-
Identify maximum in a matrix (3:30)
-
List out genes in a defined gene set (9:00)
-
Explanation of how we could theoretically ignore UTRs and compare multiple RNAseq data types (20:00)
-
What you can compare using Access RNAseq data (21:40)
-
Ideas of how to look at your data after you’ve filtered out multiple gene lists (26:00)
-
Using Microsoft Excel to look at your data and run T test, log scale, etc (36:30)
Section 6: ChIPseq qPCR, Pipeline, and Analysis
Video 6.1: Designing ChIP qPCR Primers
-
Rationale on where to design primers (0:20)
-
Visualizing data in IGV (3:00)
-
Choosing locations (8:00)
-
“Designing Primers for ChIP.docx” (10:00)
-
follow links and instructions in file
-
-
Go to www.idt.dna.com/pages
-
use Young’s account info, and tell him you have primers to order
-
25 nm DNA Oligo with Standard Desalting
-
-
Add primers to qPCR_primers_v2.bed using genome.ucsc.edu (18:30)
Video 6.2: How to run ChIP qPCR
-
Where do I find the files and which files are there? (1:00)
-
Setting up the plate in excel (3:00)
-
Run qPCR follow protocol in “ChIP-qPCR ViiA7 protocol.docx” (12:00)
Video 6.3: Auto-launching ChIPseq Pipeline
-
See “Video 1.1” for a more detailed overview of the pipeline output
-
Input metadata (0:10)
-
preferably before sequencing run
-
-
Importance of file naming (3:00)
-
How to manually launch the pipeline (7:30)
-
SEE YOUNG OR JUN FIRST! (not for a normal situation)
-
Video 6.4: Gathering Homer Motifs and Summarizing Enhancers
-
Where to find files and creating input file (1:00)
-
Viewing all motif data based on all called peaks in excel ( 3:50)
-
***Note mistake made in sorting… corrected at (16:00)
-
-
Look at homerMotifs.R to see scripts that you can use to manipulate the data (7:00)
-
can upload the created matrix into R and use pheatmap
-
-
Correlating to Coltron Output (13:30)
-
“DEGREE_Table”
-
Use to tease out which TFs are likely more significant
-
what is expressed?
-
what has super enhancers?
-
-
Video 6.5: Compare motif enrichment across various groups of samples
-
R script = gatherBCHN_motifs.R (0.15)
-
Gather files (7:00) and grep
-
Make matrix (11:00)
-
Heatmap data (13:30)
-
Filter for highly enriched motifs (15:00)
-
-
“Wide format”? (11:30)
-
Using generated “allmotifs.wide.text” file with RNAseq data (21:50)
Section 7: Making and Editing Figures
Video 7.1: Figures for Western Blots in Adobe Illustrator
-
“masking” Illustrator (0:00)
-
object > clipping mask > make
-
double click inside the box to see inside mask (9:50)
-
-
“rectangle tool” (3:00)
-
“rotation” of a bitmap (4:00)
-
“align” (9:00)
-
Can make any line straight by using the “Shift” key
-