Mindful Human Capital
  • Case Studies (Offline)
  • Tutoring
  • About

test

---

title: “\”R-omics\“: Using R for Bioinformatics”

subtitle: “R-Ladies MTL October 2023 Meetup”

date: “2023-10-30”

format:

revealjs:

embed-resources: true

editor: visual

---

## R-Ladies MTL October 2023 Meetup

Welcome! Today’s agenda:

6:00 - 6:15: Settling in

6:15 - 6:30: Introductions and call for presenters!

6:30 - 7:30: “R-omics”: Using R for bioinformatics

```

- Includes 20+ min break for exercises

```

7:30 - 8:00: R-help/networking period

## R-Ladies

R Ladies Global seeks to achieve proportionate representation by encouraging, inspiring, and empowering women and gender minorities currently underrepresented in the R community

We focus on the R language/environment (but if you want to talk about python or SAS, you won’t be banned)

All members must follow the [RLadies Code of Conduct](https://github.com/rladies/starter-kit/blob/master/RLadiesRulesAndGuideLinesENES.md#CoCEN)

- Community driven, independent, free of charge

- Safe and friendly, zero harassment!

## Introductions: Kathryn Morrison

![](images/kathryn.jpg){fig-align=“center”}

## Introductions: Monica Cella

![](images/Monica.jpg){fig-align=“center”}

## Introductions: Rhiannon Kamstra

![](images/rhiannon.jpg){fig-align=“center”}

## Following along

Slides available from (link also in Meetup chat):

https://github.com/rladies/meetup-presentations_montreal/20231030-using-r-for-bioinformatics

Scripts for exercises and demo also available.

## What is bioinformatics?

![](images/bioinformatics_venn_diagram.png){fig-align=“center”}

------------------------------------------------------------------------

![](images/2022_Sequencing_cost_per_Human_Genome.jpg){fig-align=“center”}

![](2022_Sequencing_cost_per_Human_Genome.jpg){fig-align=“center”}

## Data sources

- Often complex, high-throughput technology

- “-omics” examine large pools of biological molecules [![](images/Overview-of-different-omics-sciences-such-as-genomics-transcriptomics-and-proteomics.png)](https://www.researchgate.net/figure/Overview-of-different-omics-sciences-such-as-genomics-transcriptomics-and-proteomics_fig1_333003279)\

------------------------------------------------------------------------

![](images/From-genome-to-metabolome-genomics-is-a-relatively-mature-field-but-the-more-complex-of.jpg){fig-align=“center”}

## Tools for biological data

- Bioinformatics is broad and rapidly evolving

- Selection of tools and/or languages depends on application

- Applications with a **G**raphical **U**ser **I**nterface are common (e.g., Clustal for alignment)

![](images/gr1.jpg){fig-align=“center”}

- As are command-line tools

## R in bioinformatics

- R is open-source and vastly extendable via packages

- Data wrangling, exploration, analysis, and visualization all in once place

- Supports reproducible science and knowledge sharing

  • Workflows can be scaled and shared by writing scripts or packages

## Bioconductor

![](images/Bioconductor.png){fig-align=“center”}

- Started in 2001

- 2nd largest **R package repository** after the CRAN

  • \~2.3k packages in v3.18!

- Focused on tools for analyzing data from biological assays

## Installing Bioconductor packages

1. Install Bioconductor’s package manager

``` r

if (!require(“BiocManager”, quietly = TRUE))

install.packages(“BiocManager”)

```

2. Search for available packages

``` r

# List all packages

BiocManager::available()

# List only packages containing “sapiens” in the name

BiocManager::available(pattern = “sapiens”)

```

You can also use [https://www.bioconductor.org/packages/]().

------------------------------------------------------------------------

3. Install Bioconductor packages

``` r

# Install a specific package, e.g. Biostrings

BiocManager::install(“Biostrings”)

# Install all core Bioconductor packages (may take a while!)

BiocManager::install()

```

**Note:** avoid using `install.packages()` to ensure you get the most up-to-date version

## Working with raw sequence data

- DNA, RNA, and amino acid (protein) sequences can be represented as strings

```{r}

#| echo=TRUE

# Letters represent bases G, C, T, A

dna <- “TAGCAG”

```

- Specialized object types can add functionality and improve performance (e.g., `DNAString`, `AAString`)

```{r}

#| echo=TRUE

# DNAString objects are used for DNA sequences

Biostrings::DNAString(dna)

```

## Use objects designed to work with multiple sequences

- Multiple sequences can be stored in a single vector

```{r}

#| echo=TRUE

# DNAStringSet objects contain multiple DNAString values

dna3 <- c(“TAGCAG”, “CCATTG”, “AAG”)

Biostrings::DNAStringSet(dna3)

```

## Base R functions are often still useful

- Check the number of values/sequences

``` r

more_dna <- Biostrings::DNAStringSet(c(“GC”, “TA”))

length(more_dna)

```

- Add/extract elements (sequences)

``` r

# Extract the 2nd sequence

more_dna[2]

# Add a 3rd sequence

more_dna[3] <- “AGGTAG”

```

------------------------------------------------------------------------

+-----------------------+---------------------------------------------------------+

| Biostrings Function | Purpose |

+=======================+=========================================================+

| `translate()` | Translate DNA/RNA to amino acids |

+-----------------------+---------------------------------------------------------+

| `complement()` | Get the (reversed) complementary sequence of base pairs |

| | |

| `reverseComplement()` | |

+-----------------------+---------------------------------------------------------+

| `alphabetFrequency()` | Count the frequency of each letter |

+-----------------------+---------------------------------------------------------+

| `matchPattern()` | Finds matching occurrences of a pattern |

+-----------------------+---------------------------------------------------------+

Example functions for working with sequences

## Exercises

1. Install the Bioconductor package `Biostrings` .

2. Store at least 2 DNA sequences in a DNAStringSet object.

3. Translate your DNA into amino acids.

**Bonus 1:** Try incorporating [ambiguous bases](https://www.ddbj.nig.ac.jp/ddbj/code-e.html) in your DNA. Test different `if.fuzzy.codon` options of `translate()`.

4. Count the number of methionine (M) residues in each sequence.

**Bonus 2:** Summarize the distribution across all of your sequences.

https://www.ddbj.nig.ac.jp/ddbj/code-e.html

## Working with special file formats

- Bioinformatics data often come in specialized file formats

- Common sequence data formats include:

  • FASTA

  • FASTQ - contains quality information

  • BAM

  • SAM - contains alignment to reference sequence

- `samtools`, `seqinr`, `ShortReads` are helpful packages for working with files

## Command-line tools

- Many useful tools are used from the command-line

  • E.g., [BLAST](https://www.ncbi.nlm.nih.gov/books/NBK569856/) to compare a human coronavirus 3C-like protease (partial) sequence against [NCBI’s protein database](https://www.ncbi.nlm.nih.gov/protein/)

``` unix

blastp -query aa_seq.fasta -remote -db nr -out results.txt -outfmt 6 -evalue 1e-30

```

First 10 lines of output:

``` vi

# BLASTP 2.9.0+

# Query: sars_partial

# RID: KZSA0UTY016

# Database: nr

# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score

# 500 hits found

sars_partial 7WHC_A 100.000 70 0 0 1 70 1 70 3.24e-44 151

sars_partial 7UJU_A 100.000 70 0 0 1 70 2 71 3.52e-44 151

sars_partial 7UJ9_A 100.000 70 0 0 1 70 1 70 4.77e-44 151

```

## Build and execute calls from R

- The same call from R

``` r

aa_seq <- “SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQA”

names(aa_seq) <- “sars_partial”

query_file <- “aa_seq.fasta”

out_file <- “results.txt”

# Write sequence as a FASTA file

seqinr::write.fasta(aa_seq, names(aa_seq), file.out = my_file)

# Construct call

blast_call <- paste0(“blastp -query”, query_file, ” -remote -db nr -out “, out_file,” -outfmt 7 -evalue 1e-30”)

# Execute

system(blast_call)

```

**Note:** Command-line tools (e.g., `blastp`) need to be installed and configured before use.

## High-dimensional data

- E.g., gene expression data from microarrays or RNAseq

  • Quantity of expression (e.g., \# of transcripts) x 1,000s of genes (features)

- Presents special challenges for analysis and visualization

  • E.g., batch effects, multiple testing, handling large datasets

- **Spatial biology** platforms map gene expression over tissue coordinates

## Demo: Mapping gene expression in a mouse brain

![10X Visium counts mRNA for each gene in each 50 micron spot on the grid](images/mouse_brain.png){fig-align=“center”}

## Demo: Mapping gene expression in a mouse brain

- See `seurat_demo.R` to follow along

- Based on Seurat vignette/tutorial available at:

  • [https://satijalab.org/seurat/articles/seurat5_spatial_vignette]()

© oday Sophie Strassmann

Cookie Preferences