There has been considerable hype surrounding the potential applications CRISPR/CAS-9 in molecular biology¹. Much of the excitement is certainly warranted, however, there are still quite a few challenges with this methodology to overcome.

One particular challenge is performing gene knock-ins. There are a number of reasons for this, including specificity and efficiency, but a big problem seems to be from the GC content of a particular construct. More specifically, sequences highly enriched with G’s & C’s are able to readily form secondary structures which are notoriously difficult to work with². This seems to be more of an open secret than a well characterized phenomenon in molecular biology.

In our particular case, we are interested in inserting a GFP construct in various genes of interest so that protein fluctuations could be tracked in live cells. The goal is for this to be as non-invasive as possible. Thus, we were only interested in inserting the GFP constructs into either the beginning or end of a gene’s coding sequence.

Assignment

For this assignment, your goal is to determine the %GC content of the regions flanking from the start and stop codons of all coding domain sequence (CDS) in human chromosomes 22 and Y, and the mitochondrial chromosome. For each CDS feature, there are 4 regions that you will need to find: these are the 750 bp regions flanking (both up and downstream) the start and stop codons. The accessions numbers are:

The first thing to do is use Biopython to fetch and parse the .gb files and count the number of CDS features. Be sure to fetch the fully annotated .gb file by using:

gb = Entrez.efetch(db='nucleotide',
            id='NC_000022.11',
            rettype='gbwithparts',
            retmode='text')

Plot a histogram for each of chromosomes of the %GC content for each of the 4 regions.

Now if you were to look at the gene names, you will see that the same gene name can often correspond to many different CDS features. So now count the number of unique gene names within the CDS features:

You can see that there’s quite a few genes with multiple CDS annotations. It turns out this is because of many genes undergo alternative splicing. What this means for our purposes is that some of these alternatively spliced genes potentially have different start and stop coordinates. Now determine how many of these CDS features have unique start and stop coordinates:

Lastly, plot a histogram for each of the chromosomes showing the %GC content the same as above, except only plot CDS features with unique start and stop coordinates:

Wright AV, Nunez JK, Doudna JA. 2016. Biology and applications of crispr systems: Harnessing nature’s toolbox for genome engineering. Cell 164:29-44.↩
Yakovchuk P, Protozanova E, Frank-Kamenetskii MD. 2006. Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nucleic acids research 34:564-574.↩

GC Content of Up/Down Stream Regions Flanking Genes

Background

Assignment