There has been considerable hype surrounding the potential applications CRISPR/CAS-9 in molecular biology1. Much of the excitement is certainly warranted, however, there are still quite a few challenges with this methodology to overcome.
One particular challenge is performing gene knock-ins. There are a number of reasons for this, including specificity and efficiency, but a big problem seems to be from the GC content of a particular construct. More specifically, sequences highly enriched with G’s & C’s are able to readily form secondary structures which are notoriously difficult to work with2. This seems to be more of an open secret than a well characterized phenomenon in molecular biology.
In our particular case, we are interested in inserting a GFP construct in various genes of interest so that protein fluctuations could be tracked in live cells. The goal is for this to be as non-invasive as possible. Thus, we were only interested in inserting the GFP constructs into either the beginning or end of a gene’s coding sequence.
For this assignment, your goal is to determine the %GC content of the regions flanking from the start and stop codons of all coding domain sequence (CDS) in human chromosomes 22 and Y, and the mitochondrial chromosome. For each CDS feature, there are 4 regions that you will need to find: these are the 750 bp regions flanking (both up and downstream) the start and stop codons. The accessions numbers are:
NC_000022.11
NC_000024.10
NC_012920.1
The first thing to do is use Biopython to fetch and parse the .gb files and count the number of CDS features. Be sure to fetch the fully annotated .gb file by using:
gb = Entrez.efetch(db='nucleotide',
id='NC_000022.11',
rettype='gbwithparts',
retmode='text')Your script should accept the three accessions as arguments like so:
./assign14.py NC_000022.11 NC_000024.10 NC_012920.1
Your output should look like:
NC_000022.11 raw CDS count: 2580
NC_012920.1 raw CDS count: 13
NC_000024.10 raw CDS count: 325
Plot a histogram for each of chromosomes of the %GC content for each of the 4 regions.
Now if you were to look at the gene names, you will see that the same gene name can often correspond to many different CDS features. So now count the number of unique gene names within the CDS features:
NC_000022.11 raw CDS count: 2580
NC_000022.11 unique gene ID count: 531
NC_012920.1 raw CDS count: 13
NC_012920.1 unique gene ID count: 13
NC_000024.10 raw CDS count: 325
NC_000024.10 unique gene ID count: 71
You can see that there’s quite a few genes with multiple CDS annotations. It turns out this is because of many genes undergo alternative splicing. What this means for our purposes is that some of these alternatively spliced genes potentially have different start and stop coordinates. Now determine how many of these CDS features have unique start and stop coordinates:
NC_000022.11 raw CDS count: 2580
NC_000022.11 unique gene ID count: 531
NC_000022.11 unique : 1246
NC_012920.1 raw CDS count: 13
NC_012920.1 unique gene ID count: 13
NC_012920.1 unique : 13
NC_000024.10 raw CDS count: 325
NC_000024.10 unique gene ID count: 71
NC_000024.10 unique : 168
Lastly, plot a histogram for each of the chromosomes showing the %GC content the same as above, except only plot CDS features with unique start and stop coordinates:
Wright AV, Nunez JK, Doudna JA. 2016. Biology and applications of crispr systems: Harnessing nature’s toolbox for genome engineering. Cell 164:29-44.↩
Yakovchuk P, Protozanova E, Frank-Kamenetskii MD. 2006. Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nucleic acids research 34:564-574.↩