Note: imported from an earlier version of this blog. Sorry if anything is broken.
How do we predict how RNA folds? First of all, we need to define what we mean by folding. RNA is a polymer made up of nucleic acids: adenine, guanine, cytosine, and uracil. There are 3 levels of RNA structure that are traditionally considered:
- primary structure:the sequence of bases, ‘A’, ‘G’, ‘C’, and ‘U’ that form the strand of RNA, ‘AAAAGGGGCCCCUUUU’ for example,
- secondary structure: which bases are hydrogen bonded together, ‘A’ to ‘U’ and ‘G’ to ‘C’, can be specified by a sequence of pairs corresponding to which indices are paired together (1,24), (2,23)…, and
- tertiary structure:how the molecule bends on itself on a larger scale.
Usually, primary structure is already known or given as an input and tertiary structure is hard to specify and compute the energy for. Secondary structure is easy to specify using a sequence of pairs and the energy is roughly dominated by hydrogen bonding between bases which is discrete. Since RNA functions mostly by binding to things using exposed (non-paired) bases, the secondary structure is also the most relevant when discerning function. For these reasons, biophysicists usually focus on predicting the secondary structure of RNA.
From the laws of thermal physics, if RNA is modeled as a system in thermal equilibrium then the probability of any state with free energy is equal to
where is the full set of states. The denominator is called the partition function, usually denoted , it normalizes the probability distribution and with it we can compute the probability of any structure. How many terms are in the partition function? How many states are in ? Depends on what assumptions you make. In full generality, even only allowing Watson-Crick base pairs, there are terms in that sum. However, if you assume that there are no crossing pairs, i.e. arcs that would intersect in the above diagram, or pairs such that , , and , there are only terms in the sum. This is still an exponential quantity, but it turns out it is actually computable in sub-exponential time via dynamic programming. This assumption is called the no-pseudoknot assumption, and it means that if I have an pair, then for all bases in-between them, I can ignore all the bases outside of them when enumerating structures. This begs the question whether it is a physically valid assumption. The answer is NO. There are many RNAs found in nature that have “crossing” pairs such as Group I and Group II introns. However, without this assumption, computing the partition function is intractable, so most RNA folding software packages make it.
While it is tempting to do analysis on a structure with the minimum free energy and highest probability, since RNA is a thermal system with an exponential multiplicity of states even the probability of the most probable state is extremely low and any quantity computed on even the most probable state has negligible contribution to the physical expected value of that quantity. However, I will show here that not only can we compute the partition function, but we can also sample from the Boltzmann distribution. Once we are able to do that we can compute arbitrary expected values on the full Boltzmann distribution via Monte Carlo integration. (In fact, Monte Carlo integration is not even needed if you can express your quantity in terms of the partition function values, which can give you full expected values without statistical error! But that is outside of the scope of this post).
Putting this all together, to predict the folding and general properties of RNA we need 4 things:
- a specification of a state of RNA ,
- an energy function , which takes a state and outputs a physical energy,
- the partition function, the sum of the Boltzmann factors over every possible state , and
- an algorithm for sampling structures from the Boltzmann distribution,
Quick Code Note
All code going forward will be in Haskell. You should be able to extract all the code here into a file and it should compile. To set up our environment, make sure to have this header at the top of the file:
{-# LANGUAGE NoMonomorphismRestriction #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE TypeFamilies #-}
import Data.Array
import Data.Time.Clock.POSIX
import Data.Colour
import System.Random
import Data.List
import Diagrams.Prelude
import Diagrams.Backend.SVG.CmdLine
States of RNA
Because of the no-pseudoknot assumption, we can actually specify the states using a recursive data structure.
data Structure = Paired Char Char (Structure, Structure) |
Slip Char Structure |
None
deriving Show
The thought behind this definition is that we are specifying the structure base by base. If base is unpaired, then we move on to the next base. If base is paired, then we fork the structure into the bases underneath the pair and the bases after the pair. We carry along the letter of the base for convenience and to compute the energies of any pairing that might happen.
Energy Model
A fully realistic energy model of RNA would be very difficult to compute. There are exposed ions everywhere on the strand which interact with each other and there is tension in the backbone. Many of these interactions are non-linear which breaks the recursion and therefore computability of the problem. In practice, there are several levels of complexity that are implemented which try to account for the non-linearities in a computable fashion (this actually is the only difference between professional RNA secondary structure prediction and an Algorithms homework problem). Here I will implement a very simple energy model: adding the free energies of each hydrogen bond. A hydrogen bond contributes about . Adenine-uracil bonds have 2 hydrogen bonds, guanine-cytosine bonds have 3. Therefore in the exponent of the Boltzmann factor, I will put -4.6 for A-U bonds and -6.9 for G-C bonds. Anything else is represented by infinite energy, meaning that it will never happen.
baseEnergy :: Char -> Char -> Double
baseEnergy a b = case (a, b) of
('A', 'U') -> -4.6
('G', 'C') -> -6.9
('U', 'A') -> -4.6
('C', 'G') -> -6.9
otherwise -> 1.0/0.0 -- non-allowed pairs treated like an infinite hill
energyModel :: Structure -> Double
energyModel (Paired a b (s1,s2)) =
(baseEnergy a b) + -- energy of pair
(energyModel s1) + -- energy of substructure under pair
(energyModel s2) -- energy of substructure after pair
energyModel (Slip _ s) = energyModel s
energyModel None = 0
Partition Function
The sum initially seems daunting. However, from the recursive definition of our energy model, we can easily see that the partition function can be defined recursively as well. To be precise, if we define as the sub-partition-function from to , it must be equal to the sum of the Boltzmann terms of all structures where is paired and all structures where is unpaired:
where . This becomes a standard dynamic programming problem. We can turn Z into a table of values with cells. If cell is on the bottom left, and on the top right, each row of the table only depends on the entries below and to the left. We can compute the full partition function by starting from the bottom left and working out to the top right in time. This is even parallelizable along the diagonals (I’m not going to parallelize it here).
type PFunArray = Array (Int, Int) Double
boltz :: Double -> Double
boltz x = exp (-x)
partitionFunction :: String -> PFunArray
partitionFunction strand = arr
where
len = length strand
arr = array ((0,0), (len - 1, len - 1))
[((i,j), (pfCell i j strand arr)) | i <- [0..len-1], j <- [i..len-1]]
pfCell :: Int -> Int -> String -> PFunArray -> Double
pfCell i j strand arr = if (i==j) then 1 else (sum pairTerms) + slipTerm
where
pairTerm i k = boltz (baseEnergy (strand !! i) (strand !!k)) *
(if k-i > 2 then arr ! (i+1, k-1) else 1) *
(if j-k > 1 then arr ! (k+1, j) else 1)
pairTerms = [pairTerm i k | k <- [(i+1)..j]]
slipTerm = if i == j then 1 else arr ! (i+1, j)
Sampling RNA Structures
To actually predict the structure of RNA molecules in nature, we can work backward through the partition function table to sample structures from the Boltzmann distribution. Consider the length of the strand along base to base ,
- the probability that is unpaired is equal to the sum of the probabilities of every structure along to with unpaired, or ,
- the probability that is paired to base is equal to the sum of the probabilities of every structure along to where and are paired, or ,
- completing the rest of the structure can be achieved by calling the sampling function recursively for every length of the strand left general, to for the unpaired case, to and to for the second case.
I encode this logic in Haskell below. At each level I roll a random number from 0 to and sort through the cases. When I pass cumulative probability greater than the number I’ve rolled I choose that case and recurse.
sampleStructure :: Int -> Int -> PFunArray -> String -> StdGen -> Structure
sampleStructure i j arr strand gen = if (i > j) then None else struct
where
-- setup pair terms
pairTerm i k = boltz (baseEnergy (strand !! i) (strand !! k)) *
(if k-i > 2 then arr ! (i+1, k-1) else 1) *
(if j-k > 1 then arr ! (k+1, j) else 1)
pairTerms = [pairTerm i k | k <- [(i+1)..j]]
-- structure functions
slipStructure genr = Slip (strand !! i) (sampleStructure (i+1) j arr strand genr)
innerPairStructure k genr = if k - i > 1 then sampleStructure (i+1) (k-1) arr strand genr else None
outerPairStructure k genr = if j - k > 0 then sampleStructure (k+1) (j) arr strand genr else None
pairStructure k genr = Paired (strand !! i) (strand !! k) (innerPairStructure k genr, outerPairStructure k genr)
-- (cumulative probability, structure) pairs
slipCase = (0, slipStructure) -- starting at offset of 0
pairCases = [(sum (take (k-i-1) pairTerms) + arr ! (i+1, j), pairStructure k) | k <- [(i+1)..j]]
allCases = slipCase:pairCases
-- sample a random number according to the current sub
(roll, newgen) = randomR (0, arr ! (i, j)) gen
-- if roll >= cumulative probability switch canidates, else not
checkCase current candidate = if roll >= (fst candidate) then (snd candidate) newgen else current
struct = foldl' checkCase None allCases
Drawing Structures
One last thing to do is to draw these structures, which I do using Brent Yorgey‘s cool diagrams library. Every base is a circle with a letter underneath, color coded by base, with backbone and arcs denoting bonds. This depicts RNA with the bonds fanned out and visible, whereas other depictions of RNA will try to keep the length of each bond pretty constant, which is a more physical picture.
colorMap a = case a of
'G' -> sRGB24read "#107896"
'C' -> sRGB24read "#829356"
'U' -> sRGB24read "#F26D21"
'A' -> sRGB24read "#C02F1D"
otherwise -> white
base :: Char -> Diagram B
base b = circle 1 # fc (colorMap b) <>
text [b] # fontSize (local 2) # (translateY (-2.6))
backbone = hrule 1
isNone :: Structure -> Bool
isNone s = case s of
None -> True
otherwise -> False
structLength :: Structure -> Int
structLength None = 0
structLength (Slip _ s) = 1 + structLength s
structLength (Paired _ _ (s1, s2)) = 2 + structLength s1 + structLength s2
renderStructure :: Structure -> Diagram B
renderStructure (Paired a b (s1, s2)) =
base a <>
translateX 1.5 backbone <>
translateX 3 innerElement <>
translateX (3.0*k') (base b) <>
translateX (3.0*k'+3) outerElement <>
arc xDir (0.5 @@ turn) # scale (1.5*k') # translateX (1.5*k')
where
k = structLength s1 + 1
k' = fromIntegral k
innerElement = if isNone s1 then mempty else renderStructure s1 ||| backbone
outerElement = if isNone s2 then mempty
else translateX (-1.5) backbone <>
renderStructure s2
renderStructure (Slip c s) = if isNone s then base c else base c <>
translateX 1.5 backbone <>
translateX 3 (renderStructure s)
renderStructure None = mempty
Example Output
Let’s try it out:
main = do
let gen = mkStdGen 42
let rna = "GAAAAAAGGGGAAACCAAAGCCCAAUUUGCUUUUAAAAGGCCAA"
let pf = partitionFunction rna
let struct = sampleStructure 0 (length rna - 1) pf rna gen
let diagram = renderStructure struct
mainWith diagram
In shell:
ghc RNAfolding.hs
RNAfolding.o -o RNAt.svg -w 600 -h 300
Cool!
Further Reading
There is a long history of this problem being solved going back to the 1980’s even before John McCaskill first formulated a partition function approach to RNA secondary structure prediction. Below are a couple significant papers in this series, or at least those that are known to me. I’ll also put my undergraduate thesis here, where I wrote extensively on a slightly more complicated version of this problem.
Partition Functions:
McCaskill, J.S., 1990. The equilibrium partition function and base pair
binding probabilities for RNA secondary structure. Biopolymers, 29(6‐7),
pp.1105-1119.
pdf
Stochastic Sampling:
Ding, Y. and Lawrence, C.E., 2003. A statistical sampling algorithm for
RNA secondary structure prediction. Nucleic acids research, 31(24),
pp.7280-7301.
pdf
More complicated energy models:
Mathews, D.H., Sabina, J., Zuker, M. and Turner, D.H., 1999. Expanded
sequence dependence of thermodynamic parameters improves prediction of
RNA secondary structure. Journal of molecular biology, 288(5),
pp.911-940.
pdf
Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M. and Turner, D.H., 2004. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences of the United States of America, 101(19), pp.7287-7292. pdf
Some clear formulations, useful to me:
Dirks, R.M. and Pierce, N.A., 2003. A partition function algorithm for
nucleic acid secondary structure including pseudoknots. Journal of
computational chemistry, 24(13), pp.1664-1677.
pdf
RNAbows:
Aalberts, D.P. and Jannen, W.K., 2013. Visualizing RNA base-pairing
probabilities with RNAbow diagrams. RNA, 19(4), pp.475-478.
pdf
My thesis:
Flynn, M.J. and Aalberts, D.P., 2015. RNA Macrostates and Computational
Tools. Williams College. github
pdf
4 Comments
-
January 10, 2017 at 4:13 am
Awesome article, in all of content, intersectionality, and presentaion. Have a possible slight correction though that made me stop and double-check:
The sum initially seems daunting.
Shouldn’t this be:
The sum initially seems daunting.
Still processing the rest :p
mflynn
January 10, 2017 at 4:24 am
Yes! Thanks.
-
Brent Yorgey
January 10, 2017 at 9:49 pm
Hey, this is really cool. I think I remember you trying to explain some of this to me a couple years ago, though I’m not sure I really got it at the time (it makes a lot of sense now, though).
Also, nice pictures! 😉
mflynn
January 11, 2017 at 4:57 am
Thanks! Glad you thought it was cool. Diagrams is very useful!