Simple RNA folding in 130 lines of Haskell

Note: imported from an earlier version of this blog. Sorry if anything is broken.

RNA secondary structure

How do we predict how RNA folds? First of all, we need to define what we mean by folding. RNA is a polymer made up of nucleic acids: adenine, guanine, cytosine, and uracil. There are 3 levels of RNA structure that are traditionally considered:

primary structure:the sequence of bases, ‘A’, ‘G’, ‘C’, and ‘U’ that form the strand of RNA, ‘AAAAGGGGCCCCUUUU’ for example,
secondary structure: which bases are hydrogen bonded together, ‘A’ to ‘U’ and ‘G’ to ‘C’, can be specified by a sequence of pairs corresponding to which indices are paired together (1,24), (2,23)…, and
tertiary structure:how the molecule bends on itself on a larger scale.

Usually, primary structure is already known or given as an input and tertiary structure is hard to specify and compute the energy for. Secondary structure is easy to specify using a sequence of pairs and the energy is roughly dominated by hydrogen bonding between bases which is discrete. Since RNA functions mostly by binding to things using exposed (non-paired) bases, the secondary structure is also the most relevant when discerning function. For these reasons, biophysicists usually focus on predicting the secondary structure of RNA.

From the laws of thermal physics, if RNA is modeled as a system in thermal equilibrium then the probability of any state $s$ with free energy $E(s)$ is equal to

$P(s)= \frac{e^{-E(s)/kT}}{\sum_{s' \in S} e^{-E(s')/kT}},$

where $S$ is the full set of states. The denominator is called the partition function, usually denoted $Z$ , it normalizes the probability distribution and with it we can compute the probability of any structure. How many terms are in the partition function? How many states are in $S$ ? Depends on what assumptions you make. In full generality, even only allowing Watson-Crick base pairs, there are $O(n!)$ terms in that sum. However, if you assume that there are no crossing pairs, i.e. arcs that would intersect in the above diagram, or pairs $(i,j), (k,l)$ such that $i < k$ , $k < j$ , and $j < l$ , there are only $O(1.8^n)$ terms in the sum. This is still an exponential quantity, but it turns out it is actually computable in sub-exponential time via dynamic programming. This assumption is called the no-pseudoknot assumption, and it means that if I have an $(i,j)$ pair, then for all bases in-between them, I can ignore all the bases outside of them when enumerating structures. This begs the question whether it is a physically valid assumption. The answer is NO. There are many RNAs found in nature that have “crossing” pairs such as Group I and Group II introns. However, without this assumption, computing the partition function is intractable, so most RNA folding software packages make it.

While it is tempting to do analysis on a structure with the minimum free energy and highest probability, since RNA is a thermal system with an exponential multiplicity of states even the probability of the most probable state is extremely low and any quantity computed on even the most probable state has negligible contribution to the physical expected value of that quantity. However, I will show here that not only can we compute the partition function, but we can also sample from the Boltzmann distribution. Once we are able to do that we can compute arbitrary expected values on the full Boltzmann distribution via Monte Carlo integration. (In fact, Monte Carlo integration is not even needed if you can express your quantity in terms of the partition function values, which can give you full expected values without statistical error! But that is outside of the scope of this post).

Putting this all together, to predict the folding and general properties of RNA we need 4 things:

a specification of a state of RNA $s$ ,
an energy function $E(s): S \to \text{Energy}$ , which takes a state and outputs a physical energy,
the partition function, the sum of the Boltzmann factors over every possible state $Z = \sum_{s' \in S} e^{-E(s')/kT}$ , and
an algorithm for sampling structures from the Boltzmann distribution, $P(s) = e^{-E(s)/kT}/Z$

Quick Code Note

All code going forward will be in Haskell. You should be able to extract all the code here into a file and it should compile. To set up our environment, make sure to have this header at the top of the file:

{-# LANGUAGE NoMonomorphismRestriction #-}
{-# LANGUAGE FlexibleContexts          #-}
{-# LANGUAGE TypeFamilies              #-}
import Data.Array
import Data.Time.Clock.POSIX 
import Data.Colour
import System.Random
import Data.List
import Diagrams.Prelude
import Diagrams.Backend.SVG.CmdLine

States of RNA

Because of the no-pseudoknot assumption, we can actually specify the states using a recursive data structure.

data Structure = Paired Char Char (Structure, Structure) |
                 Slip Char Structure |
                 None
               deriving Show

The thought behind this definition is that we are specifying the structure base by base. If base $i$ is unpaired, then we move on to the next base. If base $i$ is paired, then we fork the structure into the bases underneath the pair and the bases after the pair. We carry along the letter of the base for convenience and to compute the energies of any pairing that might happen.

Energy Model

A fully realistic energy model of RNA would be very difficult to compute. There are exposed ions everywhere on the strand which interact with each other and there is tension in the backbone. Many of these interactions are non-linear which breaks the recursion and therefore computability of the problem. In practice, there are several levels of complexity that are implemented which try to account for the non-linearities in a computable fashion (this actually is the only difference between professional RNA secondary structure prediction and an Algorithms homework problem). Here I will implement a very simple energy model: adding the free energies of each hydrogen bond. A hydrogen bond contributes about $-6\ \text{kJ}/\text{mol} = -2.3\ k_BT$ . Adenine-uracil bonds have 2 hydrogen bonds, guanine-cytosine bonds have 3. Therefore in the exponent of the Boltzmann factor, I will put -4.6 for A-U bonds and -6.9 for G-C bonds. Anything else is represented by infinite energy, meaning that it will never happen.

baseEnergy :: Char -> Char -> Double
baseEnergy a b = case (a, b) of
  ('A', 'U') -> -4.6
  ('G', 'C') -> -6.9
  ('U', 'A') -> -4.6
  ('C', 'G') -> -6.9
  otherwise -> 1.0/0.0 -- non-allowed pairs treated like an infinite hill

energyModel :: Structure -> Double
energyModel (Paired a b (s1,s2)) =
  (baseEnergy a b) + -- energy of pair
  (energyModel s1) + -- energy of substructure under pair
  (energyModel s2) -- energy of substructure after pair
energyModel (Slip _ s) = energyModel s
energyModel None = 0

Partition Function

The sum $Z = \sum_{s' \in S} e^{-E(s')/kT}$ initially seems daunting. However, from the recursive definition of our energy model, we can easily see that the partition function can be defined recursively as well. To be precise, if we define $Z[i,j]$ as the sub-partition-function from $i$ to $j$ , it must be equal to the sum of the Boltzmann terms of all structures where $i$ is paired and all structures where $i$ is unpaired:

$Z[i, j] = \overbrace{Z[i+1, j]}^{i \text{ unpaired}} + \overbrace{\sum_{i < k \leq j} B(i, k) Z[i+1, k-1] Z[k+1, j]}^{i \text{ paired}}$

where $B(i, k) = \text{exp}(-\text{BaseEnergy}(i,k))$ . This becomes a standard dynamic programming problem. We can turn Z into a table of values with $O(n^2)$ cells. If cell $(0,0)$ is on the bottom left, and $(n-1,n-1)$ on the top right, each row of the table only depends on the entries below and to the left. We can compute the full partition function by starting from the bottom left and working out to the top right in $O(n^3)$ time. This is even parallelizable along the diagonals (I’m not going to parallelize it here).

type PFunArray = Array (Int, Int) Double

boltz :: Double -> Double
boltz x = exp (-x)

partitionFunction :: String -> PFunArray
partitionFunction strand = arr                           
  where
    len = length strand
    arr = array ((0,0), (len - 1, len - 1))
          [((i,j), (pfCell i j strand arr)) | i <- [0..len-1], j <- [i..len-1]]

pfCell :: Int -> Int -> String -> PFunArray -> Double
pfCell i j strand arr = if (i==j) then 1 else (sum pairTerms) +  slipTerm
  where
    pairTerm i k = boltz (baseEnergy (strand !! i) (strand !!k)) *
                   (if k-i > 2 then arr ! (i+1, k-1) else 1) *
                   (if j-k > 1 then arr ! (k+1, j) else 1)
    pairTerms = [pairTerm i k | k <- [(i+1)..j]]
    slipTerm  = if i == j then 1 else arr ! (i+1, j)

Sampling RNA Structures

To actually predict the structure of RNA molecules in nature, we can work backward through the partition function table to sample structures from the Boltzmann distribution. Consider the length of the strand along base $i$ to base $j$ ,

the probability that $i$ is unpaired is equal to the sum of the probabilities of every structure along $i$ to $j$ with $i$ unpaired, or $\frac{Z[i+1, j]}{Z[i,j]}$ ,
the probability that $i$ is paired to base $k$ is equal to the sum of the probabilities of every structure along $i$ to $j$ where $i$ and $k$ are paired, or $\frac{B(i,k)Z[i+1,k-1]Z[k+1,j]}{Z[i, j]}$ ,
completing the rest of the structure can be achieved by calling the sampling function recursively for every length of the strand left general, $i+1$ to $j$ for the unpaired case, $i+1$ to $k-1$ and $k+1$ to $j$ for the second case.

I encode this logic in Haskell below. At each level I roll a random number from 0 to $Z[i,j]$ and sort through the cases. When I pass cumulative probability greater than the number I’ve rolled I choose that case and recurse.

sampleStructure :: Int -> Int -> PFunArray -> String -> StdGen -> Structure
sampleStructure i j arr strand gen = if (i > j) then None else struct
  where
    -- setup pair terms
    pairTerm i k = boltz (baseEnergy (strand !! i) (strand !! k)) *
                   (if k-i > 2 then arr ! (i+1, k-1) else 1) *
                   (if j-k > 1 then arr ! (k+1, j) else 1)
    pairTerms = [pairTerm i k | k <- [(i+1)..j]]

    -- structure functions
    slipStructure genr = Slip (strand !! i) (sampleStructure (i+1) j arr strand genr)
    innerPairStructure k genr = if k - i > 1 then  sampleStructure (i+1) (k-1) arr strand genr else None
    outerPairStructure k genr = if j - k > 0 then sampleStructure (k+1) (j) arr strand genr else None                                           
    pairStructure k genr = Paired (strand !! i) (strand !! k) (innerPairStructure k genr, outerPairStructure k genr)

    -- (cumulative probability, structure) pairs
    slipCase = (0, slipStructure) -- starting at offset of 0 
    pairCases = [(sum (take (k-i-1) pairTerms) + arr ! (i+1, j), pairStructure k) | k <- [(i+1)..j]]
    allCases = slipCase:pairCases

    -- sample a random number according to the current sub
    (roll, newgen) = randomR (0, arr ! (i, j)) gen
    -- if roll >= cumulative probability switch canidates, else not
    checkCase current candidate = if roll >= (fst candidate) then (snd candidate) newgen else current 
    struct = foldl' checkCase None allCases

Drawing Structures

One last thing to do is to draw these structures, which I do using Brent Yorgey‘s cool diagrams library. Every base is a circle with a letter underneath, color coded by base, with backbone and arcs denoting bonds. This depicts RNA with the bonds fanned out and visible, whereas other depictions of RNA will try to keep the length of each bond pretty constant, which is a more physical picture.

colorMap a = case a of
  'G' -> sRGB24read "#107896"
  'C' -> sRGB24read "#829356"
  'U' -> sRGB24read "#F26D21"
  'A' -> sRGB24read "#C02F1D"
  otherwise -> white

base :: Char -> Diagram B
base b = circle 1 # fc (colorMap b) <>
  text [b] # fontSize (local 2) # (translateY (-2.6))
backbone = hrule 1

isNone :: Structure -> Bool
isNone s = case s of
  None -> True
  otherwise -> False

structLength :: Structure -> Int
structLength None = 0
structLength (Slip _ s) = 1 + structLength s
structLength (Paired _ _ (s1, s2)) = 2 + structLength s1 + structLength s2

renderStructure :: Structure -> Diagram B
renderStructure (Paired a b (s1, s2)) =  
  base a <>
  translateX 1.5 backbone <>
  translateX 3 innerElement <>
  translateX (3.0*k') (base b) <>
  translateX (3.0*k'+3) outerElement <>
  arc xDir (0.5 @@ turn) # scale (1.5*k') # translateX (1.5*k')
  where
    k = structLength s1 + 1
    k' = fromIntegral k
    innerElement =  if isNone s1 then mempty else renderStructure s1 ||| backbone
    outerElement =  if isNone s2 then mempty
      else  translateX (-1.5) backbone <>
            renderStructure s2
renderStructure (Slip c s) = if isNone s then base c else base c <>
  translateX 1.5 backbone <>
  translateX 3 (renderStructure s)
renderStructure None = mempty

Example Output

Let’s try it out:

main = do
  let gen =  mkStdGen 42
  let rna = "GAAAAAAGGGGAAACCAAAGCCCAAUUUGCUUUUAAAAGGCCAA"
  let pf = partitionFunction rna
  let struct = sampleStructure 0 (length rna - 1) pf rna gen
  let diagram = renderStructure struct
  mainWith diagram

In shell:

ghc RNAfolding.hs
RNAfolding.o -o RNAt.svg -w 600 -h 300

Cool!

Further Reading

There is a long history of this problem being solved going back to the 1980’s even before John McCaskill first formulated a partition function approach to RNA secondary structure prediction. Below are a couple significant papers in this series, or at least those that are known to me. I’ll also put my undergraduate thesis here, where I wrote extensively on a slightly more complicated version of this problem.

Partition Functions:
McCaskill, J.S., 1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6‐7), pp.1105-1119. pdf

Stochastic Sampling:
Ding, Y. and Lawrence, C.E., 2003. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic acids research, 31(24), pp.7280-7301. pdf

More complicated energy models:
Mathews, D.H., Sabina, J., Zuker, M. and Turner, D.H., 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of molecular biology, 288(5), pp.911-940. pdf

Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M. and Turner, D.H., 2004. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences of the United States of America, 101(19), pp.7287-7292. pdf

Some clear formulations, useful to me:
Dirks, R.M. and Pierce, N.A., 2003. A partition function algorithm for nucleic acid secondary structure including pseudoknots. Journal of computational chemistry, 24(13), pp.1664-1677. pdf

RNAbows:
Aalberts, D.P. and Jannen, W.K., 2013. Visualizing RNA base-pairing probabilities with RNAbow diagrams. RNA, 19(4), pp.475-478. pdf

My thesis:
Flynn, M.J. and Aalberts, D.P., 2015. RNA Macrostates and Computational Tools. Williams College. github pdf

4 Comments

wrmsr

January 10, 2017 at 4:13 am

Awesome article, in all of content, intersectionality, and presentaion. Have a possible slight correction though that made me stop and double-check:

The sum $Z = \sum\_{s’ \in S} e^{-E(s)/kT}$ initially seems daunting.

Shouldn’t this be:

The sum $Z = \sum\_{s’ \in S} e^{-E(s’)/kT}$ initially seems daunting.

Still processing the rest :p
- mflynn
January 10, 2017 at 4:24 am

Yes! Thanks.
Brent Yorgey

January 10, 2017 at 9:49 pm

Hey, this is really cool. I think I remember you trying to explain some of this to me a couple years ago, though I’m not sure I really got it at the time (it makes a lot of sense now, though).

Also, nice pictures! 😉
- mflynn
January 11, 2017 at 4:57 am

Thanks! Glad you thought it was cool. Diagrams is very useful!

Simple RNA folding in 130 lines of Haskell

4 Comments

Published

Category

Tags

Contact