rajiv.sg

Rajiv M Ranganath's homepage

Re-learning Stats and Learning Functional Programming in R

For sometime now, I’ve been spending my free cycles trying to re-learn statistics and learn functional programming. In the first part of this post, I’ll briefly touch upon my motivations for doing this. Then I’ll use an example to illustrate programming logic that can be imperatively and functionally decomposed, and why the latter approach might be better in many cases. Just be to clear, I’m not advocating functional programming as a silver bullet. It might be, or it might not be, and we don’t know yet.

Why Re-Learn Stats?

This is probably more easier to answer. Statistics, or more precisely application of stats, has come a long way since I first took those lessons way back during my school days. We had no “Big-Data” storage or elastic “Cloud Computing” infrastructure available to us then. So we had limited ability to collect and store humungous amounts of data and computing resources needed to do interesting things with it.

Now, we live in a world where our every action is getting stored and data-mined for some kind of value. With machine learning algorithms being applied to predict my next heart-beat, I figured I might as well re-learn how its all done in the new brave world.

After going through a number of books, I narrowed down to two books that I think touches upon these topics quite nicely.

  1. Think Stats - Probability and Statistics for Programmers
  2. Think Complexity

When I started working on these two books, instead of doing the examples in Python, I decided to do it in a functional way using “R”.

Why learn Functional Programming?

Well, like with many things in life, each of us develops our own set of reasons to do or not to do something. In my case, I’ve been an object-oriented, imperative guy for a number of years. It has served me quite well, and continues to do so. I also have had the privilege of working for and with Qt, which is probably one of the best C++ App development frameworks.

My fascination with functional programming started to develop when I saw JavaScript go mainstream. While technically, JavaScript is a prototypal language, it offers many features that would make a functional programmer happy and helps create some very powerful and succinct abstractions.

Learning to think how to program functionally takes sometime and it has a steep learning curve. Its not because its difficult, but its because we need to unlearn certain patterns in order to be able to think effectively functionally. I’ll give an example of that in the next section.

Having spent sometime at it, I can say that its worth the time and effort. Overall I think it makes one a better programmer and this is a common experience shared by others too.

Think Stats in R

In the previous section, I mentioned that programming functionally helps us create powerful abstractions. In the section, I’ll go over Python code from Chapter 1 of Think Stats and highlight how to approach the same abstractions functionally.

First lets take a look at the Recode method in class Pregnancies.

survey.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def Recode(self):
     for rec in self.records:

         # divide mother's age by 100
         try:
             if rec.agepreg != 'NA':
                 rec.agepreg /= 100.0
         except AttributeError:
             pass

         # convert weight at birth from lbs/oz to total ounces
         # note: there are some very low birthweights
         # that are almost certainly errors, but for now I am not
         # filtering
         try:
             if (rec.birthwgt_lb != 'NA' and rec.birthwgt_lb < 20 and
                 rec.birthwgt_oz != 'NA' and rec.birthwgt_oz <= 16):
                 rec.totalwgt_oz = rec.birthwgt_lb * 16 + rec.birthwgt_oz
             else:
                 rec.totalwgt_oz = 'NA'
         except AttributeError:
             pass

Looking at the above code, we’ll quickly realize there are two distinct ideas that we’re trying to express.

  1. A conditional which is working on the fields (attributes) of the record based on an intended logic
  2. An interator which is going over all the records

This is a perfectly valid imperative code. But can we do better? Can we keep these two ideas distinct and compose them so that we can get the same results and yet have a cleaner abstraction. Lets try in R.

survey.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
  recode <- function(df) {
    recode_agepreg <- function(x) {
      x <- if (is.na(x)) x else x/100.0
    }

    add_totalwgt_oz <- function(df) {
      x <- if (!is.na(df['birthwgt_lb']) &&
               (df['birthwgt_lb'] < 20) &&
               !is.na(df['birthwgt_oz']) &&
               (df['birthwgt_oz'] <= 16)) {
        df['birthwgt_lb'] * 16 + df['birthwgt_oz']
      } else {
        NA
      }
    }

    df$agepreg <- sapply(df$agepreg, recode_agepreg)
    totalwgt_oz <- apply(df[, c("birthwgt_oz", "birthwgt_lb")], 1, add_totalwgt_oz)
    cbind(df, totalwgt_oz)
  }

In the above code, conditional logic that we want is captured in the functions recode_agepreg and add_totalwgt_oz. With the logic nicely available to us as functions, we can then neatly apply them to our table.

1
2
    df$agepreg <- sapply(df$agepreg, recode_agepreg)
    totalwgt_oz <- apply(df[, c("birthwgt_oz", "birthwgt_lb")], 1, add_totalwgt_oz)

We immediately gain the benefit of clearer abstraction and greater readability of code.

I’ll end this post with another example from the same chapter, which will drive home the point even further. Lets take a look at PartitionRecords method.

first.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def PartitionRecords(table):
    """Divides records into two lists: first babies and others.

    Only live births are included

    Args:
        table: pregnancy Table
    """
    firsts = survey.Pregnancies()
    others = survey.Pregnancies()

    for p in table.records:
        # skip non-live births
        if p.outcome != 1:
            continue

        if p.birthord == 1:
            firsts.AddRecord(p)
        else:
            others.AddRecord(p)

    return firsts, others

In this case the conditional is slightly more complicated than in the earlier example. The first conditional basically determines if the second conditional makes sense or not. Even from an imperative programming perspective, the use of continue in an iterator is considered harmful as it breaks the code flow and hence affects code readability.

Trying to do the same functionally would mean, we can just take the conditionals apply them in sequence, without breaking the code flow in anyway. This once again leads to cleaner abstraction and more readable code.

first.R
1
2
3
4
live_births_df <- pregnancies[, c("outcome", "birthord", "prglength")]
                             [pregnancies[, c("outcome")] == 1,]
birthord_first <- length(subset(live_births_df$birthord, live_births_df$birthord == 1))
birthord_other <- length(subset(live_births_df$birthord, live_births_df$birthord != 1))

I’ve posted the R code used here in the think-stats-R GitHub repo.

Comments