Lexical scope and function closures in R

Introduction

R is different to many “easy to use” statistical software packages – it expects to be given commands at the R command prompt. This can be intimidating for new users, but is at the heart of its power. Most powerful software tools have an underlying scripting language. This is because scriptable tools are typically more flexible, and easier to automate, script, program, etc. In fact, even software packages like Excel or Minitab have a macro programming language behind the scenes available for “power users” to exploit.

Programming from the ground up

It is natural to want to automate (repetitive) tasks on a computer, to automate a “work flow”. This is especially natural for computational tasks, as all software tools are built from programming language components, anyway. In R, you do stuff by executing a sequence of commands. By putting a bunch of commands one after another into a text file, we can source the file, and script R. Scripting is the simplest form of programming – automating a sequence of tasks. Indeed, in Unix (including Linux and MacOS), we can put a bunch of Unix shell commands together in a shell script. In Windows, you can put a bunch of terminal commands together in a batch file.

Next, one can add in simple control structures, to support looping, branching and conditional execution. Looping allows repetition of very similar tasks. Branching and conditional execution allow decisions to be made depending on what has already happened. Most scripting languages support simple control structures – this allows carrying out of tasks which we could do in principle, but perhaps not in practice, due to the laborious and repetitive nature of some work-flows. We can go a long way with this, but…

Although scripting is a simple form of programming, it isn’t “real” programming, or software engineering. Software engineering is about developing flexible, modular, robust, re-usable, generic program components, and using them to build large, complex software systems – modularity is absolutely key here. Functions and procedures are a first step towards introducing modularity, allowing the development of “real” software. Proper support for these tends to distinguish “real” programming languages from scripting languages (though many modern “scripting” languages have at least some limited support, and the distinction between scripting languages and “real” languages is now very blurred).

Functions and procedures

Procedures (or subroutines) are re-usable pieces of code which can be called from other pieces of code when needed. They may be provided with inputs, but do not have to be. They are usually called for their “side-effects”, such as doing plots, changing global variables, or reading/writing data to/from disk.

Functions are also re-usable pieces of code, but are mainly used to obtain a return-value that is computed on the basis of the given inputs. “Pure” functions do not have any side-effects. Functions and procedures may be combined in a hierarchical way to build large, complex algorithms from much simpler modular components. Note that many languages (including R), do not make a distinction between functions and procedures in the syntax of the language, but conceptually the distinction is really quite important.

Variable scope

Almost all programming languages allow the definition of variables which are labels or tags representing or pointing at some value that may be defined and re-defined at run-time. In most modern programming languages, functions can define local variables which can be used in addition to any inputs (formal parameters) of the function – these are very important for the development of modular, re-usable code components. In particular, they help to avoid unanticipated name clashes in the global name-space. If a function refers to a variable which is neither a formal parameter nor a local variable, then a rule is needed to find which (if any) variable with that label is in scope for the function, so that the program can know what value to use.

Dynamic scope

Under dynamic scope, if an “unknown” variable is referred to in a function, the idea is to use the version of the variable that is in scope at the time that the function was called (and apply this rule recursively) – this is the scoping rule used by the S-PLUS implementation of the S language. Dynamic scope was common among early dynamic programming languages – including early implementations of LISP (and is still used in Emacs LISP), as it was quite intuitive and natural to implement using a stack-based approach similar to the stack-based approach to passing variables in and out of subroutines commonly used by machine code and assembly programmers.

Despite being intuitively appealing, at least initially, there are a number of problems with dynamic scope in practice. In particular, we can’t really know by code inspection whether or not a given section of code will run in all situations without actually running the code, as we can’t know whether all variable bindings will resolve correctly. This is an issue even for dynamic languages, but is particularly problematic for strongly typed compiled languages, as it becomes difficult for the compiler to figure out the types of all variables correctly and therefore generate the appropriate byte-code. It is also very difficult for a function to have associated state – to do this, you must somehow get state variables into global name-space where they then become vulnerable to masking and name clashes. See the Wikipedia page on scope for further details.

Lexical scope

Under lexical scoping rules, if an “unknown” variable is referred to in a function, the idea is to use the version that is “in scope” in the enclosing piece of code (and apply this rule recursively) — this is the scoping rule used by R (as R is built on top of a Scheme interpreter, a LISP derivative which emphasises lexical scope). Variable bindings can be all resolved, checked and verified at compile-time – this is safer, and in many other ways better. Most modern languages adopt lexical scoping, including most functional languages, such as LISPs (including LISP-STAT) and derivatives. In fact, I first read about lexical scope, function closures and their use in statistical computing in Luke Tierney’s LISP-STAT book (Tierney, 1990) in the early 1990s. That book was published over 20 years ago, so it just goes to show that there is nothing new about these functional programming approaches. In fact, although Tierney’s book describes a now obsolete system, I would nevertheless recommend reading it if you can find a copy, as I think it is still one of the best books on statistical computing ever written. It really puts the recent glut of horrible R-themed books to shame!

Given that R has been lexically scoped and has supported function closures since day one, it is reasonable to wonder why this programming style is not used more widely in R code. I think it is the difference in scoping rules between S-PLUS and R that has led to a fear of developing any R code which relies on non-local scoping rules. Certainly, in the early days of R, I would use S-PLUS at work and R at home, and I would want my code to work in exactly the same in both places! This is a shame, as lexical scoping is very powerful, and exploited widely in functional programming styles. The use of lexical scope and function closures in R is described quite nicely in Gentleman (2008), along with many other things.

To make sure that the concepts are clear, inspect the following piece of code and figure out what the result of the final function call will be. The answer is given below the code, so try not to peek before reading on…

a=1
b=2
f<-function(x)
{
  a*x + b
}
g<-function(x)
{
  a=2
  b=1
  f(x)
}
g(2)

No, really, try and figure it out before reading on for the answer! Understanding this example is key to understanding the difference between lexical and dynamic scope. Clearly the obvious answers are 4 and 5. If you didn’t get one of those, go back and try again! ;-) So, one of those is the result you get in a dynamically scoped language like S-PLUS, and the other is the result that you get in a lexically scoped language like R. But which is which? Many people when asked what this code does give the answer 5. This is the result for a dynamically scoped language. It is not the answer you get in R. In R, you get the answer 4. This is because f() was defined in the global environment, so it is the global bindings of a and b which count. Although the function g() defines its own local bindings for a and b, these have no impact on the global bindings, and are simply not relevant to the evaluation of f().

Function closures

Some languages (including LISPs and derivatives such as Scheme, Python, and R) have functions as “first class objects”, which means that a function is able to return as its value another function. If the function (fChild) returned by the function (fParent) refers to variables not local to fChild, then scoping rules must apply to the resolution of the variable binding. If the language is lexically scoped, then the binding is determined by the variables in scope within the function fParent. The function fChild therefore has an associated environment, which provides bindings for non-local variable references – this allows maintaining of state. A function together with its environment is referred to as a function closure, and is a very powerful programming tool. Below is some more code to help illustrate what is going on. Again, try to figure out the result of the final function call before reading on for the answer and explanation…

a=1
b=2
f<-function(a,b)
{
  return( function(x) {
    a*x + b
  })
}
g=f(2,1)
g(2)

Here, the function g(), together with its associated environment, is referred to as a function closure. See the Wikipedia page for closure for further details. So, what is the result of calling g(2) in this case? Again, some people get this wrong, and give the answer 4. This isn’t what you get in R – in R you get 5, again due to lexical scope. The point is that the function g() is created inside f(), and so it is the variable bindings in scope within f() at the time g() was created which matter. Since f() has a and b as formal arguments, these mask the global variables of the same name, so it is the 2 and 1 that are passed into f() to create g() which matter in the evaluation of g(). This is why function closures are so powerful. They are not simply functions, they are functions together with an associated environment, and the associated environment allows function closures to have associated state. Here the state corresponds to the values of a and b that were used in the creation of g(), but in principle the state can be essentially any data structure.

Function closures for scientific computing

Function closures have numerous important applications in a variety of problems in scientific computing that involve dealing in some way with the “function environment problem”. There is quite a nice discussion of this issue in Oliveira and Stewart (2006), in the context of several strongly typed compiled languages. Consider, for example, a function that will numerically integrate a univariate function using (say) the trapezium rule. This integration function might expect that you pass in the function to be integrated, together with the limits of integration, and possibly a step size. Most likely this integration function will expect that the function passed in is univariate. However, in practice many functions have additional parameters (eg. the straight line example, above, which was a function of x, but depending on additional parameters a and b). This problem is solved by passing in a univariate function closure that contains the necessary environment to evaluate this univariate function correctly. Similar considerations apply for functions that carry out optimisation, solve ODEs by passing in the RHS, etc.

The smfsb R package

The second edition of my textbook, Stochastic Modelling for Systems Biology, has recently been published (Wilkinson, 2011). The second edition has an associated R package, smfsb, available from CRAN – I gave a tutorial introduction in a previous post. The code makes extensive use of lexical scope and function closures, precisely to solve the function environment problem…

References

  • Oliveira, S, Stewart, D.E. (2006) Writing scientific software, CUP.
  • Gentleman, R. (2008) R Programming for Bioinformatics, Chapman & Hall/CRC Press.
  • Tierney, L. (1990) LISP-STAT, Wiley.
  • Wilkinson (2011), Stochastic Modelling for Systems Biology, second edition, Chapman & Hall/CRC Press.
  • About these ads

    Tags: , , , , , , , , , , , , ,

    9 Responses to “Lexical scope and function closures in R”

    1. Parallel particle filtering and pMCMC using R and multicore « Darren Wilkinson's research blog Says:

      [...] Darren Wilkinson's research blog Statistics, computing, Bayes, stochastic modelling, systems biology and bioinformatics « Lexical scope and function closures in R [...]

    2. Catalogue of my first 25 blog posts « Darren Wilkinson's research blog Says:

      [...] Lexical scope and function closures in R: an introduction to notions of variable scope and closure in the context of [...]

    3. colinfang Says:

      Neither R nor S-PLUS employs dynamic scoping. http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-scope.pdf

    4. eppek Says:

      “… as R is built on top of a Scheme interpreter”, really? How so?

    5. anantakrroy Says:

      Reblogged this on anantakrroy and commented:
      Scope@R

    6. Bob Carpenter Says:

      R is an odd hybrid when it comes to scope because the definition of an R function is dynamic. Consider


      a = 10;
      f = function(x) { if (rbinom(1,1,0.5)) { a = 1000 }; return(a * x); }

      Whether a is local or global is determined for each call at run time. That is, it can’t be determined what value f(3) will have, though it will be either 30 or 3000, unless of course, a has been redefined.

    7. Ansgar Tuemmers Says:

      There seems to be another twist to R scoping that I find confusing:


      a<-1
      ff<-function(x){function(){x}}
      f1<-ff(a)
      f1()
      a<-2
      f1()

      As expected (after reading your post and thinking about it), both calls to f1 return 1. But then let’s remove the first call:

      a<-1
      ff<-function(x){function(){x}}
      f1<-ff(a)
      a<-2
      f1()

      Now the only call to f1() returns 2. So the first call seems to have a curious side effect. I’m assuming it’s a lazy evaluation thing…

    8. Escopo Léxico e Lazy-evaluation no R | Blog Pra falar de coisas Says:

      […] ao escopo. Esse post discute as diferenças conceituais entre escopo dinâmico e escopo léxico (que é o default do R). Eu não vou entrar aqui nas minúcias conceituais das definições, mas […]

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s


    Follow

    Get every new post delivered to your Inbox.

    Join 160 other followers

    %d bloggers like this: