Some miscellaneous notes about R.

rJava needs jvm.dll to work properly. Since it is not in the path, add it to the PATH env. variable. jvm.dll is generally found in <jdk path>/jre/bin/server **NOTE: ** 64-bit applications need 64-bit jvm and 32-bit apps need 32-bit jvm.

See: http://stackoverflow.com/questions/7019912/using-the-rjava-package-on-win7-64-bit-with-r

Customize Sweave for better output. See: http://www.stat.auckland.ac.nz/~ihaka/downloads/Sweave-customisation.pdf

NOTE: If you're using TeXLive then you might have to add Sweave to the local search path. To see if your tex system can find Sweave.sty issue the following command:

bash$ kpsewhich Sweave.sty # Empty indicates file not found

bash$ ## If it is missing do the following:
bash$ cd ~/Library/texmf/tex/latex
bash$ ln -s /Library/Frameworks/R.framework/Resources/share/texmf Sweave

In short do the following: a. Add the following lines at end of Sweave.sty

\DefineVerbatimEnviornment{Sinput}{Verbatim} {xleftmargin=2em}
\DefineVerbatimEnviornment{Soutput}{Verbatim}{xleftmargin=2em}
\DefineVerbatimEnviornment{Scode}{Verbatim}{xleftmargin=2em}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}

b. Shorten output lines and remove continuation prompts.

<<echo=false,results=hide>>=
options(width=60, continue="  ")
@

NOTE: echo=false will only suppress R/S code. The output will still be displayed! results=hide needs to be included to ignore the output generated by R/S code chunk. See page 13 of Sweave-manual at http://www.statistik.imu.edu/~leisch/Sweave/Sweave-manual.pdf

c. Avoid comment loss and code reformatting, and set plot size (default is 6 x 6)

\SweaveOpts{keep.source=TRUE, width=6, height=4}

d. Generate LaTeX with R. Especially useful for tables!

\begin{center}
\begin{tabular}{rrrrrrrr}
<<results=tex,echo=false>>=
nr <- 3; nc <- 8
x <- matrix(round(rnorm(nr*nc), 2), ncol=nc)
apply(x, 1,
      function(x)
       cat(paste("$", x, "$", sep="", collapse=" & "), "\\\\\n")
     )
@
\end{tabular}
\end{center}

e. Customising graphics

<<echo=false>>=
options(SweaveHooks=list(fig=function()
           par(mar=c(5.1, 4.1, 1.1, 2.1))))
@

If you use .First function for initial setup of some variables and/or local display, this can cause some problems with compiling some packages.

For e.g., I display a date at the start of an interactive session. This is great when I use the interactive session but this caused compilation issues, at least for RcppArmadillo.

Workarounds: i. Comment .First function. Build the program/package and uncomment the .First function again. ii. Use the --no-environ or --no-init-file (if that is where you've stored your .Rprofile file!)
[[ and $ use partial matching if exact matching fails. So x$aa will match x$aabb if x does not contain a component named "aa" and "aabb" is the only name which has a prefix "aa". For [[, partial matching can be controlled via the exact argument which defaults to NA indicating that partial matching is allowed, but should result in a warning when it occurs. [ always requires an exact match!
```
R> ?`[[`
R> ?'[['   ## also works!
```
To use scale/alpha functions in ggplot2 you also have to load the "scales" library. Always use the below, especially if you are using ggplot2.
```
R> library(scales)
R> library(ggplot2)
```
The NAMESPACE file is used to list the exported functions. This is used to create the environment that gets attached when the particular package is called using require or library. See ?new.env for more information.
If you can compile the packages but install.packages won't install the binary package (i.e. options()$pkgType == "mac.binary") then use type=source with the install.packages command.

For e.g., I was having trouble installing rgdal but when I downloaded the source and compiled it everything worked...so I just issued the command:
```
R> install.packages("rgdal", type="source", configure.args=c("--with-proj-include=/opt/local/include", "--with-proj-lib=/opt/local/lib"))
```
and it just worked!!
Running Rserve from the interpreter creates a daemon which will not quit when you quit the session. To kill the Rserve daemon process use something like below:
```
bash$ kill -15 `ps aux | grep "Rserve" | grep "library" | awk '{print $2}'`
```
Always load the "compiler" package and call enableJIT(3) function. This enables byte compilation of all functions and makes them considerably faster!

Sometimes there are weird problems with evaluation. If you get strange error messages then close R and change to enableJIT(2). See ?`enableJIT`
Use "data.table" instead of data.frame. It handles large amounts of data quite easily and is very well designed. Since it overrides the data.frame class, it can be used instead. However, beware, there are some differences. See vignette('datatable-faq') for questions that are bound to come up!
Learn to use unclass
```
R> glm.D93 <- glm(speed ~ dist, data = cars)
R> model <- unclass(glm.D93)
```
Now, the variable model is a list which can be searched using $ indexing!

Use methods, showMethods, and getS3Method to see what methods are available for class or object!

R> methods(class="data.frame")
R> methods(class="sf")
R> methods(class="data.table") ## ?data.table:::cube

To do a cross join in data.table use the following

R> x <- data.table(id1=letters[1:3], vals1=1:3)
R> y <- data.table(id2=letters[4:7], vals2=4:7)
R> crossjoin <- setkey(x[,c(k=1,.SD)],k)[y[,c(k=1,.SD)],allow.crosjoin=TRUE][,k:=NULL]

Found this on http://stackoverflow.com/question/10600060/how-to-do-cross-join-in-r

If you run into "EOF within quoted string" error/warning.

Disable quoting and set stringsAsFactors as FALSE. I.e. do something like below in R-session:
```
R> read.delim(fname, quote="", stringsAsFactors=FALSE)
```
See https://stackoverflow.com/questions/17414776/read-csv-warning-eof-within-quoted-string-prevents-complete-reading-of-file

Never use *tmp* as a variable name. *tmp* is used for subset assignment like:

R> x[3:5] <- 13:15 # This is equivalent to 3 lines below
R> `*tmp*` <- x
R> x <- "[<-"(`*tmp*`, 3:5, value=13:15)
R> rm(`*tmp*)

Examine the internals of an R object:

R> DT <- data.table(A=5:1,B=letters[5:1])
R> .Internal(inspect(DT))
R> .Internal(inspect(list.files))
R> .Internal(inspect(matrix(nrow=3,ncol=3)))

load/save are very useful. The files created are much smaller than even the zip archive of csv/txt files!

readRDS/saveRDS are much more convenient than load/save. http://www.fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/

To input random string in each of the rows of either data.frame/data.table use the following function:

R> f <- function(n, len=10) {
       set.seed(1234L)
       chars <- vector("character", n)
       for (i in 1:n) {
           r <- as.integer(runif(1,1,len))
           start <- sample(length(letters))[1]
           end <- start + r
           idx <- (start:end)%%length(letters)
           chars[i] <- paste0(letters[idx],collapse="")
       }
       chars
   }

R> d <- data.table(a=1:100, b=f(100))
R> d1 <- data.frame(a=1:100, b=f(100))

Or you can try the much faster version below: (2018.01.24)

R> f1 <- function(n, len=10) {
     set.seed(1234L)
     l <- sapply(sample(seq_len(len), n, replace=TRUE),
                 function(x) sample(letters, x))
     unlist(lapply(l, paste0, collapse=""))
   }
R> f1(15)

R> d <- data.table(a=1:100, b=f1(100))  ## Much faster!
R> d1 <- data.frame(a=1:100, b=f1(100))  ## Much faster!

Or you can try the genrandstr function below too! (2019.03.09)

R> d <- data.table(a=1:100, b=replicate(100,genrandstr(10)))
R> d1 <- data.frame(a=1:100, b=replicate(100,genrandstr(10)))

To examine the type of object use the function dput. Also ?dput See also: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
A good workflow for working with R can be breaking down things into different files. These can then be loaded independently! For an example, see https://stackoverflow.com/questions/2258092/what-best-practices-do-you-use-for-programming-in-r

Ordering data in data.table. Consider the question `How can I identify the first and the last observation within a group in R?' asked on http://www.ats.ucla.edu/stat/r/faq/firstlast.htm

R> hsb2 <- data.table(read.csv('http://www.ats.ucla.edu/stat/r/faq/hsb2.csv'))
R> setkey(hsb2, math) # Sort according to math scores
R> hsb2[,tail(.SD,1),by=prog] # highest math marks grouped by prog
R> hsb2[,head(.SD,1),by=prog] # lowest math marks grouped by prog

Check the answers generated by the above snippet with the one listed on the FAQ.

Collections of references for S4: https://stackoverflow.com/questions/4143611/sources-on-s4-objects-methods-and-programming-in-r
Use the package "gdata" instead of "xlsx"!

Be aware of types. R can waste space if you aren't careful. Consider:

R> a <- 1:1000000
R> b <- 1:1000000
R> object.size(a)  # 4000040 bytes
R> object.size(b)  # 4000040 bytes
R> a[[2]] <- a[[2]] + 1
R> b[[2]] <- b[[2]] + 1L
R> object.size(a)  # 8000040 bytes
R> object.size(b)  # 4000040 bytes

And use the L suffix especially for indexing/subsetting.

#define USE_RINTERNALS before #include <Rinternals.h> is needed to gain direct access to internals of SEXPRECS. See last paragraph of section 1.7 of R Internals.
R objects can contain a lot of information. Use class, str, and dput to investigate the type of object. NOTE: Using dput for large objects (e.g., data.table with millions of rows) can cause problems!! Consider the below example:
```
R> n <- 100; x <- rnorm(n); y <- 3*x+0.2*rnorm(n)
R> d <- data.frame(x=x,y=y)
R> model <- lm(y~x,data=d)
R> model$coefficients
R> summary(model)$coefficients # compare this with above!!
R> help(summary.lm) # or print(summary.lm)
```
Why is this? Because model has class lm and summary generic class has a specialized function summary.lm which contains a lot more informaiton. How do we know that it contains more information? Simple, we saw it printed when we examined the model object!!
While using data.table don't use %in% or match when you want to search/subset by some character string matching. While these functions will not raise errors it is unclear whether these functions will, or won't, work! Use chmatch or %chin% instead! See ?chmatch for more information.
You can attach a saved object. Learned this from Martin Maechler's talk (slide 28 found on http://stat.ethz.ch/Teaching/maechler/R/useR_2014/Maechler-2014-pr.pdf).

Actually reading the help ?attach has a section titled Good practice which suggests using on.exit(detach) as the next statement of attach. For e.g.:
```
R> attach('mysavedsession.RData')
R> on.exit(detach('file:mysavedsession.RData'))
```
However, I don't know how useful this will be. Why? Because the file will not be automatically updated. You will still need to save the image manually with whatever objects you need.
Use log1p when you want to calculate $\log(1+x)$ and $|x| \ll 1$. I.e., if $x$ is very small (can happen when multiplying many probabilities) then $\log(1+x)$ will be indistinguishable from $\log(1)$. This is due to floating point inconsistencies. Similarly, expm1(x) to compute exp(x) - 1 when $|x| \ll 1$.

Also, check out cospi, sinpi, and tanpi. ?`cospi`

Use on.exit to ensure execution of an extension at the exit of a function (normally or on error).

R> outcon <- file('tst.hml','wt')
R> on.exit(close(outcon))  # Now this expression will be exectued whenever the function is exited
R> # do some R stuff here
R> ?on.exit
R> system.time  # See this code for neat use of on.exit

Here is another neat use of on.exit:

R> # Neat idea of package.function.lengths
R> #   from https://nicercode.github.io/blog/2013-05-07-how-long-is-a-function/
R> function.length <- function(f) {
     if (is.character(f)) {
       f <- match.call(f)
     }
     length(deparse(f))
   }
R>

R> package.function.lengths <- function(package) {
     package.functions <- function(package) {
       qual_pkg <-
         if (isTRUE(startsWith(package, "package:"))) {
             package
         } else {
             sprintf("package:%s", package)
         }
       bare_pkg <- gsub("^package:","",qual_pkg)
       if (! qual_pkg %in% search()) {
         orig_search <- search()
         require(bare_pkg, character.only=TRUE)
         pkgs_added <- setdiff(search(), orig_search) ## pkgs added by our library call!
         ## FIXME:
         ## for(p in rev(pkgs_added)) {
         ##   ## print(substitute(detach(name=p, character.only=TRUE, unload=TRUE), list(p=p)))
         ##   on.exit(substitute(detach(name=p, character.only=TRUE, unload=TRUE), list(p=p)), add=TRUE)
         ## }
         for(p in pkgs_added) {
           on.exit(detach(name=p, character.only=TRUE),add=TRUE)  ## Terrible hack!
           ## NOTE: No idea how to use substitute with `on.exit`!!
         }
       }
       object.names <- ls(name=qual_pkg)
       objects <- sapply(object.names, get, qual_pkg)
       objects[sapply(objects, is.function)]
     }

     functions <- package.functions(package)
     func_length <- vapply(functions, function.length, 0L)
     data.table(func=sprintf("%s::`%s`", package, names(functions)), length=func_length)
   }
R> package.function.lengths('raster')

http://www.win-vector.com/blog/2009/09/survive-r/
1. Write down everything-keep notes in a separate file.
2. Find some way to search for R answers
3. Learn unclass()
4. Learn how to list class and methods.
5. Learn to stomp out attributes
6. Swallow your pride.
7. Find and rely on "the one-liners."
8. Learn dput()
Fix some parameters for plotting. This was suggested by Andrew Gelman. See http://andrewgelman.com/2010/10/29/could_someone_p/ and comments in it.
```
setHook('plot.new',function(...) par(mar=c(3,3,2,1),mgp=c(2,0.7,0),tck=-0.01)
```
NOTE: This causes problem with plotting in other packages. raster is a good example of this problem.
Break a project into 4 pieces:
1. load.R Load all the required data
2. clean.R Clean all the data
3. func.R Utility functions
4. do.R Do the analysis/report writing here
See https://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing

How to duplicate the following R code using data.table? See http://www.statmethods.net/management/aggregate.html

R> attach(mtcars)
R> aggregate(mtcars,by=list(cyl,vs),FUN=mean,na.rm=TRUE)
R> detach(mtcars)

This can be accomplished with data.table like so:

R> library(data.table)
R> mtcarsdt <- as.data.table(mtcars)
R> mtcarsdt[,sapply(.SD,function(x) list(mean(x))),by=list(cyl,vs)]

OR better yet: see http://stackoverflow.com/questions/11695490/aggregating-multiple-columns-in-data-table

R> mtcarsdt[,lapply(.SD,mean),by=list(cyl,vs)]

Use ls.str() instead of str(). It provides more details. Especially when listing environments.
Some presentations of "data.table" which contain dates use something like 20150702 for July 2nd and that's all right for sorting but it causes much hassle for simple date arithmetic. Instead use "%Y%j" (i.e. year + no. of days from start of year) with strftime to get a numeric value that is easy to sort and the date arithemtic is also possible.
```
R> as.numeric(strftime("2015-07-02","%Y%j")
[1] 2015183
R> as.Date("2015183",format="%Y%j")
[1] "2015-07-02"
```
format, as.character, and strftime convert from '"POSIXlt"' and '"POSIXct"' to character vectors. strptime converts character vectors to clas '"POSIXlt"'. See ?format or ?strftime .

Mnemonic:
- strptime <==> string parse time i.e. character -> date/time class
- strftime <==> string format time i.e. date/time -> character
If you use both the 'reshape2' and 'data.table' packages you'll have to load reshape2 first and then data.table See ?data.table:::melt or ?data.table:::cast. From Description in ?data.table:::melt

From 1.9.6, to melt or cast data.tables, it is not necessary to load 'reshape2' anymore. If you have to, then load 'reshape2' package before loading 'data.table'.
If you want to split a column into multiple columns in data.table using strsplit can be confusing and slow. Instead use data.table::tstrsplit.table. See ?tstrsplit to learn more about how to use it.

Also see http://stackoverflow.com/questions/18154556/r-split-text-string-in-a-data-table-columns

Applying a function row-wise in data.table.

Consider the following example

d <- data.table(x=sample(1:25,50,replace=TRUE),
                y=sample(1:25,50,replace=TRUE),
                z=sample(1:25,50,replace=TRUE))

Now you want a third column that is the minimum of the three columns.

d[,.(I=.I,x,y,z)][,.(x,y,z,minval=min(x,y,z)),by=I] # d[,min(x,y,z),by=.I] won't work!

Some tips from http://www.stat.cmu.edu/~hseltman/RTips.html

Start each function with checks of arguments. It takes a little extra time but will usually repay you (or other users of the function) by pointing out the source of errors. E.g.:
```
myfun = function(dtf,name,p=0.5) {
  if (is.matrix(dtf)) dtf = data.frame(dtf)
  if (!is.data.frame(dtf)) stop("dtf must be a data.frame or matrix")
  if (!is.character(name) || length(name)!=1) stop("name must be a single character string.")
  ...
  return(result)
}
```
Allow for stopping and restarting of function with long loops (e.g., MCMC). A good trick is to setup your function (or even just a loop) as follows:
```
myfun = function() {
  if (file.exists('myresults.dat')) {
    ...load and use old reulsts...
  }
  ...
  for (i in 1:100000) {
    if (file.exists('stop')) {
      write.table(myresults,file='myresults.data')
      stop('Early stop due to detection of stop file')
    }
    ...
  }
  ...
  return(...)
}
```
Then, you can create a file called "stop" at any time (e.g., on Linux using "echo stop>stop" at Linux prompt) and the function will gracefully stop at the start of the next loop iteration. Without too much work, you can probably set up your function to automatically continue wherever you left off. Just remember to delete or rename the "stop" file before running the funciton again.

Assigning to matrix in a for loop is complicated in R! It can be done more simply by using matrix indexing! For e.g., consider how would you create a distance matrix for the following data.frame?

R> dataframe <- data.frame(from=LETTERS[c(1,1,1,2,2,3)],
+                          to=LETTERS[c(2,3,4,3,4,4)],
+                          distance=c(45,50,19,20,18,12))

R> distmatrix <- matrix(0,nrow=4,ncol=4,dimnames=list(LETTERS[1:4],LETTERS[1:4]))
R> for(i in 1:nrow(dataframe)) {
+    from <- as.character(dataframe[i,'from'])
+    to <- as.character(dataframe[i,'to'])
+    distmatrix[from,to] <- d[i,'distance']
+  }
R>

This can be done much more simply like:

R> idx <- cbind(dataframe$from,dataframe$to)
R> distmatrix[idx] <- dataframe$distance

To input a list as a column in the data.frame/data.table use I(). The only stipulation is that the list should have the same length as nrow of the data.frame/data.table
```
R> d <- data.frame(a=1:3,b=1:3,c=I(list(1:8,2:15,letters[1:15]))
```

Avoid using strptime! It is VERY slow! http://stackoverflow.com/questions/12786335/why-is-as-date-slow-on-a-character-vector If you need to use strptime then give a value for tz argument. This will speed up strptime. Check out the following in R:

R> r = seq(as.Date('2000-01-01'),as.Date('2012-01-01'),by='days')
R> d = format(sample(r,3e6,replace=TRUE),'%m/%d/%Y')
R> system.time(as.Date(d,'%m/%d/%Y')) # second fastest!
R> system.time(strptime(d,'%m/%d/%Y'))
R> system.time(strptime(d,'%m/%d/%Y',tz='GMT')) # fastest!

A call is just like a list. It has length, [[, and [ methods, and is recursive! See section 3.3 in http://www.biostat.jhsph.edu/~rpeng/docs/R-debug-tools.pdf for a neat use of this feature to modify code interactively!
loadedNamespaces() returns a character vector of loaded namespaces. Very useful along with search(), getLoadedDLLs(), showMethods(x).
Using PCRE (Perl Compatible Regular Expression) library makes regex searching/substituting faster. This can be enabled by using perl=TRUE in grep, regexpr, gregexpr, sub, gsub, and strsplit commands!

Also, for strsplit if the sep is not a regex, using fixed=TRUE makes strsplit considerably faster!
Remove leading/trailing spaces with base::trimws.
You can use raster::zApply or raster::stackApply to group bunch of layers (say based on month or year) in a rasterstack/rasterbrick. Just be aware that stackApply uses na.rm=TRUE as default whereas most other functions in R use `na.rm=FALSE! This can lead to subtle, hard to detect, bugs.
Plotting factors will raise "Cannot allocate memory" errors. Convert the factors to either character or Date classes to make it work.

To display labels in bars in a barplot you have to use the midpoints which is what is returned by barplot and the values of the bar as x, y coordinates as arguments of the text function. See the example below:

R> m <- mtcars[1:5,]
R> midpts <- barplot(m$mpg, names.arg=rownames(m),las=2) # las=2 for perpendicular labels!
R> text(midpts, m$mpg-0.9, m$mpg) # Subtract appropriate offset from y values to position labels!

To run examples in R try

R> example(seq)
R> example(ls)
R> example(plot)

If you don't want your environment to be clobbered, or modified (which is worse!), then use the local=TRUE function argument of example. Check out the following:

R> envobjs <- ls()
R> example(ls, local=TRUE) ## Try without local=TRUE to see the difference!
R> setdiff(ls(), envobjs) ## This will be empty with local=TRUE! Safer (more correct?) behavior!

Try to save date/time in POSIXct instead of POSIXlt. POSIXct uses almost 1/4th the bytes of POSIXlt.

R> class(Sys.time())
R> object.size(as.POSIXct(Sys.time()))
R> dput(as.POSIXct(Sys.time()))
R> object.size(as.POSIXlt(Sys.time()))
R> dput(as.POSIXlt(Sys.time()))

From the "Debugging R code" subsection in "Writing R Extensions": The first thing to find out is what R was doing at the time of the error, and the most useful tool is traceback()!!
Some solutions for programming pitfalls: Easy solutions:

a. Input only required data
```
R> colClasses <- c("NULL","integer","NULL","numeric")
R> d <- read.table("myfile",colClasses=colClasses)
```
b. Preallocate-and-fill, not copy-and-append (*** MOST IMPORTANT ***)
```
R> res <- numeric(nrow(df))
R> for(i in seq_len(nrow(df)))
+    res[[i]] <- some_calc(df[i,])
```
c. Vectorized calculations, not iteration
```
R> x <- runif(100000); x2 <- x^2
R> m <- matrix(x2,nrow=1000);y <- rowSums(m)
```
d. Avoid unnecessary character creation operations e.g. USE.NAMES=FALSE in sapply, use.names=FALSE in unlist.

Moderate solutions:

a. Use appropriate functions, often from specialized packages

b. Identify appropriate algorithms, e.g. %in% is O(N) whereas naive might be O(N^2^)

c. Use C or Fortran code.
NOTE: Assigning to columns might copy whole data.frame! This is why data.frames are slow. Look at the example mentioned in section 3.3.3 "Tracing copies of an object" in the "Writing R Extensions" manual.

Functions that take variable number of arguments behave strangely when you try to use the functional aspects of R. Consider the following examples:

R> mean(2,3,8) # doesn't work
R> sum(2,3,8) # works?
R> mean(c(2,3,8)) # works now
R> sum(c(2,3,8)) # also works!  Why?
R> # Now some *really strange* behavior
R> l <- list(x=c("hello","world"),y=c("goodbye","world"))
R> lapply(l,file.path) # doesn't work!
R> lapply(l, function(x) do.call("file.path",as.list(x))) # WORKS!

Use the package 'rbenchmark'. It provides relative speed comparisons.

lengths can be used to find out length of eaach element of the list. Also, it's much faster!!

R> l <- list(1:10, 3:5, "vijay", c(32,22))
R> sapply(l, length)
R> lengths(l) # same as sapply(l, length)

If there is problem installing rgl then just download the binary from CRAN and install it manually by running R from the directory where you put the "tgz" like this:
```
bash$ Rscript -e "install.packages('rgl_0.96.0.tgz',repos=NULL)"
```
OR
```
R> Sys.setenv(LDFLAGS="-L/usr/X11/lib") # libGL.1.dylib
R> install.packages(c("rgl"))
```

If there is problem with installing gmp/Rmpfr (especially when using R from Macports) then check if the CFLAGS environment variable is set. After spending 3+ hours (2017.01.01) I realized that I could install it quite easily with the following:

bash$ R CMD config --all  # print all environment variables!
bash$ R CMD config CFLAGS # will give the value of CFLAGS from /opt/local
bash$ R CMD config CPPFLAGS
bash$ sudo R

R> Sys.setenv(CFLAGS="<output from previous line with modifications>")
R> Sys.setenv(CPPFLAGS="<output from CPPFLAGS line with modifications>")
R> install.packages(c('gmp','Rmpfr'))

If nothing works just install the binary version! --- 2017.03.11

Close R and then start a new session and try the following:

R> Sys.setenv(LDFLAGS="-L/opt/local/lib"); Sys.setenv(CFLAGS="-I/opt/local/include"); install.packages(c('gmp','Rmpfr')) ## --- 2017.04.28

R is very flexible, and dynamic language!! You can modify functions in a live session. See the section "Inserting Code with trace" on http://www.biostat.jhsph.edu/~rpeng/docs/R-debug-tools.pdf

Basically, when you are defining an R function you are defining a callable object which when called with allowed arguments will run a set of expressions. The arguments for the function can be obtained by formals function and the body can be obtained using the aptly named body function! Consider the following
```
R> f <- function(x) x * 2
R> formals(f) # $x # only one argument
R> formals(plot) #
R> body(f) # x * 2
R> body(f)[[3]] <- 15 # modifying body of f!!
R> f # function(x) x * 15     ## YOWZA!
```
See the pdf! It reveals the strength of R!!

Use seq_along in for loops instead of the 1:length idiom. For e.g., use

R> for(i in seq_along(x)) x[i] <- x[i]/100

instead of

R> for(i in 1:length(x)) x[i] <- x[i]/100

While seq_along might be slightly faster it is more helpful when x is empty! Consider the outputs in the cases below:

R> x <- c()
R> 1:length(x)  # INCORRECT!!
1 0
R> seq_along(x) # CORRECT!!
integer(0)

See section 8.1.60 in "The R Inferno" for similar advice!

Similarly, use seq_len instead of 1:n idiom in for loop! Compare

R> n <- 0
R> for(i in 1:n) print(i)   ## INCORRECT!
R> for(i in seq(n)) print(i) ## Still INCORRECT!!
R> for(i in seq_len(n)) print(i)  ## CORRECT!

Use edit to modify function definition. See ?edit

R> f <- function(x) x * x
R> f <- edit(f)  # Make changes in your editor!
R> f  ## See the new changes!!

Do not use == for membership checking. It causes subtle bug when the lhs and rhs are of length greater than 1. Always use %in% for membership checking!
```
R> x <- 1:10
R> x == c(2,4,6,8,10)   # Only returns one TRUE!!  INCORRECT!
R> x %in% c(2,4,6,8,10) # Returns five TRUEs! CORRECT!
```
See sections 8.1.6 and 8.1.7 in "The R Inferno" for more examples.

Also, == doesn't do NA handling correctly whereas %in% does. Check out the following segment:
```
R> d <- data.frame(gender=c('m','m',NA,'f','f'),age=c(25,28,35,25,22))
R> d[d$gender=='m', ] ## returns 3 rows!! INCORRECT!!
R> d[d$gender %in% c('m'), ] ## returns only 2 rows!! CORRECT!!
```
NOTE: This is not because of stringsAsFactors too!! It will still work incorrectly even if you used stringsAsFactors=FALSE in the data.frame construction!!

If you're using data.table in a function then be careful in how you name function arguments. This is because data.table uses some clever environment and NSE tricks to make data analysis easier but which breaks how R behaves! Below examples demonstrate the problem:

R> freq <- function(n) 440*(2^(1/12))^n
# INCORRECT BEHAVIOR!
R> generate_sine <- function(code) {
     frequencies <- data.table(n=(-33:14), code=c(letters,LETTERS)[1:48])
     frequencies[code==code, tuneR::sine(freq(n), duration=10000)] ## PROBLEM HERE!
   }
# CORRECT BEHAVIOR!
R> generate_sine <- function(fcode) {
     frequencies <- data.table(n=(-33:14), code=c(letters,LETTERS)[1:48])
     frequencies[code==fcode, tuneR::sine(freq(n), duration=10000)] ## CORRECT!
}

The problem in the first case is that it is looking up code in the environment of frequencies data.table and since we are comparing code with itself it is always TRUE and hence it will always return all the rows!! Basically, this is what is happening in the first instance:

R> d <- data.table(a=c(1,2,3,1,2,3), b=11:15)
R> a <- 3  # I am looking for rows with this value!!
R> d[a==a, ] ## Problem!  It never looks for the `a` defined in the previous line!
R> d[a %in% a, ] ## Won't work either!!

strrep can be used to create repeated strings. Mnemonic: string repeat. This function is just like the rep function.
```
R> strrep('-', 25) # dashed line
```
Use RSQLite.spatialite from https://github.com/pschmied/RSQLite.spatialite . This requires compilation so it might be harder to get it to work on windows.

To profile R code to find out which line numbers and particular function calls are slow try the following in your R session.

R> Rprof("profile1.out", line.profiling=TRUE)
R> source("<file-to-profile>.R")
R> Rprof(NULL) # Done with profiling
R> summaryProf("profile1.out", lines="show")

If R refuses to update packages it's because they're located in the system as well as the user library and it doesn't know quite which to update. The easiest solution is to delete one of the packages (i.e., delete the directory from the folder) and then try to reinstall stuff. I had problems with Rcpp, and xts on OSGEO-live VM and below is what I did to fix it.

R> library(data.table)
R> pkgs <- as.data.table(installed.packages()[,c("Package", "LibPath")])
R> pkgs[,.N,by=Package][N>1, .(Package)] ## See duplicate packages
R> dups <- pkgs[, .N,by=Package][N>1,Package]
R> pkgs[Package %in% dups, ]  ## find duplicate packages
R> ## Alternatively you can do this
R> setkey(pkgs, Package)
R> pkgs[pkgs[, .N, by=Package][N>1,]] ## data.table power!!
R> ## Rcpp and xts were installed in /usr/lib/R and /usr/local/lib/R/site-library !!

Contingency tables are created using xtabs. xtabs uses a formula interace. Most of the values listed by xtabs can be generated by aggregating functions in either data.table and/or dplyr. Consider the example from the help manual.
```
R> DF <- as.data.table(as.data.frame(UCBAdmissions))
R> xtabs(Freq ~ Gender + Admit, DF)
R> DF[, .N, by=.(Gender, Admit)] ## same...just looks weird!
```

As much as I like data.table and it claims that it is a data.frame it doesn't behave like data.frame for a very basic indexing case:

R> mtcars[2] # second column
R> mtcars[c(2,4)] # second and fourth columns
R> mtcarsdt <- as.data.table(mtcars)
R> mtcarsdt[2] # second row!! What??
R> mtcarsdt[c(2,4)] # second and fourth rows!!  Strange!!
R> # The correct way is like this:
R> mtcarsdt[, 2, with=FALSE]        # Notice the empty i index
R> mtcarsdt[, c(2,4), with=FALSE]

To convert a ragged list into a matrix, with NA for missing values, use something like below:
```
R> l <- list(1:2,1:3,1:4,1:5,1:6)
R> m <- do.call(rbind, lapply(l, '[', seq_len(max(lengths(l)))))
```
See http://r.789695.n4.nabble.com/Convert-quot-ragged-quot-list-to-matrix-td895283.html for some very neat ideas!!
```
R> unname(t(do.call(cbind,lapply(l,ts))))  ## Gabor Grothendieck's idea
```

If you wish to plot histogram alongwith density then use probability=TRUE or freq=FALSE with your hist function.

R> with(mtcars, hist(mpg, probability=TRUE, breaks=20))
R> # OR with(mtcars, hist(mpg, freq=FALSE, breaks=20))
R> with(mtcars, lines(density(mpg), col='red', lwd=1.2))

To check packages first build a .tar.gz package and then check the resulting archive.

bash$ cd ~/code/R/data.table
bash$ R CMD build --no-build-vignettes .
bash$ R CMD check data.table_1.10.5.tar.gz # Today's build archive

findInterval is a very handy function to put values into appropriate categories. Consider the example below of how to assign appropriate grades without having to do a lot of work!

R> d <- data.frame(points=c(78, 82, 95, 100, 70, 65))
R> cutoffs <- c(51,61,71,81,91,101) ## Cutoff thresholds for grades!
R> d$grp <- findInterval(d$points, cutoffs)
R> d$grade <- letters[length(cutoffs)-d$grp]

The numbers in d$grp correspond to the following values:

| Group      | Value |
|------------+-------|
| [-Inf, 51) |  0    |
| [51,   61) |  1    |
| [61,   71) |  2    |
| [71,   81) |  3    |
| [81,   91) |  4    |
| [91,  101) |  5    |
| [101, Inf) |  6    |

To verify that this is, in fact, the case try the following:

R> d1 <- data.frame(points=1:101, expected=rep(c(0,1,2,3,4,5,6),c(50,10,10,10,10,10,1)))
R> d1$grp <- findInterval(d1$points, cutoffs)
R> stopifnot(all(d1$grp == d1$expected))

Use find to see where a particular value was defined. For e.g.,

R> find("ls")  ## package:base
R> find("drivers") ## package:MASS
R> find("petrol") ## package:MASS
R> find("[[") ## package:base
R> find("find") ## package:utils
R> ?`find`

Do not use nrow function!! Using nrow in a function which updates a matrix, even a predefined one, might cause massive copies, and painfully slow code, of matrix/data.frame and yield horrible performance. This is due to R's copy-on-write semantics! Compare the functions below:

R> f1 <- function() {
     dimstate = 100; nmcmc = 1e4
     chain = matrix(0,nrow=nmcmc,ncol=dimstate);
     for (i in seq_len(nmcmc)) {
       if (i == nrow(chain)) { ## `nrow` causes slowness!!
       }
       x = rnorm(dimstate,mean=0,sd=1)
       chain[i] = x
     }
   }

R> f2 <- function() {
     dimstate = 100; nmcmc = 1e4
     chain = matrix(0,nrow=nmcmc,ncol=dimstate);
     for (i in seq_len(nmcmc)) {
       if (i == dim(chain)[1L]) { ## same as `nrow` but not so slow!!
       }
       x = rnorm(dimstate,mean=0,sd=1)
       chain[i] = x
     }
   }

R> f3 <- function() {
     dimstate = 100; nmcmc = 1e4
     chain = matrix(0,nrow=nmcmc,ncol=dimstate);
     for (i in seq_len(nmcmc)) {
       if (i == nmcmc) { ## simplest/easiest and most efficient way!  Use this!!!!!
       }
       x = rnorm(dimstate,mean=0,sd=1)
       chain[i] = x
     }
   }

R> library(rbenchmark)
R> benchmark(f1(),f2(), f3(), replications=1L)

See the excellent blog post at https://statisfaction.wordpress.com/2017/12/10/nrow-references-and-copies/

Also, run both f1() and f2() in R started with the --verbose switch to see the number of GCs that happen for f1 and f2.

If you want to verify that all the packages that you have installed can be loaded you will have to set the R_MAX_NUM_DLLS environment variable to some high number (usually slightly more than the number of installed packages is a good idea) will allow you to load all the packages. See the code below:
```
bash$: R_MAX_NUM_DLLS=400 R
```
```
R> pkgs = installed.packages()[,"Package"]
R> lapply(pkgs, library, character.only=TRUE)
R> problems_loading = c("ff")
R> lapply(pkgs[!pkgs %in% problems_loading], library, character.only=TRUE)
```
To use gpu in R just use the "gpuR" package!! It makes using GPUs in R extremely easy!

Use the argument local=TRUE in the example function call to prevent the function from polluting your enviroment with variables defined in the example portions!!

~ $ R --vanilla

R> ls(); example(abbreviate); ls() # New variables!  Can easily overwrite your variables.
R> quit('no')

~ $ R --vanilla

R> ls(); example(abbreviate, local=TRUE); ls() # No new variables!
R> quit('no')

Use the library fastmatch which provides a drop in replacement for the base::match function. Add the following two lines in your .Rprofile.
```
library(fastmatch)
match <- fmatch
```
Use fasttime::fastPOSIXct instead of as.POSIXct! It is considerably faster.

If there are some problems with PROJ5 then make sure that you set PKG_CONFIG_PATH variable and use it with the sudo R CMD commands. I had to do the following:

~$ sudo PKG_CONFIG_PATH=/opt/local/lib/proj5/lib/pkgconfig R CMD INSTALL .
~$ man pkg-config
~$ echo "export PKG_CONFIG=/opt/local/lib/proj5/lib/pkgconfig" >> ~/.zshrc

base::.rmpkg is a useful function to remove the prefix 'package:' from string vectors. Also, it lists how to efficiently use sub/gsub functions!
Sometimes we deliberately wish to override graphics settings in our function and when we do this we should modify the graphics state for our function to revert graphics state settings to their original values. This can be achieved by using the on.exit funciton. Have the following lines whenever you make changes to graphics state.
```
opar <- par(no.readonly=TRUE)
on.exit(par(opar), add=TRUE)  ## ?on.exit
```

A useful random string generating function.

R> genrandstr <- function(stringlen = 5) {
+    paste0(sample(c(letters,LETTERS), stringlen, replace=TRUE), collapse="")
+  }

R> set.seed(1234L)
R> genrandstr() # "fGFGS"
R> genrandstr(8) # "HamIAKCo"
R> replicate(5, genrandstr(2)) # "Wp" "Ro" "nj" "mq" "pi"
R> strs <- replicate(500, genrandstr(8)) # 500 strings of length 8 each!
R> tstrs <- table(strs) ## frequency of strs
R> tstrs[tstrs>1]  ## NO duplicates!!
R> strs <- replicate(50000, genrandstr(8)) # 50000 strings of length 8 each!
R> tstrs <- table(strs) ## frequency of strings
R> tstrs[tstrs>1] ## still NO duplicates!!
R> sapply(sample(5:20,500,replace=TRUE), genrandstr) ## Super useful!!

A very useful function!

R> run_examples_from_package <- function(pkgname, local=TRUE) {
+    pkg <- sprintf("package:%s", pkgname)
+    stopifnot(pkg %in% search())
+    pdfout = sprintf("%s_examples.pdf", pkgname)
+    msgout = sprintf("%s_examples_message.txt", pkgname)
+    pdf(pdfout); on.exit(dev.off(), add=TRUE)
+    msg <- file(msgout, open="wt")
+    sink(msg)
+    sink(msg, type="message"); on.exit(sink(), add=TRUE)
+    op <- options(error=NULL); on.exit(options(op), add=TRUE)
+    op1 <- options("example.ask"=FALSE); on.exit(options(op1), add=TRUE)
+    ## invisible(sapply(ls(grep(pkg, search(), fixed=TRUE)), example, package=pkgname, character.only=TRUE, local=local))
+    invisible(sapply(ls(pkg), example, package=pkgname, character.only=TRUE, local=local))
+  }
R>
R> run_examples_from_package("sf") ## This will generate two files in the directory

Try the following:

R> run_examples_from_package("data.table", local=FALSE) ## local=TRUE givew weird errors!  Non-standard evaluation!!
R> run_examples_from_package("sf")

This creates two files in getwd()! The output of plotting commands are in _examples.pdf and the output and message text are in _examples_message.txt files!

So for our examples the following files will be added in getwd() data.table_examples.pdf, sf_examples.pdf (plotting output) data.table_examples_message.txt, sf_examples_message.txt (console and stderr otuput)

While sample and runif return same results sample can be faster, especially if you're doing simulations and/or bootstrapping. Consider the following:

R> set.seed(1)
R> (tmp1 <- ceiling(runif(1000,0,100)))
R> set.seed(1)
R> (tmp2 <- sample(1:100,1000,replace=TRUE))
R> all.equal(tmp1, tmp2)  ## TRUE
R> library(rbenchmark)
R> n <- 100; m <- 1000
R> benchmark(sample(n,size=n*m,replace=T), ceiling(runif(n*m,0,n)),
+            ceiling(n*runif(n*m)),floor(runif(n*m,1,n+1)))
R>
R> set.seed(1); c1 <- sample(n,size=n*m,replace=T)
R> set.seed(1); c2 <- ceiling(runif(n*m,0,n))
R> set.seed(1); c3 <- ceiling(n*runif(n*m))
R> set.seed(1); c4 <- floor(runif(n*m,1,n+1))
R> all.equal(c1,c2)
R> all.equal(c2,c3)
R> all.equal(c3,c4)

Using lapply/sapply can simplify a lot of geospatial code! Consider this example from the vignette of sp::over on pg 3.

> library(sp)
> x <- c(0.5,0.5,1,0.5)
> y <- c(1.5,0.5,0.5,0.5)
> xy <- cbind(xy)
> dimnames(xy)[[1]] <- letters[1:4]
> pts <- SpatialPoints(xy)
> xpol <- c(0,1,1,0,0)
> ypol <- c(0,0,1,1,0)
> pol = SpatialPolygons(list(
+   Polygons(list(Polygon(cbind(xpol-1.05,ypol))), ID="x1"),
+   Polygons(list(Polygon(cbind(xpol,ypol))), ID="x2"),
+   Polygons(list(Polygon(cbind(xpol,ypol - 1.0))), ID="x3"),
+   Polygons(list(Polygon(cbind(xpol + 1.0, ypol))), ID="x4"),
+   Polygons(list(Polygon(cbind(xpol+.4, ypol+.1))), ID="x5")
+   ))

This can be simplified considerably by storing polygon coordinates in a list and then using lapply to generate the final polygon. Consider the following simplification (at least in my opinion):

R> l <- list(cbind(xpol=xpol-1.05,ypol), cbind(xpol,ypol),
+           cbind(xpol,ypol=ypol-1), cbind(xpol=xpol+1,ypol),
+           cbind(xpol=xpol+.4, ypol=ypol+.1))
R> pol1 <- SpatialPolygons(
+            lapply(seq_along(l),
+                   function(x)
+                     Polygons(list(Polygon(l[[x]])), ID=sprintf("x%s",x))))
>
> all.equal(pol, pol1, check.attributes=FALSE)

Rmpi needs to be installed from source and if you're using openmpi from macports then you need to use configure.args to set variables needed to get Rmpi to compile. I had to use the following:

R> install.packages(c("Rmpi"), dependencies=TRUE,configure.args=c("--with-Rmpi-include=/opt/local/include/openmpi-devel-mp"
         "--with-Rmpi-libpath=/opt/local/lib/openmpi-devel-mp", "--with-Rmpi-type=OPENMPI"))

bash$ port contents openmpi-devel-default # find paths for Rmpi-libpath and Rmpi-include!

R's type hierarchy: NULL < raw < logical < integer < double < complex < character < list < expression

Here are a couple of very useful frequency functions! I got this idea from https://st2.ning.com/topology/rest/1.0/file/get/4077505910?profile=original

## I have it organized how I think it will be beneficial to me...
freqsdt <- function(DT, groupcols, percent=TRUE) {
    stopifnot(is.data.table(DT), is.character(groupcols) & length(groupcols) > 0L, all(groupcols %chin% colnames(DT)))
    res <- DT[, .(frequency=.N), by=groupcols][order(-frequency)][,percentage:=100*frequency/sum(frequency)]
    res
    outcols <- colnames(res)
    if(!isTRUE(percent)) outcols <- setdiff(outcols, "percentage")
    res[, ..outcols]
}

allfreqs <- function(DT, catlim=100L) {
    stopifnot(is.data.table(DT), is.integer(catlim) & catlim > 0L)
    nms <- names(DT)
    nmlen <- length(nms)
    final <- data.table(NULL)
    for(i in seq_len(nmlen)) {
        freqs <- freqsdt(DT, c(nms[i]))
        if(nrow(freqs) <= catlim) {
            final <- rbind(final,
                           data.table(vname=nms[i],value=as.factor(freqs[[1]]),
                                      frequency=freqs[[2]], percent=freqs[[3]]))
        }
    }
    final
}

And you use them like so:

R> d <- data.table(grp=sample(c('a','b','c'),1000,replace=T),grp1=sample(1:2,1000,replace=T),x=runif(1000),y=runif(1000),z=runif(1000))
R> freqsdt(d, c("grp"))
R> freqsdt(d, c("grp1"))
R> freqsdt(d, c("grp", "grp1"))
R> freqsdt(d, c("grp1", "grp"))
R> allfreqs(d) ## Probably should be called first!! Before anything else...

Get all the different S3methods of a function

getAllS3methods <- function(func) {
    ## I was looking at ?sf::`st_cast` and wanted to see all the different st_cast methods for different geometry types.
    ## This is how to go about it.
    ##
    ##  R> library("sf"); ?st_cast
    ##  R> getAllS3methods("st_cast")
    ##  R> str(getAllS3methods("st_cast")
    ##
    ##  This is very helpful for geospatial packages!
    ##
    stopifnot(is.character(func), length(func) == 1L)
    m <- methods(func)
    ms <- gsub("\\*$", "", as.character(m))
    s3methods <- sapply(ms, function(x) c(strsplit(x,"\\."))) ## each item is a two character vector c("function" "class")!

    ##
    ## To see the issue with s3methods do the following:
    ##
    ## table(sapply(s3methods, length))
    ## s3methods[which(sapply(s3methods, length) > 2]

    fixed_s3methods <- lapply(s3methods, function(x) c(x[[1]], paste(x[2:length(x)], collapse="."))) ## methods like print.data.table and print.boostrap.lca

    funcdefs <- lapply(fixed_s3methods, function(x) do.call(getS3method, as.list(x)))
    funcdefs
}

Generate a random filename.

generate_random_filename <- genrandfilename <-
  function(minlen=5L,maxlen=20L,filechars=paste0(c(letters,LETTERS,0:9),collapse=""),extensions=c("pdf","exe","txt","md","xls","doc","xlsx","dat","csv","shp","prj"),allowspaces=FALSE) {
    stopifnot(is.integer(minlen), is.integer(maxlen), minlen > 0L, maxlen > 0L, maxlen >= minlen)
    stopifnot(isTRUE(allowspaces) || isFALSE(allowspaces), is.character(filechars), is.character(extensions))
    require("data.table")
    len <- sample(seq.int(minlen,maxlen), 1)
    filechars <- unlist(strsplit(filechars,""))
    ## browser()
    char_prob_tbl <- data.table(char=filechars,prob=1/length(filechars))
    if(isTRUE(allowspaces)) {
        ## When spaces are allowed, modify the probability in such a way that space is selected twice as often as other chars
        char_prob_tbl <- rbind(char_prob_tbl, list(" ", 2/(length(filechars)+1)))
        char_prob_tbl[char != " ", prob:=(1-char_prob_tbl[char==' ', prob])/ .N]
        ## char_prob_tbl[char != " ", prob:=(1 - .SD[char==' ', prob])/ .SD[char != ' ', .N]]
        stopifnot(char_prob_tbl[, sum(prob)] == 1L)
    }
    ## filename <- paste0(sample(filechars, len, replace=T), collapse="") ## len can be larger than length of filechars
    filename <- paste0(sample(char_prob_tbl$char, len, prob=char_prob_tbl$prob, replace=T), collapse="") ## len can be larger than length of filechars
    ext <- sample(extensions, 1L, replace=TRUE)
    filename <- sprintf("%s.%s", filename, ext)
    filename
  }

And you use them like so:

R> genrandfilename()
R> files <- replicate(15, generate_random_filename())

Generate a list of prefixes.

prefixes <- function(x) {
  ## This is analogous to J's \
  ##
  ## In J try:
  ## ]\'banana'
  ##
  ## In R:
  ## R> prefixes(1:5)
  ## R> prefixes(letters[1:5])
  ##
  ## R> all.equal(cumsum(1:10), unlist(lapply(prefixes(1:10), sum)))
  ## R> all.equal(cumprod(1:10), unlist(lapply(prefixes(1:10), prod)))
  ## R> all.equal(cummax(1:10), unlist(lapply(prefixes(1:10), max)))
  ## R> with(list(x=runif(15)), all.equal(cumsum(x), unlist(lapply(prefixes(x), sum))))
  ## R> with(list(x=runif(15)), all.equal(cumprod(x), unlist(lapply(prefixes(x), prod))))
  ## R> with(list(x=runif(15)), all.equal(cummax(x), unlist(lapply(prefixes(x), max))))
  ##
  stopifnot(is.vector(x))
  mapply(function(a,b) x[seq.int(a,b)], rep(1,length(x)), seq_along(x))
}

This can be used to create some other cumulative functions which are not present in R. Here are a couple of examples:

cumsd <- function(x) {
  stopifnot(is.vector(x), (is.numeric(x) || is.integer(x)))
  unlist(lapply(prefixes(x), sd))
}
cummean <- function(x) {
  stopifnot(is.vector(x), (is.numeric(x) || is.integer(x)))
  unlist(lapply(prefixes(x), mean))
}

I have often felt the need to shuffle the input for my functions/algorithms to ensure that they are not relying on the ordering of data. For instance, when I'm first writing a function which takes a sequence as input, I'll use a simple sequence like seq.int(1L, 10L) as input to get the function working. And, once the function is ready I'll start using it. And, as I continue to use this function/algo I will notice that when the input is not ordered (or simple) I often get results that don't make sense. And, this is usually the result of using simple vectors for testing when designing the function initially. As my understanding improves and I realize the unstated, but definitely present, assumption that I had made while writing the function I will have to try to fix the function with shuffled vectors. And, this function is an attempt to create permutations of an input vector or data.frame which can be used to shuffle these collections which can then be used as argument[s] for function. So, for instance if I had mymean <- function(x) { sum(x)/length(x) } then I would check it like this lapply(permutations(1:15,n=20), mymean) to ensure that I get the same result for all the permutations!

permutations <- function(x, n=6L) {

  ## useful function to generate permutations of vector or data.frame.
  ##
  ## Can be used to generate random ordering so that you can check whether
  ## your algo/functions depend on particular ordering of variables.
  ##
  ## R> permutations(1:10)
  ## R> permutations(mtcars)

  stopifnot((is.vector(x) & length(x) > 1L) || (is.data.frame(x) & nrow(x) > 1L))

  idx <- seq_along(x)
  if(is.data.frame(x)) {
    idx <- seq_len(nrow(x))
  }

  ## indices <- replicate(n, sample(idx, length(idx))) ## Permutations are cols...
  ## indices <- t(replicate(n, sample(idx, length(idx)))) ## permutations are rows...
  ## indices <- lapply(seq_len(n), function(x) sample(idx, length(idx)))
  ##
  indices <- replicate(n, sample(idx, length(idx))) ## Permutations are cols...

  res <- if(is.vector(x)) {
    ## x[indices]
    lapply(seq_len(ncol(indices)), function(i) x[indices[,i]])
  } else {
    ## x[indices,,drop=FALSE] ## drop=FALSE...just in case we get 1 col data.frame!!
    lapply(seq_len(ncol(indices)), function(i) x[indices[, i], , drop=FALSE])
  }
  res
}

When using R interactively I have often found the need to modify some options or par settings for a particular piece of code snippet which needs to be reverted back after the code snippet is done. This can be accomplished by writing a function which uses on.exit. This was explained by Patrick Burns at https://www.burns-stat.com/the-options-mechanism-in-r/

withOptions <- function(optlist, expr) {
    opts <- options(optlist)
    on.exit(options(opts))
    expr <- substitute(expr)
    eval.parent(expr)
}
R> print((1:10)^-1)
R> withOptions(list(digits=3),print((1:10)^-1))

And we can do the same thing for par too! And, in my opinion, it is a whole lot more helpful with par than it is with options.

withPars <- function(parlist, expr) {
    opar <- par(parlist)
    on.exit(par(opar))
    expr <- substitute(expr)
    eval.parent(expr)
}
R> plot(mtcars$mpg, mtcars$disp)
R> withPars(list(mar=c(1,1,1,1)), plot(mtcars$mpg, mtcars$disp))

It is often required to fix column names for GIS related work. The below function tries to fix column names that are likely to work in GIS softwares.

fixcolnames <- function(x, lowercase=FALSE) {

  ## Very useful for GIS related work!!!!

  f <- ifelse(lowercase, tolower, identity)
  res <- gsub("^(_?[0-9]+)?_|_$", "", f(gsub("[^A-Za-z0-9]+", "_", as.character(x))))
  ifelse(res == "", "V1", res)
}
R> # Try this to see the usefulness of this little function!
R> weird_colnames <- c(" with spaces ", "| with strange' punctuation |", "with-hyphens!", "with(parenthesis)",
     "0 with leading numbers", "99 more leading numbers","But numbers 99 inside are okay!",
     "  99 numbers and beginning spaces is tricky!", "Overall, Case Sensitive!",
     "") ## Caught the last one and leading spaces with numbers case by using PBT (hypothesis is excellent!!)
R> d <- data.table(weird_colnames=weird_colnames)
R> d[ , `:=`(fixed_colnames = fixcolnames(weird_colnames))]
R> d

To load SparkR in an R session make sure to download spark and set the SPARK_HOME envvar. Then you can load SparkR library using the following:

library(SparkR,lib.loc=file.path(Sys.getenv("SPARK_HOME"),"R","lib"))
sparkR.session(master="local[*]",sparkConfig=list(spark.driver.memory="2g"))

Use all.equal to check if numbers are equal. R's dynamic nature makes comparison using == very strange. Check out the difference between == and all.equal here:

R> deg2rad <- function(deg) deg * pi / 180
R> rad2deg <- function(rad) rad * 180 / pi
R> d <- data.table(deg=1:15)[,rad:=deg2rad(deg)][,deg1:=rad2deg(rad)]
R> d[, all(deg == deg1)] ## returns FALSE!!
R> d[, .(deg, deg1)] ## printing looks ok....
R> d[deg != deg1, .(deg, deg1, diff=deg-deg1)] ## main problem....
R> d[, all.equal(deg, deg1)] ## returns TRUE!!

A very useful function to display all the objects in the R environment. This is just like RStudio's environment pane...but this works in base R too!

lsos <- lsobjs <- .ls.objects  <- function(pos=1L, pattern, order.by, decreasing=FALSE, head=FALSE, n=5) {

  ##
  ## See http://stackoverflow.com/questions/1358003/tricks-to-manage-the-available-memory-in-an-r-session
  ##
  ## Modified to sort correctly based on size! There's a subtle bug in the SO answer

  napply <- function(names, fn) sapply(names, function(x) fn(get(x, pos=pos)))
  names <- ls(pos=pos, pattern=pattern)
  if(length(names) == 0L) return(character(0))

  obj.class <- napply(names, function(x) as.character(class(x))[[1]])
  obj.mode <- napply(names, base::mode)
  obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
  obj.size <- napply(names, function(x) {
                       l <- capture.output(print(object.size(x), units="auto"))
                       l[length(l)]
                  })
  obj.dim <- t(napply(names, function(x) as.numeric(dim(x))[1:2]))
  vec <- is.na(obj.dim)[,1] & (obj.type != "function")
  obj.dim[vec, 1] <- napply(names, length)[vec]

  out <- data.frame(obj.type, obj.size, obj.dim)
  names(out) <- c("Type", "Size", "Rows", "Columns")

  if(any(obj.type %in% c("RasterStack", "RasterBrick", "SpatRaster"))) {
    nlayers <- function(x) ifelse(inherits(x, c("RasterStack", "RasterBrick")),
                                  raster::nlayers(x),
                                  ifelse(inherits(x, c("SpatRaster")), terra::nlyr(x), NA))
    out <- cbind(out, Layers=napply(names, nlayers))
  }

  if(!missing(order.by)) {
    idx <- if (order.by=="Size") {
      sizes <- napply(names, object.size)
      order(sizes, decreasing=decreasing)
    } else {
      order(out[[order.by]], decreasing=decreasing)
    }
    out <- out[idx, ]
  }

  if(head)
    out <- head(out, n)

  gc() ## This function uses a lot of memory! Free it before exiting.
  return(out)
}

R> lsos()

An added advantage of lsos is that it can be particularly useful for examining lists and environments. Check this out in your console:

R> lsos(mtcars)
R> lsos(environment(par))
R> lsos(getS3method("[","data.table"))

DE-9IM representations of the various spatial predicates:

Contains(a,b)    <==>  'T*****FF*'
CoveredBy(a,b)   <==>  'T*F**F***' \/ '*TF**F***' \/ '**FT*F***' \/ '**F*TF***'
Covers(a,b)      <==>  'T*****FF*' \/ '*T****FF*' \/ '***T**FF*' \/ '****T*FF*'
Crosses(a,b)     <==>  'T*T******' \/ 'T*****T**' \/ '0********'
Disjoint(a,b)    <==>  'FF*FF****'
Equals(a, b)     <==>  'T*F**FFF*'
Intersects(a,b)  <==>  'T********' \/ '*T*******' \/ '***T*****' \/ '****T****'
Overlaps(a,b)    <==>  'T*T***T**' \/ '1*T***T**'
Touches(a,b)     <==>  'FT*******' \/ 'F**T*****' \/ 'F***T****'
Within(a,b)      <==>  'T*F**F***'

Contains(a,b)    <==>  Within(b,a)
CoveredBy(a,b)   <==>  Covers(b,a)
Covers(a,b)      <==>  CoveredBy(b,a)
Disjoint(a,b)    <==>  ! Intersects(a,b)
Equals(a,b)      <==>  Within(a,b) /\ Contains(a,b)
Intersects(a,b)  <==>  ! Disjoint(a,b)
Within(a,b)      <==>  Contains(b,a)

See https://en.wikipedia.org/wiki/DE-9IM#Spatial_predicates . These are useful with sf::st_relate function!

Column details:

## Column Details for a Data Frame
## ^^^    ^^^^^^^       ^    ^
colDetails <- function(DF) {
  stopifnot("Needs a data.frame or data.table" = inherits(DF, "data.frame"))
  colnames <- colnames(DF)
  colclasses <- sapply(DF, classes)
  colidx <- seq_along(DF)
  num_nas <- sapply(DF, numna)
  num_uniq <- sapply(DF, num_unique)
  DFM <- copy(DF)
  for(i in seq_len(ncol(DF))) {
   if(any(class(DF[[i]]) %in% c("factor","character","Date","POSIXct","POSIXlt"))) { DFM[[i]] <- as.numeric(rep(NA,nrow(DFM))) }
  }
  colstats <- function(x, na.rm=TRUE, digits=3L) {
   ## idea from McElreath's rethinking::precis function
   stats <- if(is.numeric(x)) {
     c(round(mean(x,na.rm=na.rm),digits=digits), round(sd(x,na.rm=na.rm),digits=digits), round(quantile(x,probs=c(0.055,0.945),na.rm=na.rm),digits=digits))
   } else {
     c(NA, NA, NA, NA)
   }
   names(stats) <- c("mean","sd","5.5%","94.5%")
   stats
  }
  ## stats <- t(apply(DF,2,colstats))
  stats <- do.call(rbind, lapply(DF, colstats))
  histosparks <- sapply(DFM, histospark) ## see below for histospark
  DD <- data.table(ColName=colnames, ColClasses=colclasses, ColIdx=colidx, NumNA=num_nas, PctNA=round(100*num_nas/nrow(DF),3), NumUniq=num_uniq, PctUniq=round(100*num_uniq/nrow(DF),3), row.names=NULL)
  DD <- cbind(DD, stats)
  DD[,Histogram:=histosparks]
  DD[]
}
R> colDetails(mtcars)

And, the helper functions used in the function above are:

numna <- numnas <- numNA <- numNAs <- function(x) sum(is.na(x))
classes <- function(x) paste(class(x), collapse=", ")
histospark <- function(x, width=10L) {
  if(all(is.na(x))){return("")}
  if(is.integer64(x)){ x <- as.numeric(x) }
  sparks <- c("\u2581","\u2582","\u2583","\u2585","\u2587")
  bins <- graphics::hist(x, breaks=width, plot=FALSE)
  factor <- cut(bins$counts / max(gins$counts), breaks=seq(0L,1L,length=length(sparks)+1L),labels=sparks,include.lowest=TRUE)
  paste0(factor,collapse="")
}

Padded numbers are often useful for creating unique IDs in geospatial related tasks. This function simplifies creating padded numbers instead of having to manually calculate them.

paddedNumbers <- function(nums) {
    ## paddedNumbers(c(1,19,101,1010,10283)) returns c("00001","00019","00101","01010","10283")
    stopifnot(all(nums > 0))
    width <- nchar(as.character(max(nums)))
    fmt <- sprintf("%%0%dd",width)
    sprintf(fmt, nums)
}

Generate a random date series.

generate_random_date_range <- function(start_date=as.Date('2000-01-01'), end_date=Sys.Date(), num_days=30L) {
  ## Useful function to generate a date range based on start/end dates...
  ##
  ## R> generate_random_date_range(as.Date('2000-01-01'),as.Date('2022-12-31'), 30L)
  ## R> generate_random_date_range(as.Date('2000-01-01'),as.Date('2022-12-31'), as.integer(sample(365,1L))) ## much more useful...in my opinion
  ## R> generate_random_date_range(as.Date('2000-01-01'),as.Date('2022-12-31'), as.integer(sample(c(365,366),1,prob=c(3/4,1/4)))) ## even more interesting
  stopifnot(class(start_date)==class(Sys.Date()), class(end_date)==class(Sys.Date()), is.integer(num_days), end_date > start_date, end_date - start_date >= num_days)
  with(list(start_dt=sample(seq(start_date, end_date - num_days, by='1 day'),1)),seq(start_dt, start_dt+num_days-1,by='1 day'))
}

If a data.table contains integer64 the dcast.data.table will insert really large values for any cell that ought to receive NA values. From reading the issue https://github.com/Rdatatable/data.table/issues/4561 it appears that this is a rather common occurrence. The comments in that thread describe the solution: use fill=bit64::as.integer64(NA) in the dcast.data.table call.

List all objects in all environments.

getAllObjects <- function() {
  ## This can be used to determine where the object is coming from...
  ##
  ## I stumbled upon this when trying to figure out how to get `colSums` to work with `integer64`. I learned that `colSums` is definedin `package:Matrix` as well as `package:base`.
  ## This is also evident when you do `?colSums` where R will ask you which function documentation you wish to read.
  envs <- search()
  getObjects <- function(env) {
   objs <- ls(name=env)
   searchIdx <- which(env==search())
   data.table(searchidx=searchIdx,env=env,obj=objs,obj_type=sapply(objs,function(s)typeof(get(s,env))))
  }
  rbindlist(lapply(envs,getObjects))
}

Example of applying a function recursively to a list using rapply:

## Slightly ammended list from the example shown in ?rapply
X <- list(list(a = pi, b = list(c = 2L)), d = "a test", e=as.Date(c('2024-01-01','2024-12-31')))

## does not descend into X[[1]][["b"]] to take sqrt of X[[1]][["b"]][["c"]]
rapply(X, sqrt, classes="numeric", how="replace") ## skips descending into X[[1]][["b"]] because class(X[[1]][["b"]]) is a "list"!

rapply(X, function(n){ if(is.numeric(n)){sqrt(n)}else{identity(n)}}, how="replace") ## Use R's functional nature to keep it simple!
## General idea is that if you wish to apply a function recursively do not use the `classes` argument of `rapply`. Instead, make the classes check as a part of the function application!

## NOTE (vijay): If you decide to use `ifelse` instead of `if` ... just be careful!
str(Xi <- rapply(X, function(n) ifelse(is.numeric(n),sqrt(n),identity(n)), how="replace")) ## Incorrect ... check out the element Xi[["e"]]!
str(Xc <- rapply(X, function(n) ifelse(is.numeric(n),sqrt,identity)(n), how="replace"))    ## Correct ... contrast Xc$e with Xi[["e"]]

## TODO (vijay): Try to understand the behavior of `ifelse`!

I learned from https://www.pymc-labs.com/blog-posts/bayesian-inference-at-scale-running-ab-tests-with-millions-of-observations/ that binning can be helpful in scaling up bayesian statistics. And, this will require functions to calculate mean/sd of binned data. The below functions accomplish this:

binned_mean <- function(x) {
  ## R> y <- rnorm(10e6, mean=100L, sd=20L)
  ## R> yhist <- hist(y, breaks=20e3, plot=FALSE)
  ## R> mean(y); mean(sd)
  ## R> binned_mean(yhist); binned_sd(yhist)
  stopifnot(class(x)=="histogram")
  n <- sum(x$counts)
  mu <- with(x, sum(mids*counts)/n)
  mu
}

binned_sd <- function(x) {
  ## R> y <- rnorm(10e6, mean=100L, sd=20L)
  ## R> yhist <- hist(y, breaks=20e3, plot=FALSE)
  ## R> mean(y); mean(sd)
  ## R> binned_mean(yhist); binned_sd(yhist)
  stopifnot(class(x)=="histogram")
  n <- sum(x$counts)
  mu <- with(x,sum(mids*counts)/n)
  binvar <- with(x, (sum(mids^2 * counts) - n*mu^2)/(n-1))
  sqrt(binvar)
}