rJava needs jvm.dll to work properly. Since it is not in the path, add it to the PATH env. variable. jvm.dll is generally found in <jdk path>/jre/bin/server **NOTE: ** 64-bit applications need 64-bit jvm and 32-bit apps need 32-bit jvm.
See: http://stackoverflow.com/questions/7019912/using-the-rjava-package-on-win7-64-bit-with-r
Customize Sweave for better output. See: http://www.stat.auckland.ac.nz/~ihaka/downloads/Sweave-customisation.pdf
NOTE: If you're using TeXLive then you might have to add Sweave to the local search path. To see if your tex system can find Sweave.sty issue the following command:
bash$ kpsewhich Sweave.sty # Empty indicates file not found
bash$ ## If it is missing do the following:
bash$ cd ~/Library/texmf/tex/latex
bash$ ln -s /Library/Frameworks/R.framework/Resources/share/texmf Sweave
In short do the following: a. Add the following lines at end of Sweave.sty
\DefineVerbatimEnviornment{Sinput}{Verbatim} {xleftmargin=2em}
\DefineVerbatimEnviornment{Soutput}{Verbatim}{xleftmargin=2em}
\DefineVerbatimEnviornment{Scode}{Verbatim}{xleftmargin=2em}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
b. Shorten output lines and remove continuation prompts.
<<echo=false,results=hide>>=
options(width=60, continue=" ")
@
NOTE: echo=false
will only suppress R/S code. The output will still
be displayed! results=hide
needs to be included to ignore the
output generated by R/S code chunk. See page 13 of Sweave-manual at
http://www.statistik.imu.edu/~leisch/Sweave/Sweave-manual.pdf
c. Avoid comment loss and code reformatting, and set plot size (default is 6 x 6
)
\SweaveOpts{keep.source=TRUE, width=6, height=4}
d. Generate LaTeX with R. Especially useful for tables!
\begin{center}
\begin{tabular}{rrrrrrrr}
<<results=tex,echo=false>>=
nr <- 3; nc <- 8
x <- matrix(round(rnorm(nr*nc), 2), ncol=nc)
apply(x, 1,
function(x)
cat(paste("$", x, "$", sep="", collapse=" & "), "\\\\\n")
)
@
\end{tabular}
\end{center}
e. Customising graphics
<<echo=false>>=
options(SweaveHooks=list(fig=function()
par(mar=c(5.1, 4.1, 1.1, 2.1))))
@
If you use .First function for initial setup of some variables and/or local display, this can cause some problems with compiling some packages.
For e.g., I display a date at the start of an interactive session. This is great when I use the interactive session but this caused compilation issues, at least for RcppArmadillo.
Workarounds:
i. Comment .First
function. Build the program/package and
uncomment the .First
function again.
ii. Use the --no-environ
or --no-init-file
(if that is where
you've stored your .Rprofile
file!)
[[
and $
use partial matching if exact matching fails. So
x$aa
will match x$aabb
if x does not contain a component named "aa"
and "aabb" is the only name which has a prefix "aa". For [[
,
partial matching can be controlled via the exact
argument which
defaults to NA indicating that partial matching is allowed, but
should result in a warning when it occurs. [
always requires an
exact match!
R> ?`[[`
R> ?'[[' ## also works!
To use scale/alpha functions in ggplot2 you also have to load the "scales" library. Always use the below, especially if you are using ggplot2.
R> library(scales)
R> library(ggplot2)
The NAMESPACE file is used to list the exported functions. This is
used to create the environment that gets attached when the
particular package is called using require
or library
. See
?new.env
for more information.
If you can compile the packages but install.packages won't install
the binary package (i.e. options()$pkgType == "mac.binary"
) then
use type=source
with the install.packages
command.
For e.g., I was having trouble installing rgdal but when I downloaded the source and compiled it everything worked...so I just issued the command:
R> install.packages("rgdal", type="source", configure.args=c("--with-proj-include=/opt/local/include", "--with-proj-lib=/opt/local/lib"))
and it just worked!!
Running Rserve from the interpreter creates a daemon which will not quit when you quit the session. To kill the Rserve daemon process use something like below:
bash$ kill -15 `ps aux | grep "Rserve" | grep "library" | awk '{print $2}'`
Always load the "compiler" package and call enableJIT(3) function. This enables byte compilation of all functions and makes them considerably faster!
Sometimes there are weird problems with evaluation. If you get strange
error messages then close R and change to enableJIT(2).
See ?`enableJIT`
Use "data.table" instead of data.frame. It handles large amounts of data quite easily and is very well designed. Since it overrides the data.frame class, it can be used instead. However, beware, there are some differences. See vignette('datatable-faq') for questions that are bound to come up!
Learn to use unclass
R> glm.D93 <- glm(speed ~ dist, data = cars)
R> model <- unclass(glm.D93)
Now, the variable model is a list which can be searched using $
indexing!
Use methods
, showMethods
, and getS3Method
to see what methods
are available for class or object!
R> methods(class="data.frame")
R> methods(class="sf")
R> methods(class="data.table") ## ?data.table:::cube
To do a cross join in data.table use the following
R> x <- data.table(id1=letters[1:3], vals1=1:3)
R> y <- data.table(id2=letters[4:7], vals2=4:7)
R> crossjoin <- setkey(x[,c(k=1,.SD)],k)[y[,c(k=1,.SD)],allow.crosjoin=TRUE][,k:=NULL]
Found this on http://stackoverflow.com/question/10600060/how-to-do-cross-join-in-r
If you run into "EOF within quoted string" error/warning.
Disable quoting and set stringsAsFactors
as FALSE. I.e. do something like
below in R-session:
R> read.delim(fname, quote="", stringsAsFactors=FALSE)
Never use *tmp*
as a variable name. *tmp*
is used for subset assignment
like:
R> x[3:5] <- 13:15 # This is equivalent to 3 lines below
R> `*tmp*` <- x
R> x <- "[<-"(`*tmp*`, 3:5, value=13:15)
R> rm(`*tmp*)
Examine the internals of an R object:
R> DT <- data.table(A=5:1,B=letters[5:1])
R> .Internal(inspect(DT))
R> .Internal(inspect(list.files))
R> .Internal(inspect(matrix(nrow=3,ncol=3)))
load
/save
are very useful. The files created are much smaller than
even the zip archive of csv/txt files!
readRDS
/saveRDS
are much more convenient than load/save.
http://www.fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/
To input random string in each of the rows of either data.frame/data.table use the following function:
R> f <- function(n, len=10) {
set.seed(1234L)
chars <- vector("character", n)
for (i in 1:n) {
r <- as.integer(runif(1,1,len))
start <- sample(length(letters))[1]
end <- start + r
idx <- (start:end)%%length(letters)
chars[i] <- paste0(letters[idx],collapse="")
}
chars
}
R> d <- data.table(a=1:100, b=f(100))
R> d1 <- data.frame(a=1:100, b=f(100))
Or you can try the much faster version below: (2018.01.24)
R> f1 <- function(n, len=10) {
set.seed(1234L)
l <- sapply(sample(seq_len(len), n, replace=TRUE),
function(x) sample(letters, x))
unlist(lapply(l, paste0, collapse=""))
}
R> f1(15)
R> d <- data.table(a=1:100, b=f1(100)) ## Much faster!
R> d1 <- data.frame(a=1:100, b=f1(100)) ## Much faster!
Or you can try the genrandstr
function below too! (2019.03.09)
R> d <- data.table(a=1:100, b=replicate(100,genrandstr(10)))
R> d1 <- data.frame(a=1:100, b=replicate(100,genrandstr(10)))
To examine the type of object use the function dput
. Also ?dput
See also: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
A good workflow for working with R can be breaking down things into different files. These can then be loaded independently! For an example, see https://stackoverflow.com/questions/2258092/what-best-practices-do-you-use-for-programming-in-r
Ordering data in data.table. Consider the question `How can I identify the first and the last observation within a group in R?' asked on http://www.ats.ucla.edu/stat/r/faq/firstlast.htm
R> hsb2 <- data.table(read.csv('http://www.ats.ucla.edu/stat/r/faq/hsb2.csv'))
R> setkey(hsb2, math) # Sort according to math scores
R> hsb2[,tail(.SD,1),by=prog] # highest math marks grouped by prog
R> hsb2[,head(.SD,1),by=prog] # lowest math marks grouped by prog
Check the answers generated by the above snippet with the one listed on the FAQ.
Collections of references for S4: https://stackoverflow.com/questions/4143611/sources-on-s4-objects-methods-and-programming-in-r
Use the package "gdata" instead of "xlsx"!
Be aware of types. R can waste space if you aren't careful. Consider:
R> a <- 1:1000000
R> b <- 1:1000000
R> object.size(a) # 4000040 bytes
R> object.size(b) # 4000040 bytes
R> a[[2]] <- a[[2]] + 1
R> b[[2]] <- b[[2]] + 1L
R> object.size(a) # 8000040 bytes
R> object.size(b) # 4000040 bytes
And use the L
suffix especially for indexing/subsetting.
#define USE_RINTERNALS
before #include <Rinternals.h>
is needed to gain
direct access to internals of SEXPRECS. See last paragraph of section 1.7
of R Internals.
R objects can contain a lot of information. Use class
, str
, and
dput
to investigate the type of object. NOTE: Using dput
for large
objects (e.g., data.table with millions of rows) can cause problems!!
Consider the below example:
R> n <- 100; x <- rnorm(n); y <- 3*x+0.2*rnorm(n)
R> d <- data.frame(x=x,y=y)
R> model <- lm(y~x,data=d)
R> model$coefficients
R> summary(model)$coefficients # compare this with above!!
R> help(summary.lm) # or print(summary.lm)
Why is this? Because model has class lm and summary generic class has
a specialized function summary.lm
which contains a lot more informaiton.
How do we know that it contains more information? Simple, we saw it
printed when we examined the model object!!
While using data.table don't use %in%
or match
when you want to
search/subset by some character string matching. While these functions will not
raise errors it is unclear whether these functions will, or won't, work! Use
chmatch
or %chin%
instead! See ?chmatch
for more information.
You can attach a saved object. Learned this from Martin Maechler's talk (slide 28 found on http://stat.ethz.ch/Teaching/maechler/R/useR_2014/Maechler-2014-pr.pdf).
Actually reading the help ?attach
has a section titled Good practice which suggests
using on.exit(detach)
as the next statement of attach. For e.g.:
R> attach('mysavedsession.RData')
R> on.exit(detach('file:mysavedsession.RData'))
However, I don't know how useful this will be. Why? Because the file will not be automatically updated. You will still need to save the image manually with whatever objects you need.
Use log1p
when you want to calculate $\log(1+x)$ and $|x| \ll 1$. I.e., if
$x$ is very small (can happen when multiplying many probabilities) then
$\log(1+x)$ will be indistinguishable from $\log(1)$. This is due to floating
point inconsistencies. Similarly, expm1(x)
to compute exp(x) - 1
when
$|x| \ll 1$.
Also, check out cospi
, sinpi
, and tanpi
. ?`cospi`
Use on.exit
to ensure execution of an extension at the exit of a function
(normally or on error).
R> outcon <- file('tst.hml','wt')
R> on.exit(close(outcon)) # Now this expression will be exectued whenever the function is exited
R> # do some R stuff here
R> ?on.exit
R> system.time # See this code for neat use of on.exit
Here is another neat use of on.exit:
R> # Neat idea of package.function.lengths
R> # from https://nicercode.github.io/blog/2013-05-07-how-long-is-a-function/
R> function.length <- function(f) {
if (is.character(f)) {
f <- match.call(f)
}
length(deparse(f))
}
R>
R> package.function.lengths <- function(package) {
package.functions <- function(package) {
qual_pkg <-
if (isTRUE(startsWith(package, "package:"))) {
package
} else {
sprintf("package:%s", package)
}
bare_pkg <- gsub("^package:","",qual_pkg)
if (! qual_pkg %in% search()) {
orig_search <- search()
require(bare_pkg, character.only=TRUE)
pkgs_added <- setdiff(search(), orig_search) ## pkgs added by our library call!
## FIXME:
## for(p in rev(pkgs_added)) {
## ## print(substitute(detach(name=p, character.only=TRUE, unload=TRUE), list(p=p)))
## on.exit(substitute(detach(name=p, character.only=TRUE, unload=TRUE), list(p=p)), add=TRUE)
## }
for(p in pkgs_added) {
on.exit(detach(name=p, character.only=TRUE),add=TRUE) ## Terrible hack!
## NOTE: No idea how to use substitute with `on.exit`!!
}
}
object.names <- ls(name=qual_pkg)
objects <- sapply(object.names, get, qual_pkg)
objects[sapply(objects, is.function)]
}
functions <- package.functions(package)
func_length <- vapply(functions, function.length, 0L)
data.table(func=sprintf("%s::`%s`", package, names(functions)), length=func_length)
}
R> package.function.lengths('raster')
http://www.win-vector.com/blog/2009/09/survive-r/
unclass()
dput()
Fix some parameters for plotting. This was suggested by Andrew Gelman. See http://andrewgelman.com/2010/10/29/could_someone_p/ and comments in it.
setHook('plot.new',function(...) par(mar=c(3,3,2,1),mgp=c(2,0.7,0),tck=-0.01)
NOTE: This causes problem with plotting in other packages. raster
is a good example of this problem.
Break a project into 4 pieces:
See https://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing
How to duplicate the following R code using data.table? See http://www.statmethods.net/management/aggregate.html
R> attach(mtcars)
R> aggregate(mtcars,by=list(cyl,vs),FUN=mean,na.rm=TRUE)
R> detach(mtcars)
This can be accomplished with data.table like so:
R> library(data.table)
R> mtcarsdt <- as.data.table(mtcars)
R> mtcarsdt[,sapply(.SD,function(x) list(mean(x))),by=list(cyl,vs)]
OR better yet: see http://stackoverflow.com/questions/11695490/aggregating-multiple-columns-in-data-table
R> mtcarsdt[,lapply(.SD,mean),by=list(cyl,vs)]
Use ls.str()
instead of str()
. It provides more details. Especially when
listing environments.
Some presentations of "data.table" which contain dates use something like 20150702 for July 2nd and that's all right for sorting but it causes much hassle for simple date arithmetic. Instead use "%Y%j" (i.e. year + no. of days from start of year) with strftime to get a numeric value that is easy to sort and the date arithemtic is also possible.
R> as.numeric(strftime("2015-07-02","%Y%j")
[1] 2015183
R> as.Date("2015183",format="%Y%j")
[1] "2015-07-02"
format
, as.character
, and strftime
convert from '"POSIXlt"' and
'"POSIXct"' to character vectors. strptime
converts character vectors
to clas '"POSIXlt"'. See ?format
or ?strftime
.
Mnemonic:
If you use both the 'reshape2' and 'data.table' packages you'll have to
load reshape2 first and then data.table See ?data.table:::melt
or
?data.table:::cast
. From Description in ?data.table:::melt
From 1.9.6, to melt or cast data.tables, it is not necessary to load 'reshape2' anymore. If you have to, then load 'reshape2' package before loading 'data.table'.
If you want to split a column into multiple columns in data.table using strsplit
can be confusing and slow. Instead use data.table::tstrsplit
.table.
See ?tstrsplit
to learn more about how to use it.
Also see http://stackoverflow.com/questions/18154556/r-split-text-string-in-a-data-table-columns
Applying a function row-wise in data.table.
Consider the following example
d <- data.table(x=sample(1:25,50,replace=TRUE),
y=sample(1:25,50,replace=TRUE),
z=sample(1:25,50,replace=TRUE))
Now you want a third column that is the minimum of the three columns.
d[,.(I=.I,x,y,z)][,.(x,y,z,minval=min(x,y,z)),by=I] # d[,min(x,y,z),by=.I] won't work!
Some tips from http://www.stat.cmu.edu/~hseltman/RTips.html
Start each function with checks of arguments. It takes a little extra time but will usually repay you (or other users of the function) by pointing out the source of errors. E.g.:
myfun = function(dtf,name,p=0.5) {
if (is.matrix(dtf)) dtf = data.frame(dtf)
if (!is.data.frame(dtf)) stop("dtf must be a data.frame or matrix")
if (!is.character(name) || length(name)!=1) stop("name must be a single character string.")
...
return(result)
}
Allow for stopping and restarting of function with long loops (e.g., MCMC). A good trick is to setup your function (or even just a loop) as follows:
myfun = function() {
if (file.exists('myresults.dat')) {
...load and use old reulsts...
}
...
for (i in 1:100000) {
if (file.exists('stop')) {
write.table(myresults,file='myresults.data')
stop('Early stop due to detection of stop file')
}
...
}
...
return(...)
}
Then, you can create a file called "stop" at any time (e.g., on Linux using "echo stop>stop" at Linux prompt) and the function will gracefully stop at the start of the next loop iteration. Without too much work, you can probably set up your function to automatically continue wherever you left off. Just remember to delete or rename the "stop" file before running the funciton again.
Assigning to matrix in a for loop is complicated in R! It can be done more simply by using matrix indexing! For e.g., consider how would you create a distance matrix for the following data.frame?
R> dataframe <- data.frame(from=LETTERS[c(1,1,1,2,2,3)],
+ to=LETTERS[c(2,3,4,3,4,4)],
+ distance=c(45,50,19,20,18,12))
R> distmatrix <- matrix(0,nrow=4,ncol=4,dimnames=list(LETTERS[1:4],LETTERS[1:4]))
R> for(i in 1:nrow(dataframe)) {
+ from <- as.character(dataframe[i,'from'])
+ to <- as.character(dataframe[i,'to'])
+ distmatrix[from,to] <- d[i,'distance']
+ }
R>
This can be done much more simply like:
R> idx <- cbind(dataframe$from,dataframe$to)
R> distmatrix[idx] <- dataframe$distance
To input a list as a column in the data.frame/data.table use I()
. The
only stipulation is that the list should have the same length as nrow of
the data.frame/data.table
R> d <- data.frame(a=1:3,b=1:3,c=I(list(1:8,2:15,letters[1:15]))
Avoid using strptime! It is VERY slow! http://stackoverflow.com/questions/12786335/why-is-as-date-slow-on-a-character-vector
If you need to use strptime then give a value for tz argument. This will
speed up strptime
. Check out the following in R:
R> r = seq(as.Date('2000-01-01'),as.Date('2012-01-01'),by='days')
R> d = format(sample(r,3e6,replace=TRUE),'%m/%d/%Y')
R> system.time(as.Date(d,'%m/%d/%Y')) # second fastest!
R> system.time(strptime(d,'%m/%d/%Y'))
R> system.time(strptime(d,'%m/%d/%Y',tz='GMT')) # fastest!
A call is just like a list. It has length
, [[
, and [
methods, and
is recursive!
See section 3.3 in http://www.biostat.jhsph.edu/~rpeng/docs/R-debug-tools.pdf
for a neat use of this feature to modify code interactively!
loadedNamespaces()
returns a character vector of loaded namespaces.
Very useful along with search()
, getLoadedDLLs()
, showMethods(x)
.
Using PCRE (Perl Compatible Regular Expression) library makes regex
searching/substituting faster. This can be enabled by using perl=TRUE
in
grep
, regexpr
, gregexpr
, sub
, gsub
, and strsplit
commands!
Also, for strsplit
if the sep is not a regex, using fixed=TRUE
makes
strsplit
considerably faster!
Remove leading/trailing spaces with base::trimws
.
You can use raster::zApply
or raster::stackApply
to group bunch of layers (say based on
month or year) in a rasterstack/rasterbrick. Just be aware that stackApply
uses na.rm=TRUE
as default whereas most other functions in R use `na.rm=FALSE!
This can lead to subtle, hard to detect, bugs.
Plotting factors will raise "Cannot allocate memory" errors. Convert the factors to either character or Date classes to make it work.
To display labels in bars in a barplot you have to use the midpoints which is
what is returned by barplot and the values of the bar as x, y coordinates
as arguments of the text
function. See the example below:
R> m <- mtcars[1:5,]
R> midpts <- barplot(m$mpg, names.arg=rownames(m),las=2) # las=2 for perpendicular labels!
R> text(midpts, m$mpg-0.9, m$mpg) # Subtract appropriate offset from y values to position labels!
To run examples in R try
R> example(seq)
R> example(ls)
R> example(plot)
If you don't want your environment to be clobbered, or modified (which is worse!), then use the local=TRUE
function argument of example
. Check out the following:
R> envobjs <- ls()
R> example(ls, local=TRUE) ## Try without local=TRUE to see the difference!
R> setdiff(ls(), envobjs) ## This will be empty with local=TRUE! Safer (more correct?) behavior!
Try to save date/time in POSIXct instead of POSIXlt. POSIXct uses almost 1/4th the bytes of POSIXlt.
R> class(Sys.time())
R> object.size(as.POSIXct(Sys.time()))
R> dput(as.POSIXct(Sys.time()))
R> object.size(as.POSIXlt(Sys.time()))
R> dput(as.POSIXlt(Sys.time()))
From the "Debugging R code" subsection in "Writing R Extensions": The
first thing to find out is what R was doing at the time of the error, and
the most useful tool is traceback
()!!
Some solutions for programming pitfalls: Easy solutions:
a. Input only required data
R> colClasses <- c("NULL","integer","NULL","numeric")
R> d <- read.table("myfile",colClasses=colClasses)
b. Preallocate-and-fill, not copy-and-append (*** MOST IMPORTANT ***)
R> res <- numeric(nrow(df))
R> for(i in seq_len(nrow(df)))
+ res[[i]] <- some_calc(df[i,])
c. Vectorized calculations, not iteration
R> x <- runif(100000); x2 <- x^2
R> m <- matrix(x2,nrow=1000);y <- rowSums(m)
d. Avoid unnecessary character creation operations e.g.
USE.NAMES=FALSE
in sapply
, use.names=FALSE
in unlist
.
Moderate solutions:
a. Use appropriate functions, often from specialized packages
b. Identify appropriate algorithms, e.g. %in% is O(N) whereas naive might be O(N^2^)
c. Use C or Fortran code.
NOTE: Assigning to columns might copy whole data.frame! This is why data.frames are slow. Look at the example mentioned in section 3.3.3 "Tracing copies of an object" in the "Writing R Extensions" manual.
Functions that take variable number of arguments behave strangely when you try to use the functional aspects of R. Consider the following examples:
R> mean(2,3,8) # doesn't work
R> sum(2,3,8) # works?
R> mean(c(2,3,8)) # works now
R> sum(c(2,3,8)) # also works! Why?
R> # Now some *really strange* behavior
R> l <- list(x=c("hello","world"),y=c("goodbye","world"))
R> lapply(l,file.path) # doesn't work!
R> lapply(l, function(x) do.call("file.path",as.list(x))) # WORKS!
Use the package 'rbenchmark'. It provides relative speed comparisons.
lengths
can be used to find out length of eaach element of the list.
Also, it's much faster!!
R> l <- list(1:10, 3:5, "vijay", c(32,22))
R> sapply(l, length)
R> lengths(l) # same as sapply(l, length)
If there is problem installing rgl then just download the binary from CRAN and install it manually by running R from the directory where you put the "tgz" like this:
bash$ Rscript -e "install.packages('rgl_0.96.0.tgz',repos=NULL)"
OR
R> Sys.setenv(LDFLAGS="-L/usr/X11/lib") # libGL.1.dylib
R> install.packages(c("rgl"))
If there is problem with installing gmp/Rmpfr (especially when using R from Macports) then check if the CFLAGS environment variable is set. After spending 3+ hours (2017.01.01) I realized that I could install it quite easily with the following:
bash$ R CMD config --all # print all environment variables!
bash$ R CMD config CFLAGS # will give the value of CFLAGS from /opt/local
bash$ R CMD config CPPFLAGS
bash$ sudo R
R> Sys.setenv(CFLAGS="<output from previous line with modifications>")
R> Sys.setenv(CPPFLAGS="<output from CPPFLAGS line with modifications>")
R> install.packages(c('gmp','Rmpfr'))
If nothing works just install the binary version! --- 2017.03.11
Close R and then start a new session and try the following:
R> Sys.setenv(LDFLAGS="-L/opt/local/lib"); Sys.setenv(CFLAGS="-I/opt/local/include"); install.packages(c('gmp','Rmpfr')) ## --- 2017.04.28
R is very flexible, and dynamic language!! You can modify functions in a live session. See the section "Inserting Code with trace" on http://www.biostat.jhsph.edu/~rpeng/docs/R-debug-tools.pdf
Basically, when you are defining an R function you are defining a callable
object which when called with allowed arguments will run a set of expressions.
The arguments for the function can be obtained by formals
function and the
body can be obtained using the aptly named body
function! Consider the following
R> f <- function(x) x * 2
R> formals(f) # $x # only one argument
R> formals(plot) #
R> body(f) # x * 2
R> body(f)[[3]] <- 15 # modifying body of f!!
R> f # function(x) x * 15 ## YOWZA!
See the pdf! It reveals the strength of R!!
Use seq_along in for loops instead of the 1:length
idiom. For e.g.,
use
R> for(i in seq_along(x)) x[i] <- x[i]/100
instead of
R> for(i in 1:length(x)) x[i] <- x[i]/100
While seq_along
might be slightly faster it is more helpful when x
is empty!
Consider the outputs in the cases below:
R> x <- c()
R> 1:length(x) # INCORRECT!!
1 0
R> seq_along(x) # CORRECT!!
integer(0)
See section 8.1.60 in "The R Inferno" for similar advice!
Similarly, use seq_len
instead of 1:n
idiom in for loop! Compare
R> n <- 0
R> for(i in 1:n) print(i) ## INCORRECT!
R> for(i in seq(n)) print(i) ## Still INCORRECT!!
R> for(i in seq_len(n)) print(i) ## CORRECT!
Use edit
to modify function definition. See ?edit
R> f <- function(x) x * x
R> f <- edit(f) # Make changes in your editor!
R> f ## See the new changes!!
Do not use ==
for membership checking. It causes subtle bug when
the lhs and rhs are of length greater than 1. Always use %in%
for membership checking!
R> x <- 1:10
R> x == c(2,4,6,8,10) # Only returns one TRUE!! INCORRECT!
R> x %in% c(2,4,6,8,10) # Returns five TRUEs! CORRECT!
See sections 8.1.6 and 8.1.7 in "The R Inferno" for more examples.
Also, ==
doesn't do NA handling correctly whereas %in%
does. Check
out the following segment:
R> d <- data.frame(gender=c('m','m',NA,'f','f'),age=c(25,28,35,25,22))
R> d[d$gender=='m', ] ## returns 3 rows!! INCORRECT!!
R> d[d$gender %in% c('m'), ] ## returns only 2 rows!! CORRECT!!
NOTE: This is not because of stringsAsFactors
too!! It will still
work incorrectly even if you used stringsAsFactors=FALSE
in the
data.frame construction!!
If you're using data.table in a function then be careful in how you name function arguments. This is because data.table uses some clever environment and NSE tricks to make data analysis easier but which breaks how R behaves! Below examples demonstrate the problem:
R> freq <- function(n) 440*(2^(1/12))^n
# INCORRECT BEHAVIOR!
R> generate_sine <- function(code) {
frequencies <- data.table(n=(-33:14), code=c(letters,LETTERS)[1:48])
frequencies[code==code, tuneR::sine(freq(n), duration=10000)] ## PROBLEM HERE!
}
# CORRECT BEHAVIOR!
R> generate_sine <- function(fcode) {
frequencies <- data.table(n=(-33:14), code=c(letters,LETTERS)[1:48])
frequencies[code==fcode, tuneR::sine(freq(n), duration=10000)] ## CORRECT!
}
The problem in the first case is that it is looking up code
in the environment
of frequencies data.table and since we are comparing code
with itself it is
always TRUE
and hence it will always return all the rows!! Basically,
this is what is happening in the first instance:
R> d <- data.table(a=c(1,2,3,1,2,3), b=11:15)
R> a <- 3 # I am looking for rows with this value!!
R> d[a==a, ] ## Problem! It never looks for the `a` defined in the previous line!
R> d[a %in% a, ] ## Won't work either!!
strrep
can be used to create repeated strings. Mnemonic: str
ing rep
eat.
This function is just like the rep
function.
R> strrep('-', 25) # dashed line
Use RSQLite.spatialite
from https://github.com/pschmied/RSQLite.spatialite .
This requires compilation so it might be harder to get it to work on windows.
To profile R code to find out which line numbers and particular function calls are slow try the following in your R session.
R> Rprof("profile1.out", line.profiling=TRUE)
R> source("<file-to-profile>.R")
R> Rprof(NULL) # Done with profiling
R> summaryProf("profile1.out", lines="show")
If R refuses to update packages it's because they're located in the system as well as the user library and it doesn't know quite which to update. The easiest solution is to delete one of the packages (i.e., delete the directory from the folder) and then try to reinstall stuff. I had problems with Rcpp, and xts on OSGEO-live VM and below is what I did to fix it.
R> library(data.table)
R> pkgs <- as.data.table(installed.packages()[,c("Package", "LibPath")])
R> pkgs[,.N,by=Package][N>1, .(Package)] ## See duplicate packages
R> dups <- pkgs[, .N,by=Package][N>1,Package]
R> pkgs[Package %in% dups, ] ## find duplicate packages
R> ## Alternatively you can do this
R> setkey(pkgs, Package)
R> pkgs[pkgs[, .N, by=Package][N>1,]] ## data.table power!!
R> ## Rcpp and xts were installed in /usr/lib/R and /usr/local/lib/R/site-library !!
Contingency tables are created using xtabs. xtabs uses a formula interace. Most of the values listed by xtabs can be generated by aggregating functions in either data.table and/or dplyr. Consider the example from the help manual.
R> DF <- as.data.table(as.data.frame(UCBAdmissions))
R> xtabs(Freq ~ Gender + Admit, DF)
R> DF[, .N, by=.(Gender, Admit)] ## same...just looks weird!
As much as I like data.table and it claims that it is a data.frame it doesn't behave like data.frame for a very basic indexing case:
R> mtcars[2] # second column
R> mtcars[c(2,4)] # second and fourth columns
R> mtcarsdt <- as.data.table(mtcars)
R> mtcarsdt[2] # second row!! What??
R> mtcarsdt[c(2,4)] # second and fourth rows!! Strange!!
R> # The correct way is like this:
R> mtcarsdt[, 2, with=FALSE] # Notice the empty i index
R> mtcarsdt[, c(2,4), with=FALSE]
To convert a ragged list into a matrix, with NA for missing values, use something like below:
R> l <- list(1:2,1:3,1:4,1:5,1:6)
R> m <- do.call(rbind, lapply(l, '[', seq_len(max(lengths(l)))))
See http://r.789695.n4.nabble.com/Convert-quot-ragged-quot-list-to-matrix-td895283.html for some very neat ideas!!
R> unname(t(do.call(cbind,lapply(l,ts)))) ## Gabor Grothendieck's idea
If you wish to plot histogram alongwith density then use probability=TRUE
or
freq=FALSE
with your hist function.
R> with(mtcars, hist(mpg, probability=TRUE, breaks=20))
R> # OR with(mtcars, hist(mpg, freq=FALSE, breaks=20))
R> with(mtcars, lines(density(mpg), col='red', lwd=1.2))
To check packages first build a .tar.gz package and then check the resulting archive.
bash$ cd ~/code/R/data.table
bash$ R CMD build --no-build-vignettes .
bash$ R CMD check data.table_1.10.5.tar.gz # Today's build archive
findInterval
is a very handy function to put values into appropriate categories.
Consider the example below of how to assign appropriate grades without having
to do a lot of work!
R> d <- data.frame(points=c(78, 82, 95, 100, 70, 65))
R> cutoffs <- c(51,61,71,81,91,101) ## Cutoff thresholds for grades!
R> d$grp <- findInterval(d$points, cutoffs)
R> d$grade <- letters[length(cutoffs)-d$grp]
The numbers in d$grp
correspond to the following values:
| Group | Value |
|------------+-------|
| [-Inf, 51) | 0 |
| [51, 61) | 1 |
| [61, 71) | 2 |
| [71, 81) | 3 |
| [81, 91) | 4 |
| [91, 101) | 5 |
| [101, Inf) | 6 |
To verify that this is, in fact, the case try the following:
R> d1 <- data.frame(points=1:101, expected=rep(c(0,1,2,3,4,5,6),c(50,10,10,10,10,10,1)))
R> d1$grp <- findInterval(d1$points, cutoffs)
R> stopifnot(all(d1$grp == d1$expected))
Use find
to see where a particular value was defined. For e.g.,
R> find("ls") ## package:base
R> find("drivers") ## package:MASS
R> find("petrol") ## package:MASS
R> find("[[") ## package:base
R> find("find") ## package:utils
R> ?`find`
Do not use nrow
function!! Using nrow
in a function which updates
a matrix, even a predefined one, might cause massive copies, and painfully
slow code, of matrix/data.frame and yield horrible performance. This is
due to R's copy-on-write semantics! Compare the functions below:
R> f1 <- function() {
dimstate = 100; nmcmc = 1e4
chain = matrix(0,nrow=nmcmc,ncol=dimstate);
for (i in seq_len(nmcmc)) {
if (i == nrow(chain)) { ## `nrow` causes slowness!!
}
x = rnorm(dimstate,mean=0,sd=1)
chain[i] = x
}
}
R> f2 <- function() {
dimstate = 100; nmcmc = 1e4
chain = matrix(0,nrow=nmcmc,ncol=dimstate);
for (i in seq_len(nmcmc)) {
if (i == dim(chain)[1L]) { ## same as `nrow` but not so slow!!
}
x = rnorm(dimstate,mean=0,sd=1)
chain[i] = x
}
}
R> f3 <- function() {
dimstate = 100; nmcmc = 1e4
chain = matrix(0,nrow=nmcmc,ncol=dimstate);
for (i in seq_len(nmcmc)) {
if (i == nmcmc) { ## simplest/easiest and most efficient way! Use this!!!!!
}
x = rnorm(dimstate,mean=0,sd=1)
chain[i] = x
}
}
R> library(rbenchmark)
R> benchmark(f1(),f2(), f3(), replications=1L)
See the excellent blog post at https://statisfaction.wordpress.com/2017/12/10/nrow-references-and-copies/
Also, run both f1
() and f2
() in R started with the --verbose
switch to see
the number of GCs that happen for f1
and f2
.
If you want to verify that all the packages that you have installed can be loaded
you will have to set the R_MAX_NUM_DLLS
environment variable to some high number
(usually slightly more than the number of installed packages is a good idea)
will allow you to load all the packages. See the code below:
bash$: R_MAX_NUM_DLLS=400 R
R> pkgs = installed.packages()[,"Package"]
R> lapply(pkgs, library, character.only=TRUE)
R> problems_loading = c("ff")
R> lapply(pkgs[!pkgs %in% problems_loading], library, character.only=TRUE)
To use gpu in R just use the "gpuR" package!! It makes using GPUs in R extremely easy!
Use the argument local=TRUE
in the example
function call to prevent
the function from polluting your enviroment with variables defined in the
example portions!!
~ $ R --vanilla
R> ls(); example(abbreviate); ls() # New variables! Can easily overwrite your variables.
R> quit('no')
~ $ R --vanilla
R> ls(); example(abbreviate, local=TRUE); ls() # No new variables!
R> quit('no')
Use the library fastmatch which provides a drop in replacement for the
base::match
function. Add the following two lines in your .Rprofile.
library(fastmatch)
match <- fmatch
Use fasttime::fastPOSIXct
instead of as.POSIXct
! It is considerably faster.
If there are some problems with PROJ5 then make sure that you set PKG_CONFIG_PATH variable and use it with the sudo R CMD commands. I had to do the following:
~$ sudo PKG_CONFIG_PATH=/opt/local/lib/proj5/lib/pkgconfig R CMD INSTALL .
~$ man pkg-config
~$ echo "export PKG_CONFIG=/opt/local/lib/proj5/lib/pkgconfig" >> ~/.zshrc
base::.rmpkg
is a useful function to remove the prefix 'package:' from string
vectors. Also, it lists how to efficiently use sub
/gsub
functions!
Sometimes we deliberately wish to override graphics settings in our function
and when we do this we should modify the graphics state for our function
to revert graphics state settings to their original values. This can be
achieved by using the on.exit
funciton. Have the following lines whenever
you make changes to graphics state.
opar <- par(no.readonly=TRUE)
on.exit(par(opar), add=TRUE) ## ?on.exit
A useful random string generating function.
R> genrandstr <- function(stringlen = 5) {
+ paste0(sample(c(letters,LETTERS), stringlen, replace=TRUE), collapse="")
+ }
R> set.seed(1234L)
R> genrandstr() # "fGFGS"
R> genrandstr(8) # "HamIAKCo"
R> replicate(5, genrandstr(2)) # "Wp" "Ro" "nj" "mq" "pi"
R> strs <- replicate(500, genrandstr(8)) # 500 strings of length 8 each!
R> tstrs <- table(strs) ## frequency of strs
R> tstrs[tstrs>1] ## NO duplicates!!
R> strs <- replicate(50000, genrandstr(8)) # 50000 strings of length 8 each!
R> tstrs <- table(strs) ## frequency of strings
R> tstrs[tstrs>1] ## still NO duplicates!!
R> sapply(sample(5:20,500,replace=TRUE), genrandstr) ## Super useful!!
A very useful function!
R> run_examples_from_package <- function(pkgname, local=TRUE) {
+ pkg <- sprintf("package:%s", pkgname)
+ stopifnot(pkg %in% search())
+ pdfout = sprintf("%s_examples.pdf", pkgname)
+ msgout = sprintf("%s_examples_message.txt", pkgname)
+ pdf(pdfout); on.exit(dev.off(), add=TRUE)
+ msg <- file(msgout, open="wt")
+ sink(msg)
+ sink(msg, type="message"); on.exit(sink(), add=TRUE)
+ op <- options(error=NULL); on.exit(options(op), add=TRUE)
+ op1 <- options("example.ask"=FALSE); on.exit(options(op1), add=TRUE)
+ ## invisible(sapply(ls(grep(pkg, search(), fixed=TRUE)), example, package=pkgname, character.only=TRUE, local=local))
+ invisible(sapply(ls(pkg), example, package=pkgname, character.only=TRUE, local=local))
+ }
R>
R> run_examples_from_package("sf") ## This will generate two files in the directory
Try the following:
R> run_examples_from_package("data.table", local=FALSE) ## local=TRUE givew weird errors! Non-standard evaluation!!
R> run_examples_from_package("sf")
This creates two files in getwd()
! The output of plotting commands
are in
So for our examples the following files will be added in getwd()
data.table_examples.pdf, sf_examples.pdf (plotting output)
data.table_examples_message.txt, sf_examples_message.txt (console and stderr otuput)
While sample
and runif
return same results sample
can be faster,
especially if you're doing simulations and/or bootstrapping. Consider the
following:
R> set.seed(1)
R> (tmp1 <- ceiling(runif(1000,0,100)))
R> set.seed(1)
R> (tmp2 <- sample(1:100,1000,replace=TRUE))
R> all.equal(tmp1, tmp2) ## TRUE
R> library(rbenchmark)
R> n <- 100; m <- 1000
R> benchmark(sample(n,size=n*m,replace=T), ceiling(runif(n*m,0,n)),
+ ceiling(n*runif(n*m)),floor(runif(n*m,1,n+1)))
R>
R> set.seed(1); c1 <- sample(n,size=n*m,replace=T)
R> set.seed(1); c2 <- ceiling(runif(n*m,0,n))
R> set.seed(1); c3 <- ceiling(n*runif(n*m))
R> set.seed(1); c4 <- floor(runif(n*m,1,n+1))
R> all.equal(c1,c2)
R> all.equal(c2,c3)
R> all.equal(c3,c4)
Using lapply/sapply can simplify a lot of geospatial code! Consider this
example from the vignette of sp::over
on pg 3.
> library(sp)
> x <- c(0.5,0.5,1,0.5)
> y <- c(1.5,0.5,0.5,0.5)
> xy <- cbind(xy)
> dimnames(xy)[[1]] <- letters[1:4]
> pts <- SpatialPoints(xy)
> xpol <- c(0,1,1,0,0)
> ypol <- c(0,0,1,1,0)
> pol = SpatialPolygons(list(
+ Polygons(list(Polygon(cbind(xpol-1.05,ypol))), ID="x1"),
+ Polygons(list(Polygon(cbind(xpol,ypol))), ID="x2"),
+ Polygons(list(Polygon(cbind(xpol,ypol - 1.0))), ID="x3"),
+ Polygons(list(Polygon(cbind(xpol + 1.0, ypol))), ID="x4"),
+ Polygons(list(Polygon(cbind(xpol+.4, ypol+.1))), ID="x5")
+ ))
This can be simplified considerably by storing polygon coordinates in a list and then using lapply to generate the final polygon. Consider the following simplification (at least in my opinion):
R> l <- list(cbind(xpol=xpol-1.05,ypol), cbind(xpol,ypol),
+ cbind(xpol,ypol=ypol-1), cbind(xpol=xpol+1,ypol),
+ cbind(xpol=xpol+.4, ypol=ypol+.1))
R> pol1 <- SpatialPolygons(
+ lapply(seq_along(l),
+ function(x)
+ Polygons(list(Polygon(l[[x]])), ID=sprintf("x%s",x))))
>
> all.equal(pol, pol1, check.attributes=FALSE)
Rmpi needs to be installed from source and if you're using openmpi from macports then you need to use configure.args to set variables needed to get Rmpi to compile. I had to use the following:
R> install.packages(c("Rmpi"), dependencies=TRUE,configure.args=c("--with-Rmpi-include=/opt/local/include/openmpi-devel-mp"
"--with-Rmpi-libpath=/opt/local/lib/openmpi-devel-mp", "--with-Rmpi-type=OPENMPI"))
bash$ port contents openmpi-devel-default # find paths for Rmpi-libpath and Rmpi-include!
R's type hierarchy: NULL < raw < logical < integer < double < complex < character < list < expression
Here are a couple of very useful frequency functions! I got this idea from https://st2.ning.com/topology/rest/1.0/file/get/4077505910?profile=original
## I have it organized how I think it will be beneficial to me...
freqsdt <- function(DT, groupcols, percent=TRUE) {
stopifnot(is.data.table(DT), is.character(groupcols) & length(groupcols) > 0L, all(groupcols %chin% colnames(DT)))
res <- DT[, .(frequency=.N), by=groupcols][order(-frequency)][,percentage:=100*frequency/sum(frequency)]
res
outcols <- colnames(res)
if(!isTRUE(percent)) outcols <- setdiff(outcols, "percentage")
res[, ..outcols]
}
allfreqs <- function(DT, catlim=100L) {
stopifnot(is.data.table(DT), is.integer(catlim) & catlim > 0L)
nms <- names(DT)
nmlen <- length(nms)
final <- data.table(NULL)
for(i in seq_len(nmlen)) {
freqs <- freqsdt(DT, c(nms[i]))
if(nrow(freqs) <= catlim) {
final <- rbind(final,
data.table(vname=nms[i],value=as.factor(freqs[[1]]),
frequency=freqs[[2]], percent=freqs[[3]]))
}
}
final
}
And you use them like so:
R> d <- data.table(grp=sample(c('a','b','c'),1000,replace=T),grp1=sample(1:2,1000,replace=T),x=runif(1000),y=runif(1000),z=runif(1000))
R> freqsdt(d, c("grp"))
R> freqsdt(d, c("grp1"))
R> freqsdt(d, c("grp", "grp1"))
R> freqsdt(d, c("grp1", "grp"))
R> allfreqs(d) ## Probably should be called first!! Before anything else...
Get all the different S3methods of a function
getAllS3methods <- function(func) {
## I was looking at ?sf::`st_cast` and wanted to see all the different st_cast methods for different geometry types.
## This is how to go about it.
##
## R> library("sf"); ?st_cast
## R> getAllS3methods("st_cast")
## R> str(getAllS3methods("st_cast")
##
## This is very helpful for geospatial packages!
##
stopifnot(is.character(func), length(func) == 1L)
m <- methods(func)
ms <- gsub("\\*$", "", as.character(m))
s3methods <- sapply(ms, function(x) c(strsplit(x,"\\."))) ## each item is a two character vector c("function" "class")!
##
## To see the issue with s3methods do the following:
##
## table(sapply(s3methods, length))
## s3methods[which(sapply(s3methods, length) > 2]
fixed_s3methods <- lapply(s3methods, function(x) c(x[[1]], paste(x[2:length(x)], collapse="."))) ## methods like print.data.table and print.boostrap.lca
funcdefs <- lapply(fixed_s3methods, function(x) do.call(getS3method, as.list(x)))
funcdefs
}
Generate a random filename.
generate_random_filename <- genrandfilename <-
function(minlen=5L,maxlen=20L,filechars=paste0(c(letters,LETTERS,0:9),collapse=""),extensions=c("pdf","exe","txt","md","xls","doc","xlsx","dat","csv","shp","prj"),allowspaces=FALSE) {
stopifnot(is.integer(minlen), is.integer(maxlen), minlen > 0L, maxlen > 0L, maxlen >= minlen)
stopifnot(isTRUE(allowspaces) || isFALSE(allowspaces), is.character(filechars), is.character(extensions))
require("data.table")
len <- sample(seq.int(minlen,maxlen), 1)
filechars <- unlist(strsplit(filechars,""))
## browser()
char_prob_tbl <- data.table(char=filechars,prob=1/length(filechars))
if(isTRUE(allowspaces)) {
## When spaces are allowed, modify the probability in such a way that space is selected twice as often as other chars
char_prob_tbl <- rbind(char_prob_tbl, list(" ", 2/(length(filechars)+1)))
char_prob_tbl[char != " ", prob:=(1-char_prob_tbl[char==' ', prob])/ .N]
## char_prob_tbl[char != " ", prob:=(1 - .SD[char==' ', prob])/ .SD[char != ' ', .N]]
stopifnot(char_prob_tbl[, sum(prob)] == 1L)
}
## filename <- paste0(sample(filechars, len, replace=T), collapse="") ## len can be larger than length of filechars
filename <- paste0(sample(char_prob_tbl$char, len, prob=char_prob_tbl$prob, replace=T), collapse="") ## len can be larger than length of filechars
ext <- sample(extensions, 1L, replace=TRUE)
filename <- sprintf("%s.%s", filename, ext)
filename
}
And you use them like so:
R> genrandfilename()
R> files <- replicate(15, generate_random_filename())
Generate a list of prefixes.
prefixes <- function(x) {
## This is analogous to J's \
##
## In J try:
## ]\'banana'
##
## In R:
## R> prefixes(1:5)
## R> prefixes(letters[1:5])
##
## R> all.equal(cumsum(1:10), unlist(lapply(prefixes(1:10), sum)))
## R> all.equal(cumprod(1:10), unlist(lapply(prefixes(1:10), prod)))
## R> all.equal(cummax(1:10), unlist(lapply(prefixes(1:10), max)))
## R> with(list(x=runif(15)), all.equal(cumsum(x), unlist(lapply(prefixes(x), sum))))
## R> with(list(x=runif(15)), all.equal(cumprod(x), unlist(lapply(prefixes(x), prod))))
## R> with(list(x=runif(15)), all.equal(cummax(x), unlist(lapply(prefixes(x), max))))
##
stopifnot(is.vector(x))
mapply(function(a,b) x[seq.int(a,b)], rep(1,length(x)), seq_along(x))
}
This can be used to create some other cumulative functions which are not present in R. Here are a couple of examples:
cumsd <- function(x) {
stopifnot(is.vector(x), (is.numeric(x) || is.integer(x)))
unlist(lapply(prefixes(x), sd))
}
cummean <- function(x) {
stopifnot(is.vector(x), (is.numeric(x) || is.integer(x)))
unlist(lapply(prefixes(x), mean))
}
I have often felt the need to shuffle the input for my
functions/algorithms to ensure that they are not relying on the ordering
of data. For instance, when I'm first writing a function which takes
a sequence as input, I'll use a simple sequence like seq.int(1L, 10L)
as
input to get the function working. And, once the function is ready I'll
start using it. And, as I continue to use this function/algo I will notice
that when the input is not ordered (or simple) I often get results that
don't make sense. And, this is usually the result of using simple vectors
for testing when designing the function initially. As my understanding
improves and I realize the unstated, but definitely present, assumption
that I had made while writing the function I will have to try to fix the
function with shuffled vectors. And, this function is an attempt to
create permutations of an input vector
or data.frame
which can be used
to shuffle these collections which can then be used as argument[s] for
function. So, for instance if I had
mymean <- function(x) { sum(x)/length(x) }
then I would check it like this
lapply(permutations(1:15,n=20), mymean)
to ensure that I get the same
result for all the permutations!
permutations <- function(x, n=6L) {
## useful function to generate permutations of vector or data.frame.
##
## Can be used to generate random ordering so that you can check whether
## your algo/functions depend on particular ordering of variables.
##
## R> permutations(1:10)
## R> permutations(mtcars)
stopifnot((is.vector(x) & length(x) > 1L) || (is.data.frame(x) & nrow(x) > 1L))
idx <- seq_along(x)
if(is.data.frame(x)) {
idx <- seq_len(nrow(x))
}
## indices <- replicate(n, sample(idx, length(idx))) ## Permutations are cols...
## indices <- t(replicate(n, sample(idx, length(idx)))) ## permutations are rows...
## indices <- lapply(seq_len(n), function(x) sample(idx, length(idx)))
##
indices <- replicate(n, sample(idx, length(idx))) ## Permutations are cols...
res <- if(is.vector(x)) {
## x[indices]
lapply(seq_len(ncol(indices)), function(i) x[indices[,i]])
} else {
## x[indices,,drop=FALSE] ## drop=FALSE...just in case we get 1 col data.frame!!
lapply(seq_len(ncol(indices)), function(i) x[indices[, i], , drop=FALSE])
}
res
}
When using R interactively I have often found the need to modify some
options
or par
settings for a particular piece of code snippet which
needs to be reverted back after the code snippet is done. This can be
accomplished by writing a function which uses on.exit
. This was explained
by Patrick Burns at
https://www.burns-stat.com/the-options-mechanism-in-r/
withOptions <- function(optlist, expr) {
opts <- options(optlist)
on.exit(options(opts))
expr <- substitute(expr)
eval.parent(expr)
}
R> print((1:10)^-1)
R> withOptions(list(digits=3),print((1:10)^-1))
And we can do the same thing for par
too! And, in my opinion, it is
a whole lot more helpful with par
than it is with options
.
withPars <- function(parlist, expr) {
opar <- par(parlist)
on.exit(par(opar))
expr <- substitute(expr)
eval.parent(expr)
}
R> plot(mtcars$mpg, mtcars$disp)
R> withPars(list(mar=c(1,1,1,1)), plot(mtcars$mpg, mtcars$disp))
It is often required to fix column names for GIS related work. The below function tries to fix column names that are likely to work in GIS softwares.
fixcolnames <- function(x, lowercase=FALSE) {
## Very useful for GIS related work!!!!
f <- ifelse(lowercase, tolower, identity)
res <- gsub("^(_?[0-9]+)?_|_$", "", f(gsub("[^A-Za-z0-9]+", "_", as.character(x))))
ifelse(res == "", "V1", res)
}
R> # Try this to see the usefulness of this little function!
R> weird_colnames <- c(" with spaces ", "| with strange' punctuation |", "with-hyphens!", "with(parenthesis)",
"0 with leading numbers", "99 more leading numbers","But numbers 99 inside are okay!",
" 99 numbers and beginning spaces is tricky!", "Overall, Case Sensitive!",
"") ## Caught the last one and leading spaces with numbers case by using PBT (hypothesis is excellent!!)
R> d <- data.table(weird_colnames=weird_colnames)
R> d[ , `:=`(fixed_colnames = fixcolnames(weird_colnames))]
R> d
To load SparkR in an R session make sure to download spark and set the SPARK_HOME envvar.
Then you can load SparkR
library using the following:
library(SparkR,lib.loc=file.path(Sys.getenv("SPARK_HOME"),"R","lib"))
sparkR.session(master="local[*]",sparkConfig=list(spark.driver.memory="2g"))
Use all.equal
to check if numbers are equal. R's dynamic nature makes comparison using ==
very strange.
Check out the difference between ==
and all.equal
here:
R> deg2rad <- function(deg) deg * pi / 180
R> rad2deg <- function(rad) rad * 180 / pi
R> d <- data.table(deg=1:15)[,rad:=deg2rad(deg)][,deg1:=rad2deg(rad)]
R> d[, all(deg == deg1)] ## returns FALSE!!
R> d[, .(deg, deg1)] ## printing looks ok....
R> d[deg != deg1, .(deg, deg1, diff=deg-deg1)] ## main problem....
R> d[, all.equal(deg, deg1)] ## returns TRUE!!
A very useful function to display all the objects in the R environment. This is just like RStudio's environment pane...but this works in base R too!
lsos <- lsobjs <- .ls.objects <- function(pos=1L, pattern, order.by, decreasing=FALSE, head=FALSE, n=5) {
##
## See http://stackoverflow.com/questions/1358003/tricks-to-manage-the-available-memory-in-an-r-session
##
## Modified to sort correctly based on size! There's a subtle bug in the SO answer
napply <- function(names, fn) sapply(names, function(x) fn(get(x, pos=pos)))
names <- ls(pos=pos, pattern=pattern)
if(length(names) == 0L) return(character(0))
obj.class <- napply(names, function(x) as.character(class(x))[[1]])
obj.mode <- napply(names, base::mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.size <- napply(names, function(x) {
l <- capture.output(print(object.size(x), units="auto"))
l[length(l)]
})
obj.dim <- t(napply(names, function(x) as.numeric(dim(x))[1:2]))
vec <- is.na(obj.dim)[,1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.dim)
names(out) <- c("Type", "Size", "Rows", "Columns")
if(any(obj.type %in% c("RasterStack", "RasterBrick", "SpatRaster"))) {
nlayers <- function(x) ifelse(inherits(x, c("RasterStack", "RasterBrick")),
raster::nlayers(x),
ifelse(inherits(x, c("SpatRaster")), terra::nlyr(x), NA))
out <- cbind(out, Layers=napply(names, nlayers))
}
if(!missing(order.by)) {
idx <- if (order.by=="Size") {
sizes <- napply(names, object.size)
order(sizes, decreasing=decreasing)
} else {
order(out[[order.by]], decreasing=decreasing)
}
out <- out[idx, ]
}
if(head)
out <- head(out, n)
gc() ## This function uses a lot of memory! Free it before exiting.
return(out)
}
R> lsos()
An added advantage of lsos
is that it can be particularly useful for
examining lists and environments. Check this out in your console:
R> lsos(mtcars)
R> lsos(environment(par))
R> lsos(getS3method("[","data.table"))
DE-9IM representations of the various spatial predicates:
Contains(a,b) <==> 'T*****FF*'
CoveredBy(a,b) <==> 'T*F**F***' \/ '*TF**F***' \/ '**FT*F***' \/ '**F*TF***'
Covers(a,b) <==> 'T*****FF*' \/ '*T****FF*' \/ '***T**FF*' \/ '****T*FF*'
Crosses(a,b) <==> 'T*T******' \/ 'T*****T**' \/ '0********'
Disjoint(a,b) <==> 'FF*FF****'
Equals(a, b) <==> 'T*F**FFF*'
Intersects(a,b) <==> 'T********' \/ '*T*******' \/ '***T*****' \/ '****T****'
Overlaps(a,b) <==> 'T*T***T**' \/ '1*T***T**'
Touches(a,b) <==> 'FT*******' \/ 'F**T*****' \/ 'F***T****'
Within(a,b) <==> 'T*F**F***'
Contains(a,b) <==> Within(b,a)
CoveredBy(a,b) <==> Covers(b,a)
Covers(a,b) <==> CoveredBy(b,a)
Disjoint(a,b) <==> ! Intersects(a,b)
Equals(a,b) <==> Within(a,b) /\ Contains(a,b)
Intersects(a,b) <==> ! Disjoint(a,b)
Within(a,b) <==> Contains(b,a)
See https://en.wikipedia.org/wiki/DE-9IM#Spatial_predicates . These are useful with sf::st_relate
function!
Column details:
## Column Details for a Data Frame
## ^^^ ^^^^^^^ ^ ^
colDetails <- function(DF) {
stopifnot("Needs a data.frame or data.table" = inherits(DF, "data.frame"))
colnames <- colnames(DF)
colclasses <- sapply(DF, classes)
colidx <- seq_along(DF)
num_nas <- sapply(DF, numna)
num_uniq <- sapply(DF, num_unique)
DFM <- copy(DF)
for(i in seq_len(ncol(DF))) {
if(any(class(DF[[i]]) %in% c("factor","character","Date","POSIXct","POSIXlt"))) { DFM[[i]] <- as.numeric(rep(NA,nrow(DFM))) }
}
colstats <- function(x, na.rm=TRUE, digits=3L) {
## idea from McElreath's rethinking::precis function
stats <- if(is.numeric(x)) {
c(round(mean(x,na.rm=na.rm),digits=digits), round(sd(x,na.rm=na.rm),digits=digits), round(quantile(x,probs=c(0.055,0.945),na.rm=na.rm),digits=digits))
} else {
c(NA, NA, NA, NA)
}
names(stats) <- c("mean","sd","5.5%","94.5%")
stats
}
## stats <- t(apply(DF,2,colstats))
stats <- do.call(rbind, lapply(DF, colstats))
histosparks <- sapply(DFM, histospark) ## see below for histospark
DD <- data.table(ColName=colnames, ColClasses=colclasses, ColIdx=colidx, NumNA=num_nas, PctNA=round(100*num_nas/nrow(DF),3), NumUniq=num_uniq, PctUniq=round(100*num_uniq/nrow(DF),3), row.names=NULL)
DD <- cbind(DD, stats)
DD[,Histogram:=histosparks]
DD[]
}
R> colDetails(mtcars)
And, the helper functions used in the function above are:
numna <- numnas <- numNA <- numNAs <- function(x) sum(is.na(x))
classes <- function(x) paste(class(x), collapse=", ")
histospark <- function(x, width=10L) {
if(all(is.na(x))){return("")}
if(is.integer64(x)){ x <- as.numeric(x) }
sparks <- c("\u2581","\u2582","\u2583","\u2585","\u2587")
bins <- graphics::hist(x, breaks=width, plot=FALSE)
factor <- cut(bins$counts / max(gins$counts), breaks=seq(0L,1L,length=length(sparks)+1L),labels=sparks,include.lowest=TRUE)
paste0(factor,collapse="")
}
Padded numbers are often useful for creating unique IDs in geospatial related tasks. This function simplifies creating padded numbers instead of having to manually calculate them.
paddedNumbers <- function(nums) {
## paddedNumbers(c(1,19,101,1010,10283)) returns c("00001","00019","00101","01010","10283")
stopifnot(all(nums > 0))
width <- nchar(as.character(max(nums)))
fmt <- sprintf("%%0%dd",width)
sprintf(fmt, nums)
}
Generate a random date series.
generate_random_date_range <- function(start_date=as.Date('2000-01-01'), end_date=Sys.Date(), num_days=30L) {
## Useful function to generate a date range based on start/end dates...
##
## R> generate_random_date_range(as.Date('2000-01-01'),as.Date('2022-12-31'), 30L)
## R> generate_random_date_range(as.Date('2000-01-01'),as.Date('2022-12-31'), as.integer(sample(365,1L))) ## much more useful...in my opinion
## R> generate_random_date_range(as.Date('2000-01-01'),as.Date('2022-12-31'), as.integer(sample(c(365,366),1,prob=c(3/4,1/4)))) ## even more interesting
stopifnot(class(start_date)==class(Sys.Date()), class(end_date)==class(Sys.Date()), is.integer(num_days), end_date > start_date, end_date - start_date >= num_days)
with(list(start_dt=sample(seq(start_date, end_date - num_days, by='1 day'),1)),seq(start_dt, start_dt+num_days-1,by='1 day'))
}
If a data.table contains integer64 the dcast.data.table
will insert really large values for any cell that ought to receive NA
values.
From reading the issue https://github.com/Rdatatable/data.table/issues/4561 it appears that this is a rather common occurrence. The comments in that
thread describe the solution: use fill=bit64::as.integer64(NA)
in the dcast.data.table
call.
List all objects in all environments.
getAllObjects <- function() {
## This can be used to determine where the object is coming from...
##
## I stumbled upon this when trying to figure out how to get `colSums` to work with `integer64`. I learned that `colSums` is definedin `package:Matrix` as well as `package:base`.
## This is also evident when you do `?colSums` where R will ask you which function documentation you wish to read.
envs <- search()
getObjects <- function(env) {
objs <- ls(name=env)
searchIdx <- which(env==search())
data.table(searchidx=searchIdx,env=env,obj=objs,obj_type=sapply(objs,function(s)typeof(get(s,env))))
}
rbindlist(lapply(envs,getObjects))
}
Example of applying a function recursively to a list using rapply
:
## Slightly ammended list from the example shown in ?rapply
X <- list(list(a = pi, b = list(c = 2L)), d = "a test", e=as.Date(c('2024-01-01','2024-12-31')))
## does not descend into X[[1]][["b"]] to take sqrt of X[[1]][["b"]][["c"]]
rapply(X, sqrt, classes="numeric", how="replace") ## skips descending into X[[1]][["b"]] because class(X[[1]][["b"]]) is a "list"!
rapply(X, function(n){ if(is.numeric(n)){sqrt(n)}else{identity(n)}}, how="replace") ## Use R's functional nature to keep it simple!
## General idea is that if you wish to apply a function recursively do not use the `classes` argument of `rapply`. Instead, make the classes check as a part of the function application!
## NOTE (vijay): If you decide to use `ifelse` instead of `if` ... just be careful!
str(Xi <- rapply(X, function(n) ifelse(is.numeric(n),sqrt(n),identity(n)), how="replace")) ## Incorrect ... check out the element Xi[["e"]]!
str(Xc <- rapply(X, function(n) ifelse(is.numeric(n),sqrt,identity)(n), how="replace")) ## Correct ... contrast Xc$e with Xi[["e"]]
## TODO (vijay): Try to understand the behavior of `ifelse`!