End of internship

I am now coming to the end of my internship working on Zoön. I will quickly cover what I have done in three months, then give some recent updates.

Work achieved

As github gives us some nice summary statistics, I thought I’d put them here as a nice way to underline my time working on the project.

Main package

~120 commits containing ~3.5 thousand lines of code. The big blob in the middle is the lead up to the workshop. 56 issues opened, 39 closed.

zoon

Modules

~100 commits with ~2 thousand lines of code.

modules

Features

And just to try and summarise my contributions, here is a list of features

Core package
  • Github repo set up
  • Github repo for modules set up
  • Framework for calling packages from the repository
  • Include multiple modules either within one analysis or to split analyses for comparison
  • Crossvalidation and external validation
  • Save current progress of analysis on crash
  • Rerun a workflow (or run from break point on crashed workflow)
  • Take a workflow, change a few modules and rerun from where stuff is changed
  • Automatic module documentation building (not CI), and ability to get module help from R.
Select Modules
  • Collect data from GBIF and other sources
  • Collect environmental data from NCEP and Bioclim
  • Wrapper for biomod giving Random Forest, bioclim, Maxent (untested), GAMs and other models
  • Basic map plotting
  • Validation statistics (AUC, Kappa, sensitivity)
  • Upload whole analysis to figshare

Recent updates

I realise I haven’t actually written a blog since the post soon after the workshop. However, a lot of the work since then has been cleaning up code, writing better documentation and code comments, and writing unit tests (now nearly all functions are tested).

However, what has been added is functions to make zoon easier to use interactively for example during development, or for trying out different analyses.

Firstly, workflow now saves the progress so far if it crashes.


w <- workflow(SpOcc(species = 'Anopheles plumbeus', 
                    extent = c(-10, 10, 30, 50)),
              Bioclim(extent = c(-20, -10, 0, 10)),
              OneHundredBackground,
              LogisticRegression,
              SameTimePlaceMap)

tmpZoonWorkflow

As our occurrence datapoints are outside the extent of the covariate data, the workflow crashes. However, the object tmpZoonWorkflow now contains our progress so far. As this includes some data downloaded from online repositories, this might be quite a time saver.

I guess this won’t work if R fully hangs (which is entirely possible). Or even if it does work, having tmpZoonWorkflow in the namespace is not very useful if your R session has crashed. So perhaps this should be saving a tempory .RData file after each module.

There is also functions to rerun or change a workflow. However, they still are underdevelopment (I really wanted to finish these before the end.) They work only for simple cases at the moment.


# Make a new function that breaks
breaks <- function() stop('B-b-b-breaaak')

# Uhoh, it broke
b <- workflow(UKAnophelesPlumbeus, UKAir, 
              breaks, LogisticRegression, PrintMap)

# Hooray, a lurid green 'map'!
b <- ChangeWorkflow(tmpZoonWorkflow, process = OneHundredBackground)

Similarly, you might want to rerun a workflow. If a workflow breaks because of an internet connection drop or something similar, this might be easier to use than ChangeWorkflow. Otherwise, if a paper publishes an analysis, with a Zoön worklow uploaded to Figshare or Datadryad, the first thing you will want to do is rerun it.


# Download a workflow. We've got one called 'b' from above so pretend we downloaded that.

r <- RerunWorkflow(b)

Then we can compare our results with that in the paper. We also now have the data, and model objects so we can fully examine their work if we wish.

Thanks

So I think that’s about all for now. Thanks Greg, Nick and Emiel. I’ve really enjoyed it and look forward to following the progress of Zoön (I’m sure I’ll contribute some modules here and there.)

Tim
@timcdlucas

What do ZOӦN workflows look like?

On top of the core functionality we will be developing some features that will help users to learn, utilise and develop ZOӦN. The first is an overview that displays the contents of your workflow.

This is just a sketch which isn’t integrated yet, but its shows what modules you have included, and whether they are in a ‘list’ or a ‘chain’. All being well, your workflow would look something like this…

WorkflowViz_tester

All the modules are blue, and so are found in the module repository. If they aren’t the overview will look like this…

WorkflowViz_tester2

The modules not found (misspelling or not submitted) will then appear in red. A list of these modules is given at the bottom of the overview.

As you can see the ‘lists’ look different from ‘chains’ because they are different (see Tim’s post in week 5). In the above workflow there is one list and 3 chains.

Lists create separate arms of the workflow that run in parallel for each item in a list, e.g. running the same workflow for lots of species, or lots of models. Whereas chains include multiple items in one workflow, i.e. you want to outputs a map, create a set of predictions and run some diagnostics. You can have as many chains as you like, i.e. combine some data, run sequential models, or run lots of processes on your data. But you can include only 1 list per workflow.

In the workflow overview you will be alerted to this as the 2nd List will appear darker and look like this…

WorkflowViz_tester3

The next step is tidy this up and integrate it into the ZOӦN package. Then we may include different options and more information, and work on the aesthetics.

Week 7 (workshop week)

Workshop

Last week we had a workshop on zoon. It was an excellent day and highlighted various very useful avenues for improvement. I think one of the major benefits was clarifying what zoon is useful for, what our ‘competition’ is (and how we significantly differ from other projects) and the major things that need to exist for people to adopt zoon. I will talk about these later.

workshop

Bugs and problems

We (inevitably) found plenty of bugs. Some of these were fixed on the day and some are still to be fixed.

I have moved many of them from the google docs and my notes to github issues. Feel free to add more.

The major source for bugs was reading user’s own data in. Notably this led to a number of downstream errors e.g. data being read in wrong but not causing an error until the model module. So one major fix needed is for zoon to check after each module that the output is correct. This will aid debugging by isolating the problem much more quickly. It also highighted a need for zoon to save the ‘progress so far’. When downloading large datasets for example, it is annoying to have to redownload the dataset because of a typo in a module name later on.

In general, reading data in needs plenty of work to make it very easy and robust. I could do with a better idea of how people would like to put their data in (csv, tab delimited?, excel, rasters, anything else commonly used?). We also talked about reading in local R objects.

And while zoon will only use longitude latitude and value (0,1 or abundance mostly), I imagine people often have spreadsheets with more columns that they would like to use. So some attempt at guessing which the correct columns to use are would be sensible. Users indicating which columns they wish to use is also vital.

Other work

One major alteration that was suggested at the workshop and implemented in the coming weeks is to change the syntax of the main workflow function from this

w <- workflow(occurMod = 'UKAnopheles',
              covarMod = 'UKAir',
              procMod  = ModuleOptions('Crossvalidate', k=2),
              modelMod = 'LogisticRegression',
              outputMod= 'PrintMap')

to this

w <- workflow(occurrence = UKAnopheles,
              covariate  = UKAir,
              process    = Crossvalidate(k=2),
              model      = LogisticRegression,
              output     = PrintMap)

There are three changes here:
– A change in argument name
– Removing the requirement for quotation marks
– pass module options as arguments to a function like module name (the k=2 here) rather than using ModuleOptions.

The third change is more difficult and has some technical challenges that Nick has mostly figured out.

Why is zoon different

We had some really profitable discussions about the target audience of zoon and what would be required for people to start using zoon. The big packages/software that are ostensibly in the same area are Biomod and GUI maxent. However Zoon is quite different to these being in large part a wrapper to other packages.

It was noted that for people to start using zoon, it needs to be as easy as possible while providing things that other packages don’t. However, rather than being easy, it should be more useable i.e. it should suite the task better. The task of zoon is to do good, reproducible distribution modelling (not ‘get a map that vaguely looks feasible’). What a GUI could or should provide for zoon is an ongoing discussion. But instead of being just an easier way to perform a basic (and perhaps ill advised) analysis, perhaps it should be an easy way to explore the functionality of zoon (i.e. the modules repository). And if a GUI is used to run whole workflows, it should always output a code version of the workflow so that it is reproducible and preferably make the code immediately available to the GUI user as an entry point into the command line use of the package.

In terms of what zoon offers there is two strains. Firstly, as a wrapper for many other packages, we should be able to provide a very wide number of models very easily. Furthermore we aim to have very comprehensive functionality at every stage of an analysis again in part by wrapping other packages rather than writing complex methods from scratch.

The second, perhaps more interesting strain is that zoon should enable different types of analyses to other packages. For example, comparing a number of different models or easily combining data from a number of different sources or quickly providing and combining effective and diverse output from your models. These are tasks that are a level above what most packages provide; in other words, most current packages would provide one step in these analyses. Instead of having to write lengthy R scripts, that in their complexity lose their reproducibility, zoon should allow these higher levels to be run quickly, simply and reproducibly.

Finally, I think there are some interesting differences between Zoon, biomod and maxent. The major difference is that Zoon is aiming to be a community. A small group of developers can never keep up with a large, burgening and rapidly moving field. However, a community can. I think this makes a huge difference (although it comes with it’s own set of issues.)

Final comments

Finally it was noted that zoon used to be spelt zoön to calify the spelling. So this is my proposal for Zoön artwork.

zoonMotohead

Rtools – choose you own download adventure.

Windows users may have a little trouble getting Rtools to work … Once you have installed Rtools and loaded the devtools package, you may get this error message when you type in this command find_rtools()

WARNING: Rtools is required to build R packages, but is not currently installed.
Please download and install Rtools 3.1 from http://cran.r-project.org/bin/windows/Rtools

Don’t worry, its pretty common and you just need to change a couple of details in the System Path. This is how to do it….

  1. Before you download Rtools, type this into R

Sys.getenv(‘PATH’) This will get you the Path  for the environmental variables i.e.Path R code

  1. Copy this Path into a text file (removing the quote marks “ “) – it should look something like Path below.

If you have admin rights go to step 5. C:\\Program Files\\Common Files\\Microsoft Shared\\Windows Live; C:\\Program Files (x86)\\Common Files\\Microsoft Shared\\Windows Live; C:\\Windows\\system32; C:\\Windows; C:\\Windows\\System32\\Wbem; C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\; C:\\Program Files\\Intel\\WiFi\\bin\\; C:\\Program Files\\Common Files\\Intel\\WirelessCommon\\; C:\\Program Files (x86)\\Windows Live\\Shared; C:\\Program Files (x86)\\Common Files\\Roxio Shared\\DLLShared\\; C:\\Program Files (x86)\\Common Files\\Roxio Shared\\OEM\\DLLShared\\; C:\\Program Files (x86)\\Common Files\\Roxio Shared\\OEM\\DLLShared\\; C:\\Program Files (x86)\\Common Files\\Roxio Shared\\OEM\\12.0\\DLLShared\\; C:\\Program Files (x86)\\Roxio\\OEM\\AudioCore\\; C:\\Program Files (x86)\\QuickTime\\QTSystem\\; C:\\Program Files (x86)\\WinSCP\\; C:\\Python27; C:\\Python27\\Scripts

  1. Copy the text file onto a memory stick so you can do step 4.
  1. Log in as the administrator (or whoever can) and download Rtools from the admin account.

Rtools doesn’t ask for admin privileges by default. But you won’t be able to change the registry (or something above my computing pay grade) without admin rights.

  1. Then download Rtools…

http://cran.at.r-project.org/bin/windows/Rtools/ BUT before you install you have to check that you can edit the system path. If you can, then  copy in the Path you have in the text file, then go through the pages to to install R tools, and go to 7. Otherwise go to 6. Rtools System Path

  1. If you can’t alter the system path then make sure the top box is checked on this previous page of the wizard.

The option name isn’t visible because of the path name, awkward! But by checking the top box you should get to the option above. Go to 5. R tools check box

  1. That worked for me 🙂

Helpful links: https://github.com/stan-dev/rstan/wiki/Install-Rtools-for-Windows http://stackoverflow.com/questions/20023739/r-getting-rtools-to-install-on-r-version-3-0-2 http://cran.r-project.org/doc/manuals/R-admin.html#The-Windows-toolset

Week 6 (workshop next week!)

Progress

Cross and external validation

I have now implemented cross validation and external validation. For example use

workflow( occurMod = 'UKAnophelesPlumbeus', 
          covarMod = 'UKAir',
          procMod = 'BackgroundAndCrossvalid',
          modelMod = 'LogisticRegression',
          outMod = 'PerformanceMeasures'
        )

This uses the (obligatory) A. plumbeus data that is bundled with the package and some NCEP air data from NCEP (that is also bundled). The data is split into the default of 5 sets for crossvalidation.

Logistic regression is performed on each of the test sets. The model of each is then used to predict the value of the held out data.

Logistic regression is then ALSO performed on the full dataset. The model module then passes on the model trained on the full dataset and a dataframe with lat, lon, true value, number indicator for crossvalidation fold, predicted value and the environmental data.

Finally, a PerformanceMeasures is a module that calculates AUC, kappy, sensitivity, specificity etc.

The crossvalidation is written so that it will automatically run crossvalidation if the dataframe from the process module suggests there are more than 1 group of data. The number 1, 2, 3… indicate groups for cross validation. 0 indicates external validation data that is not ever used to train the data.

Output

We now have a couple of output options.

workflow( occurMod = 'UKAnophelesPlumbeus', 
          covarMod = 'UKAir',
          procMod = 'BackgroundAndCrossvalid',
          modelMod = 'LogisticRegression',
          outMod = list('PerformanceMeasures', 'PrintMap')
        )

mapAndPerform

This workflow prints a map of the probability surface and calculates the performance measures. I haven’t put much thought into printing maps straight to the graphics device when there are multiple maps to be printed (e.g. when two models were run). Currently one map is printed, and then the second replaces the first.

A module ‘SurfaceMap’ saves maps to a specified directory and this DOES work with multiple models.

Documentation

Modules are now documented with roxygen2 as well as the main zoon repo. However, using BuildModule you just specificy a couple of arguments with a description of the module and a description of the parameters and it builds the roxygen2 comments for you. Then we just need to run roxygen2 on the modules repo every now and then.

ModuleHelp() is a new function which reads the documentation .Rd file from github and prints it to the console. This is entirely hacked from ‘?’ (i.e. help()) in base R. I will be able to make this read your R options and print to the browser or print to the console depending. But I haven’t done that yet.

Workshop

I now just have a lot to do for the workshop. I’m writing some of the little modules that make zoon much more flexible as I think of them. I also need to think how best to demonstrate Zoon.

Questions and to dos

Linking modules

Unfortunately I haven’t made the functions to link modules together. This would have been really nice to have done for the workshop but it’s a bit trickier than I realised. So there are some glaring workflows that are currently awkward. A common workflow would be to download occurrence data from GBIF, and load in some external validation data, then run a typical analysis to see how well models trained on the GBIF do at predicting the external data. This isn’t currently possible because you can’t daisy chain the ‘SpOcc’ module and the ‘LocalOccurrenceData’ module.

Similarly I can’t write one module that splits the data for crossvalidation and another module that samples background pseudoabsences. So if (as is common) you want to do both of these process you are currently limited to the ‘BackgroundAndCrossvalid’ module which does both of these on it’s own.

Structured module repo

I wasted a lot of time splitting the module repo into different folders for each type of module (occurrence, covaraiate etc.) I realised that this irreperably broke the module documentation so I removed it.

Week 5

Progress

Lists

The whole of the week has been rewriting the main function (and many others) so that the package can accept lists of modules. This is now done and I think the package works sensibly. Here is some examples

Get the package from github.

# install devtools
install.packages('devtools')
library('devtools')

# install and load zoon from github
install_github('zoonproject/zoon')
library('zoon')

Then to run an analysis for two species we might can use the module ‘SpOcc’. This collects data from a number of ecological databases using the ‘spooc’ package by ROpenSci. By default it uses just GBIF which is fine for our purposes.

flow <- workflow(occurMod = list(
                   ModuleOptions('SpOcc', 
                     species = 'Culex pipiens', extent = c(-20,20,45,65)), 
                   ModuleOptions('SpOcc', 
                     species = 'Anopheles plumbeus', extent = c(-20,20,45,65)) 
                ),
                covarMod = 'UKAir',
                procMod = 'OneHundredBackground',
                modelMod = 'LogisticRegression',
                outMod = 'SameTimePlaceMap')

This will give us two separate analysis, one for each species. The output modules are not particularly well set up for parallel output yet. But we can have a quick look at our output with

par(mfrow=c(1,2))
plot(flow$output[[1]])
points(flow$occurrence[[1]][,1:2])
plot(flow$output[[2]])
points(flow$occurrence[[2]][,1:2])

parallelSp

So any of the modules can be given parallel modules in this fasion. But only one module can have multiple modules as it is unclear quite what would be wanted with multiple sets of modules and if it’s all combinations then this will quickly get fairly ridiculous.

x <- workflow(occurMod = list('UKAnophelesPlumbeus',
                'UKAnophelesPlumbeus'),
       covarMod = list('UKAir', 'UKAir'),
       procMod = 'OneHundredBackground',
       modelMod = 'LogisticRegression',
       outMod = 'SameTimePlaceMap')

Error in workflow(occurMod = list("UKAnophelesPlumbeus", "UKAnophelesPlumbeus"),  :

To do

Daisy chain

Following on from the listable work above I really need to now write functions that ‘daisy chain’ modules together rather than running them in parallel. For example, as output we might want AUC and a map, but that shouldn’t stop us running two species. Or we might want to use species data from two sources but want to combine them and run a number of models, rather comparing analyses done with the two datasets separately.

Multicore

Something that I won’t be doing immediately but might be quite easy and useful to implement is to make these analyses run on parallel cores. As the code is written with *apply functions, this should be fairly quite straight forward.

Cross validation

This is probably what I will do next. The main functions need to be rewritten so that cross validation and external validation is an integral part of the workflow

Documentation

It’s not quite clear how to do the documentation for modules. As the modules are held on repositories, the documentation isn’t loaded automatically with the package. Furthermore, we want it to be easy for contributors to add well documented modules. This will probably be done by making the MakeModule() function ask for information.

As a first attempt I think the documentation will be just written manually and held on the online repo. A new function will call the documentation given the module name.

Week 4

Progress

Progress has been a little slow this week what with the Bank holiday. Also, most of work done has been under the hood stuff that is time consuming and uninteresting in the sense that it doesn’t add new features.

Environments

I spent quite a while working out how environments work in R and then fixing the package so it uses them more correctly. Before, a number of functions were saved into the global environment. Modules, which are each just a module, were put into the global environment. Further more, the BiomodModels module creates a predict method for the biomod model class. This was also saved to the global environment.

All these objects are now saved into the environment of the workflow function call. This keeps the global environment clean and tidy.

Tests

I have finally gotten around to setting up and writting unit tests for the package. This are written with the testthat package. I have now written tests for most of the functions in zoon as well as some whole workflow tests. I am still not very clear on what a good strategy for testing functions with complex outputs is (e.g. workflow()). At the moment I am:
– testing all the internal functions that create the complex output
– testing each bit of the complex output object

In one case, I am testing the function by running it in two ways which should give the same output and then testing the equality of the entire output object. This seems messy to me. But anything less complete feels at tisk of missing a bug.

I am also struggling to write tests for the GetModuleList() function as it requires browser authentification.

I will try looking at other packages to continue getting a better idea of how other people test their code.

Workflow with lists

I am still working on the workflow function so that it accepts lists of modules of each type. This is quite involved so it is not finished yet. After discussion we decided that giving a list of covariate modules should create parallel analyses in the same way lists of occurrence or model modules would. I will write a function to combine covariate datasets from different modules although apparently this could be a bit tricky, getting the extents the same etc.

Just thinking about this now, I wonder whether this should be written to accomodate multi-threaded processing from the beginning. But probably not. As it is being written with *apply functions, it should be fairly simple to swap in the multicore versions later.

Discussions and decisions.

Cross validation

We decided that the best way to implement k fold cross validation is not to treat the k folds as k seperate analyses, and then join them later, but to have an extra column in the dataframe that gets passed from process modules to model modules with an integer indicating the fold. This column could also usefully indicate data that is external validation data.

Documentation

For now documentation will be a simple search to online documentatiaon. This should be fairly simple. I can make the BuildModule() function so that it requires some information on the module when it is built as well.

Future work

As usual we have a lot of stuff we would love to add. Currently the priorities are making the package accept lists of modules, good documentation and cross validation. We have then discussed adding INLA (although apparently this isn’t straight forward), worldclim data, and quite a range of outputs as there is currently almost no output.

Sorry, still no appropriate pictures…