The First Programming Design Pattern in pxWorks

First of all, we need to explain a few things in more detail.

(Re)Introduction

pxWorks is an open source programming platform that enables the following, among other things:

  • Implement data-mining programming logic in a clear fashion by modelling code around the flow of data.
  • Use a mixture of any scripting languages in the same project seamlessly without introducing any intermediary code or any extra packages. (Our examples will be mostly written in R.)
  • Delegate code writing with an assurance that the written code will be transparent and easy to follow and to debug.
  • Create prototypes quickly and easily.

Programming Logic (Control Flow) in pxWorks

Any computer program can be represented by a graph. In pxWorks, graph nodes represent operations and graph edges represent the direction of control flow.

To enable programming loops without making logic complicated, the platform uses just two types of connections: unconditional and conditional.

The simplest program is the one that uses unconditional connections. Such connections are represented on the canvas by grey lines. The graph with only unconditional connections represents a simple program in which each node that has inputs waits until all the code blocks associated with connected inputs have been processed.

Nodes that have conditional inputs allow to introduce loops into control flow. Conditional connections are represented by magenta lines. Nodes that have both unconditional and conditional input connections wait for their turn to execute the code based on the following rule: either all the unconditionally connected nodes have been (re)calculated or at least one conditional node has been (re)calculated and generated an input file.

So in the first case, with unconditional links, the triggering of the code takes place regardless of whether an input file is generated for the dependent block (hence the execution is unconditional).

In the second case, with conditional links, the triggering of the code takes place only on condition that an input file has been generated after running the earlier block.

Even more details on this subject can be found here.

The First Design Pattern: Heartbeat

Before proceeding any further, you might want to get the example file here. (To run the example, you will need to unzip it and open in pxWorks.)

The first and simplest use case might be periodic retrieval (and processing) of some data using an R/Python/Julia/etc. or any mixture of these. We will use R.

To implement this design pattern we need a block that will initiate the control flow, let’s call it ‘init,’ and a heartbeat block, which is simply a script that generates an output file and passes the control flow back to its own input socket. There is no need to generate the file every time, but for simplicity, we will keep regenerating it every time the script is run.
The heartbeat output can be linked to any number of blocks that need to be run after the heartbeat block. Without complicating the program with actual data retrieval and processing, for demonstration purposes, we will simply generate random numbers and plot them. When you run the script, to see the plot, simply click on the “graph” icon in the main menu. pxWorks should open a new window which will display the latest generated plot.

So the heartbeat block will keep running perpetually and will trigger scripts in dependent blocks.

To stop the heartbeat block, the generated file must be deleted and the script must stop generating the file.

In further posts, we will demonstrate other design patterns we use in our data analysis workflow. This first example already shows how simple it is to introduce programming logic using just two types of connections to model the control flow rather than multiple types of blocks as done in some other platforms.

Things become so much simpler. Instead of thinking about programming architecture, one becomes free to think about the data as programming complexity vanishes.

###

How to simplify your code by using data flows

How can one effectively develop and manage code in large complex data analysis projects?

In the past I routinely developed conventions for naming my R scripts so that those scripts have prefixes to determine the sequence to run the scripts. I used this convention several years until I came across a massive data analysis task. I needed to process data generated by a trading algorithm that managed a portfolio of hundreds of stocks. The initial solution was clear — to write R scripts that manage other R scripts. So I persisted. However, some tasks had to be run manually, such as launching a sequence of R instances that would process data in parallel. Finally, it became clear that using script naming conventions and special folder structures were not an optimal solution as I had an even more complicated challenge ahead — connecting the algorithm to the market. The workflow was no longer hierarchical but had a structure that could only be conveyed by a graph with loops.

Thus came into being the platform pxWorks (www.pxworks.io, the screenshot is below). The platform is open source and the code is published under AGPLv3.

Some of the features of this platform are as follows:

  • Running code in any scripting language or any compiled code in a code block. For example one can easily mix R and Python code in your project.
  • Easy code debugging due to the fact that a code in a block can be run in isolation from the rest of the code and has user defined inputs and outputs that are saved to disk.
  • Ability to implement any programming logic on a graph that determines data flows.
  • Ability to implement conditional loops easily by using conditional connections (sockets). No special blocks are required for that (as is usually the case in some visual programming environments.
  • Possibility of modification of programming blocks (graph nodes) on the fly by simply editing underlying text files that define each block and refreshing the block.
  • Extensible code block library. One can easily add a block of code into the library for reuse. Simple folders in the library directory are treated as ‘folders’ in the library menu, so blocks can be easily grouped.

More details and technical specifications of the platform can be found in the forum on the website of the project.

I am currently looking for collaborators and feedback to help me improve the software to make it even more useful to as many people as possible. Let me know if you need any new features. Participating is easy, just fork the code on GitHub and start extending the code base or report any issues you have.

I am also developing a production-stage algorithmic trading system using this platform, so leave a comment at the forum of the website www.pxworks.io if you are interested in that trading code being open sourced.

R based game “2048” with a simple API for ML benchmarking

For a long time I have been on the lookout for a ready-to-use R-based API for testing ML algorithms. My ideal “tool” would be an R-based API for the game “go” (“weiqi” in Chinese and “baduk” in  Korean), but I found none so far.

Since writing a rules engine is quite time-consuming, I would not venture developing one until recently when I stumbled upon the game “2048.” Many people use this simple game to benchmark their machine learning algorithms; and YouTube  has so many machine learning demos using “2048” that I won’t even list them here. To my surprise, all the videos I watched seem to use a java script based front-end for testing. Using a java script connection seemed like an over-complication for me including an additional bottleneck.

So after spending an hour building my own R-based version of “2048,” I decided to see what other options were available; and I found this nice implementation in C: https://github.com/mevdschee/2048.c by Maurits van der Schee. Porting from C saved me a lot of time. Many thanks to the original author.

I added a simple API for attaching the game code to a ML algorithm. I will just make a note that each game may be run in its own R environment, which allows for easy set up of parallel computing using the standard “foreach.” Other than that, the whole code is a little over 400 lines, so everything about its usage should be self-explanatory.

The program could be run in an interactive text mode as well (using `main_interactive()`). However, I found no way to port text coloring to the console of RStudio. Still, as this code is meant to be played by machines rather than human users, I suppose, I achieved my goal.

You can download the R version of the code from my github repository: github.com/cloudcell/2048_4ML. Comments / suggestions are always welcome.


Usage

# interactive mode
p.env <- new.env()
main_interactive(p.env)

# using for benchmarking
p.env <- new.env()
main_ML_init(p.env)
main_ML_run(p.env, m =”L” )
# use p.env$board to retrieve board state
p.env$board
main_ML_run(p.env, m =”D”, show_board = FALSE)
p.env$board
# …

 

Introducing Package ‘fuzztest’

This is a tool for code fault analysis. I built it to automate the most boring part of my debugging process.

The package automates test setup and logging and visualizes function exit states in a way that simplifies identification of root causes of software defects. Fuzzing is implemented by random generation of input parameters as shown in a demo below. Finally, even though there was no goal to make this tool as another unit testing package, one can use it as such, as this tool should potentially assure 100% code coverage with minimal effort.

The package tests all possible combinations of input parameters and produces statistics and visuals. If you have some specific requirements, you can simply build a wrapper function that catches output you are interested in and generates an error if a required condition is not met. Then submit your wrapper function for testing. You can even compare current output values against values recorded in a log during a ‘reference’ test run, effectively making a comprehensive unit test.

The package can be installed from here: https://github.com/cloudcell/fuzztest/.

Below is a presentation of a built-in demo. You can run it using `demo(fuzzdemo)`.

I will be grateful for your comments and suggestions.

 


 

> demo(fuzzdemo)

    r <- list()
    r$x <- c(0)
    r$y <- c(0)
    r$option <- c("a", "b", "c")
    r$suboption <- c("a", "b", "c","d")
    
    generate.argset(arg_register = r, display_progress=TRUE)
    apply.argset(FUN="fuzzdemofunc")
    test_summary()
    plot_tests()
    plot_tests(fail = F)    
    plot_tests(pass = F)
===LOG OMITTED===

Fuzztest: Argument-Option Combination Results
===================================================
   ARG~OPT     Arg Name     PASS    FAIL    FAIL%
---------------------------------------------------
   1 ~    1    x               5       7     58.3
   2 ~    1    y               5       7     58.3
   3 ~    1    option          4       0      0.0
   3 ~    2    option          1       3     75.0
   3 ~    3    option          0       4    100.0
   4 ~    1    suboption       2       1     33.3
   4 ~    2    suboption       1       2     66.7
   4 ~    3    suboption       1       2     66.7
   4 ~    4    suboption       1       2     66.7
===================================================

Fuzztest: Summary
========================================================================
  Arg Name            Failure Rate Contribution, % (Max - Min)          
------------------------------------------------------------------------
          x    0.0  '                                                  '
          y    0.0  '                                                  '
     option  100.0  '**************************************************'
  suboption   33.3  '*****************                                 '
========================================================================

The summary shows that the argument 'option' explains the most
variability in the outcome. So let's concentrate on the arg. 'option.'

The detailed statistics table shows that most failures occur when  
value #3 is selected within the argument 'option'. At the same time, 
a test log (omitted here) shows that the types of errors are mixed. 
For now, however, let's assume that fixing the bugs related to  
control flow is more important. 

The following three graphs will demonstrate how the data above
can be represented visually. Notice that some lines are grouped
when they intersect vertical axes. The groups correspond to specific
options and are ordered from the bottom of the chart to the top: 
i.e. the fist grouping of lines at axis 'suboption' (at the bottom)
corresponds to value 'a', the next one up is suboption 'b', and so on.
In case an argument has only one value in the test, the whole
group of lines will be evenly spread from the bottom to the top of 
the chart, as is the case for arguments 'x' and 'y'.

* All test cases:

download

One can also selectively display only passing or failing tests
as will be shown next.

* Only 'passing' test cases:
download (1)

* Only 'failing' test cases:
download (2)

Let's assume all the control flow related bugs discussed above 
are fixed now. To make this assumption to "work" during testing
we will simply choose a combination of options that will not 
cause the demo function to produce 'fail' states shown above. 
Such a combination could be {x=0, y=0, option='a', suboption='a'}.

Now we will concentrate on the numeric part of the test.
There are two main testing approaches:
 1. Create an evenly spaced sequence of values for each parameter 
    (x and y) from lowest to highest and let the argument set generator
    combine and test these values. This approach has an advantage 
    for more intuitive visualization as sequences of values
    for testing will be aligned with the vertical axis. For example, 
    if we create a test sequence [-10;+10] for argument 'x', 
    visualized test results will list those from 'Min' to 'Max'. 
    So finding simple linear dependencies that cause errors will 
    be easier as it will be easier than when using a random set
    of values (below).
 2. Generate random parameters for selected arguments and let the test
    framework test all possible parameter combinations.
  
  
The First Approach: Ordered Test Sequences  
    r <- list()
    r$x <- c(seq(from=-5, to=5, length.out = 11))
    r$y <- c(seq(from=-5, to=5, length.out = 11))
    r$option <- c("a")
    r$suboption <- c("a")
    
    generate.argset(arg_register = r, display_progress=TRUE)
    apply.argset(FUN="fuzzdemofunc")
    test_summary()
    plot_tests()
===LOG OMITTED===

Fuzztest: Argument-Option Combination Results
===================================================
   ARG~OPT     Arg Name     PASS    FAIL    FAIL%
---------------------------------------------------
   1 ~    1    x               9       2     18.2
   1 ~    2    x              10       1     9.09
   1 ~    3    x               9       2     18.2
   1 ~    4    x              10       1     9.09
   1 ~    5    x               9       2     18.2
   1 ~    6    x              10       1     9.09
   1 ~    7    x               9       2     18.2
   1 ~    8    x              10       1     9.09
   1 ~    9    x              10       1     9.09
   1 ~   10    x              10       1     9.09
   1 ~   11    x              10       1     9.09
   2 ~    1    y              11       0      0.0
   2 ~    2    y              10       1     9.09
   2 ~    3    y              10       1     9.09
   2 ~    4    y              10       1     9.09
   2 ~    5    y              10       1     9.09
   2 ~    6    y               9       2     18.2
   2 ~    7    y               9       2     18.2
   2 ~    8    y               9       2     18.2
   2 ~    9    y               9       2     18.2
   2 ~   10    y              10       1     9.09
   2 ~   11    y               9       2     18.2
   3 ~    1    option        106      15     12.4
   4 ~    1    suboption     106      15     12.4
===================================================

Fuzztest: Summary
========================================================================
  Arg Name            Failure Rate Contribution, % (Max - Min)          
------------------------------------------------------------------------
          x   9.09  '*****                                             '
          y   18.2  '*********                                         '
     option    0.0  '                                                  '
  suboption    0.0  '                                                  '
========================================================================

The test summary shows that argument 'y' contributes to 
failure the most.

What about the chart?

Rplot02

Now one can clearly see two linear relationships between
'x' and 'y'. These correspond to 'numeric bugs' #1NC and #4NC
(Please, see details in the file 'include_fuzzdemofunc.R')

Let's assume the previously discovered bugs have been fixed.
So we will again choose a different combination of input parameters
for arguments 'option' and 'suboption' for the next test.
  
  
The Second Approach: Random Test Sequences  
  
It makes no sense testing options randomly as all those 
combinations of values will be tested anyway. So the test 
will be conducted for numeric arguments only.

This test has 900 cases and might take a couple of minutes,
so you have time to pour yourself a cup of coffee: (_)]...

    set.seed(0)
    r <- list()
    r$x <- runif(15, min=-10, max=10)
    r$y <- runif(15, min=-10, max=10)
    r$option <- c("b")
    r$suboption <- c("a","b","c","d")
    
    generate.argset(arg_register = r, display_progress=TRUE)
    apply.argset(FUN="fuzzdemofunc")
    test_summary()
    plot_tests()
===LOG OMITTED===

Fuzztest: Argument-Option Combination Results
===================================================
   ARG~OPT     Arg Name     PASS    FAIL    FAIL%
---------------------------------------------------
   1 ~    1    x              35      25     41.7
   1 ~    2    x              35      25     41.7
   1 ~    3    x              35      25     41.7
   1 ~    4    x              35      25     41.7
   1 ~    5    x              35      25     41.7
   1 ~    6    x              35      25     41.7
   1 ~    7    x              35      25     41.7
   1 ~    8    x              35      25     41.7
   1 ~    9    x              35      25     41.7
   1 ~   10    x              35      25     41.7
   1 ~   11    x              35      25     41.7
   1 ~   12    x              35      25     41.7
   1 ~   13    x              35      25     41.7
   1 ~   14    x              35      25     41.7
   1 ~   15    x              35      25     41.7
   2 ~    1    y              45      15     25.0
   2 ~    2    y              30      30     50.0
   2 ~    3    y              30      30     50.0
   2 ~    4    y              45      15     25.0
   2 ~    5    y              30      30     50.0
   2 ~    6    y              45      15     25.0
   2 ~    7    y              45      15     25.0
   2 ~    8    y              30      30     50.0
   2 ~    9    y              30      30     50.0
   2 ~   10    y              30      30     50.0
   2 ~   11    y              30      30     50.0
   2 ~   12    y              30      30     50.0
   2 ~   13    y              30      30     50.0
   2 ~   14    y              30      30     50.0
   2 ~   15    y              45      15     25.0
   3 ~    1    option        525     375     41.7
   4 ~    1    suboption     225       0      0.0
   4 ~    2    suboption     195      30     13.3
   4 ~    3    suboption       0     225    100.0
   4 ~    4    suboption     105     120     53.3
===================================================

Fuzztest: Summary
========================================================================
  Arg Name            Failure Rate Contribution, % (Max - Min)          
------------------------------------------------------------------------
          x    0.0  '                                                  '
          y   25.0  '************                                      '
     option    0.0  '                                                  '
  suboption  100.0  '**************************************************'
========================================================================

The test table shows that suboption #3 ('c') is always failing.

Let's see if the visual approach provides a better perspective.

Rplot05

This graph has a confusing order of axes at this point.
An axis that has only one option should either be hidden or placed
at an edge of the chart so relations with other parameters could
be visible. To reorder axes, for simplicity, we will quickly create 
a smaller test with a different sequence of arguments, which will 
change the sequence of axes.

    set.seed(0)
    r <- list()
    r$x <- runif(5, min=-10, max=10)
    r$y <- runif(5, min=-10, max=10)
    r$suboption <- c("a","b","c","d")
    r$option <- c("b")
    
    generate.argset(arg_register = r, display_progress=TRUE)
    apply.argset(FUN="fuzzdemofunc")
    test_summary()
    plot_tests()
===LOG OMITTED===

Fuzztest: Argument-Option Combination Results
===================================================
   ARG~OPT     Arg Name     PASS    FAIL    FAIL%
---------------------------------------------------
   1 ~    1    x              12       8     40.0
   1 ~    2    x              12       8     40.0
   1 ~    3    x              12       8     40.0
   1 ~    4    x              12       8     40.0
   1 ~    5    x              12       8     40.0
   2 ~    1    y              10      10     50.0
   2 ~    2    y              15       5     25.0
   2 ~    3    y              15       5     25.0
   2 ~    4    y              10      10     50.0
   2 ~    5    y              10      10     50.0
   3 ~    1    suboption      25       0      0.0
   3 ~    2    suboption      15      10     40.0
   3 ~    3    suboption       0      25    100.0
   3 ~    4    suboption      20       5     20.0
   4 ~    1    option         60      40     40.0
===================================================

Fuzztest: Summary
========================================================================
  Arg Name            Failure Rate Contribution, % (Max - Min)          
------------------------------------------------------------------------
          x    0.0  '                                                  '
          y   25.0  '************                                      '
  suboption  100.0  '**************************************************'
     option    0.0  '                                                  '
========================================================================

Rplot06--

The textual test summary shows the same pattern as in the previous 
test. Also, a reduced set of test cases produced a more transparent
representation of test results without losing important details.

There are many ways to proceed from here:
* if some 'error' states are valid, one can exclude them from tests 
  using the 'subset' argument of apply.argset().
* if bugs are trivial, one can eliminate them one by one.
* if faults are intractable, one can start with narrowing down the
  range of input parameters and further analyze function behavior.

-------------
 End of Demo 
-------------


The following is the test function used in the demo

#' Generates errors for several combinations of input parameters to test the
#' existing and emerging functionality of the package
#'
#' Whenever options lead the control flow within a function to a 'demo bug', 
#' the function stops and the test framework records a 'FAIL' result.
#' Upon a successful completion, the function returns a numeric value into the 
#' environment from which the function was called.
#'
#' @param x: any numeric scalar value (non-vector)
#' @param y: any numeric scalar value (non-vector)
#' @param option any character value from "a", "b", "c"
#' @param suboption any character value from "a", "b", "c", "d"
#' 
#' @author cloudcell
#' 
#' @export
fuzzdemofunc <- function(x, y, option, suboption)
{
    tmp1 <- 0
    switch(option,
           "a"={
               switch(suboption,
                      "a"={                                    },
                      "b"={                                    },
                      "c"={ if(x + y <0) stop("demo bug #1CF (control flow)") },
                      "d"={                                    },
                      { stop("Wrong suboption (valid 'FAIL')") }
               )
               if(abs(x-y+1)<0.01) stop("demo bug #1NC (numeric calc.)")
           },
           "b"={
               x <- 1
               switch(suboption,
                      "a"={ x <- x*1.5                         },
                      "b"={ x <- y                             },
                      "c"={ y <- 1 }, "d"={ if(x>y) stop("demo bug #2CF (control flow)") },
                      { stop("Wrong suboption (valid 'FAIL')") }
               )
               if(abs(x %% 5 - y)<0.01) stop("demo bug #2NC (numeric calc.)")
           },
           "c"={
               switch(suboption,
                      "a"={ stop("demo bug #3CF (control flow)") },
                      "b"={                                    },
                      "c"={                                    },
                      "d"={  rm(tmp1)                          },
                      { stop("Wrong suboption (valid 'FAIL')") }
               )
               if(abs(x %/% 5 - y)<0.01) stop("demo bug #3NC  (numeric calc.)")
           },
           { stop("Wrong option (valid 'FAIL')") }
    )
    
    if(!exists("tmp1")) stop("demo bug #4CF (control flow)")
    
    result <- x - y*2 + 5
    
    if(abs(result)<0.01) stop("demo bug #4NC  (numeric calc.)")
    
    result
}

			

Testing R Code

Among various ways to test R code on GitHub / Travis / Codecov, there exist four main approaches:

  1. use RUnit package
  2. use testthat package
  3. use one’s own custom function (what I’ve been doing so far)
  4. save test output as reference & compare modified code output against it

After reading this post [http://yihui.name/en/2013/09/testing-r-packages/], I realized that saving reference values to compare against them the output after code is modified does not allow TDD, or test driven development. So the tests will always “drag behind” the development process.

I am currently using option #3. However, there are obvious shortcomings of this approach in large projects. Since I am using R as well as other languages, naturally, my choice falls on RUnit (xUnit framework) as multiple languages use this format and this fact will make life easier in the long run.

Key points about the testing workflow:

  1. install the package
  2. test the package
  3. testing in development mode is a separate matter and won’t be my primary concern

File locations (using Dirk Eddelbuettel’s github repo as an example):

“package_root/tests” folder contains only the file “doRUnit.R” that launches tests {launcher example}

“package_root/inst/unitTests” contains a file with the primary testing suite builder code {suit_builder example} and test code files {test files examples}. “unitTests” folder will be moved into the package root folder after installation and will become accessible to the ‘laucher’ R code sitting in the “package_root/tests” folder.

References for RUnit:

PS: A good review of testing packages (pros, cons, usage):