We have a new (207 MB) DataModeler release available for your retrieval.
The main focus is two new functions, closing an evolutionary loophole and addressing a potential Mathematica pathology. The new functions are DataCompletenessMap and DataCompletenessPlot which let us easily look at the prevalence and distribution of nonnumerics in data which we want to model. Of course, we layer the information presented with intelligent and easily specified tooltips and make it easy to generate quality and insightful graphics.
The evolutionary loophole which has been closed is in the handling of nonnumerics. Previously, we had a NumericColumnThreshold which would restrict model development to those inputs which contained at least a specified fraction of numeric entries (and only the numeric predictions would be considered in evaluating model quality). However, if we allowed more than the default fraction of missing elements, then the model search algorithm could combine multiple inputs and the net result would be that models would be evaluated considering an even smaller fraction of the data records. To address this, we introduced a NumericPredictionRequirement which required models to evaluate to a numeric for a minimum fraction of the data records — otherwise, the model would be rejected even though the inputs were individually satisfying the completeness requirement.
The pathology issue is that Mathematica can consume all available memory/virtual memory/disk space when doing very difficult model searches via very long modeling runs with transcendental functions. Although this behavior is stochastic and is an issue for rare data sets and modeling problems, this is effectively a memory leak which we now monitor and truncate if a MemoryLimit is reached and return/archive the results at that point as we would if a TimeConstraint were encountered.
The complete release notes are below:
The highlights of this release are the new DataCompletenessMap and DataCompletenessPlot. These are useful to get the zen of data sets which feature incomplete data. Related to this, we have introduced a new SymbolicRegression option, NumericPredictionRequirement, since the default of evaluating model quality only on complete records in the data meant that models could, in extreme circumstances, be assessed on far fewer data records than expected even though each of the constituent variables passed the NumericColumnThreshold.
Mathematica also appears to have some memory leak issues which can come into play for very long modeling runs. To address this, we implemented a MemoryLimit on SymbolicRegression to avoid runaway consumption of RAM and disk space.
- Introduced two new functions related to data completeness, DataCompletenessMap and DataCompletenessPlot, which provide a visual assessment of the presence of non-numeric elements in a data set.
- Addressed a problem wherein models could be assessed on a smaller than expected fraction of the data set even though the individual variables satisfied the completeness threshold specified by the NumericColumnThreshold. Hence, we introduced a new option NumericPredictionRequirement (default associated with SymbolicRegression) which sets the minimum fraction of data records which must be numerically evaluatable by a model for the non- numerics to be automatically excluded from the assessment of ModelQuality. Along with this change, the NumericColumnThreshold default has been set to Automatic which will use the NumericPredictionRequirement as its threshold.
- Modified SmallPlot so that it stays unevaluated if the input format is not recognized. This allows it to be used as a pure function for options such as ToolTipFunction.
- Implemented a default Mesh -> Automatic setting for CorrelationMatrixPlot which will suppress the mesh if a large number of data columns are being plotted. Also trapped the situation where nominally numeric columns didn't have any numeric overlap.
- Modified SymbolicRegression so that StoreModelSet can be used within the GenerationMonitor, CascadeMonitor, RunMonitor or EvolutionMonitor pure functions.
- Implemented a MemoryLimit option for SymbolicRegression to guard against the Mathematica kernel memory leaks during the model search. This places an upper bound on the incremental memory required during each of the IndependentEvolutions. Additionally, a MemoryMonitor option was created which can return the profile of memory consumption over the course of each kernel's model search.