## Modeling Trick: the Signed Pseudo Logarithm

Much of the data that the analyst uses exhibits extraordinary range. For example: incomes, company sizes, popularity of books and any “winner takes all process”; (see: Living in A Lognormal World). Tukey recommended the logarithm as an important “stabilizing transform” (a transform that brings data into a more usable form prior to generating exploratory statistics, analysis or modeling). One benefit of such transforms is: data that is normal (or Gaussian) meets more of the stated expectations of common modeling methods like least squares linear regression. So data from distributions like the lognormal is well served by a `log()`

transformation (that transforms the data closer to Gaussian) prior to analysis. However, not all data is appropriate for a log-transform (such as data with zero or negative values). We discuss a simple transform that we call a signed pseudo logarithm that is particularly appropriate to signed wide-range data (such as profit and loss).Log-transforming data is essential when analyzing systems that operate in relative terms or are “scale invariant” (such as financial returns). For example Geometric Brownian motion is a stochastic process central to a number of financial models (such as option pricing). Geometric Brownian motion is actually the exponential of a standard linear Brownian motion (where increments are normal or Gaussian). In this case a logarithmic transformation actually moves from the observed data to a more natural frame of reference (where increments are additive instead of being multiplicative).

In addition to being the correct transform for log-normal data, the log transform offers useful range compression (it is often not safe to allow variables with extraordinary ranges into linear models) and is *stabilizing* in that it corrects heteroscedasticity in any data set where the outcome measurements are thought to have a relative error (typical of incomes and pricing).

However, a major shortcoming of the log transform is its inability to deal with zero values and negative values.

A signed value we have often been asked to characterize or predict is: profit and loss (often called P&L). One natural model for P&L would be as the difference between a revenue and an expense. If the revenue and expense were both normally distributed then their difference would also be normal (and we would not need any stabilizing transform). However, if the revenue and expense were both log-normally distributed (say both proportional to task size or some other log-normal parameter) then the difference is not normal (it retains the propensity for extreme values or heavy tails of the original distributions). And for many financial size measures (company size, contract size and so on) the log-normal distribution is a much more realistic model than the normal distribution. In some situations P&L’s are formed from completely observed revenues and expenses (so we can model everything without sign problems), in other situations the signed P&L from an unobserved (or unrecorded) underling process and we are forced to deal with signed quantities.

For signed data we suggest the following transformation (code in R):

pseudoLog10 <- function(x) { asinh(x/2)/log(10) }

`asinh()`

is a somewhat ugly function that is the inverse of `sinh()`

. `sinh()`

is defined as:

sinh(x) = (e^x - e^(-x))/2

The important point is for `x`

such that `|x|`

is large `2*sinh(x)`

rapidly approaches `sign(x)*e^(|x|)`

. Thus we should expect `asinh(x/2)`

to look a lot like `sign(x)*log(|x|)`

(which is why we call it a signed pseudo logarithm). For `pseudoLog10()`

we take the previous function divided by `log(10)`

to ensure that we are in log-10 like units (i.e. `pseudoLog10(100)`

is nearly 2, `pseudoLog10(1000)`

is nearly 3 and so on). Business audiences tend to have an easier time with log-10 (or dB) units (which can be explained as counting the number of decimal digits) than natural log or log-e units.

So for large positive numbers `pseudoLog10()`

pretty much behaves like `log10()`

(itself a standard transform). In fact `pseudoLog10()`

has the following nice properties:

`pseudoLog10(x)`

is defined for all real`x`

.`pseudoLog10(0) = 0`

.`pseudoLog10(-x) = -pseudoLog10(x)`

.`pseudoLog10(x)`

is monotone in`x`

.- For
`x`

such that`|x|`

is large:`pseudoLog10(x)`

is very near`sign(x)*log10(|x|)`

.

We strongly recomend trying this transformation before feeding heavy tail data into a linear or logistic model.

However, we can not recommend the transformation for presentation. Consider the simple case of plotting the distribution or density of normal data with mean zero and standard-deviation 10 (see My Favorite Graphs for description of a density plot):

```
```
library(ggplot2)
pseudoLog10 <- function(x) { asinh(x/2)/log(10) }
d <- data.frame(x=rnorm(n=1000,sd=100))
ggplot(d) + geom_density(aes(x=x))

The density plot shows what we would expect- a near normal distribution (most points towards the center and mass falling off quickly as we move away). However, the plot of the pseudoLog10 transformed data is not what we would hope:

```
```
d$pseudoLog10x <- pseudoLog10(d$x)
ggplot(d) + geom_density(aes(x=pseudoLog10x))

The data density (falsely) appears bimodal! This is because the `pseudoLog10()`

transform is compressing ranges more and more violently as we move away from the origin (and not compressing near the origin). So as we move away from origin: the product of the real data density times the degree of range compression climbs, achieves a maximum and then falls. This phenomena (which is just a “change of variables” for densities) gives us the bimodal appearance for unimodal distributions that have significant mass outside of the range [-10,10]. The bimodal appearance is mostly a fact about the transform not really a feature of the underlying data.

We see value in examining at the relative sizes and centers of these two modes for asymmetric distributions (such as the profit and loss statement for a set of accounts that are mostly losing money). The position and relative sizes of the modes gives us an initial hint what to look for (helps with questions like: “are total losses driven by many accounts or by few accounts” and so on). We can not, however, recommend the `pseudoLog10()`

transform for presentation. The most striking feature of the graph is almost always the bimodal appearance of the data; and the bimodal appearance is almost always an artifact of the transform (not a real feature of the data). You can not in good conscious push a presentation where the most prominent and exciting observation is not in fact in the data.

We do still recommend trying the `pseudoLog10()`

transform when building a linear or logistic model with wide ranged data. The transformation usefully compresses range which allows the modeled coefficients to be a function of most of the data and not a function of a few extreme values. Models that depend on most of their data (or on central estimates from their data) tend to be safer, achieve higher statistical significance and cross-validate more reliably. Models that are dominated by a few extreme values tend to be unsafe, not achieve statistical significance and not cross-validate reliably. The bimodal artifact can work in the favor of modeling as it tends to compress a transformed variable into “typical positive example” and “typical negative example” while still allowing magnitudes to enter the model in some form.

Used with care the `pseudoLog10()`

or `arcsinh()`

transform can be an important data preparation step for signed data with large range. Many financial summaries (such as P&L) meet these conditions and often profit from the transform.

Nice article on a principled way to get a very similar transform (with good citations): Rick Wicklin “A log transformation of positive and negative values” http://blogs.sas.com/content/iml/2014/07/14/log-transformation-of-pos-neg/ . A lot of these techniques are part of the useful “folk theorems” that a lot of data scientists know, but finding a good writeup can be a problem. In our book (“Practical Data Science with R”, page 75 ) author Nina Zumel decided it can be better to map the entire interval [-1,1] to zero (giving up smoothness) to avoid inflicting excess math on audiences. All of these methods have the uneven stretching effect- which is needed away from the origin (a necessary stabilizing/centering or range compression step), but leaves some issues near the origin (unimodal data often appears bi-modal when density plotted).