Posted on Categories data science, Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, Statistics To English Translation, TutorialsTags , , , , , , 2 Comments on Working with Sessionized Data 1: Evaluating Hazard Models

Working with Sessionized Data 1: Evaluating Hazard Models

When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a “ready to analyze” form each individual example gets a row of data and every fact about the example is a column. Usually transforming data into this form is a matter of performing the equivalent of a number of SQL joins (for example, Lecture 23 (“The Shape of Data”) from our paid video course Introduction to Data Science discusses this).

Doorlog

One notable exception is log data. Log data is a very thin data form where different facts about different individuals are written across many different rows. Converting log data into a ready for analysis form is called sessionizing. We are going to share a short series of articles showing important aspects of sessionizing and modeling log data. Each article will touch on one aspect of the problem in a simplified and idealized setting. In this article we will discuss the importance of dealing with time and of picking a business appropriate goal when evaluating predictive models.

For this article we are going to assume that we have sessionized our data by picking a concrete near-term goal (predicting cancellation of account or “exit” within the next 7 days) and that we have already selected variables for analysis (a number of time-lagged windows of recent log events of various types). We will use a simple model without variable selection as our first example. We will use these results to show how you examine and evaluate these types of models. In later articles we will discuss how you sessionize, how you choose examples, variable selection, and other key topics.

Continue reading Working with Sessionized Data 1: Evaluating Hazard Models

Posted on Categories Mathematics, Statistics, TutorialsTags , , , , , , 1 Comment on Estimating rates from a single occurrence of a rare event

Estimating rates from a single occurrence of a rare event

Elon Musk’s writing about a Tesla battery fire reminded me of some of the math related to trying to estimate the rate of a rare event from a single occurrence of the event (plus many non-event occurrences). In this article we work through some of the ideas. Continue reading Estimating rates from a single occurrence of a rare event