Skip to content
Snippets Groups Projects
Unverified Commit e57e3938 authored by actuaryzhang's avatar actuaryzhang Committed by Sean Owen
Browse files

[SPARK-18715][ML] Fix AIC calculations in Binomial GLM

The AIC calculation in Binomial GLM seems to be off when the response or weight is non-integer: the result is different from that in R. This issue arises when one models rates, i.e, num of successes normalized over num of trials, and uses num of trials as weights. In this case, the effective likelihood is  weight * label ~ binomial(weight, mu), where weight = number of trials, and weight * label = number of successes and mu = is the success rate.

srowen sethah yanboliang HyukjinKwon zhengruifeng

## What changes were proposed in this pull request?
I suggest changing the current aic calculation for the Binomial family from
```
-2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
        weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
      }.sum()
```
to the following which generalizes to the case of real-valued response and weights.
```
      -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
        val wt = math.round(weight).toInt
        if (wt == 0){
          0.0
        } else {
          dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
        }
      }.sum()
```
## How was this patch tested?
I will write the unit test once the community wants to include the proposed change. For now, the following modifies existing tests in weighted Binomial GLM to illustrate the issue. The second label is changed from 0 to 0.5.

```
val datasetWithWeight = Seq(
    (1.0, 1.0, 0.0, 5.0),
    (0.5, 2.0, 1.0, 2.0),
    (1.0, 3.0, 2.0, 1.0),
    (0.0, 4.0, 3.0, 3.0)
  ).toDF("y", "w", "x1", "x2")

val formula = (new RFormula()
  .setFormula("y ~ x1 + x2")
  .setFeaturesCol("features")
  .setLabelCol("label"))
val output = formula.fit(datasetWithWeight).transform(datasetWithWeight).select("features", "label", "w")

val glr = new GeneralizedLinearRegression()
    .setFamily("binomial")
    .setWeightCol("w")
    .setFitIntercept(false)
    .setRegParam(0)

val model = glr.fit(output)
model.summary.aic
```
The AIC from Spark is 17.3227, and the AIC from R is 15.66454.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16149 from actuaryzhang/aic.
parent 43298d15
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment