Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    c8abddc5
    [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API · c8abddc5
    Davies Liu authored
    ```
    pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
        :: Experimental ::
    
        If `observed` is Vector, conduct Pearson's chi-squared goodness
        of fit test of the observed data against the expected distribution,
        or againt the uniform distribution (by default), with each category
        having an expected frequency of `1 / len(observed)`.
        (Note: `observed` cannot contain negative values)
    
        If `observed` is matrix, conduct Pearson's independence test on the
        input contingency matrix, which cannot contain negative entries or
        columns or rows that sum up to 0.
    
        If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
        test for every feature against the label across the input RDD.
        For each feature, the (feature, label) pairs are converted into a
        contingency matrix for which the chi-squared statistic is computed.
        All label and feature values must be categorical.
    
        :param observed: it could be a vector containing the observed categorical
                         counts/relative frequencies, or the contingency matrix
                         (containing either counts or relative frequencies),
                         or an RDD of LabeledPoint containing the labeled dataset
                         with categorical features. Real-valued features will be
                         treated as categorical for each distinct value.
        :param expected: Vector containing the expected categorical counts/relative
                         frequencies. `expected` is rescaled if the `expected` sum
                         differs from the `observed` sum.
        :return: ChiSquaredTest object containing the test statistic, degrees
                 of freedom, p-value, the method used, and the null hypothesis.
    ```
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #3091 from davies/his and squashes the following commits:
    
    145d16c [Davies Liu] address comments
    0ab0764 [Davies Liu] fix float
    5097d54 [Davies Liu] add Hypothesis test Python API
    c8abddc5
    History
    [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API
    Davies Liu authored
    ```
    pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
        :: Experimental ::
    
        If `observed` is Vector, conduct Pearson's chi-squared goodness
        of fit test of the observed data against the expected distribution,
        or againt the uniform distribution (by default), with each category
        having an expected frequency of `1 / len(observed)`.
        (Note: `observed` cannot contain negative values)
    
        If `observed` is matrix, conduct Pearson's independence test on the
        input contingency matrix, which cannot contain negative entries or
        columns or rows that sum up to 0.
    
        If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
        test for every feature against the label across the input RDD.
        For each feature, the (feature, label) pairs are converted into a
        contingency matrix for which the chi-squared statistic is computed.
        All label and feature values must be categorical.
    
        :param observed: it could be a vector containing the observed categorical
                         counts/relative frequencies, or the contingency matrix
                         (containing either counts or relative frequencies),
                         or an RDD of LabeledPoint containing the labeled dataset
                         with categorical features. Real-valued features will be
                         treated as categorical for each distinct value.
        :param expected: Vector containing the expected categorical counts/relative
                         frequencies. `expected` is rescaled if the `expected` sum
                         differs from the `observed` sum.
        :return: ChiSquaredTest object containing the test statistic, degrees
                 of freedom, p-value, the method used, and the null hypothesis.
    ```
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #3091 from davies/his and squashes the following commits:
    
    145d16c [Davies Liu] address comments
    0ab0764 [Davies Liu] fix float
    5097d54 [Davies Liu] add Hypothesis test Python API