Skip to content
  • RJ Nowling's avatar
    ec9df6a7
    [SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF · ec9df6a7
    RJ Nowling authored
    This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents.
    
    This is implemented using a minimumOccurence parameter (default 0).  When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0.  As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.
    
    This PR makes the following changes:
    * Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
    * Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
    * Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
    * Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
    * Updated the MLLib Feature Extraction programming guide to describe the new feature
    
    Author: RJ Nowling <rnowling@gmail.com>
    
    Closes #2494 from rnowling/spark-3614-idf-filter and squashes the following commits:
    
    0aa3c63 [RJ Nowling] Fix identation
    e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite
    bfa82ec [RJ Nowling] Add space after if
    30d20b3 [RJ Nowling] Add spaces around equals signs
    9013447 [RJ Nowling] Add space before division operator
    79978fc [RJ Nowling] Remove unnecessary semi-colon
    40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs
    47850ab [RJ Nowling] Changed minimumOccurence to Int from Long
    9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs
    1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator
    1801fd2 [RJ Nowling] Fix style errors in IDF.scala
    6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable
    a200bab [RJ Nowling] Remove unnecessary else statement
    4b974f5 [RJ Nowling] Remove accidentally-added import from testing
    c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
    ec9df6a7
    [SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF
    RJ Nowling authored
    This PR for [SPARK-3614](https://issues.apache.org/jira/browse/SPARK-3614) adds functionality for filtering out terms which do not appear in at least a minimum number of documents.
    
    This is implemented using a minimumOccurence parameter (default 0).  When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0.  As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.
    
    This PR makes the following changes:
    * Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
    * Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
    * Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
    * Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
    * Updated the MLLib Feature Extraction programming guide to describe the new feature
    
    Author: RJ Nowling <rnowling@gmail.com>
    
    Closes #2494 from rnowling/spark-3614-idf-filter and squashes the following commits:
    
    0aa3c63 [RJ Nowling] Fix identation
    e6523a8 [RJ Nowling] Remove unnecessary toDouble's from IDFSuite
    bfa82ec [RJ Nowling] Add space after if
    30d20b3 [RJ Nowling] Add spaces around equals signs
    9013447 [RJ Nowling] Add space before division operator
    79978fc [RJ Nowling] Remove unnecessary semi-colon
    40fd70c [RJ Nowling] Change minimumOccurence to minDocFreq in code and docs
    47850ab [RJ Nowling] Changed minimumOccurence to Int from Long
    9fb4093 [RJ Nowling] Remove unnecessary lines from IDF class docs
    1fc09d8 [RJ Nowling] Add backwards-compatible constructor to DocumentFrequencyAggregator
    1801fd2 [RJ Nowling] Fix style errors in IDF.scala
    6897252 [RJ Nowling] Preface minimumOccurence members with val to make them final and immutable
    a200bab [RJ Nowling] Remove unnecessary else statement
    4b974f5 [RJ Nowling] Remove accidentally-added import from testing
    c0cc643 [RJ Nowling] Add minimumOccurence filtering to IDF
Loading