-
- Downloads
[SPARK-18408][ML] API Improvements for LSH
## What changes were proposed in this pull request? (1) Change output schema to `Array of Vector` instead of `Vectors` (2) Use `numHashTables` as the dimension of Array (3) Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to `MinHashLSH` (4) Make `randUnitVectors/randCoefficients` private (5) Make Multi-Probe NN Search and `hashDistance` private for future discussion Saved for future PRs: (1) AND-amplification and `numHashFunctions` as the dimension of Vector are saved for a future PR. (2) `hashDistance` and MultiProbe NN Search needs more discussion. The current implementation is just a backward compatible one. ## How was this patch tested? Related unit tests are modified to make sure the performance of LSH are ensured, and the outputs of the APIs meets expectation. Author: Yun Ni <yunn@uber.com> Author: Yunni <Euler57721@gmail.com> Closes #15874 from Yunni/SPARK-18408-yunn-api-improvements.
Showing
- mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala 43 additions, 34 deletions...apache/spark/ml/feature/BucketedRandomProjectionLSH.scala
- mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala 75 additions, 63 deletionsmllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala
- mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 59 additions, 53 deletions...c/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala 58 additions, 42 deletions...e/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala 12 additions, 5 deletions.../src/test/scala/org/apache/spark/ml/feature/LSHTest.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala 59 additions, 24 deletions...t/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala
Loading
Please register or sign in to comment