Skip to content
Snippets Groups Projects
  1. Jul 10, 2015
    • Jonathan Alter's avatar
      [SPARK-7977] [BUILD] Disallowing println · e14b545d
      Jonathan Alter authored
      Author: Jonathan Alter <jonalter@users.noreply.github.com>
      
      Closes #7093 from jonalter/SPARK-7977 and squashes the following commits:
      
      ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite
      7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite
      10724b6 [Jonathan Alter] Changing some printlns to logs in tests
      eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      0b1dcb4 [Jonathan Alter] More println cleanup
      aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      0c16fa3 [Jonathan Alter] Replacing some printlns with logs
      45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      5c8e283 [Jonathan Alter] Allowing println in audit-release examples
      5b50da1 [Jonathan Alter] Allowing printlns in example files
      ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      83ab635 [Jonathan Alter] Fixing new printlns
      54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns
      b837c3a [Jonathan Alter] Disallowing println
      e14b545d
  2. Jul 06, 2015
    • Daniel Emaasit (PhD Student)'s avatar
      [SPARK-8124] [SPARKR] Created more examples on SparkR DataFrames · 293225e0
      Daniel Emaasit (PhD Student) authored
      Here are more examples on SparkR DataFrames including creating a Spark Contect and a SQL
      context, loading data and simple data manipulation.
      
      Author: Daniel Emaasit (PhD Student) <daniel.emaasit@gmail.com>
      
      Closes #6668 from Emaasit/dan-dev and squashes the following commits:
      
      3a97867 [Daniel Emaasit (PhD Student)] Used fewer rows for createDataFrame
      f7227f9 [Daniel Emaasit (PhD Student)] Using command line arguments
      a550f70 [Daniel Emaasit (PhD Student)] Used base R functions
      33f9882 [Daniel Emaasit (PhD Student)] Renamed file
      b6603e3 [Daniel Emaasit (PhD Student)] changed "Describe" function to "describe"
      90565dd [Daniel Emaasit (PhD Student)] Deleted the getting-started file
      b95a103 [Daniel Emaasit (PhD Student)] Deleted this file
      cc55cd8 [Daniel Emaasit (PhD Student)] combined all the code into one .R file
      c6933af [Daniel Emaasit (PhD Student)] changed variable name to SQLContext
      8e0fe14 [Daniel Emaasit (PhD Student)] provided two options for creating DataFrames
      2653573 [Daniel Emaasit (PhD Student)] Updates to a comment and variable name
      275b787 [Daniel Emaasit (PhD Student)] Added the Apache License at the top of the file
      2e8f724 [Daniel Emaasit (PhD Student)] Added the Apache License at the top of the file
      486f44e [Daniel Emaasit (PhD Student)] Added the Apache License at the file
      d705112 [Daniel Emaasit (PhD Student)] Created more examples on SparkR DataFrames
      293225e0
  3. Jul 01, 2015
    • zsxwing's avatar
      [SPARK-8378] [STREAMING] Add the Python API for Flume · 75b9fe4c
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6830 from zsxwing/flume-python and squashes the following commits:
      
      78dfdac [zsxwing] Fix the compile error in the test code
      f1bf3c0 [zsxwing] Address TD's comments
      0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
      e93736b [zsxwing] Fix the test case for determine_modules_to_test
      9d5821e [zsxwing] Fix pyspark_core dependencies
      f9ee681 [zsxwing] Merge branch 'master' into flume-python
      7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
      b96b0de [zsxwing] Merge branch 'master' into flume-python
      ce85e83 [zsxwing] Fix incompatible issues for Python 3
      01cbb3d [zsxwing] Add import sys
      152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
      14ba0ff [zsxwing] Add flume-assembly for sbt building
      b8d5551 [zsxwing] Merge branch 'master' into flume-python
      4762c34 [zsxwing] Fix the doc
      0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
      9f33873 [zsxwing] Add the Python API for Flume
      75b9fe4c
  4. Jun 30, 2015
    • Shuo Xiang's avatar
      [SPARK-8551] [ML] Elastic net python code example · 54524574
      Shuo Xiang authored
      Author: Shuo Xiang <shuoxiangpub@gmail.com>
      
      Closes #6946 from coderxiang/en-java-code-example and squashes the following commits:
      
      7a4bdf8 [Shuo Xiang] address comments
      cddb02b [Shuo Xiang] add elastic net python example code
      f4fa534 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      6ad4865 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      98804c9 [Shuo Xiang] fix bug in topBykey and update test
      54524574
  5. Jun 19, 2015
    • Liang-Chi Hsieh's avatar
      [HOTFIX] Fix scala style in DFSReadWriteTest that causes tests failed · 4a462c28
      Liang-Chi Hsieh authored
      This scala style problem causes tested failed.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6907 from viirya/hotfix_style and squashes the following commits:
      
      c53f188 [Liang-Chi Hsieh] Fix scala style.
      4a462c28
    • RJ Nowling's avatar
      Add example that reads a local file, writes to a DFS path provided by th... · a9858036
      RJ Nowling authored
      ...e user, reads the file back from the DFS, and compares word counts on the local and DFS versions. Useful for verifying DFS correctness.
      
      Author: RJ Nowling <rnowling@gmail.com>
      
      Closes #3347 from rnowling/dfs_read_write_test and squashes the following commits:
      
      af8ccb7 [RJ Nowling] Don't use java.io.File since DFS may not be POSIX-compatible
      b0ef9ea [RJ Nowling] Fix string style
      07c6132 [RJ Nowling] Fix string style
      7d9a8df [RJ Nowling] Fix string style
      f74c160 [RJ Nowling] Fix else statement style
      b9edf12 [RJ Nowling] Fix spark wc style
      44415b9 [RJ Nowling] Fix local wc style
      94a4691 [RJ Nowling] Fix space
      df59b65 [RJ Nowling] Fix if statements
      1b314f0 [RJ Nowling] Add scaladoc
      a931d70 [RJ Nowling] Fix import order
      0c89558 [RJ Nowling] Add example that reads a local file, writes to a DFS path provided by the user, reads the file back from the DFS, and compares word counts on the local and DFS versions. Useful for verifying DFS correctness.
      a9858036
    • Xiangrui Meng's avatar
      [SPARK-8151] [MLLIB] pipeline components should correctly implement copy · 43c7ec63
      Xiangrui Meng authored
      Otherwise, extra params get ignored in `PipelineModel.transform`. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6622 from mengxr/SPARK-8087 and squashes the following commits:
      
      0e4c8c4 [Xiangrui Meng] fix merge issues
      26fc1f0 [Xiangrui Meng] address comments
      e607a04 [Xiangrui Meng] merge master
      b85b57e [Xiangrui Meng] fix examples/compile
      d6f7891 [Xiangrui Meng] rename defaultCopyWithParams to defaultCopy
      84ec278 [Xiangrui Meng] remove setter checks due to generics
      2cf2ed0 [Xiangrui Meng] snapshot
      291814f [Xiangrui Meng] OneVsRest.copy
      1dfe3bd [Xiangrui Meng] PipelineModel.copy should copy stages
      43c7ec63
    • Bryan Cutler's avatar
      [SPARK-8444] [STREAMING] Adding Python streaming example for queueStream · a2016b4b
      Bryan Cutler authored
      A Python example similar to the existing one for Scala.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #6884 from BryanCutler/streaming-queueStream-example-8444 and squashes the following commits:
      
      435ba7e [Bryan Cutler] [SPARK-8444] Fixed style checks, increased sleep time to show empty queue
      257abb0 [Bryan Cutler] [SPARK-8444] Stop context gracefully, Removed unused import, Added description comment
      376ef6e [Bryan Cutler] [SPARK-8444] Fixed bug causing DStream.pprint to append empty parenthesis to output instead of blank line
      1ff5f8b [Bryan Cutler] [SPARK-8444] Adding Python streaming example for queue_stream
      a2016b4b
  6. Jun 17, 2015
    • Brennon York's avatar
      [SPARK-7017] [BUILD] [PROJECT INFRA] Refactor dev/run-tests into Python · 50a0496a
      Brennon York authored
      All, this is a first attempt at refactoring `dev/run-tests` into Python. Initially I merely converted all Bash calls over to Python, then moved to a much more modular approach (more functions, moved the calls around, etc.). What is here is the initial culmination and should provide a great base to various downstream issues (e.g. SPARK-7016, modularize / parallelize testing, etc.). Would love comments / suggestions for this initial first step!
      
      /cc srowen pwendell nchammas
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5694 from brennonyork/SPARK-7017 and squashes the following commits:
      
      154ed73 [Brennon York] updated finding java binary if JAVA_HOME not set
      3922a85 [Brennon York] removed necessary passed in variable
      f9fbe54 [Brennon York] reverted doc test change
      8135518 [Brennon York] removed the test check for documentation changes until jenkins can get updated
      05d435b [Brennon York] added check for jekyll install
      22edb78 [Brennon York] add check if jekyll isn't installed on the path
      2dff136 [Brennon York] fixed pep8 whitespace errors
      767a668 [Brennon York] fixed path joining issues, ensured docs actually build on doc changes
      c42cf9a [Brennon York] unpack set operations with splat (*)
      fb85a41 [Brennon York] fixed minor set bug
      0379833 [Brennon York] minor doc addition to print the changed modules
      aa03d9e [Brennon York] added documentation builds as a top level test component, altered high level project changes to properly execute core tests only when necessary, changed variable names for simplicity
      ec1ae78 [Brennon York] minor name changes, bug fixes
      b7c72b9 [Brennon York] reverting streaming context
      03fdd7b [Brennon York] fixed the tuple () wraps around example lambda
      705d12e [Brennon York] changed example to comply with pep3113 supporting python3
      60b3d51 [Brennon York] prepend rather than append onto PATH
      7d2f5e2 [Brennon York] updated python tests to remove unused variable
      2898717 [Brennon York] added a change to streaming test to check if it only runs streaming tests
      eb684b6 [Brennon York] fixed sbt_test_goals reference error
      db7ae6f [Brennon York] reverted SPARK_HOME from start of command
      1ecca26 [Brennon York] fixed merge conflicts
      2fcdfc0 [Brennon York] testing targte branch dump on jenkins
      1f607b1 [Brennon York] finalizing revisions to modular tests
      8afbe93 [Brennon York] made error codes a global
      0629de8 [Brennon York] updated to refactor and remove various small bugs, removed pep8 complaints
      d90ab2d [Brennon York] fixed merge conflicts, ensured that for regular builds both core and sql tests always run
      b1248dc [Brennon York] exec python rather than running python and exiting with return code
      f9deba1 [Brennon York] python to python2 and removed newline
      6d0a052 [Brennon York] incorporated merge conflicts with SPARK-7249
      f950010 [Brennon York] removed building hive-0.12.0 per SPARK-6908
      703f095 [Brennon York] fixed merge conflicts
      b1ca593 [Brennon York] reverted the sparkR test
      afeb093 [Brennon York] updated to make sparkR test fail
      1dada6b [Brennon York] reverted pyspark test failure
      9a592ec [Brennon York] reverted mima exclude issue, added pyspark test failure
      d825aa4 [Brennon York] revert build break, add mima break
      f041d8a [Brennon York] added space from commented import to now test build breaking
      983f2a2 [Brennon York] comment out import to fail build test
      2386785 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-7017
      76335fb [Brennon York] reverted rat license issue for sparkconf
      e4a96cc [Brennon York] removed the import error and added license error, fixed the way run-tests and run-tests.py report their error codes
      56d3cb9 [Brennon York] changed test back and commented out import to break compile
      b37328c [Brennon York] fixed typo and added default return is no error block was found in the environment
      7613558 [Brennon York] updated to return the proper env variable for return codes
      a5bd445 [Brennon York] reverted license, changed test in shuffle to fail
      803143a [Brennon York] removed license file for SparkContext
      b0b2604 [Brennon York] comment out import to see if build fails and returns properly
      83e80ef [Brennon York] attempt at better python output when called from bash
      c095fa6 [Brennon York] removed another wait() call
      26e18e8 [Brennon York] removed unnecessary wait()
      07210a9 [Brennon York] minor doc string change for java version with namedtuple update
      ec03bf3 [Brennon York] added namedtuple for java version to add readability
      2cb413b [Brennon York] upcased global variables, changes various calling methods from check_output to check_call
      639f1e9 [Brennon York] updated with pep8 rules, fixed minor bugs, added run-tests file in bash to call the run-tests.py script
      3c53a1a [Brennon York] uncomment the scala tests :)
      6126c4f [Brennon York] refactored run-tests into python
      50a0496a
  7. Jun 12, 2015
    • Roger Menezes's avatar
      [SPARK-8314][MLlib] improvement in performance of MLUtils.appendBias · 6e9c3ff1
      Roger Menezes authored
      MLUtils.appendBias method is heavily used in creating intercepts for linear models.
      This method uses Breeze's vector concatenation which is very slow compared to the plain
      System.arrayCopy. This improvement is to change the implementation to use System.arrayCopy.
      
      I saw the following performance improvements after the change:
      Benchmark with mnist dataset for 50 times:
      MLUtils.appendBias (SparseVector Before): 47320 ms
      MLUtils.appendBias (SparseVector After): 1935 ms
      MLUtils.appendBias (DenseVector Before): 5340 ms
      MLUtils.appendBias (DenseVector After): 4080 ms
      This is almost a 24 times performance boost for SparseVectors.
      
      Author: Roger Menezes <rmenezes@netflix.com>
      
      Closes #6768 from rogermenezes/improve-append-bias and squashes the following commits:
      
      4e42f75 [Roger Menezes] address feedback
      e999d79 [Roger Menezes] first commit
      6e9c3ff1
  8. Jun 04, 2015
    • Thomas Omans's avatar
      [SPARK-7743] [SQL] Parquet 1.7 · cd3176bd
      Thomas Omans authored
      Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743).
      
      Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala`
      
      ```diff
      -    val readContext = getReadSupport(configuration).init(
      +    val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init(
      ```
      
      Since ParquetInputFormat.getReadSupport was made package private in the latest release.
      
      Thanks
      -- Thomas Omans
      
      Author: Thomas Omans <tomans@cj.com>
      
      Closes #6597 from eggsby/SPARK-7743 and squashes the following commits:
      
      2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0
      cd3176bd
  9. Jun 03, 2015
    • Patrick Wendell's avatar
      [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0 · 2c4d550e
      Patrick Wendell authored
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #6328 from pwendell/spark-1.5-update and squashes the following commits:
      
      2f42d02 [Patrick Wendell] A few more excludes
      4bebcf0 [Patrick Wendell] Update to RC4
      61aaf46 [Patrick Wendell] Using new release candidate
      55f1610 [Patrick Wendell] Another exclude
      04b4f04 [Patrick Wendell] More issues with transient 1.4 changes
      36f549b [Patrick Wendell] [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
      2c4d550e
  10. Jun 02, 2015
    • DB Tsai's avatar
      [SPARK-7547] [ML] Scala Example code for ElasticNet · a86b3e9b
      DB Tsai authored
      This is scala example code for both linear and logistic regression. Python and Java versions are to be added.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #6576 from dbtsai/elasticNetExample and squashes the following commits:
      
      e7ca406 [DB Tsai] fix test
      6bb6d77 [DB Tsai] fix suite and remove duplicated setMaxIter
      136e0dd [DB Tsai] address feedback
      1ec29d4 [DB Tsai] fix style
      9462f5f [DB Tsai] add example
      a86b3e9b
    • Ram Sriharsha's avatar
      [SPARK-7387] [ML] [DOC] CrossValidator example code in Python · c3f4c325
      Ram Sriharsha authored
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6358 from harsha2010/SPARK-7387 and squashes the following commits:
      
      63efda2 [Ram Sriharsha] more examples for classifier to distinguish mapreduce from spark properly
      aeb6bb6 [Ram Sriharsha] Python Style Fix
      54a500c [Ram Sriharsha] Merge branch 'master' into SPARK-7387
      615e91c [Ram Sriharsha] cleanup
      204c4e3 [Ram Sriharsha] Merge branch 'master' into SPARK-7387
      7246d35 [Ram Sriharsha] [SPARK-7387][ml][doc] CrossValidator example code in Python
      c3f4c325
  11. May 31, 2015
    • Reynold Xin's avatar
      [SPARK-7979] Enforce structural type checker. · 4b5f12ba
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6536 from rxin/structural-type-checker and squashes the following commits:
      
      f833151 [Reynold Xin] Fixed compilation.
      633f9a1 [Reynold Xin] Fixed typo.
      d1fa804 [Reynold Xin] [SPARK-7979] Enforce structural type checker.
      4b5f12ba
    • Reynold Xin's avatar
      [SPARK-3850] Trim trailing spaces for examples/streaming/yarn. · 564bc11e
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6530 from rxin/trim-whitespace-1 and squashes the following commits:
      
      7b7b3a0 [Reynold Xin] Reset again.
      dc14597 [Reynold Xin] Reset scalastyle.
      cd556c4 [Reynold Xin] YARN, Kinesis, Flume.
      4223fe1 [Reynold Xin] [SPARK-3850] Trim trailing spaces for examples/streaming.
      564bc11e
  12. May 29, 2015
    • Ram Sriharsha's avatar
      [SPARK-6013] [ML] Add more Python ML examples for spark.ml · dbf8ff38
      Ram Sriharsha authored
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6443 from harsha2010/SPARK-6013 and squashes the following commits:
      
      732506e [Ram Sriharsha] Code Review Feedback
      121c211 [Ram Sriharsha] python style fix
      5f9b8c3 [Ram Sriharsha] python style fixes
      925ca86 [Ram Sriharsha] Simple Params Example
      8b372b1 [Ram Sriharsha] GBT Example
      965ec14 [Ram Sriharsha] Random Forest Example
      dbf8ff38
  13. May 28, 2015
    • Reynold Xin's avatar
      [SPARK-7929] Remove Bagel examples & whitespace fix for examples. · 2881d14c
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6480 from rxin/whitespace-example and squashes the following commits:
      
      8a4a3d4 [Reynold Xin] [SPARK-7929] Remove Bagel examples & whitespace fix for examples.
      2881d14c
    • Li Yao's avatar
      [MINOR] Fix the a minor bug in PageRank Example. · c771589c
      Li Yao authored
      Fix the bug that entering only 1 arg will cause array out of bounds exception in PageRank example.
      
      Author: Li Yao <hnkfliyao@gmail.com>
      
      Closes #6455 from lastland/patch-1 and squashes the following commits:
      
      de06128 [Li Yao] Fix the bug that entering only 1 arg will cause array out of bounds exception.
      c771589c
    • zsxwing's avatar
      [SPARK-7895] [STREAMING] [EXAMPLES] Move Kafka examples from scala-2.10/src to src · 000df2f0
      zsxwing authored
      Since `spark-streaming-kafka` now is published for both Scala 2.10 and 2.11, we can move `KafkaWordCount` and `DirectKafkaWordCount` from `examples/scala-2.10/src/` to `examples/src/` so that they will appear in `spark-examples-***-jar` for Scala 2.11.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6436 from zsxwing/SPARK-7895 and squashes the following commits:
      
      c6052f1 [zsxwing] Update examples/pom.xml
      0bcfa87 [zsxwing] Fix the sleep time
      b9d1256 [zsxwing] Move Kafka examples from scala-2.10/src to src
      000df2f0
  14. May 25, 2015
    • tedyu's avatar
      Close HBaseAdmin at the end of HBaseTest · 23bea97d
      tedyu authored
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #6381 from ted-yu/master and squashes the following commits:
      
      e2f0ea1 [tedyu] Close HBaseAdmin at the end of HBaseTest
      23bea97d
  15. May 23, 2015
    • GenTang's avatar
      [SPARK-5090] [EXAMPLES] The improvement of python converter for hbase · 4583cf4b
      GenTang authored
      Hi,
      
      Following the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-td10001.html. I made some modification in three files in package examples:
      1. HBaseConverters.scala: the new converter will converts all the records in an hbase results into a single string
      2. hbase_input.py: as the value string may contain several records, we can use ast package to convert the string into dict
      3. HBaseTest.scala: as the package examples use hbase 0.98.7 the original constructor HTableDescriptor is deprecated. The updation to new constructor is made
      
      Author: GenTang <gen.tang86@gmail.com>
      
      Closes #3920 from GenTang/master and squashes the following commits:
      
      d2153df [GenTang] import JSONObject precisely
      4802481 [GenTang] dump the result into a singl String
      62df7f0 [GenTang] remove the comment
      21de653 [GenTang] return the string in json format
      15b1fe3 [GenTang] the modification of comments
      5cbbcfc [GenTang] the improvement of pythonconverter
      ceb31c5 [GenTang] the modification for adapting updation of hbase
      3253b61 [GenTang] the modification accompanying the improvement of pythonconverter
      4583cf4b
  16. May 17, 2015
    • Reynold Xin's avatar
      [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface. · 517eb37a
      Reynold Xin authored
      Also moved all the deprecated functions into one place for SQLContext and DataFrame, and updated tests to use the new API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6210 from rxin/df-writer-reader-jdbc and squashes the following commits:
      
      7465c2c [Reynold Xin] Fixed unit test.
      118e609 [Reynold Xin] Updated tests.
      3441b57 [Reynold Xin] Updated javadoc.
      13cdd1c [Reynold Xin] [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.
      517eb37a
  17. May 16, 2015
    • Reynold Xin's avatar
      [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API. · 161d0b4a
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6211 from rxin/mllib-reader and squashes the following commits:
      
      79a2cb9 [Reynold Xin] [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.
      161d0b4a
    • Reynold Xin's avatar
      [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API · 578bfeef
      Reynold Xin authored
      This patch introduces DataFrameWriter and DataFrameReader.
      
      DataFrameReader interface, accessible through SQLContext.read, contains methods that create DataFrames. These methods used to reside in SQLContext. Example usage:
      ```scala
      sqlContext.read.json("...")
      sqlContext.read.parquet("...")
      ```
      
      DataFrameWriter interface, accessible through DataFrame.write, implements a builder pattern to avoid the proliferation of options in writing DataFrame out. It currently implements:
      - mode
      - format (e.g. "parquet", "json")
      - options (generic options passed down into data sources)
      - partitionBy (partitioning columns)
      Example usage:
      ```scala
      df.write.mode("append").format("json").partitionBy("date").saveAsTable("myJsonTable")
      ```
      
      TODO:
      
      - [ ] Documentation update
      - [ ] Move JDBC into reader / writer?
      - [ ] Deprecate the old interfaces
      - [ ] Move the generic load interface into reader.
      - [ ] Update example code and documentation
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6175 from rxin/reader-writer and squashes the following commits:
      
      b146c95 [Reynold Xin] Deprecation of old APIs.
      bd8abdf [Reynold Xin] Fixed merge conflict.
      26abea2 [Reynold Xin] Added general load methods.
      244fbec [Reynold Xin] Added equivalent to example.
      4f15d92 [Reynold Xin] Added documentation for partitionBy.
      7e91611 [Reynold Xin] [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API.
      578bfeef
  18. May 15, 2015
    • Ram Sriharsha's avatar
      [SPARK-7575] [ML] [DOC] Example code for OneVsRest · cc12a86f
      Ram Sriharsha authored
      Java and Scala examples for OneVsRest. Fixes the base classifier to be Logistic Regression and accepts the configuration parameters of the base classifier.
      
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6115 from harsha2010/SPARK-7575 and squashes the following commits:
      
      87ad3c7 [Ram Sriharsha] extra line
      f5d9891 [Ram Sriharsha] Merge branch 'master' into SPARK-7575
      7076084 [Ram Sriharsha] cleanup
      dfd660c [Ram Sriharsha] cleanup
      8703e4f [Ram Sriharsha] update doc
      cb23995 [Ram Sriharsha] fix commandline options for JavaOneVsRestExample
      69e91f8 [Ram Sriharsha] cleanup
      7f4e127 [Ram Sriharsha] cleanup
      d4c40d0 [Ram Sriharsha] Code Review fixes
      461eb38 [Ram Sriharsha] cleanup
      e0106d9 [Ram Sriharsha] Fix typo
      935cf56 [Ram Sriharsha] Try to match Java and Scala Example Commandline options
      5323ff9 [Ram Sriharsha] cleanup
      196a59a [Ram Sriharsha] cleanup
      6adfa0c [Ram Sriharsha] Style Fix
      8cfc5d5 [Ram Sriharsha] [SPARK-7575] Example code for OneVsRest
      cc12a86f
  19. May 14, 2015
    • DB Tsai's avatar
      [SPARK-7568] [ML] ml.LogisticRegression doesn't output the right prediction · c1080b6f
      DB Tsai authored
      The difference is because we previously don't fit the intercept in Spark 1.3. Here, we change the input `String` so that the probability of instance 6 can be classified as `1.0` without any ambiguity.
      
      with lambda = 0.001 in current LOR implementation, the prediction is
      ```
      (4, spark i j k) --> prob=[0.1596407738787411,0.8403592261212589], prediction=1.0
      (5, l m n) --> prob=[0.8378325685476612,0.16216743145233883], prediction=0.0
      (6, spark hadoop spark) --> prob=[0.0692663313297627,0.9307336686702373], prediction=1.0
      (7, apache hadoop) --> prob=[0.9821575333444208,0.01784246665557917], prediction=0.0
      ```
      and the training accuracy is
      ```
      (0, a b c d e spark) --> prob=[0.0021342419881406746,0.9978657580118594], prediction=1.0
      (1, b d) --> prob=[0.9959176174854043,0.004082382514595685], prediction=0.0
      (2, spark f g h) --> prob=[0.0014541569986711233,0.9985458430013289], prediction=1.0
      (3, hadoop mapreduce) --> prob=[0.9982978367343561,0.0017021632656438518], prediction=0.0
      ```
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #6109 from dbtsai/lor-example and squashes the following commits:
      
      ac63ce4 [DB Tsai] first commit
      c1080b6f
    • Xiangrui Meng's avatar
      [SPARK-7407] [MLLIB] use uid + name to identify parameters · 1b8625f4
      Xiangrui Meng authored
      A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.
      
      This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6019 from mengxr/SPARK-7407 and squashes the following commits:
      
      c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
      520f0a2 [Xiangrui Meng] address comments
      2569168 [Xiangrui Meng] fix tests
      873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
      409ea08 [Xiangrui Meng] minor updates
      83a163c [Xiangrui Meng] update JavaDeveloperApiExample
      5db5325 [Xiangrui Meng] update OneVsRest
      7bde7ae [Xiangrui Meng] merge master
      697fdf9 [Xiangrui Meng] update Bucketizer
      7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
      629d402 [Xiangrui Meng] fix LRSuite
      154516f [Xiangrui Meng] merge master
      aa4a611 [Xiangrui Meng] fix examples/compile
      a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
      fdbc415 [Xiangrui Meng] all tests passed
      c255f17 [Xiangrui Meng] fix tests in ParamsSuite
      818e1db [Xiangrui Meng] merge master
      e1160cf [Xiangrui Meng] fix tests
      fbc39f0 [Xiangrui Meng] pass test:compile
      108937e [Xiangrui Meng] pass compile
      8726d39 [Xiangrui Meng] use parent uid in Param
      eaeed35 [Xiangrui Meng] update Identifiable
      1b8625f4
  20. May 11, 2015
    • Bryan Cutler's avatar
      [SPARK-7522] [EXAMPLES] Removed angle brackets from dataFormat option · 4f8a1551
      Bryan Cutler authored
      As is, to specify this option on command line, you have to escape the angle brackets.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #6049 from BryanCutler/dataFormat-option-7522 and squashes the following commits:
      
      b34afb4 [Bryan Cutler] [SPARK-7522] Removed angle brackets from dataFormat option
      4f8a1551
  21. May 09, 2015
  22. May 06, 2015
    • jerryshao's avatar
      [SPARK-7396] [STREAMING] [EXAMPLE] Update KafkaWordCountProducer to use new Producer API · 316a5c04
      jerryshao authored
      Otherwise it will throw exception:
      
      ```
      Exception in thread "main" kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries.
      	at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90)
      	at kafka.producer.Producer.send(Producer.scala:77)
      	at org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96)
      	at org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      ```
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #5936 from jerryshao/SPARK-7396 and squashes the following commits:
      
      270bbe2 [jerryshao] Fix Kafka Produce throw Exception issue
      316a5c04
    • Shivaram Venkataraman's avatar
      [SPARK-6799] [SPARKR] Remove SparkR RDD examples, add dataframe examples · 4e930420
      Shivaram Venkataraman authored
      This PR also makes some of the DataFrame to RDD methods private as the RDD class is private in 1.4
      
      cc rxin pwendell
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #5949 from shivaram/sparkr-examples and squashes the following commits:
      
      6c42fdc [Shivaram Venkataraman] Remove SparkR RDD examples, add dataframe examples
      4e930420
  23. May 05, 2015
    • Hrishikesh Subramonian's avatar
      [SPARK-6612] [MLLIB] [PYSPARK] Python KMeans parity · 5995ada9
      Hrishikesh Subramonian authored
      The following items are added to Python kmeans:
      
      kmeans - setEpsilon, setInitializationSteps
      KMeansModel - computeCost, k
      
      Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>
      
      Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:
      
      b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
      5fd3ced [Hrishikesh Subramonian] doc test corrections
      20b3c68 [Hrishikesh Subramonian] python 3 fixes
      4d4e695 [Hrishikesh Subramonian] added arguments in python tests
      21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
      5995ada9
    • Jihong MA's avatar
      [SPARK-7357] Improving HBaseTest example · 51f46200
      Jihong MA authored
      Author: Jihong MA <linlin200605@gmail.com>
      
      Closes #5904 from JihongMA/SPARK-7357 and squashes the following commits:
      
      7d6153a [Jihong MA] SPARK-7357 Improving HBaseTest example
      51f46200
    • Niccolo Becchi's avatar
      [MINOR] Renamed variables in SparkKMeans.scala, LocalKMeans.scala and... · da738cff
      Niccolo Becchi authored
      [MINOR] Renamed variables in SparkKMeans.scala, LocalKMeans.scala and kmeans.py to simplify readability
      
      With the previous syntax it could look like that the reduceByKey sums separately abscissas and ordinates of some 2D points. Perhaps in this way should be easier to understand the example, especially for who is starting the functional programming like me now.
      
      Author: Niccolo Becchi <niccolo.becchi@gmail.com>
      Author: pippobaudos <niccolo.becchi@gmail.com>
      
      Closes #5875 from pippobaudos/patch-1 and squashes the following commits:
      
      3bb3a47 [pippobaudos] renamed variables in LocalKMeans.scala and kmeans.py to simplify readability
      2c2a7a2 [Niccolo Becchi] Update SparkKMeans.scala
      da738cff
  24. May 04, 2015
    • Xiangrui Meng's avatar
      [SPARK-5956] [MLLIB] Pipeline components should be copyable. · e0833c59
      Xiangrui Meng authored
      This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:
      
      ~~~scala
      def fit(dataset: DataFrame, extra: ParamMap): Model = {
        copy(extra).fit(dataset)
      }
      ~~~
      
      Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:
      
      ~~~scala
      val effectiveRegParam = $(regParam) / yStd
      val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
      val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
      ~~~
      
      Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).
      
      Other changes:
      * `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
      * `fittingParamMap` was removed because the `parent` carries this information.
      * `validate` was renamed to `validateParams` to be more precise.
      
      TODOs:
      * [x] add tests for newly added methods
      * [ ] update documentation
      
      jkbradley dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5820 from mengxr/SPARK-5956 and squashes the following commits:
      
      7bef88d [Xiangrui Meng] address comments
      05229c3 [Xiangrui Meng] assert -> assertEquals
      b2927b1 [Xiangrui Meng] organize imports
      f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
      93e7924 [Xiangrui Meng] add tests for hasParam & copy
      463ecae [Xiangrui Meng] merge master
      2b954c3 [Xiangrui Meng] update Binarizer
      465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
      282a1a8 [Xiangrui Meng] fix test
      819dd2d [Xiangrui Meng] merge master
      b642872 [Xiangrui Meng] example code runs
      5a67779 [Xiangrui Meng] examples compile
      c76b4d1 [Xiangrui Meng] fix all unit tests
      0f4fd64 [Xiangrui Meng] fix some tests
      9286a22 [Xiangrui Meng] copyValues to trained models
      53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
      9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
      d882afc [Xiangrui Meng] test compile
      f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
      e0833c59
  25. Apr 29, 2015
    • Joseph K. Bradley's avatar
      [SPARK-7176] [ML] Add validation functionality to Param · 114bad60
      Joseph K. Bradley authored
      Main change: Added isValid field to Param.  Modified all usages to use isValid when relevant.  Added helper methods in ParamValidate.
      
      Also overrode Params.validate() in:
      * CrossValidator + model
      * Pipeline + model
      
      I made a few updates for the elastic net patch:
      * I changed "tol" to "convergenceTol"
      * I added some documentation
      
      This PR is Scala + Java only.  Python will be in a follow-up PR.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5740 from jkbradley/enforce-validate and squashes the following commits:
      
      ad9c6c1 [Joseph K. Bradley] re-generated sharedParams after merging with current master
      76415e8 [Joseph K. Bradley] reverted convergenceTol to tol
      af62f4b [Joseph K. Bradley] Removed changes to SparkBuild, python linalg.  Fixed test failures.  Renamed ParamValidate to ParamValidators.  Removed explicit type from ParamValidators calls where possible.
      bb2665a [Joseph K. Bradley] merged with elastic net pr
      ecda302 [Joseph K. Bradley] fix rat tests, plus add a little doc
      6895dfc [Joseph K. Bradley] small cleanups
      069ac6d [Joseph K. Bradley] many cleanups
      928fb84 [Joseph K. Bradley] Maybe done
      a910ac7 [Joseph K. Bradley] still workin
      6d60e2e [Joseph K. Bradley] Still workin
      b987319 [Joseph K. Bradley] Partly done with adding checks, but blocking on adding checking functionality to Param
      dbc9fb2 [Joseph K. Bradley] merged with master.  enforcing Params.validate
      114bad60
  26. Apr 28, 2015
    • Ilya Ganelin's avatar
      [SPARK-5932] [CORE] Use consistent naming for size properties · 2d222fb3
      Ilya Ganelin authored
      I've added an interface to JavaUtils to do byte conversion and added hooks within Utils.scala to handle conversion within Spark code (like for time strings). I've added matching tests for size conversion, and then updated all deprecated configs and documentation as per SPARK-5933.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #5574 from ilganeli/SPARK-5932 and squashes the following commits:
      
      11f6999 [Ilya Ganelin] Nit fixes
      49a8720 [Ilya Ganelin] Whitespace fix
      2ab886b [Ilya Ganelin] Scala style
      fc85733 [Ilya Ganelin] Got rid of floating point math
      852a407 [Ilya Ganelin] [SPARK-5932] Added much improved overflow handling. Can now handle sizes up to Long.MAX_VALUE Petabytes instead of being capped at Long.MAX_VALUE Bytes
      9ee779c [Ilya Ganelin] Simplified fraction matches
      22413b1 [Ilya Ganelin] Made MAX private
      3dfae96 [Ilya Ganelin] Fixed some nits. Added automatic conversion of old paramter for kryoserializer.mb to new values.
      e428049 [Ilya Ganelin] resolving merge conflict
      8b43748 [Ilya Ganelin] Fixed error in pattern matching for doubles
      84a2581 [Ilya Ganelin] Added smoother handling of fractional values for size parameters. This now throws an exception and added a warning for old spark.kryoserializer.buffer
      d3d09b6 [Ilya Ganelin] [SPARK-5932] Fixing error in KryoSerializer
      fe286b4 [Ilya Ganelin] Resolved merge conflict
      c7803cd [Ilya Ganelin] Empty lines
      54b78b4 [Ilya Ganelin] Simplified byteUnit class
      69e2f20 [Ilya Ganelin] Updates to code
      f32bc01 [Ilya Ganelin] [SPARK-5932] Fixed error in API in SparkConf.scala where Kb conversion wasn't being done properly (was Mb). Added test cases for both timeUnit and ByteUnit conversion
      f15f209 [Ilya Ganelin] Fixed conversion of kryo buffer size
      0f4443e [Ilya Ganelin]     Merge remote-tracking branch 'upstream/master' into SPARK-5932
      35a7fa7 [Ilya Ganelin] Minor formatting
      928469e [Ilya Ganelin] [SPARK-5932] Converted some longs to ints
      5d29f90 [Ilya Ganelin] [SPARK-5932] Finished documentation updates
      7a6c847 [Ilya Ganelin] [SPARK-5932] Updated spark.shuffle.file.buffer
      afc9a38 [Ilya Ganelin] [SPARK-5932] Updated spark.broadcast.blockSize and spark.storage.memoryMapThreshold
      ae7e9f6 [Ilya Ganelin] [SPARK-5932] Updated spark.io.compression.snappy.block.size
      2d15681 [Ilya Ganelin] [SPARK-5932] Updated spark.executor.logs.rolling.size.maxBytes
      1fbd435 [Ilya Ganelin] [SPARK-5932] Updated spark.broadcast.blockSize
      eba4de6 [Ilya Ganelin] [SPARK-5932] Updated spark.shuffle.file.buffer.kb
      b809a78 [Ilya Ganelin] [SPARK-5932] Updated spark.kryoserializer.buffer.max
      0cdff35 [Ilya Ganelin] [SPARK-5932] Updated to use bibibytes in method names. Updated spark.kryoserializer.buffer.mb and spark.reducer.maxMbInFlight
      475370a [Ilya Ganelin] [SPARK-5932] Simplified ByteUnit code, switched to using longs. Updated docs to clarify that we use kibi, mebi etc instead of kilo, mega
      851d691 [Ilya Ganelin] [SPARK-5932] Updated memoryStringToMb to use new interfaces
      a9f4fcf [Ilya Ganelin] [SPARK-5932] Added unit tests for unit conversion
      747393a [Ilya Ganelin] [SPARK-5932] Added unit tests for ByteString conversion
      09ea450 [Ilya Ganelin] [SPARK-5932] Added byte string conversion to Jav utils
      5390fd9 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5932
      db9a963 [Ilya Ganelin] Closing second spark context
      1dc0444 [Ilya Ganelin] Added ref equality check
      8c884fa [Ilya Ganelin] Made getOrCreate synchronized
      cb0c6b7 [Ilya Ganelin] Doc updates and code cleanup
      270cfe3 [Ilya Ganelin] [SPARK-6703] Documentation fixes
      15e8dea [Ilya Ganelin] Updated comments and added MiMa Exclude
      0e1567c [Ilya Ganelin] Got rid of unecessary option for AtomicReference
      dfec4da [Ilya Ganelin] Changed activeContext to AtomicReference
      733ec9f [Ilya Ganelin] Fixed some bugs in test code
      8be2f83 [Ilya Ganelin] Replaced match with if
      e92caf7 [Ilya Ganelin] [SPARK-6703] Added test to ensure that getOrCreate both allows creation, retrieval, and a second context if desired
      a99032f [Ilya Ganelin] Spacing fix
      d7a06b8 [Ilya Ganelin] Updated SparkConf class to add getOrCreate method. Started test suite implementation
      2d222fb3
    • jerryshao's avatar
      [SPARK-5946] [STREAMING] Add Python API for direct Kafka stream · 9e4e82b7
      jerryshao authored
      Currently only added `createDirectStream` API, I'm not sure if `createRDD` is also needed, since some Java object needs to be wrapped in Python. Please help to review, thanks a lot.
      
      Author: jerryshao <saisai.shao@intel.com>
      Author: Saisai Shao <saisai.shao@intel.com>
      
      Closes #4723 from jerryshao/direct-kafka-python-api and squashes the following commits:
      
      a1fe97c [jerryshao] Fix rebase issue
      eebf333 [jerryshao] Address the comments
      da40f4e [jerryshao] Fix Python 2.6 Syntax error issue
      5c0ee85 [jerryshao] Style fix
      4aeac18 [jerryshao] Fix bug in example code
      7146d86 [jerryshao] Add unit test
      bf3bdd6 [jerryshao] Add more APIs and address the comments
      f5b3801 [jerryshao] Small style fix
      8641835 [Saisai Shao] Rebase and update the code
      589c05b [Saisai Shao] Fix the style
      d6fcb6a [Saisai Shao] Address the comments
      dfda902 [Saisai Shao] Style fix
      0f7d168 [Saisai Shao] Add the doc and fix some style issues
      67e6880 [Saisai Shao] Fix test bug
      917b0db [Saisai Shao] Add Python createRDD API for Kakfa direct stream
      c3fc11d [jerryshao] Modify the docs
      2c00936 [Saisai Shao] address the comments
      3360f44 [jerryshao] Fix code style
      e0e0f0d [jerryshao] Code clean and bug fix
      338c41f [Saisai Shao] Add python API and example for direct kafka stream
      9e4e82b7
  27. Apr 27, 2015
    • Yuhao Yang's avatar
      [SPARK-7090] [MLLIB] Introduce LDAOptimizer to LDA to further improve extensibility · 4d9e560b
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7090
      
      LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
      As Joseph Bradley jkbradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
      Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.
      
      Concrete changes:
      
      1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.
      
      2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
              -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
              -move the code from LDA.initalState to initalState of EMLDAOptimizer
      
      3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.
      
      4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.
      
      Further work:
      add OnlineLDAOptimizer and other possible Optimizers once ready.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5661 from hhbyyh/ldaRefactor and squashes the following commits:
      
      0e2e006 [Yuhao Yang] respond to review comments
      08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
      e756ce4 [Yuhao Yang] solve mima exception
      d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
      0bb8400 [Yuhao Yang] refactor LDA with Optimizer
      ec2f857 [Yuhao Yang] protoptype for discussion
      4d9e560b
Loading