Skip to content
Snippets Groups Projects
  1. Dec 01, 2015
  2. Nov 30, 2015
  3. Nov 24, 2015
  4. Nov 23, 2015
    • Yanbo Liang's avatar
      [SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in... · 98d7ec7d
      Yanbo Liang authored
      [SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in examples and user guide doc
      
      ML ```LinearRegression``` use ```data/mllib/sample_libsvm_data.txt``` as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use ```data/mllib/sample_linear_regression_data.txt``` instead.
      The deeper causes is that ```LinearRegression``` with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at [SPARK-11918](https://issues.apache.org/jira/browse/SPARK-11918).
      It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of ```LinearRegression``` algorithm.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9905 from yanboliang/spark-11920.
      98d7ec7d
  5. Nov 22, 2015
    • Xiangrui Meng's avatar
      [SPARK-11895][ML] rename and refactor DatasetExample under mllib/examples · fe89c181
      Xiangrui Meng authored
      We used the name `Dataset` to refer to `SchemaRDD` in 1.2 in ML pipelines and created this example file. Since `Dataset` has a new meaning in Spark 1.6, we should rename it to avoid confusion. This PR also removes support for dense format to simplify the example code.
      
      cc: yinxusen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9873 from mengxr/SPARK-11895.
      fe89c181
  6. Nov 20, 2015
  7. Nov 18, 2015
  8. Nov 17, 2015
  9. Nov 16, 2015
  10. Nov 14, 2015
  11. Nov 13, 2015
  12. Nov 12, 2015
  13. Nov 11, 2015
    • Tathagata Das's avatar
      [SPARK-11290][STREAMING] Basic implementation of trackStateByKey · 99f5f988
      Tathagata Das authored
      Current updateStateByKey provides stateful processing in Spark Streaming. It allows the user to maintain per-key state and manage that state using an updateFunction. The updateFunction is called for each key, and it uses new data and existing state of the key, to generate an updated state. However, based on community feedback, we have learnt the following lessons.
      * Need for more optimized state management that does not scan every key
      * Need to make it easier to implement common use cases - (a) timeout of idle data, (b) returning items other than state
      
      The high level idea that of this PR
      * Introduce a new API trackStateByKey that, allows the user to update per-key state, and emit arbitrary records. The new API is necessary as this will have significantly different semantics than the existing updateStateByKey API. This API will have direct support for timeouts.
      * Internally, the system will keep the state data as a map/list within the partitions of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value data, it will lookup the map/list in the state RDD partition and create a new list/map of updated state data. The new state RDD partition will be created based on the update data and if necessary, with old data.
      Here is the detailed design doc. Please take a look and provide feedback as comments.
      https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em
      
      This is still WIP. Major things left to be done.
      - [x] Implement basic functionality of state tracking, with initial RDD and timeouts
      - [x] Unit tests for state tracking
      - [x] Unit tests for initial RDD and timeout
      - [ ] Unit tests for TrackStateRDD
             - [x] state creating, updating, removing
             - [ ] emitting
             - [ ] checkpointing
      - [x] Misc unit tests for State, TrackStateSpec, etc.
      - [x] Update docs and experimental tags
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #9256 from tdas/trackStateByKey.
      99f5f988
  14. Nov 10, 2015
  15. Nov 09, 2015
  16. Nov 04, 2015
  17. Nov 02, 2015
    • Rishabh Bhardwaj's avatar
      [SPARK-11383][DOCS] Replaced example code in... · 2804674a
      Rishabh Bhardwaj authored
      [SPARK-11383][DOCS] Replaced example code in mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example
      
      I have made the required changes in mllib-naive-bayes.md/mllib-isotonic-regression.md and also verified them.
      Kindle Review it.
      
      Author: Rishabh Bhardwaj <rbnext29@gmail.com>
      
      Closes #9353 from rishabhbhardwaj/SPARK-11383.
      2804674a
  18. Oct 26, 2015
  19. Sep 23, 2015
  20. Sep 21, 2015
  21. Sep 15, 2015
  22. Sep 12, 2015
  23. Aug 28, 2015
  24. Aug 25, 2015
  25. Aug 19, 2015
    • zsxwing's avatar
      [SPARK-9812] [STREAMING] Fix Python 3 compatibility issue in PySpark Streaming and some docs · 1f29d502
      zsxwing authored
      This PR includes the following fixes:
      1. Use `range` instead of `xrange` in `queue_stream.py` to support Python 3.
      2. Fix the issue that `utf8_decoder` will return `bytes` rather than `str` when receiving an empty `bytes` in Python 3.
      3. Fix the commands in docs so that the user can copy them directly to the command line. The previous commands was broken in the middle of a path, so when copying to the command line, the path would be split to two parts by the extra spaces, which forces the user to fix it manually.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #8315 from zsxwing/SPARK-9812.
      1f29d502
  26. Aug 15, 2015
Loading