Commits · 17e4e021ae7fdf5e4dd05a0473faa529e3e80dbb · cs525-sp18-g07 / spark

Dec 01, 2015

[SPARK-11961][DOC] Add docs of ChiSqSelector · e76431f8

Xusen Yin authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11961

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9965 from yinxusen/SPARK-11961.

e76431f8

Nov 30, 2015

[SPARK-11960][MLLIB][DOC] User guide for streaming tests · 55358889

Feynman Liang authored 9 years ago

CC jkbradley mengxr josepablocam

Author: Feynman Liang <feynman.liang@gmail.com>

Closes #10005 from feynmanliang/streaming-test-user-guide.

55358889

[SPARK-11975][ML] Remove duplicate mllib example (DT/RF/GBT in Java/Python) · de64b65f

Yanbo Liang authored 9 years ago

Remove duplicate mllib example (DT/RF/GBT in Java/Python).
Since we have tutorial code for DT/RF/GBT classification/regression in Scala/Java/Python and example applications for DT/RF/GBT in Scala, so we mark these as duplicated and remove them.
mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9954 from yanboliang/SPARK-11975.

de64b65f

[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml · e232720a

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-11689

Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.

Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722

mengxr feynmanliang yinxusen Sorry for the troubling.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9974 from hhbyyh/ldaMLExample.

e232720a

Nov 24, 2015

[SPARK-11952][ML] Remove duplicate ml examples · 56a0aba0

Yanbo Liang authored 9 years ago

Remove duplicate ml examples (only for ml).  mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9933 from yanboliang/SPARK-11685.

56a0aba0

Nov 23, 2015

[SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in... · 98d7ec7d

Yanbo Liang authored 9 years ago

[SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in examples and user guide doc

ML ```LinearRegression``` use ```data/mllib/sample_libsvm_data.txt``` as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use ```data/mllib/sample_linear_regression_data.txt``` instead.
The deeper causes is that ```LinearRegression``` with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at [SPARK-11918](https://issues.apache.org/jira/browse/SPARK-11918).
It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of ```LinearRegression``` algorithm.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9905 from yanboliang/spark-11920.

98d7ec7d

Nov 22, 2015

[SPARK-11895][ML] rename and refactor DatasetExample under mllib/examples · fe89c181

Xiangrui Meng authored 9 years ago

We used the name `Dataset` to refer to `SchemaRDD` in 1.2 in ML pipelines and created this example file. Since `Dataset` has a new meaning in Spark 1.6, we should rename it to avoid confusion. This PR also removes support for dense format to simplify the example code.

cc: yinxusen

Author: Xiangrui Meng <meng@databricks.com>

Closes #9873 from mengxr/SPARK-11895.

fe89c181

Nov 20, 2015

Revert "[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml" · a2dce22e
Xiangrui Meng authored 9 years ago
```
This reverts commit e359d5dc.
```
a2dce22e
[SPARK-11549][DOCS] Replace example code in mllib-evaluation-metrics.md using include_example · ed47b1e6
Vikas Nelamangala authored 9 years ago
```
Author: Vikas Nelamangala <vikasnelamangala@Vikass-MacBook-Pro.local>

Closes #9689 from vikasnp/master.
```
ed47b1e6

[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml · e359d5dc

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-11689

Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9722 from hhbyyh/ldaMLExample.

e359d5dc

Nov 18, 2015

[SPARK-11816][ML] fix some style issue in ML/MLlib examples · 67c75828

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-11816
Currently I only fixed some obvious comments issue like
// scalastyle:off println
on the bottom.

Yet the style in examples is not quite consistent, like only half of the examples  are with
// Example usage: ./bin/run-example mllib.FPGrowthExample \,

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9808 from hhbyyh/exampleStyle.

67c75828

[SPARK-11495] Fix potential socket / file handle leaks that were found via static analysis · 4b117121

Josh Rosen authored 9 years ago

The HP Fortify Opens Source Review team (https://www.hpfod.com/open-source-review-project) reported a handful of potential resource leaks that were discovered using their static analysis tool. We should fix the issues identified by their scan.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9455 from JoshRosen/fix-potential-resource-leaks.

4b117121

rmse was wrongly calculated · 1429e0a2

Viveka Kulharia authored 9 years ago

It was multiplying with U instaed of dividing by U

Author: Viveka Kulharia <vivkul@iitk.ac.in>

Closes #9771 from vivkul/patch-1.

1429e0a2

[SPARK-11728] Replace example code in ml-ensembles.md using include_example · 9154f89b

Xusen Yin authored 9 years ago

JIRA issue https://issues.apache.org/jira/browse/SPARK-11728.

The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9716 from yinxusen/SPARK-11728.

9154f89b

Nov 17, 2015

[MINOR] Correct comments in JavaDirectKafkaWordCount · e29656f8
Rohan Bhanderi authored 9 years ago
```
Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>

Closes #9781 from RohanBhanderi/patch-3.
```
e29656f8

[SPARK-11729] Replace example code in ml-linear-methods.md using include_example · 328eb49e

Xusen Yin authored 9 years ago

JIRA link: https://issues.apache.org/jira/browse/SPARK-11729

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9713 from yinxusen/SPARK-11729.

328eb49e

Nov 16, 2015
- [EXAMPLE][MINOR] Add missing awaitTermination in click stream example · bd10eb81
  jerryshao authored 9 years ago
  
  Author: jerryshao <sshao@hortonworks.com> Closes #9730 from jerryshao/clickstream-fix.
  bd10eb81
Nov 14, 2015

Typo in comment: use 2 seconds instead of 1 · 22e96b87

Rohan Bhanderi authored 9 years ago

Use 2 seconds batch size as duration specified in JavaStreamingContext constructor is 2000 ms

Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>

Closes #9714 from RohanBhanderi/patch-2.

22e96b87

Nov 13, 2015

[SPARK-11723][ML][DOC] Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame · 99693fef

Yanbo Liang authored 9 years ago

Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include:
* Use libSVM data source for all example codes under examples/ml, and remove unused import.
* Use libSVM data source for user guides under ml-*** which were omitted by #8697.
* Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```.
* Code cleanup.

mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9690 from yanboliang/spark-11723.

99693fef

[SPARK-11445][DOCS] Replaced example code in mllib-ensembles.md using include_example · 61a28486

Rishabh Bhardwaj authored 9 years ago

I have made the required changes and tested.
Kindly review the changes.

Author: Rishabh Bhardwaj <rbnext29@gmail.com>

Closes #9407 from rishabhbhardwaj/SPARK-11445.

61a28486

Nov 12, 2015

[SPARK-11629][ML][PYSPARK][DOC] Python example code for Multilayer Perceptron Classification · ea5ae270

Yanbo Liang authored 9 years ago

Add Python example code for Multilayer Perceptron Classification, and make example code in user guide document testable. mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9594 from yanboliang/spark-11629.

ea5ae270

[SPARK-11663][STREAMING] Add Java API for trackStateByKey · 0f1d00a9

Shixiong Zhu authored 9 years ago

TODO
- [x] Add Java API
- [x] Add API tests
- [x] Add a function test

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9636 from zsxwing/java-track.

0f1d00a9

Nov 11, 2015

[SPARK-11290][STREAMING] Basic implementation of trackStateByKey · 99f5f988

Tathagata Das authored 9 years ago

Current updateStateByKey provides stateful processing in Spark Streaming. It allows the user to maintain per-key state and manage that state using an updateFunction. The updateFunction is called for each key, and it uses new data and existing state of the key, to generate an updated state. However, based on community feedback, we have learnt the following lessons.
* Need for more optimized state management that does not scan every key
* Need to make it easier to implement common use cases - (a) timeout of idle data, (b) returning items other than state

The high level idea that of this PR
* Introduce a new API trackStateByKey that, allows the user to update per-key state, and emit arbitrary records. The new API is necessary as this will have significantly different semantics than the existing updateStateByKey API. This API will have direct support for timeouts.
* Internally, the system will keep the state data as a map/list within the partitions of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value data, it will lookup the map/list in the state RDD partition and create a new list/map of updated state data. The new state RDD partition will be created based on the update data and if necessary, with old data.
Here is the detailed design doc. Please take a look and provide feedback as comments.
https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em

This is still WIP. Major things left to be done.
- [x] Implement basic functionality of state tracking, with initial RDD and timeouts
- [x] Unit tests for state tracking
- [x] Unit tests for initial RDD and timeout
- [ ] Unit tests for TrackStateRDD
- [x] state creating, updating, removing
- [ ] emitting
- [ ] checkpointing
- [x] Misc unit tests for State, TrackStateSpec, etc.
- [x] Update docs and experimental tags

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9256 from tdas/trackStateByKey.

99f5f988

Nov 10, 2015

[SPARK-11550][DOCS] Replace example code in mllib-optimization.md using include_example · 638c51d9
Pravin Gadakh authored 9 years ago
```
Author: Pravin Gadakh <pravingadakh177@gmail.com>

Closes #9516 from pravingadakh/SPARK-11550.
```
638c51d9

[SPARK-11382] Replace example code in mllib-decision-tree.md using include_example · a81f47ff

Xusen Yin authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11382

B.T.W. I fix an error in naive_bayes_example.py.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9596 from yinxusen/SPARK-11382.

a81f47ff

Nov 09, 2015

[SPARK-11548][DOCS] Replaced example code in mllib-collaborative-filtering.md using include_example · b7720fa4
Rishabh Bhardwaj authored 9 years ago
```
Kindly review the changes.

Author: Rishabh Bhardwaj <rbnext29@gmail.com>

Closes #9519 from rishabhbhardwaj/SPARK-11337.
```
b7720fa4

[SPARK-11552][DOCS][Replaced example code in ml-decision-tree.md using include_example] · 51d41e4b

sachin aggarwal authored 9 years ago

I have tested it on my local, it is working fine, please review

Author: sachin aggarwal <different.sachin@gmail.com>

Closes #9539 from agsachin/SPARK-11552-real.

51d41e4b

[SPARK-10689][ML][DOC] User guide and example code for AFTSurvivalRegression · d50a66cc

Yanbo Liang authored 9 years ago

Add user guide and example code for ```AFTSurvivalRegression```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9491 from yanboliang/spark-10689.

d50a66cc

Nov 04, 2015
- [SPARK-11380][DOCS] Replace example code in mllib-frequent-pattern-mining.md using include_example · 820064e6
  Pravin Gadakh authored 9 years ago
  
  Author: Pravin Gadakh <pravingadakh177@gmail.com> Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #9340 from pravingadakh/SPARK-11380.
  820064e6
Nov 02, 2015

[SPARK-11383][DOCS] Replaced example code in... · 2804674a

Rishabh Bhardwaj authored 9 years ago

[SPARK-11383][DOCS] Replaced example code in mllib-naive-bayes.md/mllib-isotonic-regression.md using include_example

I have made the required changes in mllib-naive-bayes.md/mllib-isotonic-regression.md and also verified them.
Kindle Review it.

Author: Rishabh Bhardwaj <rbnext29@gmail.com>

Closes #9353 from rishabhbhardwaj/SPARK-11383.

2804674a

Oct 26, 2015

[SPARK-11289][DOC] Substitute code examples in ML features extractors with include_example · 943d4fa2

Xusen Yin authored 9 years ago

mengxr https://issues.apache.org/jira/browse/SPARK-11289

I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9266 from yinxusen/SPARK-11289.

943d4fa2

Sep 23, 2015

[SPARK-9715] [ML] Store numFeatures in all ML PredictionModel types · 098be27a

sethah authored 9 years ago

All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #8675 from sethah/SPARK-9715.

098be27a

Sep 21, 2015

[SPARK-3147] [MLLIB] [STREAMING] Streaming 2-sample statistical significance testing · aeef44a3

Feynman Liang authored 9 years ago

Implementation of significance testing using Streaming API.

Author: Feynman Liang <fliang@databricks.com>
Author: Feynman Liang <feynman.liang@gmail.com>

Closes #4716 from feynmanliang/ab_testing.

aeef44a3

Sep 15, 2015
- Update version to 1.6.0-SNAPSHOT. · 09b7e7c1
  Reynold Xin authored 9 years ago
  
  Author: Reynold Xin <rxin@databricks.com> Closes #8350 from rxin/1.6.
  09b7e7c1
Sep 12, 2015

[SPARK-10330] Add Scalastyle rule to require use of SparkHadoopUtil JobContext methods · b3a7480a

Josh Rosen authored 9 years ago

This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8521 from JoshRosen/SPARK-10330-part2.

b3a7480a

Aug 28, 2015

[SPARK-9910] [ML] User guide for train validation split · e8ea5baf
martinzapletal authored 9 years ago
```
Author: martinzapletal <zapletal-martin@email.cz>

Closes #8377 from zapletal-martin/SPARK-9910.
```
e8ea5baf

[SPARK-10336][example] fix not being able to set intercept in LR example · 45723214

Shuo Xiang authored 9 years ago

`fitIntercept` is a command line option but not set in the main program.

dbtsai

Author: Shuo Xiang <sxiang@pinterest.com>

Closes #8510 from coderxiang/intercept and squashes the following commits:

57c9b7d [Shuo Xiang] fix not being able to set intercept in LR example

45723214

Aug 25, 2015

[SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters · 69c9c177

Sean Owen authored 9 years ago

Replace `JavaConversions` implicits with `JavaConverters`

Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.

Author: Sean Owen <sowen@cloudera.com>

Closes #8033 from srowen/SPARK-9613.

69c9c177

Aug 19, 2015

[SPARK-9812] [STREAMING] Fix Python 3 compatibility issue in PySpark Streaming and some docs · 1f29d502

zsxwing authored 9 years ago

This PR includes the following fixes:
1. Use `range` instead of `xrange` in `queue_stream.py` to support Python 3.
2. Fix the issue that `utf8_decoder` will return `bytes` rather than `str` when receiving an empty `bytes` in Python 3.
3. Fix the commands in docs so that the user can copy them directly to the command line. The previous commands was broken in the middle of a path, so when copying to the command line, the path would be split to two parts by the extra spaces, which forces the user to fix it manually.

Author: zsxwing <zsxwing@gmail.com>

Closes #8315 from zsxwing/SPARK-9812.

1f29d502

Aug 15, 2015

[SPARK-9980] [BUILD] Fix SBT publishLocal error due to invalid characters in doc · a85fb6c0

Herman van Hovell authored 9 years ago

Tiny modification to a few comments ```sbt publishLocal``` work again.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #8209 from hvanhovell/SPARK-9980.

a85fb6c0