Commits · 0bad10d3e36d3238c7ee7c0fc5465072734b3ae4 · cs525-sp18-g07 / spark

Sep 01, 2017

[SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala... · 12ab7f7e

Sean Owen authored 7 years ago

[SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation

…build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure

## What changes were proposed in this pull request?

This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.

In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.

It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.

- Scalatest 2.x -> 3.0.3
- Chill 0.8.0 -> 0.8.4
- Clapper 1.0.x -> 1.1.2
- json4s 3.2.x -> 3.4.2
- Jackson 2.6.x -> 2.7.9 (required by json4s)

This change does _not_ fully enable a Scala 2.12 build:

- It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
- It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.

What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.

## How was this patch tested?

Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.

Author: Sean Owen <sowen@cloudera.com>

Closes #18645 from srowen/SPARK-14280.

12ab7f7e

Aug 16, 2017

[SPARK-21680][ML][MLLIB] optimize Vector compress · a0345cbe

Peng Meng authored 7 years ago

## What changes were proposed in this pull request?

When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse.
This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse.
When the length of the vector is large, there is significant performance difference between this two method.

## How was this patch tested?

The existing UT

Author: Peng Meng <peng.meng@intel.com>

Closes #18899 from mpjlu/optVectorCompress.

a0345cbe

Aug 15, 2017

[SPARK-21731][BUILD] Upgrade scalastyle to 0.9. · 3f958a99

Marcelo Vanzin authored 7 years ago

This version fixes a few issues in the import order checker; it provides
better error messages, and detects more improper ordering (thus the need
to change a lot of files in this patch). The main fix is that it correctly
complains about the order of packages vs. classes.

As part of the above, I moved some "SparkSession" import in ML examples
inside the "$example on$" blocks; that didn't seem consistent across
different source files to start with, and avoids having to add more on/off blocks
around specific imports.

The new scalastyle also seems to have a better header detector, so a few
license headers had to be updated to match the expected indentation.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18943 from vanzin/SPARK-21731.

3f958a99

Jul 13, 2017

[SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada

Sean Owen authored 7 years ago

## What changes were proposed in this pull request?

- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #17150 from srowen/SPARK-19810.

425c4ada

May 16, 2017

[SPARK-20677][MLLIB][ML] Follow-up to ALS recommend-all performance PRs · 25b4f41d

Nick Pentreath authored 7 years ago

Small clean ups from #17742 and #17845.

## How was this patch tested?

Existing unit tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17919 from MLnick/SPARK-20677-als-perf-followup.

25b4f41d

May 09, 2017

[SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsException · be53a783

Jon McLean authored 7 years ago

## What changes were proposed in this pull request?

Added a check for for the number of defined values.  Previously the argmax function assumed that at least one value was defined if the vector size was greater than zero.

## How was this patch tested?

Tests were added to the existing VectorsSuite to cover this case.

Author: Jon McLean <jon.mclean@atsid.com>

Closes #17877 from jonmclean/vectorArgmaxIndexBug.

be53a783

Apr 24, 2017

[SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT · f44c8a84

Josh Rosen authored 7 years ago

This patch bumps the master branch version to `2.3.0-SNAPSHOT`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17753 from JoshRosen/SPARK-20453.

f44c8a84

Apr 09, 2017

[SPARK-20260][MLLIB] String interpolation required for error message · 261eaf51

Vijay Ramesh authored 7 years ago

## What changes were proposed in this pull request?
This error message doesn't get properly formatted because of a missing `s`.  Currently the error looks like:

```
Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
```
(note the literal `$current` instead of the interpolated value)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Vijay Ramesh <vramesh@demandbase.com>

Closes #17572 from vijaykramesh/master.

261eaf51

Mar 24, 2017

[SPARK-17471][ML] Add compressed method to ML matrices · e8810b73

sethah authored 7 years ago

## What changes were proposed in this pull request?

This patch adds a `compressed` method to ML `Matrix` class, which returns the minimal storage representation of the matrix - either sparse or dense. Because the space occupied by a sparse matrix is dependent upon its layout (i.e. column major or row major), this method must consider both cases. It may also be useful to force the layout to be column or row major beforehand, so an overload is added which takes in a `columnMajor: Boolean` parameter.

The compressed implementation relies upon two new abstract methods `toDense(columnMajor: Boolean)` and `toSparse(columnMajor: Boolean)`, similar to the compressed method implemented in the `Vector` class. These methods also allow the layout of the resulting matrix to be specified via the `columnMajor` parameter. More detail on the new methods is given below.
## How was this patch tested?

Added many new unit tests
## New methods (summary, not exhaustive list)

**Matrix trait**
- `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` (abstract) - converts the matrix (either sparse or dense) to dense format
- `private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix` (abstract) -  converts the matrix (either sparse or dense) to sparse format
- `def toDense: DenseMatrix = toDense(true)`  - converts the matrix (either sparse or dense) to dense format in column major layout
- `def toSparse: SparseMatrix = toSparse(true)` -  converts the matrix (either sparse or dense) to sparse format in column major layout
- `def compressed: Matrix` - finds the minimum space representation of this matrix, considering both column and row major layouts, and converts it
- `def compressed(columnMajor: Boolean): Matrix` - finds the minimum space representation of this matrix considering only column OR row major, and converts it

**DenseMatrix class**
- `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` - converts the dense matrix to a dense matrix, optionally changing the layout (data is NOT duplicated if the layouts are the same)
- `private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix` - converts the dense matrix to sparse matrix, using the specified layout

**SparseMatrix class**
- `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` - converts the sparse matrix to a dense matrix, using the specified layout
- `private[ml] def toSparseMatrix(columnMajors: Boolean): SparseMatrix` - converts the sparse matrix to sparse matrix. If the sparse matrix contains any explicit zeros, they are removed. If the layout requested does not match the current layout, data is copied to a new representation. If the layouts match and no explicit zeros exist, the current matrix is returned.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15628 from sethah/matrix_compress.

e8810b73

Mar 23, 2017

[SPARK-19636][ML] Feature parity for correlation statistics in MLlib · d27daa54

Timothy Hunter authored 8 years ago

## What changes were proposed in this pull request?

This patch adds the Dataframes-based support for the correlation statistics found in the `org.apache.spark.mllib.stat.correlation.Statistics`, following the design doc discussed in the JIRA ticket.

The current implementation is a simple wrapper around the `spark.mllib` implementation. Future optimizations can be implemented at a later stage.

## How was this patch tested?

```
build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite"
```

Author: Timothy Hunter <timhunter@databricks.com>

Closes #17108 from thunterdb/19636.

d27daa54

Feb 01, 2017

[SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in... · f1a1f260

hyukjinkwon authored 8 years ago

[SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in Scala/Java APIs generation

## What changes were proposed in this pull request?

This PR proposes three things as below:

- Support LaTex inline-formula, `\( ... \)` in Scala API documentation
  It seems currently,

  ```
  \( ... \)
  ```

  are rendered as they are, for example,

  <img width="345" alt="2017-01-30 10 01 13" src="https://cloud.githubusercontent.com/assets/6477701/22423960/ab37d54a-e737-11e6-9196-4f6229c0189c.png">

  It seems mistakenly more backslashes were added.

- Fix warnings Scaladoc/Javadoc generation
  This PR fixes t two types of warnings as below:

  ```
  [warn] .../spark/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala:335: Could not find any member to link for "UnsupportedOperationException".
  [warn]   /**
  [warn]   ^
  ```

  ```
  [warn] .../spark/sql/core/src/main/scala/org/apache/spark/sql/internal/VariableSubstitution.scala:24: Variable var undefined in comment for class VariableSubstitution in class VariableSubstitution
  [warn]  * `${var}`, `${system:var}` and `${env:var}`.
  [warn]      ^
  ```

- Fix Javadoc8 break
  ```
  [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictionModel.java:7: error: reference not found
  [error]  *                       E.g., {link VectorUDT} for vector features.
  [error]                                       ^
  [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictorParams.java:12: error: reference not found
  [error]    *                          E.g., {link VectorUDT} for vector features.
  [error]                                            ^
  [error] .../spark/mllib/target/java/org/apache/spark/ml/Predictor.java:10: error: reference not found
  [error]  *                       E.g., {link VectorUDT} for vector features.
  [error]                                       ^
  [error] .../spark/sql/hive/target/java/org/apache/spark/sql/hive/HiveAnalysis.java:5: error: reference not found
  [error]  * Note that, this rule must be run after {link PreprocessTableInsertion}.
  [error]                                                  ^
  ```

## How was this patch tested?

Manually via `sbt unidoc` and `jeykil build`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16741 from HyukjinKwon/warn-and-break.

f1a1f260

Dec 21, 2016

[SPARK-17807][CORE] split test-tags into test-JAR · afd9bc1d

Ryan Williams authored 8 years ago

Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.

Alternative to #16303.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #16311 from ryan-williams/tt.

afd9bc1d

Dec 02, 2016

[SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOT · c7c72659

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch bumps master branch version to 2.2.0-SNAPSHOT.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #16126 from rxin/SPARK-18695.

c7c72659

Nov 29, 2016

[SPARK-18615][DOCS] Switch to multi-line doc to avoid a genjavadoc bug for backticks · 1a870090

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

Currently, single line comment does not mark down backticks to `<code>..</code>` but prints as they are (`` `..` ``). For example, the line below:

```scala
/** Return an RDD with the pairs from `this` whose keys are not in `other`. */
```

So, we could work around this as below:

```scala
/**
 * Return an RDD with the pairs from `this` whose keys are not in `other`.
 */
```

- javadoc

  - **Before**
    ![2016-11-29 10 39 14](https://cloud.githubusercontent.com/assets/6477701/20693606/e64c8f90-b622-11e6-8dfc-4a029216e23d.png)

  - **After**
    ![2016-11-29 10 39 08](https://cloud.githubusercontent.com/assets/6477701/20693607/e7280d36-b622-11e6-8502-d2e21cd5556b.png)

- scaladoc (this one looks fine either way)

  - **Before**
    ![2016-11-29 10 38 22](https://cloud.githubusercontent.com/assets/6477701/20693640/12c18aa8-b623-11e6-901a-693e2f6f8066.png)

  - **After**
    ![2016-11-29 10 40 05](https://cloud.githubusercontent.com/assets/6477701/20693642/14eb043a-b623-11e6-82ac-7cd0000106d1.png)

I suspect this is related with SPARK-16153 and genjavadoc issue in ` typesafehub/genjavadoc#85`.

## How was this patch tested?

I found them via

```
grep -r "\/\*\*.*\`" . | grep .scala
````

and then checked if each is in the public API documentation with manually built docs (`jekyll build`) with Java 7.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16050 from HyukjinKwon/javadoc-markdown.

1a870090

Nov 25, 2016

[SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will... · 51b1c155

hyukjinkwon authored 8 years ago

[SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility

## What changes were proposed in this pull request?

This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.

This PR roughly fixes several things as below:

- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``

  ```
  [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
  [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
  ```

- Fix an exception annotation and remove code backticks in `throws` annotation

  Currently, sbt unidoc with Java 8 complains as below:

  ```
  [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
  [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
  ```

  `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).

- Fix `[[http..]]` to `<a href="http..."></a>`.

  ```diff
  -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
  -   * blog page]].
  +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
  +   * Oracle blog page</a>.
  ```

   `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.

- It seems class can't have `return` annotation. So, two cases of this were removed.

  ```
  [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
  [error]    * return New instance of IsotonicRegression.
  ```

- Fix < to `&lt;` and > to `&gt;` according to HTML rules.

- Fix `</p>` complaint

- Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.

## How was this patch tested?

Manually tested by `jekyll build` with Java 7 and 8

```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```

```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```

Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15999 from HyukjinKwon/SPARK-3359-errors.

51b1c155

Nov 19, 2016

[SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · d5b1d5fc

hyukjinkwon authored 8 years ago

[SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation

## What changes were proposed in this pull request?

It seems in Scala/Java,

- `Note:`
- `NOTE:`
- `Note that`
- `'''Note:'''`
- `note`

This PR proposes to fix those to `note` to be consistent.

**Before**

- Scala
  ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)

- Java
  ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)

**After**

- Scala
  ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)

- Java
  ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)

## How was this patch tested?

The notes were found via

```bash
grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// Note that " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// Note: " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

And then fixed one by one comparing with API documentation/access modifiers.

After that, manually tested via `jekyll build`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15889 from HyukjinKwon/SPARK-18437.

d5b1d5fc

Oct 25, 2016

[SPARK-17748][ML] One pass solver for Weighted Least Squares with ElasticNet · 78d740a0

sethah authored 8 years ago

## What changes were proposed in this pull request?

1. Make a pluggable solver interface for `WeightedLeastSquares`
2. Add a `QuasiNewton` solver to handle elastic net regularization for `WeightedLeastSquares`
3. Add method `BLAS.dspmv` used by QN solver
4. Add mechanism for WLS to handle singular covariance matrices by falling back to QN solver when Cholesky fails.

## How was this patch tested?
Unit tests - see below.

## Design choices

**Pluggable Normal Solver**

Before, the `WeightedLeastSquares` package always used the Cholesky decomposition solver to compute the solution to the normal equations. Now, we specify the solver as a constructor argument to the `WeightedLeastSquares`. We introduce a new trait:

````scala
private[ml] sealed trait NormalEquationSolver {

def solve(
bBar: Double,
bbBar: Double,
abBar: DenseVector,
aaBar: DenseVector,
aBar: DenseVector): NormalEquationSolution
}
````

We extend this trait for different variants of normal equation solvers. In the future, we can easily add others (like QR) using this interface.

**Always train in the standardized space**

The normal solver did not previously standardize the data, but this patch introduces a change such that we always solve the normal equations in the standardized space. We convert back to the original space in the same way that is done for distributed L-BFGS/OWL-QN. We add test cases for zero variance features/labels.

**Use L-BFGS locally to solve normal equations for singular matrix**

When linear regression with the normal solver is called for a singular matrix, we initially try to solve with Cholesky. We use the output of `lapack.dppsv` to determine if the matrix is singular. If it is, we fall back to using L-BFGS locally to solve the normal equations. We add test cases for this as well.

## Test cases
I found it helpful to enumerate some of the test cases and hopefully it makes review easier.

**WeightedLeastSquares**

1. Constant columns - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully.
2. Collinear features - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully.
3. Label is constant zero - no training is performed regardless of intercept. Coefficients are zero and intercept is zero.
4. Label is constant - if fitIntercept, then no training is performed and intercept equals label mean. If not fitIntercept, then we train and return an answer that matches R's lm package.
5. Test with L1 - go through various combinations of L1/L2, standardization, fitIntercept and verify that output matches glmnet.
6. Initial intercept - verify that setting the initial intercept to label mean is correct by training model with strong L1 regularization so that all coefficients are zero and intercept converges to label mean.
7. Test diagInvAtWA - since we are standardizing features now during training, we should test that the inverse is computed to match R.

**LinearRegression**
1. For all existing L1 test cases, test the "normal" solver too.
2. Check that using the normal solver now handles singular matrices.
3. Check that using the normal solver with L1 produces an objective history in the model summary, but does not produce the inverse of AtA.

**BLAS**
1. Test new method `dspmv`.

## Performance Testing
This patch will speed up linear regression with L1/elasticnet penalties when the feature size is < 4096. I have not conducted performance tests at scale, only observed by testing locally that there is a speed improvement.

We should decide if this PR needs to be blocked before performance testing is conducted.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15394 from sethah/SPARK-17748.

78d740a0

Oct 21, 2016

[SPARK-17331][FOLLOWUP][ML][CORE] Avoid allocating 0-length arrays · a8ea4da8

Zheng RuiFeng authored 8 years ago

## What changes were proposed in this pull request?

`Array[T]()` -> `Array.empty[T]` to avoid allocating 0-length arrays.
Use regex `find . -name '*.scala' | xargs -i bash -c 'egrep "Array\[[A-Za-z]+\]\(\)" -n {} && echo {}'` to find modification candidates.

cc srowen

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15564 from zhengruifeng/avoid_0_length_array.

a8ea4da8

Sep 29, 2016

[SPARK-17721][MLLIB][ML] Fix for multiplying transposed SparseMatrix with SparseVector · 29396e7d

Bjarne Fruergaard authored 8 years ago

## What changes were proposed in this pull request?

* changes the implementation of gemv with transposed SparseMatrix and SparseVector both in mllib-local and mllib (identical)
* adds a test that was failing before this change, but succeeds with these changes.

The problem in the previous implementation was that it only increments `i`, that is enumerating the columns of a row in the SparseMatrix, when the row-index of the vector matches the column-index of the SparseMatrix. In cases where a particular row of the SparseMatrix has non-zero values at column-indices lower than corresponding non-zero row-indices of the SparseVector, the non-zero values of the SparseVector are enumerated without ever matching the column-index at index `i` and the remaining column-indices i+1,...,indEnd-1 are never attempted. The test cases in this PR illustrate this issue.

## How was this patch tested?

I have run the specific `gemv` tests in both mllib-local and mllib. I am currently still running `./dev/run-tests`.

## ___
As per instructions, I hereby state that this is my original work and that I license the work to the project (Apache Spark) under the project's open source license.

Mentioning dbtsai, viirya and brkyvz whom I can see have worked/authored on these parts before.

Author: Bjarne Fruergaard <bwahlgreen@gmail.com>

Closes #15296 from bwahlgreen/bugfix-spark-17721.

29396e7d

Sep 07, 2016

[SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of... · 3ce3a282

Liwei Lin authored 8 years ago

[SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths

## What changes were proposed in this pull request?

We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14914 from lw-lin/append_to_plus_eq_v2.

3ce3a282

Sep 04, 2016

[MINOR][ML][MLLIB] Remove work around for breeze sparse matrix. · 1b001b52

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
Since we have updated breeze version to 0.12, we should remove work around for bug of breeze sparse matrix in v0.11.
I checked all mllib code and found this is the only work around for breeze 0.11.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14953 from yanboliang/matrices.

1b001b52

Sep 01, 2016

[SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays · 3893e8c5

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14895 from srowen/SPARK-17331.

3893e8c5

Aug 27, 2016

[ML][MLLIB] The require condition and message doesn't match in SparseMatrix. · 40168dbe

Peng, Meng authored 8 years ago

## What changes were proposed in this pull request?
The require condition and message doesn't match, and the condition also should be optimized.
Small change.  Please kindly let me know if JIRA required.

## How was this patch tested?
No additional test required.

Author: Peng, Meng <peng.meng@intel.com>

Closes #14824 from mpjlu/smallChangeForMatrixRequire.

40168dbe

Aug 26, 2016

[SPARK-17207][MLLIB] fix comparing Vector bug in TestingUtils · c0949dc9

Peng, Meng authored 8 years ago

## What changes were proposed in this pull request?

fix comparing Vector bug in TestingUtils.
There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng, Meng <peng.meng@intel.com>

Closes #14785 from mpjlu/testUtils.

c0949dc9

Aug 19, 2016

[SPARK-16965][MLLIB][PYSPARK] Fix bound checking for SparseVector. · 072acf5e

Jeff Zhang authored 8 years ago

## What changes were proposed in this pull request?

1. In scala, add negative low bound checking and put all the low/upper bound checking in one place
2. In python, add low/upper bound checking of indices.

## How was this patch tested?

unit test added

Author: Jeff Zhang <zjffdu@apache.org>

Closes #14555 from zjffdu/SPARK-16965.

072acf5e

Jul 19, 2016

[SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition... · 21a6dd2a

Xin Ren authored 8 years ago

[SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent

https://issues.apache.org/jira/browse/SPARK-16535

## What changes were proposed in this pull request?

When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
```
Definition of groupId is redundant, because it's inherited from the parent
```
![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)

I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
```
<groupId>org.apache.spark</groupId>
```
As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).

ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762

## How was this patch tested?

I've tested by re-building the project, and build succeeded.

Author: Xin Ren <iamshrek@126.com>

Closes #14189 from keypointt/SPARK-16535.

21a6dd2a

Jul 16, 2016

[SPARK-3359][DOCS] More changes to resolve javadoc 8 errors that will help... · 5ec0d692

Sean Owen authored 8 years ago

[SPARK-3359][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility

## What changes were proposed in this pull request?

These are yet more changes that resolve problems with unidoc/genjavadoc and Java 8. It does not fully resolve the problem, but gets rid of as many errors as we can from this end.

## How was this patch tested?

Jenkins build of docs

Author: Sean Owen <sowen@cloudera.com>

Closes #14221 from srowen/SPARK-3359.3.

5ec0d692

Jul 11, 2016

[SPARK-16477] Bump master version to 2.1.0-SNAPSHOT · ffcb6e05

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #14130 from rxin/SPARK-16477.

ffcb6e05

Jun 06, 2016

[MINOR] Fix Typos 'an -> a' · fd8af397

Zheng RuiFeng authored 8 years ago

## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.

fd8af397

May 27, 2016

[SPARK-15413][ML][MLLIB] Change `toBreeze` to `asBreeze` in Vector and Matrix · 21b2605d

DB Tsai authored 8 years ago

## What changes were proposed in this pull request?

We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application.

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #13198 from dbtsai/minor.

21b2605d

May 19, 2016

[SPARK-15411][ML] Add @since to ml.stat.MultivariateOnlineSummarizer.scala · 5255e55c

DB Tsai authored 8 years ago

## What changes were proposed in this pull request?

Add since to ml.stat.MultivariateOnlineSummarizer.scala

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #13197 from dbtsai/cleanup.

5255e55c

[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local · 31f63ac2

Pravin Gadakh authored 8 years ago

## What changes were proposed in this pull request?

This PR add `Since` annotations in `Vectors.scala` and `Matrices.scala` of spark-mllib-local.

## How was this patch tested?

Scala Style Checks.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #13191 from pravingadakh/SPARK-14613.

31f63ac2

May 17, 2016

[SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags · 122302cb

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)

Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13074 from srowen/SPARK-15290.

122302cb

Apr 30, 2016

[SPARK-14653][ML] Remove json4s from mllib-local · 0847fe4e

Xiangrui Meng authored 8 years ago

## What changes were proposed in this pull request?

This PR moves Vector.toJson/fromJson to ml.linalg.VectorEncoder under mllib/ to keep mllib-local's dependency minimal. The json encoding is used by Params. So we still need this feature in SPARK-14615, where we will switch to ml.linalg in spark.ml APIs.

## How was this patch tested?

Copied existing unit tests over.

cc; dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #12802 from mengxr/SPARK-14653.

0847fe4e

Apr 28, 2016

Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local" · 9c7c42bc
Yin Huai authored 8 years ago
```
This reverts commit dae538a4.
```
9c7c42bc

[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local · dae538a4

Pravin Gadakh authored 8 years ago

## What changes were proposed in this pull request?

This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.

## How was this patch tested?

Scala-style checks passed.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12416 from pravingadakh/SPARK-14613.

dae538a4

Apr 26, 2016

[SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian in mllib-local · bd2c9a6d

Joseph K. Bradley authored 8 years ago

## What changes were proposed in this pull request?

Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API.  This was added after 1.6, so we can modify this API without breaking APIs.

This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
* Renamed fields to match numpy, scipy: mu => mean, sigma => cov

This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
* Modifying the constructor
* Adding a computeProbabilities method

Also:
* Added EPSILON to mllib-local for use in MultivariateGaussian

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12593 from jkbradley/sparkml-gmm-fix.

bd2c9a6d

Apr 22, 2016

[SPARK-6429] Implement hashCode and equals together · bf95b8da

Joan authored 8 years ago

## What changes were proposed in this pull request?

Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.

Author: Joan <joan@goyeau.com>

Closes #12157 from joan38/SPARK-6429-HashCode-Equals.

bf95b8da

Apr 15, 2016

[SPARK-14549][ML] Copy the Vector and Matrix classes from mllib to ml in mllib-local · 96534aa4

DB Tsai authored 8 years ago

## What changes were proposed in this pull request?

This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /*  since 1.2.0 */

The BLAS implementation will be copied, and some of the test utilities will be copies as well.

Summary of changes:

1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala
  - logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version.
2. In  mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala
  - `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR.
  - `UDT` related code was removed, and will use `SPARK-13944` https://github.com/apache/spark/pull/12259  to replace the annotation.
3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - `Since` was removed.
  - `UDT` related code was removed.
  - In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala
  - All the `throw new SparkException` were replaced by `throw new IllegalArgumentException`

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #12317 from dbtsai/dbtsai-ml-vector.

96534aa4

Apr 14, 2016

[SPARK-14612][ML] Consolidate the version of dependencies in mllib and mllib-local into one place · 9fa43a33

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Move json4s, breeze dependency declaration into parent

## How was this patch tested?

Should be no functional change, but Jenkins tests will test that.

Author: Sean Owen <sowen@cloudera.com>

Closes #12390 from srowen/SPARK-14612.

9fa43a33