Commits · 3bb2a8d7508b507edfcc21bd20912b0ff4a0a248 · cs525-sp18-g07 / spark

Oct 29, 2015

[SPARK-11388][BUILD] Fix self closing tags. · 3bb2a8d7

Java 8 javadoc does not like self closing tags: ```<p/>```, ```<br/>```, ...

This PR fixes those.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9339 from hvanhovell/SPARK-11388.

3bb2a8d7

[SPARK-11318] Include hive profile in make-distribution.sh command · f304f9c9
tedyu authored 9 years ago
```
Author: tedyu <yuzhihong@gmail.com>

Closes #9281 from tedyu/master.
```
f304f9c9

[SPARK-11370] [SQL] fix a bug in GroupedIterator and create unit test for it · f79ebf2a

Wenchen Fan authored 9 years ago

Before this PR, user has to consume the iterator of one group before process next group, or we will get into infinite loops.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9330 from cloud-fan/group.

f79ebf2a

[SPARK-11379][SQL] ExpressionEncoder can't handle top level primitive type correctly · 87f28fc2

Wenchen Fan authored 9 years ago

For inner primitive type(e.g. inside `Product`), we use `schemaFor` to get the catalyst type for it, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L403.

However, for top level primitive type, we use `dataTypeFor`, which is wrong.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9337 from cloud-fan/encoder.

87f28fc2

Oct 28, 2015

[SPARK-11322] [PYSPARK] Keep full stack trace in captured exception · 3dfa4ea5

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-11322

As reported by JoshRosen in [databricks/spark-redshift/issues/89](https://github.com/databricks/spark-redshift/issues/89#issuecomment-149828308), the exception-masking behavior sometimes makes debugging harder. To deal with this issue, we should keep full stack trace in the captured exception.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9283 from viirya/py-exception-stacktrace.

3dfa4ea5

[SPARK-11351] [SQL] support hive interval literal · 0cb7662d
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9304 from cloud-fan/interval.
```
0cb7662d

[SPARK-11376][SQL] Removes duplicated `mutableRow` field · e5b89978

Cheng Lian authored 9 years ago

This PR fixes a mistake in the code generated by `GenerateColumnAccessor`. Interestingly, although the code is illegal in Java (the class has two fields with the same name), Janino accepts it happily and accidentally works properly.

Author: Cheng Lian <lian@databricks.com>

Closes #9335 from liancheng/spark-11376.fix-generated-code.

e5b89978

[SPARK-11363] [SQL] LeftSemiJoin should be LeftSemi in SparkStrategies · 20dfd467

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-11363

In SparkStrategies some places use LeftSemiJoin. It should be LeftSemi.

cc chenghao-intel liancheng

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9318 from viirya/no-left-semi-join.

20dfd467

[SPARK-11292] [SQL] Python API for text data source · 5aa05219

Reynold Xin authored 9 years ago

Adds DataFrameReader.text and DataFrameWriter.text.

Author: Reynold Xin <rxin@databricks.com>

Closes #9259 from rxin/SPARK-11292.

5aa05219

[SPARK-11377] [SQL] withNewChildren should not convert StructType to Seq · 032748bb

Michael Armbrust authored 9 years ago

This is minor, but I ran into while writing Datasets and while it wasn't needed for the final solution, it was super confusing so we should fix it.

Basically we recurse into `Seq` to see if they have children. This breaks because we don't preserve the original subclass of `Seq` (and `StructType <:< Seq[StructField]`). Since a struct can never contain children, lets just not recurse into it.

Author: Michael Armbrust <michael@databricks.com>

Closes #9334 from marmbrus/structMakeCopy.

032748bb

[SPARK-11367][ML][PYSPARK] Python LinearRegression should support setting solver · f92b7b98

Yanbo Liang authored 9 years ago

[SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs")

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9328 from yanboliang/spark-11367.

f92b7b98

[SPARK-11369][ML][R] SparkR glm should support setting standardize · fba9e954

Yanbo Liang authored 9 years ago

SparkR glm currently support :
```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0```
We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9331 from yanboliang/spark-11369.

fba9e954

Typo in mllib-evaluation-metrics.md · fd9e345c

Mageswaran.D authored 9 years ago

Recall by threshold snippet was using "precisionByThreshold"

Author: Mageswaran.D <mageswaran1989@gmail.com>

Closes #9333 from Mageswaran1989/Typo_in_mllib-evaluation-metrics.md.

fd9e345c

[SPARK-11313][SQL] implement cogroup on DataSets (support 2 datasets) · 075ce491

Wenchen Fan authored 9 years ago

A simpler version of https://github.com/apache/spark/pull/9279, only support 2 datasets.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9324 from cloud-fan/cogroup2.

075ce491

[SPARK-11332] [ML] Refactored to use ml.feature.Instance instead of WeightedLeastSquare.Instance · 5f1cee6f

Nakul Jindal authored 9 years ago

WeightedLeastSquares now uses the common Instance class in ml.feature instead of a private one.

Author: Nakul Jindal <njindal@us.ibm.com>

Closes #9325 from nakul02/SPARK-11332_refactor_WeightedLeastSquares_dot_Instance.

5f1cee6f

[MINOR][ML] fix compile warns · 82c1c577

Xiangrui Meng authored 9 years ago

This fixes some compile time warnings.

Author: Xiangrui Meng <meng@databricks.com>

Closes #9319 from mengxr/mllib-compile-warn-20151027.

82c1c577

[SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix... · 826e1e30

Sean Owen authored 9 years ago

[SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases

Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.

Supersedes https://github.com/apache/spark/pull/9293

Author: Sean Owen <sowen@cloudera.com>

Closes #9309 from srowen/SPARK-11302.2.

826e1e30

Oct 27, 2015

[SPARK-10484] [SQL] Optimize the cartesian join with broadcast join for some cases · d9c60398

Cheng Hao authored 9 years ago

In some cases, we can broadcast the smaller relation in cartesian join, which improve the performance significantly.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #8652 from chenghao-intel/cartesian.

d9c60398

[SPARK-11178] Improving naming around task failures. · b960a890

Kay Ousterhout authored 9 years ago

Commit af3bc59d introduced new
functionality so that if an executor dies for a reason that's not
caused by one of the tasks running on the executor (e.g., due to
pre-emption), Spark doesn't count the failure towards the maximum
number of failures for the task.  That commit introduced some vague
naming that this commit attempts to fix; in particular:

(1) The variable "isNormalExit", which was used to refer to cases where
the executor died for a reason unrelated to the tasks running on the
machine, has been renamed (and reversed) to "exitCausedByApp". The problem
with the existing name is that it's not clear (at least to me!) what it
means for an exit to be "normal"; the new name is intended to make the
purpose of this variable more clear.

(2) The variable "shouldEventuallyFailJob" has been renamed to
"countTowardsTaskFailures". This variable is used to determine whether
a task's failure should be counted towards the maximum number of failures
allowed for a task before the associated Stage is aborted. The problem
with the existing name is that it can be confused with implying that
the task's failure should immediately cause the stage to fail because it
is somehow fatal (this is the case for a fetch failure, for example: if
a task fails because of a fetch failure, there's no point in retrying,
and the whole stage should be failed).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #9164 from kayousterhout/SPARK-11178.

b960a890

[SPARK-11212][CORE][STREAMING] Make preferred locations support... · 9fbd75ab

zsxwing authored 9 years ago

[SPARK-11212][CORE][STREAMING] Make preferred locations support ExecutorCacheTaskLocation and update…

… ReceiverTracker and ReceiverSchedulingPolicy to use it

This PR includes the following changes:

1. Add a new preferred location format, `executor_<host>_<executorID>` (e.g., "executor_localhost_2"), to support specifying the executor locations for RDD.
2. Use the new preferred location format in `ReceiverTracker` to optimize the starting time of Receivers when there are multiple executors in a host.

The goal of this PR is to enable the streaming scheduler to place receivers (which run as tasks) in specific executors. Basically, I want to have more control on the placement of the receivers such that they are evenly distributed among the executors. We tried to do this without changing the core scheduling logic. But it does not allow specifying particular executor as preferred location, only at the host level. So if there are two executors in the same host, and I want two receivers to run on them (one on each executor), I cannot specify that. Current code only specifies the host as preference, which may end up launching both receivers on the same executor. We try to work around it but restarting a receiver when it does not launch in the desired executor and hope that next time it will be started in the right one. But that cause lots of restarts, and delays in correctly launching the receiver.

So this change, would allow the streaming scheduler to specify the exact executor as the preferred location. Also this is not exposed to the user, only the streaming scheduler uses this.

Author: zsxwing <zsxwing@gmail.com>

Closes #9181 from zsxwing/executor-location.

9fbd75ab

[SPARK-11324][STREAMING] Flag for closing Write Ahead Logs after a write · 4f030b9e

Burak Yavuz authored 9 years ago

Currently the Write Ahead Log in Spark Streaming flushes data as writes need to be made. S3 does not support flushing of data, data is written once the stream is actually closed.
In case of failure, the data for the last minute (default rolling interval) will not be properly written. Therefore we need a flag to close the stream after the write, so that we achieve read after write consistency.

cc tdas zsxwing

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #9285 from brkyvz/caw-wal.

4f030b9e

[SPARK-10024][PYSPARK] Python API RF and GBT related params clear up · 9dba5fb2

vectorijk authored 9 years ago

implement {RandomForest, GBT, TreeEnsemble, TreeClassifier, TreeRegressor}Params for Python API
in pyspark/ml/{classification, regression}.py

Author: vectorijk <jiangkai@gmail.com>

Closes #9233 from vectorijk/spark-10024.

9dba5fb2

[SPARK-11347] [SQL] Support for joinWith in Datasets · 5a5f6590

Michael Armbrust authored 9 years ago

This PR adds a new operation `joinWith` to a `Dataset`, which returns a `Tuple` for each pair where a given `condition` evaluates to true.

```scala
case class ClassData(a: String, b: Int)

val ds1 = Seq(ClassData("a", 1), ClassData("b", 2)).toDS()
val ds2 = Seq(("a", 1), ("b", 2)).toDS()

> ds1.joinWith(ds2, $"_1" === $"a").collect()
res0: Array((ClassData("a", 1), ("a", 1)), (ClassData("b", 2), ("b", 2)))
```

This operation is similar to the relation `join` function with one important difference in the result schema. Since `joinWith` preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names `_1` and `_2`.

This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.

## Required Changes to Encoders
In the process of working on this patch, several deficiencies to the way that we were handling encoders were discovered.  Specifically, it turned out to be very difficult to `rebind` the non-expression based encoders to extract the nested objects from the results of joins (and also typed selects that return tuples).

As a result the following changes were made.
 - `ClassEncoder` has been renamed to `ExpressionEncoder` and has been improved to also handle primitive types.  Additionally, it is now possible to take arbitrary expression encoders and rewrite them into a single encoder that returns a tuple.
 - All internal operations on `Dataset`s now require an `ExpressionEncoder`.  If the users tries to pass a non-`ExpressionEncoder` in, an error will be thrown.  We can relax this requirement in the future by constructing a wrapper class that uses expressions to project the row to the expected schema, shielding the users code from the required remapping.  This will give us a nice balance where we don't force user encoders to understand attribute references and binding, but still allow our native encoder to leverage runtime code generation to construct specific encoders for a given schema that avoid an extra remapping step.
 - Additionally, the semantics for different types of objects are now better defined.  As stated in the `ExpressionEncoder` scaladoc:
  - Classes will have their sub fields extracted by name using `UnresolvedAttribute` expressions
  and `UnresolvedExtractValue` expressions.
  - Tuples will have their subfields extracted by position using `BoundReference` expressions.
  - Primitives will have their values extracted from the first ordinal with a schema that defaults
  to the name `value`.
 - Finally, the binding lifecycle for `Encoders` has now been unified across the codebase.  Encoders are now `resolved` to the appropriate schema in the constructor of `Dataset`.  This process replaces an unresolved expressions with concrete `AttributeReference` expressions.  Binding then happens on demand, when an encoder is going to be used to construct an object.  This closely mirrors the lifecycle for standard expressions when executing normal SQL or `DataFrame` queries.

Author: Michael Armbrust <michael@databricks.com>

Closes #9300 from marmbrus/datasets-tuples.

5a5f6590

[SPARK-6488][MLLIB][PYTHON] Support addition/multiplication in PySpark's BlockMatrix · 3bdbbc6c

Mike Dusenberry authored 9 years ago

This PR adds addition and multiplication to PySpark's `BlockMatrix` class via `add` and `multiply` functions.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #9139 from dusenberrymw/SPARK-6488_Add_Addition_and_Multiplication_to_PySpark_BlockMatrix.

3bdbbc6c

[SPARK-11306] Fix hang when JVM exits. · 9fc16a82

Kay Ousterhout authored 9 years ago

This commit fixes a bug where, in Standalone mode, if a task fails and crashes the JVM, the
failure is considered a "normal failure" (meaning it's considered unrelated to the task), so
the failure isn't counted against the task's maximum number of failures:
https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.
As a result, if a task fails in a way that results in it crashing the JVM, it will continuously be
re-launched, resulting in a hang. This commit fixes that problem.

This bug was introduced by #8007; andrewor14 mccheah vanzin can you take a look at this?

This error is hard to trigger because we handle executor losses through 2 code paths (the second is via Akka, where Akka notices that the executor endpoint is disconnected). In my setup, the Akka code path completes first, and doesn't have this bug, so things work fine (see my recent email to the dev list about this). If I manually disable the Akka code path, I can see the hang (and this commit fixes the issue).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #9273 from kayousterhout/SPARK-11306.

9fc16a82

[SPARK-11303][SQL] filter should not be pushed down into sample · 360ed832

Yanbo Liang authored 9 years ago

When sampling and then filtering DataFrame, the SQL Optimizer will push down filter into sample and produce wrong result. This is due to the sampler is calculated based on the original scope rather than the scope after filtering.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9294 from yanboliang/spark-11303.

360ed832

[SPARK-11277][SQL] sort_array throws exception scala.MatchError · 958a0ec8

Jia Li authored 9 years ago

I'm new to spark. I was trying out the sort_array function then hit this exception. I looked into the spark source code. I found the root cause is that sort_array does not check for an array of NULLs. It's not meaningful to sort an array of entirely NULLs anyway.

I'm adding a check on the input array type to SortArray. If the array consists of NULLs entirely, there is no need to sort such array. I have also added a test case for this.

Please help to review my fix. Thanks!

Author: Jia Li <jiali@us.ibm.com>

Closes #9247 from jliwork/SPARK-11277.

958a0ec8

[SPARK-5569][STREAMING] fix ObjectInputStreamWithLoader for supporting load array classes. · 17f49992

maxwell authored 9 years ago

When use Kafka DirectStream API to create checkpoint and restore saved checkpoint when restart,
ClassNotFound exception would occur.

The reason for this error is that ObjectInputStreamWithLoader extends the ObjectInputStream class and override its resolveClass method. But Instead of Using Class.forName(desc,false,loader), Spark uses loader.loadClass(desc) to instance the class, which do not works with array class.

For example:
Class.forName("[Lorg.apache.spark.streaming.kafka.OffsetRange.",false,loader) works well while loader.loadClass("[Lorg.apache.spark.streaming.kafka.OffsetRange") would throw an class not found exception.

details of the difference between Class.forName and loader.loadClass can be found here.
http://bugs.java.com/view_bug.do?bug_id=6446627

Author: maxwell <maxwellzdm@gmail.com>
Author: DEMING ZHU <deming.zhu@linecorp.com>

Closes #8955 from maxwellzdm/master.

17f49992

[SPARK-11270][STREAMING] Add improved equality testing for TopicAndPartition... · 8f888eea

Nick Evans authored 9 years ago

[SPARK-11270][STREAMING] Add improved equality testing for TopicAndPartition from the Kafka Streaming API

jerryshao tdas

I know this is kind of minor, and I know you all are busy, but this brings this class in line with the `OffsetRange` class, and makes tests a little more concise.

Instead of doing something like:
```
assert topic_and_partition_instance._topic == "foo"
assert topic_and_partition_instance._partition == 0
```

You can do something like:
```
assert topic_and_partition_instance == TopicAndPartition("foo", 0)
```

Before:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
False
```

After:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
True
```

I couldn't find any tests - am I missing something?

Author: Nick Evans <me@nicolasevans.org>

Closes #9236 from manygrams/topic_and_partition_equality.

8f888eea

[SPARK-11276][CORE] SizeEstimator prevents class unloading · feb8d6a4

Sem Mulder authored 9 years ago

The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects as keys.
Which results in strong references to the Class objects. If these classes are dynamically created
this prevents the corresponding ClassLoader from being GCed. Leading to PermGen exhaustion.

We use a Map with WeakKeys to prevent this issue.

Author: Sem Mulder <sem.mulder@site2mobile.com>

Closes #9244 from SemMulder/fix-sizeestimator-classunloading.

feb8d6a4

[SPARK-11297] Add new code tags · d77d198f

Xusen Yin authored 9 years ago

mengxr https://issues.apache.org/jira/browse/SPARK-11297

Add new code tags to hold the same look and feel with previous documents.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9265 from yinxusen/SPARK-11297.

d77d198f

[SPARK-10654][MLLIB] Add columnSimilarities to IndexedRowMatrix · 8b292b19

Reza Zadeh authored 9 years ago

Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix.

With a test.

Author: Reza Zadeh <reza@databricks.com>

Closes #8792 from rezazadeh/colsims.

8b292b19

Oct 26, 2015

[SPARK-11184][MLLIB] Declare most of .mllib code not-Experimental · 3cac6614

Sean Owen authored 9 years ago

Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier

Author: Sean Owen <sowen@cloudera.com>

Closes #9169 from srowen/SPARK-11184.

3cac6614

[SPARK-10271][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.clustering · 5d4f6abe

noelsmith authored 9 years ago

Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).

Author: noelsmith <mail@noelsmith.com>

Closes #8627 from noel-smith/SPARK-10271-since-mllib-clustering.

5d4f6abe

[SPARK-11289][DOC] Substitute code examples in ML features extractors with include_example · 943d4fa2

Xusen Yin authored 9 years ago

mengxr https://issues.apache.org/jira/browse/SPARK-11289

I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9266 from yinxusen/SPARK-11289.

943d4fa2

[SPARK-10562] [SQL] support mixed case partitionBy column names for tables stored in metastore · a150e6c1
Wenchen Fan authored 9 years ago
```
https://issues.apache.org/jira/browse/SPARK-10562

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9226 from cloud-fan/par.
```
a150e6c1
[SPARK-11209][SPARKR] Add window functions into SparkR [step 1]. · dc3220ce
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9193 from sun-rui/SPARK-11209.
```
dc3220ce

[SPARK-10947] [SQL] With schema inference from JSON into a Dataframe, add... · 82464fb2

Stephen De Gennaro authored 9 years ago

[SPARK-10947] [SQL] With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

Currently, when a schema is inferred from a JSON file using sqlContext.read.json, the primitive object types are inferred as string, long, boolean, etc.

However, if the inferred type is too specific (JSON obviously does not enforce types itself), this can cause issues with merging dataframe schemas.

This pull request adds the option "primitivesAsString" to the JSON DataFrameReader which when true (defaults to false if not set) will infer all primitives as strings.

Below is an example usage of this new functionality.
```
val jsonDf = sqlContext.read.option("primitivesAsString", "true").json(sampleJsonFile)

scala> jsonDf.printSchema()
root
|-- bigInteger: string (nullable = true)
|-- boolean: string (nullable = true)
|-- double: string (nullable = true)
|-- integer: string (nullable = true)
|-- long: string (nullable = true)
|-- null: string (nullable = true)
|-- string: string (nullable = true)
```

Author: Stephen De Gennaro <stepheng@realitymine.com>

Closes #9249 from stephend-realitymine/stephend-primitives.

82464fb2

[SPARK-11325] [SQL] Alias 'alias' in Scala's DataFrame API · d4c397a6
Nong Li authored 9 years ago
```
Author: Nong Li <nongli@gmail.com>

Closes #9286 from nongli/spark-11325.
```
d4c397a6

[SQL][DOC] Minor document fixes in interfaces.scala · 4bb2b369

Alexander Slesarenko authored 9 years ago

rxin just noticed this while reading the code.

Author: Alexander Slesarenko <avslesarenko@gmail.com>

Closes #9284 from aslesarenko/doc-typos.

4bb2b369