Commits · 08698ee1d6f29b2c999416f18a074d5193cdacd5 · cs525-sp18-g07 / spark

Oct 16, 2015

[SPARK-11094] Strip extra strings from Java version in test runner · 08698ee1

Jakob Odersky authored 9 years ago

Removes any extra strings from the Java version, fixing subsequent integer parsing.
This is required since some OpenJDK versions (specifically in Debian testing), append an extra "-internal" string to the version field.

Author: Jakob Odersky <jodersky@gmail.com>

Closes #9111 from jodersky/fixtestrunner.

08698ee1

[SPARK-11092] [DOCS] Add source links to scaladoc generation · ed775042

Jakob Odersky authored 9 years ago

Modify the SBT build script to include GitHub source links for generated Scaladocs, on releases only (no snapshots).

Author: Jakob Odersky <jodersky@gmail.com>

Closes #9110 from jodersky/unidoc.

ed775042

[SPARK-11060] [STREAMING] Fix some potential NPE in DStream transformation · 43f5d1f3

jerryshao authored 9 years ago

This patch fixes:

1. Guard out against NPEs in `TransformedDStream` when parent DStream returns None instead of empty RDD.
2. Verify some input streams which will potentially return None.
3. Add unit test to verify the behavior when input stream returns None.

cc tdas , please help to review, thanks a lot :).

Author: jerryshao <sshao@hortonworks.com>

Closes #9070 from jerryshao/SPARK-11060.

43f5d1f3

Oct 15, 2015

[SPARK-11135] [SQL] Exchange incorrectly skips sorts when existing ordering is... · eb0b4d6e

Josh Rosen authored 9 years ago

[SPARK-11135] [SQL] Exchange incorrectly skips sorts when existing ordering is non-empty subset of required ordering

In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases where the data has already been sorted by a superset of the requested sorting columns. For instance, let's say that a query calls for an operator's input to be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` alone will not satisfy the ordering requirements, requiring an additional sort to be planned by Exchange.

However, the current Exchange code gets this wrong and incorrectly skips sorting when the existing output ordering is a subset of the required ordering. This is simple to fix, however.

This bug was introduced in https://github.com/apache/spark/pull/7458, so it affects 1.5.0+.

This patch fixes the bug and significantly improves the unit test coverage of Exchange's sort-planning logic.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9140 from JoshRosen/SPARK-11135.

eb0b4d6e

[SPARK-10412] [SQL] report memory usage for tungsten sql physical operator · 6a2359ff

Wenchen Fan authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-10412

some screenshots:
### aggregate:
![screen shot 2015-10-12 at 2 23 11 pm](https://cloud.githubusercontent.com/assets/3182036/10439534/618320a4-70ef-11e5-94d8-62ea7f2d1531.png)

### join
![screen shot 2015-10-12 at 2 23 29 pm](https://cloud.githubusercontent.com/assets/3182036/10439537/6724797c-70ef-11e5-8f75-0cf5cbd42048.png)

Author: Wenchen Fan <wenchen@databricks.com>
Author: Wenchen Fan <cloud0fan@163.com>

Closes #8931 from cloud-fan/viz.

6a2359ff

[SPARK-11078] Ensure spilling tests actually spill · 3b364ff0

Andrew Or authored 9 years ago

#9084 uncovered that many tests that test spilling don't actually spill. This is a follow-up patch to fix that to ensure our unit tests actually catch potential bugs in spilling. The size of this patch is inflated by the refactoring of `ExternalSorterSuite`, which had a lot of duplicate code and logic.

Author: Andrew Or <andrew@databricks.com>

Closes #9124 from andrewor14/spilling-tests.

3b364ff0

[SPARK-10515] When killing executor, the pending replacement executors should not be lost · 2d000124

KaiXinXiaoLei authored 9 years ago

If the heartbeat receiver kills executors (and new ones are not registered to replace them), the idle timeout for the old executors will be lost (and then change a total number of executors requested by Driver), So new ones will be not to asked to replace them.
For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout before a new executor is asked to replace executor 1. Then driver kill executor 2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), So AM doesn't allocate a executor to replace 1.

see: https://github.com/apache/spark/pull/8668

Author: KaiXinXiaoLei <huleilei1@huawei.com>
Author: huleilei <huleilei1@huawei.com>

Closes #8945 from KaiXinXiaoLei/pendingexecutor.

2d000124

fix typo bellow -> below · 723aa75a

Britta Weber authored 9 years ago

Author: Britta Weber <britta.weber@elasticsearch.com>

Closes #9136 from brwe/typo-bellow.

723aa75a

[SPARK-11071] [LAUNCHER] Fix flakiness in LauncherServerSuite::timeout. · a5719804

Marcelo Vanzin authored 9 years ago

The test could fail depending on scheduling of the various threads
involved; the change removes some sources of races, while making the
test a little more resilient by trying a few times before giving up.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9079 from vanzin/SPARK-11071.

a5719804

[SPARK-11039][Documentation][Web UI] Document additional ui configurations · b591de7c

Nick Pritchard authored 9 years ago

Add documentation for configuration:
- spark.sql.ui.retainedExecutions
- spark.streaming.ui.retainedBatches

Author: Nick Pritchard <nicholas.pritchard@falkonry.com>

Closes #9052 from pnpritchard/SPARK-11039.

b591de7c

[SPARK-11047] Internal accumulators miss the internal flag when replaying... · d45a0d3c

Carson Wang authored 9 years ago

[SPARK-11047] Internal accumulators miss the internal flag when replaying events in the history server

Internal accumulators don't write the internal flag to event log. So on the history server Web UI, all accumulators are not internal. This causes incorrect peak execution memory and unwanted accumulator table displayed on the stage page.
To fix it, I add the "internal" property of AccumulableInfo when writing the event log.

Author: Carson Wang <carson.wang@intel.com>

Closes #9061 from carsonwang/accumulableBug.

d45a0d3c

[SPARK-11066] Update DAGScheduler's "misbehaved ResultHandler" · 523adc24

shellberg authored 9 years ago

Restrict tasks (of job) to only 1 to ensure that the causing Exception asserted for job failure is the deliberately thrown DAGSchedulerSuiteDummyException intended, not an UnsupportedOperationException from any second/subsequent tasks that can propagate from a race condition during code execution.

Author: shellberg <sah@zepler.org>

Closes #9076 from shellberg/shellberg-DAGSchedulerSuite-misbehavedResultHandlerTest-patch-1.

523adc24

[SPARK-11099] [SPARK SHELL] [SPARK SUBMIT] Default conf property file i… · aec4400b
Jeff Zhang authored 9 years ago
```
Please help review it. Thanks

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9114 from zjffdu/SPARK-11099.
```
aec4400b

[SPARK-11093] [CORE] ChildFirstURLClassLoader#getResources should return all... · 0f62c228

Adam Lewandowski authored 9 years ago

[SPARK-11093] [CORE] ChildFirstURLClassLoader#getResources should return all found resources, not just those in the child classloader

Author: Adam Lewandowski <alewandowski@ipcoop.com>

Closes #9106 from alewando/childFirstFix.

0f62c228

Oct 14, 2015

[SPARK-11076] [SQL] Add decimal support for floor and ceil · 9808052b

Cheng Hao authored 9 years ago

Actually all of the `UnaryMathExpression` doens't support the Decimal, will create follow ups for supporing it. This is the first PR which will be good to review the approach I am taking.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #9086 from chenghao-intel/ceiling.

9808052b

[SPARK-11017] [SQL] Support ImperativeAggregates in TungstenAggregate · 4ace4f8a

Josh Rosen authored 9 years ago

This patch extends TungstenAggregate to support ImperativeAggregate functions. The existing TungstenAggregate operator only supported DeclarativeAggregate functions, which are defined in terms of Catalyst expressions and can be evaluated via generated projections. ImperativeAggregate functions, on the other hand, are evaluated by calling their `initialize`, `update`, `merge`, and `eval` methods.

The basic strategy here is similar to how SortBasedAggregate evaluates both types of aggregate functions: use a generated projection to evaluate the expression-based declarative aggregates with dummy placeholder expressions inserted in place of the imperative aggregate function output, then invoke the imperative aggregate functions and target them against the aggregation buffer. The bulk of the diff here consists of code that was copied and adapted from SortBasedAggregate, with some key changes to handle TungstenAggregate's sort fallback path.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9038 from JoshRosen/support-interpreted-in-tungsten-agg-final.

4ace4f8a

[SPARK-10829] [SQL] Filter combine partition key and attribute doesn't work in DataSource scan · 1baaf2b9

Cheng Hao authored 9 years ago

```scala
withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
      withTempPath { dir =>
        val path = s"${dir.getCanonicalPath}/part=1"
        (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)

        // If the "part = 1" filter gets pushed down, this query will throw an exception since
        // "part" is not a valid column in the actual Parquet file
        checkAnswer(
          sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 1)"),
          (2 to 3).map(i => Row(i, i.toString, 1)))
      }
    }
```

We expect the result to be:
```
2,1
3,1
```
But got
```
1,1
2,1
3,1
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #8916 from chenghao-intel/partition_filter.

1baaf2b9

[SPARK-11113] [SQL] Remove DeveloperApi annotation from private classes. · 2b5e31c7

Reynold Xin authored 9 years ago

o.a.s.sql.catalyst and o.a.s.sql.execution are supposed to be private.

Author: Reynold Xin <rxin@databricks.com>

Closes #9121 from rxin/SPARK-11113.

2b5e31c7

[SPARK-10104] [SQL] Consolidate different forms of table identifiers · 56d7da14

Wenchen Fan authored 9 years ago

Right now, we have QualifiedTableName, TableIdentifier, and Seq[String] to represent table identifiers. We should only have one form and TableIdentifier is the best one because it provides methods to get table name, database name, return unquoted string, and return quoted string.

Author: Wenchen Fan <wenchen@databricks.com>
Author: Wenchen Fan <cloud0fan@163.com>

Closes #8453 from cloud-fan/table-name.

56d7da14

[SPARK-11068] [SQL] [FOLLOW-UP] move execution listener to util · 9a430a02
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9119 from cloud-fan/callback.
```
9a430a02

[SPARK-11096] Post-hoc review Netty based RPC implementation - round 2 · cf2e0ae7

Reynold Xin authored 9 years ago

A few more changes:

1. Renamed IDVerifier -> RpcEndpointVerifier
2. Renamed NettyRpcAddress -> RpcEndpointAddress
3. Simplified NettyRpcHandler a bit by removing the connection count tracking. This is OK because I now force spark.shuffle.io.numConnectionsPerPeer to 1
4. Reduced spark.rpc.connect.threads to 64. It would be great to eventually remove this extra thread pool.
5. Minor cleanup & documentation.

Author: Reynold Xin <rxin@databricks.com>

Closes #9112 from rxin/SPARK-11096.

cf2e0ae7

[SPARK-10973] · 615cc858

Reynold Xin authored 9 years ago

Close #9064
Close #9063
Close #9062

These pull requests were merged into branch-1.5, branch-1.4, and branch-1.3.

615cc858

[SPARK-8386] [SQL] add write.mode for insertIntoJDBC when the parm overwrite is false · 7e1308d3

Huaxin Gao authored 9 years ago

the fix is for jira https://issues.apache.org/jira/browse/SPARK-8386

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #9042 from huaxingao/spark8386.

7e1308d3

[SPARK-11040] [NETWORK] Make sure SASL handler delegates all events. · 31f31598
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9053 from vanzin/SPARK-11040.
```
31f31598

[SPARK-10619] Can't sort columns on Executor Page · 135a2ce5

Tom Graves authored 9 years ago

should pick into spark 1.5.2 also.

https://issues.apache.org/jira/browse/SPARK-10619

looks like this was broken by commit: https://github.com/apache/spark/commit/fb1d06fc242ec00320f1a3049673fbb03c4a6eb9#diff-b8adb646ef90f616c34eb5c98d1ebd16
It looks like somethings were change to use the UIUtils.listingTable but executor page wasn't converted so when it removed sortable from the UIUtils. TABLE_CLASS_NOT_STRIPED it broke this page.

Simply add the sortable tag back in and it fixes both active UI and the history server UI.

Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #9101 from tgravescs/SPARK-10619.

135a2ce5

[SPARK-10996] [SPARKR] Implement sampleBy() in DataFrameStatFunctions. · 390b22fa
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9023 from sun-rui/SPARK-10996.
```
390b22fa

[SPARK-10981] [SPARKR] SparkR Join improvements · 8b328857

Monica Liu authored 9 years ago

I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file.
Pull request because I filed this JIRA bug report:
https://issues.apache.org/jira/browse/SPARK-10981

Author: Monica Liu <liu.monica.f@gmail.com>

Closes #9029 from mfliu/master.

8b328857

Oct 13, 2015

[SPARK-11091] [SQL] Change spark.sql.canonicalizeView to spark.sql.nativeView. · ce3f9a80

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11091

Author: Yin Huai <yhuai@databricks.com>

Closes #9103 from yhuai/SPARK-11091.

ce3f9a80

[SPARK-11068] [SQL] add callback to query execution · 15ff85b3

Wenchen Fan authored 9 years ago

With this feature, we can track the query plan, time cost, exception during query execution for spark users.

Author: Wenchen Fan <cloud0fan@163.com>

Closes #9078 from cloud-fan/callback.

15ff85b3

[SPARK-11032] [SQL] correctly handle having · e170c221

Wenchen Fan authored 9 years ago

We should not stop resolving having when the having condtion is resolved, or something like `count(1)` will crash.

Author: Wenchen Fan <cloud0fan@163.com>

Closes #9105 from cloud-fan/having.

e170c221

[SPARK-11090] [SQL] Constructor for Product types from InternalRow · 328d1b3e

Michael Armbrust authored 9 years ago

This is a first draft of the ability to construct expressions that will take a catalyst internal row and construct a Product (case class or tuple) that has fields with the correct names.  Support include:
 - Nested classes
 - Maps
 - Efficiently handling of arrays of primitive types

Not yet supported:
 - Case classes that require custom collection types (i.e. List instead of Seq).

Author: Michael Armbrust <michael@databricks.com>

Closes #9100 from marmbrus/productContructor.

328d1b3e

[SPARK-11059] [ML] Change range of quantile probabilities in AFTSurvivalRegression · 3889b1c7

vectorijk authored 9 years ago

Value of the quantile probabilities array should be in the range (0, 1) instead of [0,1]
in `AFTSurvivalRegression.scala` according to [Discussion] (https://github.com/apache/spark/pull/8926#discussion-diff-40698242)

Author: vectorijk <jiangkai@gmail.com>

Closes #9083 from vectorijk/spark-11059.

3889b1c7

[SPARK-10932] [PROJECT INFRA] Port two minor changes to release-build.sh from scripts' old repo · d0482f6a

Josh Rosen authored 9 years ago

Spark's release packaging scripts used to live in a separate repository. Although these scripts are now part of the Spark repo, there are some minor patches made against the old repos that are missing in Spark's copy of the script. This PR ports those changes.

/cc shivaram, who originally submitted these changes against https://github.com/rxin/spark-utils

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8986 from JoshRosen/port-release-build-fixes-from-rxin-repo.

d0482f6a

[SPARK-11080] [SQL] Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM comparisions · ef72673b

Josh Rosen authored 9 years ago

In the current implementation of named expressions' `ExprIds`, we rely on a per-JVM AtomicLong to ensure that expression ids are unique within a JVM. However, these expression ids will not be _globally_ unique. This opens the potential for id collisions if new expression ids happen to be created inside of tasks rather than on the driver.

There are currently a few cases where tasks allocate expression ids, which happen to be safe because those expressions are never compared to expressions created on the driver. In order to guard against the introduction of invalid comparisons between driver-created and executor-created expression ids, this patch extends `ExprId` to incorporate a UUID to identify the JVM that created the id, which prevents collisions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9093 from JoshRosen/SPARK-11080.

ef72673b

[SPARK-11052] Spaces in the build dir causes failures in the build/mv… · 0d1b73b7
trystanleftwich authored 9 years ago
```
…n script

Author: trystanleftwich <trystan@atscale.com>

Closes #9065 from trystanleftwich/SPARK-11052.
```
0d1b73b7

[SPARK-10983] Unified memory manager · b3ffac51

Andrew Or authored 9 years ago

This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:

- **spark.memory.fraction (default 0.75)**: fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.

- **spark.memory.storageFraction (default 0.5)**: size of the storage region within the space set aside by `spark.memory.fraction`. Cached data may only be evicted if total storage exceeds this region.

- **spark.memory.useLegacyMode (default false)**: whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.

For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000.

Author: Andrew Or <andrew@databricks.com>

Closes #9084 from andrewor14/unified-memory-manager.

b3ffac51

[SPARK-7402] [ML] JSON SerDe for standard param types · 2b574f52

Xiangrui Meng authored 9 years ago

This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9090 from mengxr/SPARK-7402.

2b574f52

[PYTHON] [MINOR] List modules in PySpark tests when given bad name · c75f058b

Joseph K. Bradley authored 9 years ago

Output list of supported modules for python tests in error message when given bad module name.

CC: davies

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9088 from jkbradley/python-tests-modules.

c75f058b

[SPARK-10913] [SPARKR] attach() function support · f7f28ee7

Adrian Zhuang authored 9 years ago

Bring the change code up to date.

Author: Adrian Zhuang <adrian555@users.noreply.github.com>
Author: adrian555 <wzhuang@us.ibm.com>

Closes #9031 from adrian555/attach2.

f7f28ee7

[SPARK-10888] [SPARKR] Added as.DataFrame as a synonym to createDataFrame · 1e0aba90

Narine Kokhlikyan authored 9 years ago

as.DataFrame is more a R-style like signature.
Also, I'd like to know if we could make the context, e.g. sqlContext global, so that we do not have to specify it as an argument, when we each time create a dataframe.

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #8952 from NarineK/sparkrasDataFrame.

1e0aba90