Commits · 73b70f076d4e22396b7e145f2ce5974fbf788048 · cs525-sp18-g07 / spark

Dec 28, 2015

[SPARK-12517] add default RDD name for one created via sc.textFile · 73b70f07

Yaron Weinsberg authored 9 years ago

The feature was first added at commit: 7b877b27 but was later removed (probably by mistake) at commit: fc8b5819.
This change sets the default path of RDDs created via sc.textFile(...) to the path argument.

Here is the symptom:

* Using spark-1.5.2-bin-hadoop2.6:

scala> sc.textFile("/home/root/.bashrc").name
res5: String = null

scala> sc.binaryFiles("/home/root/.bashrc").name
res6: String = /home/root/.bashrc

* while using Spark 1.3.1:

scala> sc.textFile("/home/root/.bashrc").name
res0: String = /home/root/.bashrc

scala> sc.binaryFiles("/home/root/.bashrc").name
res1: String = /home/root/.bashrc

Author: Yaron Weinsberg <wyaron@gmail.com>
Author: yaron <yaron@il.ibm.com>

Closes #10456 from wyaron/master.

73b70f07

[SPARK-12231][SQL] create a combineFilters' projection when we call buildPartitionedTableScan · fd50df41

Kevin Yu authored 9 years ago

Hello Michael & All:

We have some issues to submit the new codes in the other PR(#10299), so we closed that PR and open this one with the fix.

The reason for the previous failure is that the projection for the scan when there is a filter that is not pushed down (the "left-over" filter) could be different, in elements or ordering, from the original projection.

With this new codes, the approach to solve this problem is:

Insert a new Project if the "left-over" filter is nonempty and (the original projection is not empty and the projection for the scan has more than one elements which could otherwise cause different ordering in projection).

We create 3 test cases to cover the otherwise failure cases.

Author: Kevin Yu <qyu@us.ibm.com>

Closes #10388 from kevinyu98/spark-12231.

fd50df41

[HOT-FIX] bypass hive test when parse logical plan to json · 8543997f

Wenchen Fan authored 9 years ago

https://github.com/apache/spark/pull/10311 introduces some rare, non-deterministic flakiness for hive udf tests, see https://github.com/apache/spark/pull/10311#issuecomment-166548851

I can't reproduce it locally, and may need more time to investigate, a quick solution is: bypass hive tests for json serialization.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10430 from cloud-fan/hot-fix.

8543997f

[SPARK-12508][PROJECT-INFRA] Fix minor bugs in dev/tests/pr_public_classes.sh script · ab6bedd8

Josh Rosen authored 9 years ago

This patch fixes a handful of minor bugs in the `dev/tests/pr_public_classes.sh` script, which is used by the `run_tests_jenkins` script to detect the addition of new public classes:

- Account for differences between BSD and GNU `sed` in order to allow the script to run on OS X.
- Diff `$ghprbActualCommit^...$ghprbActualCommit ` instead of `master...$ghprbActualCommit`: since `ghprbActualCommit` is a merge commit which results from merging the PR into the target branch, this will give us the desired diff and will avoid certain race-conditions which could lead to false-positives.
- Use `echo -e` instead of `echo` so that newline characters are handled correctly in output. This should fix a formatting glitch which caused the output to appear on a single line in the GitHub comment (see [the SC2028 page](https://github.com/koalaman/shellcheck/wiki/SC2028) on the Shellcheck wiki for more details).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10455 from JoshRosen/fix-pr-public-classes-test.

ab6bedd8

[SPARK-12218] Fixes ORC conjunction predicate push down · 8e23d8db

Cheng Lian authored 9 years ago

This PR is a follow-up of PR #10362.

Two major changes:

1. The fix introduced in #10362 is OK for Parquet, but may disable ORC PPD in many cases

PR #10362 stops converting an `AND` predicate if any branch is inconvertible. On the other hand, `OrcFilters` combines all filters into a single big conjunction first and then tries to convert it into ORC `SearchArgument`. This means, if any filter is inconvertible, no filters can be pushed down. This PR fixes this issue by finding out all convertible filters first before doing the actual conversion.

The reason behind the current implementation is mostly due to the limitation of ORC `SearchArgument` builder, which is documented in this PR in detail.

1. Copied the `AND` predicate fix for ORC from #10362 to avoid merge conflict.

Same as #10362, this PR targets master (2.0.0-SNAPSHOT), branch-1.6, and branch-1.5.

Author: Cheng Lian <lian@databricks.com>

Closes #10377 from liancheng/spark-12218.fix-orc-conjunction-ppd.

8e23d8db

[SPARK-12353][STREAMING][PYSPARK] Fix countByValue inconsistent output in Python API · 8d494009

jerryshao authored 9 years ago

The semantics of Python countByValue is different from Scala API, it is more like countDistinctValue, so here change to make it consistent with Scala/Java API.

Author: jerryshao <sshao@hortonworks.com>

Closes #10350 from jerryshao/SPARK-12353.

8d494009

[SPARK-12515][SQL][DOC] minor doc update for read.jdbc · 5aa2710c
felixcheung authored 9 years ago
```
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10465 from felixcheung/dfreaderjdbcdoc.
```
5aa2710c

[SPARK-12520] [PYSPARK] Correct Descriptions and Add Use Cases in Equi-Join · 9ab296ec

gatorsmile authored 9 years ago

After reading the JIRA https://issues.apache.org/jira/browse/SPARK-12520, I double checked the code.

For example, users can do the Equi-Join like
  ```df.join(df2, 'name', 'outer').select('name', 'height').collect()```
- There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`).
- After a PR: https://github.com/apache/spark/pull/8600, the 1.6 does not have such an issue, but the description has not been updated.

Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10477 from gatorsmile/pyOuterJoin.

9ab296ec

Dec 25, 2015

[SPARK-12396][CORE] Modify the function scheduleAtFixedRate to schedule. · 1e978139

echo2mei authored 9 years ago

Instead of just cancel the registrationRetryTimer to avoid driver retry connect to master, change the function to schedule.
It is no need to register to master iteratively.

Author: echo2mei <534384876@qq.com>

Closes #10447 from echoTomei/master.

1e978139

Dec 24, 2015

[SPARK-12440][CORE] Avoid setCheckpoint warning when directory is not local · ea4aab7e

pierre-borckmans authored 9 years ago

In SparkContext method `setCheckpointDir`, a warning is issued when spark master is not local and the passed directory for the checkpoint dir appears to be local.

In practice, when relying on HDFS configuration file and using a relative path for the checkpoint directory (using an incomplete URI without HDFS scheme, ...), this warning should not be issued and might be confusing.
In fact, in this case, the checkpoint directory is successfully created, and the checkpointing mechanism works as expected.

This PR uses the `FileSystem` instance created with the given directory, and checks whether it is local or not.
(The rationale is that since this same `FileSystem` instance is used to create the checkpoint dir anyway and can therefore be reliably used to determine if it is local or not).

The warning is only issued if the directory is not local, on top of the existing conditions.

Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com>

Closes #10392 from pierre-borckmans/SPARK-12440_CheckpointDir_Warning_NonLocal.

ea4aab7e

[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax · 502476e4

CK50 authored 9 years ago

In the past Spark JDBC write only worked with technologies which support the following INSERT statement syntax (JdbcUtils.scala: insertStatement()):

INSERT INTO $table VALUES ( ?, ?, ..., ? )

But some technologies require a list of column names:

INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )

This was blocking the use of e.g. the Progress JDBC Driver for Cassandra.

Another limitation is that syntax 1 relies no the dataframe field ordering match that of the target table. This works fine, as long as the target table has been created by writer.jdbc().

If the target table contains more columns (not created by writer.jdbc()), then the insert fails due mismatch of number of columns or their data types.

This PR switches to the recommended second INSERT syntax. Column names are taken from datafram field names.

Author: CK50 <christian.kurz@oracle.com>

Closes #10380 from CK50/master-SPARK-12010-2.

502476e4

[SPARK-12311][CORE] Restore previous value of "os.arch" property in test... · 39204661

Kazuaki Ishizaki authored 9 years ago

[SPARK-12311][CORE] Restore previous value of "os.arch" property in test suites after forcing to set specific value to "os.arch" property

Restore the original value of os.arch property after each test

Since some of tests forced to set the specific value to os.arch property, we need to set the original value.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #10289 from kiszk/SPARK-12311.

39204661

[SPARK-12502][BUILD][PYTHON] Script /dev/run-tests fails when IBM Java is used · 9e85bb71

Kazuaki Ishizaki authored 9 years ago

fix an exception with IBM JDK by removing update field from a JavaVersion tuple. This is because IBM JDK does not have information on update '_xx'

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #10463 from kiszk/SPARK-12502.

9e85bb71

Dec 23, 2015

[SPARK-12499][BUILD] don't force MAVEN_OPTS · ead6abf7

Adrian Bridgett authored 9 years ago

allow the user to override MAVEN_OPTS (2GB wasn't sufficient for me)

Author: Adrian Bridgett <adrian@smop.co.uk>

Closes #10448 from abridgett/feature/do_not_force_maven_opts.

ead6abf7

[SPARK-12500][CORE] Fix Tachyon deprecations; pull Tachyon dependency into one class · ae1f54aa

Sean Owen authored 9 years ago

Fix Tachyon deprecations; pull Tachyon dependency into `TachyonBlockManager` only

CC calvinjia as I probably need a double-check that the usage of the new API is correct.

Author: Sean Owen <sowen@cloudera.com>

Closes #10449 from srowen/SPARK-12500.

ae1f54aa

[SPARK-12477][SQL] - Tungsten projection fails for null values in array fields · 43b2a639

pierre-borckmans authored 9 years ago

Accessing null elements in an array field fails when tungsten is enabled.
It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.

This PR solves this by checking if the accessed element in the array field is null, in the generated code.

Example:
```
// Array of String
case class AS( as: Seq[String] )
val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
dfAS.registerTempTable("T_AS")
for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))}
```

With Tungsten disabled:
```
0 = [a]
1 = [null]
2 = [b]
```

With Tungsten enabled:
```
0 = [a]
15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
	at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
```

Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com>

Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array.

43b2a639

[SPARK-11164][SQL] Add InSet pushdown filter back for Parquet · 50301c0a

Liang-Chi Hsieh authored 9 years ago

When the filter is ```"b in ('1', '2')"```, the filter is not pushed down to Parquet. Thanks!

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #10278 from gatorsmile/parquetFilterNot.

50301c0a

Dec 22, 2015

[SPARK-12478][SQL] Bugfix: Dataset fields of product types can't be null · 86761e10

Cheng Lian authored 9 years ago

When creating extractors for product types (i.e. case classes and tuples), a null check is missing, thus we always assume input product values are non-null.

This PR adds a null check in the extractor expression for product types. The null check is stripped off for top level product fields, which are mapped to the outermost `Row`s, since they can't be null.

Thanks cloud-fan for helping investigating this issue!

Author: Cheng Lian <lian@databricks.com>

Closes #10431 from liancheng/spark-12478.top-level-null-field.

86761e10

[SPARK-12429][STREAMING][DOC] Add Accumulator and Broadcast example for Streaming · 20591afd

Shixiong Zhu authored 9 years ago

This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10385 from zsxwing/accumulator-broadcast-example.

20591afd

[SPARK-12487][STREAMING][DOCUMENT] Add docs for Kafka message handler · 93db50d1
Shixiong Zhu authored 9 years ago
```
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10439 from zsxwing/kafka-message-handler-doc.
```
93db50d1

[SPARK-12102][SQL] Cast a non-nullable struct field to a nullable field during analysis · b374a258

Dilip Biswal authored 9 years ago

Compare both left and right side of the case expression ignoring nullablity when checking for type equality.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #10156 from dilipbiswal/spark-12102.

b374a258

[SPARK-12471][CORE] Spark daemons will log their pid on start up. · 575a1327
Nong Li authored 9 years ago
```
Author: Nong Li <nong@databricks.com>

Closes #10422 from nongli/12471-pids.
```
575a1327
Minor corrections, i.e. typo fixes and follow deprecated · 7c970f90
Jacek Laskowski authored 9 years ago
```
Author: Jacek Laskowski <jacek@japila.pl>

Closes #10432 from jaceklaskowski/minor-corrections.
```
7c970f90

[SPARK-12456][SQL] Add ExpressionDescription to misc functions · b5ce84a1

Xiu Guo authored 9 years ago

First try, not sure how much information we need to provide in the usage part.

Author: Xiu Guo <xguo27@gmail.com>

Closes #10423 from xguo27/SPARK-12456.

b5ce84a1

[SPARK-12475][BUILD] Upgrade Zinc from 0.3.5.3 to 0.3.9 · bc0f30d0

Josh Rosen authored 9 years ago

We should update to the latest version of Zinc in order to match our SBT version.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10426 from JoshRosen/update-zinc.

bc0f30d0

[SPARK-11677][SQL][FOLLOW-UP] Add tests for checking the ORC filter creation... · 364d244a

hyukjinkwon authored 9 years ago

[SPARK-11677][SQL][FOLLOW-UP] Add tests for checking the ORC filter creation against pushed down filters.

https://issues.apache.org/jira/browse/SPARK-11677
Although it checks correctly the filters by the number of results if ORC filter-push-down is enabled, the filters themselves are not being tested.
So, this PR includes the test similarly with `ParquetFilterSuite`.
Since the results are checked by `OrcQuerySuite`, this `OrcFilterSuite` only checks if the appropriate filters are created.

One thing different with `ParquetFilterSuite` here is, it does not check the results because that is checked in `OrcQuerySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #10341 from HyukjinKwon/SPARK-11677-followup.

364d244a

[SPARK-12371][SQL] Runtime nullability check for NewInstance · 42bfde29

Cheng Lian authored 9 years ago

This PR adds a new expression `AssertNotNull` to ensure non-nullable fields of products and case classes don't receive null values at runtime.

Author: Cheng Lian <lian@databricks.com>

Closes #10331 from liancheng/dataset-nullability-check.

42bfde29

[SPARK-12446][SQL] Add unit tests for JDBCRDD internal functions · 8c1b867c

Takeshi YAMAMURO authored 9 years ago

No tests done for JDBCRDD#compileFilter.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10409 from maropu/AddTestsInJdbcRdd.

8c1b867c

[SPARK-12296][PYSPARK][MLLIB] Feature parity for pyspark mllib standard scaler model · 969d5665

Holden Karau authored 9 years ago

Some methods are missing, such as ways to access the std, mean, etc. This PR is for feature parity for pyspark.mllib.feature.StandardScaler & StandardScalerModel.

Author: Holden Karau <holden@us.ibm.com>

Closes #10298 from holdenk/SPARK-12296-feature-parity-pyspark-mllib-StandardScalerModel.

969d5665

[SPARK-11823][SQL] Fix flaky JDBC cancellation test in HiveThriftBinaryServerSuite · 2235cd44

Josh Rosen authored 9 years ago

This patch fixes a flaky "test jdbc cancel" test in HiveThriftBinaryServerSuite. This test is prone to a race-condition which causes it to block indefinitely with while waiting for an extremely slow query to complete, which caused many Jenkins builds to time out.

For more background, see my comments on #6207 (the PR which introduced this test).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10425 from JoshRosen/SPARK-11823.

2235cd44

[MINOR] Fix typos in JavaStreamingContext · 93da8565
Shixiong Zhu authored 9 years ago
```
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10424 from zsxwing/typo.
```
93da8565

[SPARK-11807] Remove support for Hadoop < 2.2 · 0a38637d

Reynold Xin authored 9 years ago

i.e. Hadoop 1 and Hadoop 2.0

Author: Reynold Xin <rxin@databricks.com>

Closes #10404 from rxin/SPARK-11807.

0a38637d

Dec 21, 2015

[SPARK-12388] change default compression to lz4 · 29cecd4a

Davies Liu authored 9 years ago

According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.

After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).

[1] https://github.com/ning/jvm-compressor-benchmark/wiki

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #10342 from davies/lz4.

29cecd4a

[SPARK-12466] Fix harmless NPE in tests · d655d37d

Andrew Or authored 9 years ago

```
[info] ReplayListenerSuite:
[info] - Simple replay (58 milliseconds)
java.lang.NullPointerException
	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
```
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull

This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests).

Tested locally to verify that the NPE is gone.

Author: Andrew Or <andrew@databricks.com>

Closes #10417 from andrewor14/fix-harmless-npe.

d655d37d

[SPARK-2331] SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] · a820ca19
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #10394 from rxin/SPARK-2331.
```
a820ca19

[SPARK-12339][SPARK-11206][WEBUI] Added a null check that was removed in · b0849b8a

Alex Bozarth authored 9 years ago

Updates made in SPARK-11206 missed an edge case which cause's a NullPointerException when a task is killed. In some cases when a task ends in failure taskMetrics is initialized as null (see JobProgressListener.onTaskEnd()). To address this a null check was added. Before the changes in SPARK-11206 this null check was called at the start of the updateTaskAccumulatorValues() function.

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #10405 from ajbozarth/spark12339.

b0849b8a

Doc typo: ltrim = trim from left end, not right · fc6dbcc7
pshearer authored 9 years ago
```
Author: pshearer <pshearer@massmutual.com>

Closes #10414 from pshearer/patch-1.
```
fc6dbcc7
[SPARK-5882][GRAPHX] Add a test for GraphLoader.edgeListFile · 1eb90bc9
Takeshi YAMAMURO authored 9 years ago
```
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #4674 from maropu/AddGraphLoaderSuite.
```
1eb90bc9

[SPARK-12392][CORE] Optimize a location order of broadcast blocks by... · 935f4663

Takeshi YAMAMURO authored 9 years ago

[SPARK-12392][CORE] Optimize a location order of broadcast blocks by considering preferred local hosts

When multiple workers exist in a host, we can bypass unnecessary remote access for broadcasts; block managers fetch broadcast blocks from the same host instead of remote hosts.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10346 from maropu/OptimizeBlockLocationOrder.

935f4663

[SPARK-12374][SPARK-12150][SQL] Adding logical/physical operators for Range · 4883a508

gatorsmile authored 9 years ago

Based on the suggestions from marmbrus , added logical/physical operators for Range for improving the performance.

Also added another API for resolving the JIRA Spark-12150.

Could you take a look at my implementation, marmbrus ? If not good, I can rework it. : )

Thank you very much!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10335 from gatorsmile/rangeOperators.

4883a508