Commits · 6fc2740ebb59aca1aa0ee1e93658a7e4e69de33c · cs525-sp18-g07 / spark

Nov 17, 2015

[SPARK-11744][LAUNCHER] Fix print version throw exception when using pyspark shell · 6fc2740e

jerryshao authored 9 years ago

Exception details can be seen here (https://issues.apache.org/jira/browse/SPARK-11744).

Author: jerryshao <sshao@hortonworks.com>

Closes #9721 from jerryshao/SPARK-11744.

6fc2740e

[SPARK-11779][DOCS] Fix reference to deprecated MESOS_NATIVE_LIBRARY · 15cc36b7

Philipp Hoffmann authored 9 years ago

MESOS_NATIVE_LIBRARY was renamed in favor of MESOS_NATIVE_JAVA_LIBRARY. This commit fixes the reference in the documentation.

Author: Philipp Hoffmann <mail@philipphoffmann.de>

Closes #9768 from philipphoffmann/patch-2.

15cc36b7

[SPARK-11751] Doc describe error in the "Spark Streaming Programming Guide" page · 7276fa9a

yangping.wu authored 9 years ago

In the **[Task Launching Overheads](http://spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads)** section,
>Task Serialization: Using Kryo serialization for serializing tasks can reduce the task sizes, and therefore reduce the time taken to send them to the slaves.

as we known **Task Serialization** is configuration by **spark.closure.serializer** parameter, but currently only the Java serializer is supported. If we set **spark.closure.serializer** to **org.apache.spark.serializer.KryoSerializer**, then this will throw a exception.

Author: yangping.wu <wyphao.2007@163.com>

Closes #9734 from 397090770/397090770-patch-1.

7276fa9a

[SPARK-11191][SQL][FOLLOW-UP] Cleans up unnecessary anonymous HiveFunctionRegistry · fa13301a

Cheng Lian authored 9 years ago

According to discussion in PR #9664, the anonymous `HiveFunctionRegistry` in `HiveContext` can be removed now.

Author: Cheng Lian <lian@databricks.com>

Closes #9737 from liancheng/spark-11191.follow-up.

fa13301a

[MINOR] [SQL] Fix randomly generated ArrayData in RowEncoderSuite · d79d8b08

Liang-Chi Hsieh authored 9 years ago

The randomly generated ArrayData used for the UDT `ExamplePoint` in `RowEncoderSuite` sometimes doesn't have enough elements. In this case, this test will fail. This patch is to fix it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9757 from viirya/fix-randomgenerated-udt.

d79d8b08

[SPARK-11447][SQL] change NullType to StringType during binaryComparison... · e01865af

Kevin Yu authored 9 years ago

[SPARK-11447][SQL] change NullType to StringType during binaryComparison between NullType and StringType

During executing PromoteStrings rule, if one side of binaryComparison is StringType and the other side is not StringType, the current code will promote(cast) the StringType to DoubleType, and if the StringType doesn't contain the numbers, it will get null value. So if it is doing <=> (NULL-safe equal) with Null, it will not filter anything, caused the problem reported by this jira.

I proposal to the changes through this PR, can you review my code changes ?

This problem only happen for <=>, other operators works fine.

scala> val filteredDF = df.filter(df("column") > (new Column(Literal(null))))
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

scala> filteredDF.show
+------+
|column|
+------+
+------+

scala> val filteredDF = df.filter(df("column") === (new Column(Literal(null))))
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

scala> filteredDF.show
+------+
|column|
+------+
+------+

scala> df.registerTempTable("DF")

scala> sqlContext.sql("select * from DF where 'column' = NULL")
res27: org.apache.spark.sql.DataFrame = [column: string]

scala> res27.show
+------+
|column|
+------+
+------+

Author: Kevin Yu <qyu@us.ibm.com>

Closes #9720 from kevinyu98/working_on_spark-11447.

e01865af

[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for metadata... · 75d20207

hyukjinkwon authored 9 years ago

[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for metadata and add a test for FIXED_LEN_BYTE_ARRAY

As discussed https://github.com/apache/spark/pull/9660 https://github.com/apache/spark/pull/9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet.

For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools).

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #9754 from HyukjinKwon/SPARK-11694-followup.

75d20207

Nov 16, 2015

[SPARK-11768][SPARK-9196][SQL] Support now function in SQL (alias for current_timestamp). · fbad920d

Reynold Xin authored 9 years ago

This patch adds an alias for current_timestamp (now function).

Also fixes SPARK-9196 to re-enable the test case for current_timestamp.

Author: Reynold Xin <rxin@databricks.com>

Closes #9753 from rxin/SPARK-11768.

fbad920d

[SPARK-11617][NETWORK] Fix leak in TransportFrameDecoder. · 540bf58f

Marcelo Vanzin authored 9 years ago

The code was using the wrong API to add data to the internal composite
buffer, causing buffers to leak in certain situations. Use the right
API and enhance the tests to catch memory leaks.

Also, avoid reusing the composite buffers when downstream handlers keep
references to them; this seems to cause a few different issues even though
the ref counting code seems to be correct, so instead pay the cost of copying
a few bytes when that situation happens.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9619 from vanzin/SPARK-11617.

540bf58f

[SPARK-11612][ML] Pipeline and PipelineModel persistence · 1c5475f1

Joseph K. Bradley authored 9 years ago

Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable.

Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9674 from jkbradley/pipeline-io.

1c5475f1

[EXAMPLE][MINOR] Add missing awaitTermination in click stream example · bd10eb81
jerryshao authored 9 years ago
```
Author: jerryshao <sshao@hortonworks.com>

Closes #9730 from jerryshao/clickstream-fix.
```
bd10eb81
[SPARK-11710] Document new memory management model · 33a0ec93
Andrew Or authored 9 years ago
```
Author: Andrew Or <andrew@databricks.com>

Closes #9676 from andrewor14/memory-management-docs.
```
33a0ec93

[SPARK-11480][CORE][WEBUI] Wrong callsite is displayed when using AsyncRDDActions#takeAsync · 30f3cfda

Kousuke Saruta authored 9 years ago

When we call AsyncRDDActions#takeAsync, actually another DAGScheduler#runJob is called from another thread so we cannot get proper callsite infomation.

Following screenshots are before this patch applied and after.

Before:
<img width="1268" alt="2015-11-04 1 26 40" src="https://cloud.githubusercontent.com/assets/4736016/10914069/0ffc1306-8294-11e5-8e89-c4fadf58dd12.png">
<img width="1258" alt="2015-11-04 1 26 52" src="https://cloud.githubusercontent.com/assets/4736016/10914070/0ffe84ce-8294-11e5-8b2a-69d36276bedb.png">

After:
<img width="1268" alt="2015-11-04 0 48 07" src="https://cloud.githubusercontent.com/assets/4736016/10914080/1d8cfb7a-8294-11e5-9e09-ede25c2563e8.png">
<img width="1269" alt="2015-11-04 0 48 26" src="https://cloud.githubusercontent.com/assets/4736016/10914081/1d934e3a-8294-11e5-8b5e-e3dc37aaced3.png">

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #9437 from sarutak/SPARK-11480.

30f3cfda

[SPARKR][HOTFIX] Disable flaky SparkR package build test · ea6f53e4

Shivaram Venkataraman authored 9 years ago

See https://github.com/apache/spark/pull/9390#issuecomment-157160063 and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #9744 from shivaram/sparkr-package-test-disable.

ea6f53e4

[SPARK-11625][SQL] add java test for typed aggregate · fd14936b
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9591 from cloud-fan/agg-test.
```
fd14936b

[SPARK-8658][SQL] AttributeReference's equals method compares all the members · 75ee12f0

gatorsmile authored 9 years ago

This fix is to change the equals method to check all of the specified fields for equality of AttributeReference.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9216 from gatorsmile/namedExpressEqual.

75ee12f0

[SPARK-11553][SQL] Primitive Row accessors should not convert null to default value · 31296628

Bartlomiej Alberski authored 9 years ago

Invocation of getters for type extending AnyVal returns default value (if field value is null) instead of throwing NPE. Please check comments for SPARK-11553 issue for more details.

Author: Bartlomiej Alberski <bartlomiej.alberski@allegrogroup.com>

Closes #9642 from alberskib/bugfix/SPARK-11553.

31296628

[SPARK-11742][STREAMING] Add the failure info to the batch lists · bcea0bfd

Shixiong Zhu authored 9 years ago

<img width="1365" alt="screen shot 2015-11-13 at 9 57 43 pm" src="https://cloud.githubusercontent.com/assets/1000778/11162322/9b88e204-8a51-11e5-8c57-a44889cab713.png">

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9711 from zsxwing/failure-info.

bcea0bfd

Revert "[SPARK-11271][SPARK-11016][CORE] Use Spark BitSet instead of... · 3c025087

Davies Liu authored 9 years ago

Revert "[SPARK-11271][SPARK-11016][CORE] Use Spark BitSet instead of RoaringBitmap to reduce memory usage"

This reverts commit e209fa27.

3c025087

[SPARK-11390][SQL] Query plan with/without filterPushdown indistinguishable · 985b38dd

Zee Chen authored 9 years ago

…ishable

Propagate pushed filters to PhyicalRDD in DataSourceStrategy.apply

Author: Zee Chen <zeechen@us.ibm.com>

Closes #9679 from zeocio/spark-11390.

985b38dd

[SPARK-11754][SQL] consolidate `ExpressionEncoder.tuple` and `Encoders.tuple` · b1a96626

Wenchen Fan authored 9 years ago

These 2 are very similar, we can consolidate them into one.

Also add tests for it and fix a bug.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9729 from cloud-fan/tuple.

b1a96626

[SPARK-11718][YARN][CORE] Fix explicitly killed executor dies silently issue · 24477d27

jerryshao authored 9 years ago

Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work.

The problem is `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done.

One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple.

Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query.

Here this PR chooses solution 2.

Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9684 from jerryshao/SPARK-11718.

24477d27

[SPARK-6328][PYTHON] Python API for StreamingListener · ace0db47
Daniel Jalova authored 9 years ago
```
Author: Daniel Jalova <djalova@us.ibm.com>

Closes #9186 from djalova/SPARK-6328.
```
ace0db47

[SPARK-11731][STREAMING] Enable batching on Driver WriteAheadLog by default · de5e531d

Burak Yavuz authored 9 years ago

Using batching on the driver for the WriteAheadLog should be an improvement for all environments and use cases. Users will be able to scale to much higher number of receivers with the BatchedWriteAheadLog. Therefore we should turn it on by default, and QA it in the QA period.

I've also added some tests to make sure the default configurations are correct regarding recent additions:
 - batching on by default
 - closeFileAfterWrite off by default
 - parallelRecovery off by default

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #9695 from brkyvz/enable-batch-wal.

de5e531d

[SPARK-11743] [SQL] Add UserDefinedType support to RowEncoder · b0c3fd34

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-11743

RowEncoder doesn't support UserDefinedType now. We should add the support for it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9712 from viirya/rowencoder-udt.

b0c3fd34

[SPARK-11752] [SQL] fix timezone problem for DateTimeUtils.getSeconds · 06f1fdba

Wenchen Fan authored 9 years ago

code snippet to reproduce it:
```
TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai"))
val t = Timestamp.valueOf("1900-06-11 12:14:50.789")
val us = fromJavaTimestamp(t)
assert(getSeconds(us) === t.getSeconds)
```

it will be good to add a regression test for it, but the reproducing code need to change the default timezone, and even we change it back, the `lazy val defaultTimeZone` in `DataTimeUtils` is fixed.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9728 from cloud-fan/seconds.

06f1fdba

[SPARK-11522][SQL] input_file_name() returns "" for external tables · 0e79604a

xin Wu authored 9 years ago

When computing partition for non-parquet relation, `HadoopRDD.compute` is used. but it does not set the thread local variable `inputFileName` in `NewSqlHadoopRDD`, like `NewSqlHadoopRDD.compute` does.. Yet, when getting the `inputFileName`, `NewSqlHadoopRDD.inputFileName` is exptected, which is empty now.
Adding the setting inputFileName in HadoopRDD.compute resolves this issue.

Author: xin Wu <xinwu@us.ibm.com>

Closes #9542 from xwu0226/SPARK-11522.

0e79604a

[SPARK-11692][SQL] Support for Parquet logical types, JSON and BSON (embedded types) · e388b39d

hyukjinkwon authored 9 years ago

Parquet supports some JSON and BSON datatypes. They are represented as binary for BSON and string (UTF-8) for JSON internally.

I searched a bit and found Apache drill also supports both in this way, [link](https://drill.apache.org/docs/parquet-format/).

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #9658 from HyukjinKwon/SPARK-11692.

e388b39d

[SPARK-11044][SQL] Parquet writer version fixed as version1 · 7f8eb3bf

hyukjinkwon authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11044

Spark writes a parquet file only with writer version1 ignoring the writer version given by user.

So, in this PR, it keeps the writer version if given or sets version1 as default.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: HyukjinKwon <gurwls223@gmail.com>

Closes #9060 from HyukjinKwon/SPARK-11044.

7f8eb3bf

[SPARK-11745][SQL] Enable more JSON parsing options · 42de5253

Reynold Xin authored 9 years ago

This patch adds the following options to the JSON data source, for dealing with non-standard JSON files:
* `allowComments` (default `false`): ignores Java/C++ style comment in JSON records
* `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names
* `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012)

To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options.

Also updated documentation to explain these options.

Scala

![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png)

Python

![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png)

Author: Reynold Xin <rxin@databricks.com>

Closes #9724 from rxin/SPARK-11745.

42de5253

Revert "[SPARK-11572] Exit AsynchronousListenerBus thread when stop() is called" · fd50fa4c
Josh Rosen authored 9 years ago
```
This reverts commit 3e0a6cf1.
```
fd50fa4c

Nov 15, 2015

[SPARK-9928][SQL] Removal of LogicalLocalTable · b58765ca

gatorsmile authored 9 years ago

LogicalLocalTable in ExistingRDD.scala is replaced by localRelation in LocalRelation.scala?

Do you know any reason why we still keep this class?

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9717 from gatorsmile/LogicalLocalTable.

b58765ca

[SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable · 835a79d7

Sun Rui authored 9 years ago

The basic idea is that:
The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.

When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.

sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.

The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.

Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages) so that these package can be accessed in R.

Author: Sun Rui <rui.sun@intel.com>

Closes #9390 from sun-rui/SPARK-10500.

835a79d7

[SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame · d7d9fa0b

zero323 authored 9 years ago

Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`

At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame. It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).

A simple improvement is to apply `dropFactor `column-wise and then reshape output list.

It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9099 from zero323/SPARK-11086.

d7d9fa0b

[SPARK-10181][SQL] Do kerberos login for credentials during hive client initialization · 72c1d68b

Yu Gao authored 9 years ago

On driver process start up, UserGroupInformation.loginUserFromKeytab is called with the principal and keytab passed in, and therefore static var UserGroupInfomation,loginUser is set to that principal with kerberos credentials saved in its private credential set, and all threads within the driver process are supposed to see and use this login credentials to authenticate with Hive and Hadoop. However, because of IsolatedClientLoader, UserGroupInformation class is not shared for hive metastore clients, and instead it is loaded separately and of course not able to see the prepared kerberos login credentials in the main thread.

The first proposed fix would cause other classloader conflict errors, and is not an appropriate solution. This new change does kerberos login during hive client initialization, which will make credentials ready for the particular hive client instance.

yhuai Please take a look and let me know. If you are not the right person to talk to, could you point me to someone responsible for this?

Author: Yu Gao <ygao@us.ibm.com>
Author: gaoyu <gaoyu@gaoyu-macbookpro.roam.corp.google.com>
Author: Yu Gao <crystalgaoyu@gmail.com>

Closes #9272 from yolandagao/master.

72c1d68b

[SPARK-11738] [SQL] Making ArrayType orderable · 3e2e1873

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11738

Author: Yin Huai <yhuai@databricks.com>

Closes #9718 from yhuai/makingArrayOrderable.

3e2e1873

[SPARK-11672][ML] set active SQLContext in JavaDefaultReadWriteSuite · 64e55511

Xiangrui Meng authored 9 years ago

The same as #9694, but for Java test suite. yhuai

Author: Xiangrui Meng <meng@databricks.com>

Closes #9719 from mengxr/SPARK-11672.4.

64e55511

[SPARK-11734][SQL] Rename TungstenProject -> Project, TungstenSort -> Sort · d22fc108

Reynold Xin authored 9 years ago

I didn't remove the old Sort operator, since we still use it in randomized tests. I moved it into test module and renamed it ReferenceSort.

Author: Reynold Xin <rxin@databricks.com>

Closes #9700 from rxin/SPARK-11734.

d22fc108

Nov 14, 2015

[SPARK-11736][SQL] Add monotonically_increasing_id to function registry. · d83c2f9f

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11736

Author: Yin Huai <yhuai@databricks.com>

Closes #9703 from yhuai/MonotonicallyIncreasingID.

d83c2f9f

Typo in comment: use 2 seconds instead of 1 · 22e96b87

Rohan Bhanderi authored 9 years ago

Use 2 seconds batch size as duration specified in JavaStreamingContext constructor is 2000 ms

Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>

Closes #9714 from RohanBhanderi/patch-2.

22e96b87