Commits · 3c9802d9400bea802984456683b2736a450ee17e · cs525-sp18-g07 / spark

Aug 10, 2015

[SPARK-9801] [STREAMING] Check if file exists before deleting temporary files. · 3c9802d9

Hao Zhu authored 9 years ago

Spark streaming deletes the temp file and backup files without checking if they exist or not

Author: Hao Zhu <viadeazhu@gmail.com>

Closes #8082 from viadea/master and squashes the following commits:

242d05f [Hao Zhu] [SPARK-9801][Streaming]No need to check the existence of those files
fd143f2 [Hao Zhu] [SPARK-9801][Streaming]Check if backupFile exists before deleting backupFile files.
087daf0 [Hao Zhu] SPARK-9801

3c9802d9

[SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python · 853809e9

Prabeesh K authored 9 years ago

This PR is based on #4229, thanks prabeesh.

Closes #4229

Author: Prabeesh K <prabsmails@gmail.com>
Author: zsxwing <zsxwing@gmail.com>
Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabeesh.k@namshi.com>

Closes #7833 from zsxwing/pr4229 and squashes the following commits:

9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python

853809e9

[SPARK-9759] [SQL] improve decimal.times() and cast(int, decimalType) · c4fd2a24

Davies Liu authored 9 years ago

This patch optimize two things:

1. passing MathContext to JavaBigDecimal.multiply/divide/reminder to do right rounding, because java.math.BigDecimal.apply(MathContext) is expensive

2. Cast integer/short/byte to decimal directly (without double)

This two optimizations could speed up the end-to-end time of a aggregation (SUM(short * decimal(5, 2)) 75% (from 19s -> 10.8s)

Author: Davies Liu <davies@databricks.com>

Closes #8052 from davies/optimize_decimal and squashes the following commits:

225efad [Davies Liu] improve decimal.times() and cast(int, decimalType)

c4fd2a24

[SPARK-9620] [SQL] generated UnsafeProjection should support many columns or large exressions · fe2fb7fb

Davies Liu authored 9 years ago

Currently, generated UnsafeProjection can reach 64k byte code limit of Java. This patch will split the generated expressions into multiple functions, to avoid the limitation.

After this patch, we can work well with table that have up to 64k columns (hit max number of constants limit in Java), it should be enough in practice.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #8044 from davies/wider_table and squashes the following commits:

9192e6c [Davies Liu] fix generated safe projection
d1ef81a [Davies Liu] fix failed tests
737b3d3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table
ffcd132 [Davies Liu] address comments
1b95be4 [Davies Liu] put the generated class into sql package
77ed72d [Davies Liu] address comments
4518e17 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table
75ccd01 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table
495e932 [Davies Liu] support wider table with more than 1k columns for generated projections

fe2fb7fb

[SPARK-9763][SQL] Minimize exposure of internal SQL classes. · 40ed2af5

Reynold Xin authored 9 years ago

There are a few changes in this pull request:

1. Moved all data sources to execution.datasources, except the public JDBC APIs.
2. In order to maintain backward compatibility from 1, added a backward compatibility translation map in data source resolution.
3. Moved ui and metric package into execution.
4. Added more documentation on some internal classes.
5. Renamed DataSourceRegister.format -> shortName.
6. Added "override" modifier on shortName.
7. Removed IntSQLMetric.

Author: Reynold Xin <rxin@databricks.com>

Closes #8056 from rxin/SPARK-9763 and squashes the following commits:

9df4801 [Reynold Xin] Removed hardcoded name in test cases.
d9babc6 [Reynold Xin] Shorten.
e484419 [Reynold Xin] Removed VisibleForTesting.
171b812 [Reynold Xin] MimaExcludes.
2041389 [Reynold Xin] Compile ...
79dda42 [Reynold Xin] Compile.
0818ba3 [Reynold Xin] Removed IntSQLMetric.
c46884f [Reynold Xin] Two more fixes.
f9aa88d [Reynold Xin] [SPARK-9763][SQL] Minimize exposure of internal SQL classes.

40ed2af5

[SPARK-9784] [SQL] Exchange.isUnsafe should check whether codegen and unsafe are enabled · 0fe66744

Josh Rosen authored 9 years ago

Exchange.isUnsafe should check whether codegen and unsafe are enabled.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8073 from JoshRosen/SPARK-9784 and squashes the following commits:

7a1019f [Josh Rosen] [SPARK-9784] Exchange.isUnsafe should check whether codegen and unsafe are enabled

0fe66744

Fixed AtmoicReference<> Example · d2852127

Mahmoud Lababidi authored 9 years ago

Author: Mahmoud Lababidi <lababidi@gmail.com>

Closes #8076 from lababidi/master and squashes the following commits:

af4553b [Mahmoud Lababidi] Fixed AtmoicReference<> Example

d2852127

[SPARK-9755] [MLLIB] Add docs to MultivariateOnlineSummarizer methods · 00b655cc

Feynman Liang authored 9 years ago

Adds method documentations back to `MultivariateOnlineSummarizer`, which were present in 1.4 but disappeared somewhere along the way to 1.5.

jkbradley

Author: Feynman Liang <fliang@databricks.com>

Closes #8045 from feynmanliang/SPARK-9755 and squashes the following commits:

af67fde [Feynman Liang] Add MultivariateOnlineSummarizer docs

00b655cc

[SPARK-9710] [TEST] Fix RPackageUtilsSuite when R is not available. · 0f3366a4

Marcelo Vanzin authored 9 years ago

RUtils.isRInstalled throws an exception if R is not installed,
instead of returning false. Fix that.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8008 from vanzin/SPARK-9710 and squashes the following commits:

df72d8c [Marcelo Vanzin] [SPARK-9710] [test] Fix RPackageUtilsSuite when R is not available.

0f3366a4

[SPARK-9743] [SQL] Fixes JSONRelation refreshing · e3fef0f9

Cheng Lian authored 9 years ago

PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions.

The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`.

This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case.

[1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63
[2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91

Author: Cheng Lian <lian@databricks.com>

Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits:

ec1957d [Cheng Lian] Fixes JSONRelation refreshing

e3fef0f9

[SPARK-9777] [SQL] Window operator can accept UnsafeRows · be80def0

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-9777

Author: Yin Huai <yhuai@databricks.com>

Closes #8064 from yhuai/windowUnsafe and squashes the following commits:

8fb3537 [Yin Huai] Set canProcessUnsafeRows to true.

be80def0

Aug 09, 2015

[CORE] [SPARK-9760] Use Option instead of Some for Ivy repos · 46025616

Shivaram Venkataraman authored 9 years ago

This was introduced in #7599

cc rxin brkyvz

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #8055 from shivaram/spark-packages-repo-fix and squashes the following commits:

890f306 [Shivaram Venkataraman] Remove test case
51d69ee [Shivaram Venkataraman] Add test case for --packages without --repository
c02e0b4 [Shivaram Venkataraman] Use Option instead of Some for Ivy repos

46025616

[SPARK-9703] [SQL] Refactor EnsureRequirements to avoid certain unnecessary shuffles · 23cf5af0

Josh Rosen authored 9 years ago

This pull request refactors the `EnsureRequirements` planning rule in order to avoid the addition of certain unnecessary shuffles.

As an example of how unnecessary shuffles can occur, consider SortMergeJoin, which requires clustered distribution and sorted ordering of its children's input rows. Say that both of SMJ's children produce unsorted output but are both SinglePartition. In this case, we will need to inject sort operators but should not need to inject Exchanges. Unfortunately, it looks like the EnsureRequirements unnecessarily repartitions using a hash partitioning.

This patch solves this problem by refactoring `EnsureRequirements` to properly implement the `compatibleWith` checks that were broken in earlier implementations. See the significant inline comments for a better description of how this works. The majority of this PR is new comments and test cases, with few actual changes to the code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7988 from JoshRosen/exchange-fixes and squashes the following commits:

38006e7 [Josh Rosen] Rewrite EnsureRequirements _yet again_ to make things even simpler
0983f75 [Josh Rosen] More guarantees vs. compatibleWith cleanup; delete BroadcastPartitioning.
8784bd9 [Josh Rosen] Giant comment explaining compatibleWith vs. guarantees
1307c50 [Josh Rosen] Update conditions for requiring child compatibility.
18cddeb [Josh Rosen] Rename DummyPlan to DummySparkPlan.
2c7e126 [Josh Rosen] Merge remote-tracking branch 'origin/master' into exchange-fixes
fee65c4 [Josh Rosen] Further refinement to comments / reasoning
642b0bb [Josh Rosen] Further expand comment / reasoning
06aba0c [Josh Rosen] Add more comments
8dbc845 [Josh Rosen] Add even more tests.
4f08278 [Josh Rosen] Fix the test by adding the compatibility check to EnsureRequirements
a1c12b9 [Josh Rosen] Add failing test to demonstrate allCompatible bug
0725a34 [Josh Rosen] Small assertion cleanup.
5172ac5 [Josh Rosen] Add test for requiresChildrenToProduceSameNumberOfPartitions.
2e0f33a [Josh Rosen] Write a more generic test for EnsureRequirements.
752b8de [Josh Rosen] style fix
c628daf [Josh Rosen] Revert accidental ExchangeSuite change.
c9fb231 [Josh Rosen] Rewrite exchange to fix better handle this case.
adcc742 [Josh Rosen] Move test to PlannerSuite.
0675956 [Josh Rosen] Preserving ordering and partitioning in row format converters also does not help.
cc5669c [Josh Rosen] Adding outputPartitioning to Repartition does not fix the test.
2dfc648 [Josh Rosen] Add failing test illustrating bad exchange planning.

23cf5af0

Disable JobGeneratorSuite "Do not clear received block data too soon". · a863348f
Reynold Xin authored 9 years ago

a863348f

[SPARK-9737] [YARN] Add the suggested configuration when required executor... · 86fa4ba6

Yadong Qi authored 9 years ago

[SPARK-9737] [YARN] Add the suggested configuration when required executor memory is above the max threshold of this cluster on YARN mode

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #8028 from watermen/SPARK-9737 and squashes the following commits:

48bdf3d [Yadong Qi] Add suggested configuration.

86fa4ba6

[SPARK-8930] [SQL] Throw a AnalysisException with meaningful messages if... · 68ccc6e1

Yijie Shen authored 9 years ago

[SPARK-8930] [SQL] Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #8057 from yjshen/explode_star and squashes the following commits:

eae181d [Yijie Shen] change explaination message
54c9d11 [Yijie Shen] meaning message for * in explode

68ccc6e1

[SPARK-9752][SQL] Support UnsafeRow in Sample operator. · e9c36938

Reynold Xin authored 9 years ago

In order for this to work, I had to disable gap sampling.

Author: Reynold Xin <rxin@databricks.com>

Closes #8040 from rxin/SPARK-9752 and squashes the following commits:

f9e248c [Reynold Xin] Fix the test case for real this time.
adbccb3 [Reynold Xin] Fixed test case.
589fb23 [Reynold Xin] Merge branch 'SPARK-9752' of github.com:rxin/spark into SPARK-9752
55ccddc [Reynold Xin] Fixed core test.
78fa895 [Reynold Xin] [SPARK-9752][SQL] Support UnsafeRow in Sample operator.
c9e7112 [Reynold Xin] [SPARK-9752][SQL] Support UnsafeRow in Sample operator.

e9c36938

Aug 08, 2015

[SPARK-6212] [SQL] The EXPLAIN output of CTAS only shows the analyzed plan · 3ca995b7

Yijie Shen authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-6212

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7986 from yjshen/ctas_explain and squashes the following commits:

bb6fee5 [Yijie Shen] refine test
f731041 [Yijie Shen] address comment
b2cf8ab [Yijie Shen] bug fix
bd7eb20 [Yijie Shen] ctas explain

3ca995b7

[MINOR] inaccurate comments for showString() · 25c363e9

CodingCat authored 9 years ago

Author: CodingCat <zhunansjtu@gmail.com>

Closes #8050 from CodingCat/minor and squashes the following commits:

5bc4b89 [CodingCat] inaccurate comments

25c363e9

[SPARK-9486][SQL] Add data source aliasing for external packages · a3aec918

Joseph Batchik authored 9 years ago

Users currently have to provide the full class name for external data sources, like:

`sqlContext.read.format("com.databricks.spark.avro").load(path)`

This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like:

`sqlContext.read.format("avro").load(path)`

This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc.

Author: Joseph Batchik <joseph.batchik@cloudera.com>
Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7802 from JDrit/service_loader and squashes the following commits:

49a01ec [Joseph Batchik] fixed a couple of format / error bugs
e5e93b2 [Joseph Batchik] modified rat file to only excluded added services
72b349a [Joseph Batchik] fixed error with orc data source actually
9f93ea7 [Joseph Batchik] fixed error with orc data source
87b7f1c [Joseph Batchik] fixed typo
101cd22 [Joseph Batchik] removing unneeded changes
8f3cf43 [Joseph Batchik] merged in changes
b63d337 [Joseph Batchik] merged in master
95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves
74db85e [Joseph Batchik] reformatted class loader
ac2270d [Joseph Batchik] removing some added test
a6926db [Joseph Batchik] added test cases for data source loader
208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources
946186e [Joseph Batchik] started working on service loader

a3aec918

[SPARK-9728][SQL]Support CalendarIntervalType in HiveQL · 23695f1d

Yijie Shen authored 9 years ago

This PR enables converting interval term in HiveQL to CalendarInterval Literal.

JIRA: https://issues.apache.org/jira/browse/SPARK-9728

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #8034 from yjshen/interval_hiveql and squashes the following commits:

7fe9a5e [Yijie Shen] declare throw exception and add unit test
fce7795 [Yijie Shen] convert hiveql interval term into CalendarInterval literal

23695f1d

[SPARK-6902] [SQL] [PYSPARK] Row should be read-only · ac507a03

Davies Liu authored 9 years ago

Raise an read-only exception when user try to mutable a Row.

Author: Davies Liu <davies@databricks.com>

Closes #8009 from davies/readonly_row and squashes the following commits:

8722f3f [Davies Liu] add tests
05a3d36 [Davies Liu] Row should be read-only

ac507a03

[SPARK-4561] [PYSPARK] [SQL] turn Row into dict recursively · 74a6541a

Davies Liu authored 9 years ago

Add an option `recursive` to `Row.asDict()`, when True (default is False), it will convert the nested Row into dict.

Author: Davies Liu <davies@databricks.com>

Closes #8006 from davies/as_dict and squashes the following commits:

922cc5a [Davies Liu] turn Row into dict recursively

74a6541a

[SPARK-9738] [SQL] remove FromUnsafe and add its codegen version to GenerateSafe · 106c0789

Wenchen Fan authored 9 years ago

In https://github.com/apache/spark/pull/7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8029 from cloud-fan/from-unsafe and squashes the following commits:

ed40d8f [Wenchen Fan] add the copy back
a93fd4b [Wenchen Fan] cogengen FromUnsafe

106c0789

[SPARK-4176] [SQL] [MINOR] Should use unscaled Long to write decimals for... · 11caf1ce

Cheng Lian authored 9 years ago

[SPARK-4176] [SQL] [MINOR] Should use unscaled Long to write decimals for precision <= 18 rather than 8

This PR fixes a minor bug introduced in #7455: when writing decimals, we should use the unscaled Long for better performance when the precision <= 18 rather than 8 (should be a typo). This bug doesn't affect correctness, but hurts Parquet decimal writing performance.

This PR also replaced similar magic numbers with newly defined constants.

Author: Cheng Lian <lian@databricks.com>

Closes #8031 from liancheng/spark-4176/minor-fix-for-writing-decimals and squashes the following commits:

10d4ea3 [Cheng Lian] Should use unscaled Long to write decimals for precision <= 18 rather than 8

11caf1ce

[SPARK-9731] Standalone scheduling incorrect cores if spark.executor.cores is not set · ef062c15

Carson Wang authored 9 years ago

The issue only happens if `spark.executor.cores` is not set and executor memory is set to a high value.
For example, if we have a worker with 4G and 10 cores and we set `spark.executor.memory` to 3G, then only 1 core is assigned to the executor. The correct number should be 10 cores.
I've added a unit test to illustrate the issue.

Author: Carson Wang <carson.wang@intel.com>

Closes #8017 from carsonwang/SPARK-9731 and squashes the following commits:

d09ec48 [Carson Wang] Fix code style
86b651f [Carson Wang] Simplify the code
943cc4c [Carson Wang] fix scheduling correct cores to executors

ef062c15

Aug 07, 2015

[SPARK-9753] [SQL] TungstenAggregate should also accept InternalRow instead of just UnsafeRow · c564b274

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-9753

This PR makes TungstenAggregate to accept `InternalRow` instead of just `UnsafeRow`. Also, it adds an `getAggregationBufferFromUnsafeRow` method to `UnsafeFixedWidthAggregationMap`. It is useful when we already have grouping keys stored in `UnsafeRow`s. Finally, it wraps `InputStream` and `OutputStream` in `UnsafeRowSerializer` with `BufferedInputStream` and `BufferedOutputStream`, respectively.

Author: Yin Huai <yhuai@databricks.com>

Closes #8041 from yhuai/joinedRowForProjection and squashes the following commits:

7753e34 [Yin Huai] Use BufferedInputStream and BufferedOutputStream.
d68b74e [Yin Huai] Use joinedRow instead of UnsafeRowJoiner.
e93c009 [Yin Huai] Add getAggregationBufferFromUnsafeRow for cases that the given groupingKeyRow is already an UnsafeRow.

c564b274

[SPARK-9754][SQL] Remove TypeCheck in debug package. · 998f4ff9

Reynold Xin authored 9 years ago

TypeCheck no longer applies in the new "Tungsten" world.

Author: Reynold Xin <rxin@databricks.com>

Closes #8043 from rxin/SPARK-9754 and squashes the following commits:

4ec471e [Reynold Xin] [SPARK-9754][SQL] Remove TypeCheck in debug package.

998f4ff9

[SPARK-9719] [ML] Clean up Naive Bayes doc · 85be65b3

Feynman Liang authored 9 years ago

Small documentation cleanups, including:
 * Adds documentation for `pi` and `theta`
 * setParam to `setModelType`

Author: Feynman Liang <fliang@databricks.com>

Closes #8047 from feynmanliang/SPARK-9719 and squashes the following commits:

b372438 [Feynman Liang] Clean up naive bayes doc

85be65b3

[SPARK-9756] [ML] Make constructors in ML decision trees private · cd540c1e

Feynman Liang authored 9 years ago

These should be made private until there is a public constructor for providing `rootNode: Node` to use these constructors.

jkbradley

Author: Feynman Liang <fliang@databricks.com>

Closes #8046 from feynmanliang/SPARK-9756 and squashes the following commits:

2cbdf08 [Feynman Liang] Make RFRegressionModel aux constructor private
a06f596 [Feynman Liang] Make constructors in ML decision trees private

cd540c1e

[SPARK-8890] [SQL] Fallback on sorting when writing many dynamic partitions · 49702bd7

Michael Armbrust authored 9 years ago

Previously, we would open a new file for each new dynamic written out using `HadoopFsRelation`. For formats like parquet this is very costly due to the buffers required to get good compression. In this PR I refactor the code allowing us to fall back on an external sort when many partitions are seen. As such each task will open no more than `spark.sql.sources.maxFiles` files. I also did the following cleanup:

- Instead of keying the file HashMap on an expensive to compute string representation of the partition, we now use a fairly cheap UnsafeProjection that avoids heap allocations.
- The control flow for instantiating and invoking a writer container has been simplified. Now instead of switching in two places based on the use of partitioning, the specific writer container must implement a single method `writeRows` that is invoked using `runJob`.
- `InternalOutputWriter` has been removed. Instead we have a `private[sql]` method `writeInternal` that converts and calls the public method. This method can be overridden by internal datasources to avoid the conversion. This change remove a lot of code duplication and per-row `asInstanceOf` checks.
- `commands.scala` has been split up.

Author: Michael Armbrust <michael@databricks.com>

Closes #8010 from marmbrus/fsWriting and squashes the following commits:

00804fe [Michael Armbrust] use shuffleMemoryManager.pageSizeBytes
775cc49 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into fsWriting
17b690e [Michael Armbrust] remove comment
40f0372 [Michael Armbrust] address comments
f5675bd [Michael Armbrust] char -> string
7e2d0a4 [Michael Armbrust] make sure we close current writer
8100100 [Michael Armbrust] delete empty commands.scala
71cc717 [Michael Armbrust] update comment
8ec75ac [Michael Armbrust] [SPARK-8890][SQL] Fallback on sorting when writing many dynamic partitions

49702bd7

[SPARK-9748] [MLLIB] Centriod typo in KMeansModel · 902334fd

Bertrand Dechoux authored 9 years ago

A minor typo (centriod -> centroid). Readable variable names help every users.

Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>

Closes #8037 from BertrandDechoux/kmeans-typo and squashes the following commits:

47632fe [Bertrand Dechoux] centriod typo

902334fd

[SPARK-8481] [MLLIB] GaussianMixtureModel predict accepting single vector · e2fbbe73

Dariusz Kobylarz authored 9 years ago

Resubmit of [https://github.com/apache/spark/pull/6906] for adding single-vec predict to GMMs

CC: dkobylarz  mengxr

To be merged with master and branch-1.5
Primary author: dkobylarz

Author: Dariusz Kobylarz <darek.kobylarz@gmail.com>

Closes #8039 from jkbradley/gmm-predict-vec and squashes the following commits:

bfbedc4 [Dariusz Kobylarz] [SPARK-8481] [MLlib] GaussianMixtureModel predict accepting single vector

e2fbbe73

[SPARK-9674] Re-enable ignored test in SQLQuerySuite · 881548ab

Andrew Or authored 9 years ago

The original code that this test tests is removed in https://github.com/apache/spark/commit/9270bd06fd0b16892e3f37213b5bc7813ea11fdd. It was ignored shortly before that so we never caught it. This patch re-enables the test and adds the code necessary to make it pass.

JoshRosen yhuai

Author: Andrew Or <andrew@databricks.com>

Closes #8015 from andrewor14/SPARK-9674 and squashes the following commits:

225eac2 [Andrew Or] Merge branch 'master' of github.com:apache/spark into SPARK-9674
8c24209 [Andrew Or] Fix NPE
e541d64 [Andrew Or] Track aggregation memory for both sort and hash
0be3a42 [Andrew Or] Fix test

881548ab

[SPARK-9733][SQL] Improve physical plan explain for data sources · 05d04e10

Reynold Xin authored 9 years ago

All data sources show up as "PhysicalRDD" in physical plan explain. It'd be better if we can show the name of the data source.

Without this patch:
```
== Physical Plan ==
NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Final,isDistinct=false))
 Exchange hashpartitioning(date#0,cat#1)
  NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Partial,isDistinct=false))
   PhysicalRDD [date#0,cat#1,count#2], MapPartitionsRDD[3] at
```

With this patch:
```
== Physical Plan ==
TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Final,isDistinct=false)]
 Exchange hashpartitioning(date#0,cat#1)
  TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Partial,isDistinct=false)]
   ConvertToUnsafe
    Scan ParquetRelation[file:/scratch/rxin/spark/sales4][date#0,cat#1,count#2]
```

Author: Reynold Xin <rxin@databricks.com>

Closes #8024 from rxin/SPARK-9733 and squashes the following commits:

811b90e [Reynold Xin] Fixed Python test case.
52cab77 [Reynold Xin] Cast.
eea9ccc [Reynold Xin] Fix test case.
fcecb22 [Reynold Xin] [SPARK-9733][SQL] Improve explain message for data source scan node.

05d04e10

[SPARK-9667][SQL] followup: Use GenerateUnsafeProjection.canSupport to test... · aeddeafc

Reynold Xin authored 9 years ago

[SPARK-9667][SQL] followup: Use GenerateUnsafeProjection.canSupport to test Exchange supported data types.

This way we recursively test the data types.

cc chenghao-intel

Author: Reynold Xin <rxin@databricks.com>

Closes #8036 from rxin/cansupport and squashes the following commits:

f7302ff [Reynold Xin] Can GenerateUnsafeProjection.canSupport to test Exchange supported data types.

aeddeafc

[SPARK-9736] [SQL] JoinedRow.anyNull should delegate to the underlying rows. · 9897cc5e

Reynold Xin authored 9 years ago

JoinedRow.anyNull currently loops through every field to check for null, which is inefficient if the underlying rows are UnsafeRows. It should just delegate to the underlying implementation.

Author: Reynold Xin <rxin@databricks.com>

Closes #8027 from rxin/SPARK-9736 and squashes the following commits:

03a2e92 [Reynold Xin] Include all files.
90f1add [Reynold Xin] [SPARK-9736][SQL] JoinedRow.anyNull should delegate to the underlying rows.

9897cc5e

[SPARK-8382] [SQL] Improve Analysis Unit test framework · 2432c2e2

Wenchen Fan authored 9 years ago

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8025 from cloud-fan/analysis and squashes the following commits:

51461b1 [Wenchen Fan] move test file to test folder
ec88ace [Wenchen Fan] Improve Analysis Unit test framework

2432c2e2

[SPARK-9674][SPARK-9667] Remove SparkSqlSerializer2 · 76eaa701

Reynold Xin authored 9 years ago

It is now subsumed by various Tungsten operators.

Author: Reynold Xin <rxin@databricks.com>

Closes #7981 from rxin/SPARK-9674 and squashes the following commits:

144f96e [Reynold Xin] Re-enable test
58b7332 [Reynold Xin] Disable failing list.
fb797e3 [Reynold Xin] Match all UDTs.
be9f243 [Reynold Xin] Updated if.
71fc99c [Reynold Xin] [SPARK-9674][SPARK-9667] Remove GeneratedAggregate & SparkSqlSerializer2.

76eaa701

[SPARK-9467][SQL]Add SQLMetric to specialize accumulators to avoid boxing · ebfd91c5

zsxwing authored 9 years ago

This PR adds SQLMetric/SQLMetricParam/SQLMetricValue to specialize accumulators to avoid boxing. All SQL metrics should use these classes rather than `Accumulator`.

Author: zsxwing <zsxwing@gmail.com>

Closes #7996 from zsxwing/sql-accu and squashes the following commits:

14a5f0a [zsxwing] Address comments
367ca23 [zsxwing] Use localValue directly to avoid changing Accumulable
42f50c3 [zsxwing] Add SQLMetric to specialize accumulators to avoid boxing

ebfd91c5