Commits · c07838b5a9cdf96c0f49055ea1c397e0f0e915d2 · cs525-sp18-g07 / spark

Jul 21, 2015

[SPARK-9206] [SQL] Fix HiveContext classloading for GCS connector. · c07838b5

Dennis Huo authored 9 years ago

IsolatedClientLoader.isSharedClass includes all of com.google.\*, presumably
for Guava, protobuf, and/or other shared Google libraries, but needs to
count com.google.cloud.\* as "hive classes" when determining which ClassLoader
to use. Otherwise, things like HiveContext.parquetFile will throw a
ClassCastException when fs.defaultFS is set to a Google Cloud Storage (gs://)
path. On StackOverflow: http://stackoverflow.com/questions/31478955

EDIT: Adding yhuai who worked on the relevant classloading isolation pieces.

Author: Dennis Huo <dhuo@google.com>

Closes #7549 from dennishuo/dhuo-fix-hivecontext-gcs and squashes the following commits:

1f8db07 [Dennis Huo] Fix HiveContext classloading for GCS connector.

c07838b5

[SPARK-8906][SQL] Move all internal data source classes into execution.datasources. · 60c0ce13

Reynold Xin authored 9 years ago

This way, the sources package contains only public facing interfaces.

Author: Reynold Xin <rxin@databricks.com>

Closes #7565 from rxin/move-ds and squashes the following commits:

7661aff [Reynold Xin] Mima
9d5196a [Reynold Xin] Rearranged imports.
3dd7174 [Reynold Xin] [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.

60c0ce13

[SPARK-8357] Fix unsafe memory leak on empty inputs in GeneratedAggregate · 9ba7c64d

navis.ryu authored 9 years ago

This patch fixes a managed memory leak in GeneratedAggregate. The leak occurs when the unsafe aggregation path is used to perform grouped aggregation on an empty input; in this case, GeneratedAggregate allocates an UnsafeFixedWidthAggregationMap that is never cleaned up because `next()` is never called on the aggregate result iterator.

This patch fixes this by short-circuiting on empty inputs.

This patch is an updated version of #6810.

Closes #6810.

Author: navis.ryu <navis@apache.org>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #7560 from JoshRosen/SPARK-8357 and squashes the following commits:

3486ce4 [Josh Rosen] Some minor cleanup
c649310 [Josh Rosen] Revert SparkPlan change:
3c7db0f [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-8357
adc8239 [Josh Rosen] Back out Projection changes.
c5419b3 [navis.ryu] addressed comments
143e1ef [navis.ryu] fixed format & added test for CCE case
735972f [navis.ryu] used new conf apis
1a02a55 [navis.ryu] Rolled-back test-conf cleanup & fixed possible CCE & added more tests
51178e8 [navis.ryu] addressed comments
4d326b9 [navis.ryu] fixed test fails
15c5afc [navis.ryu] added a test as suggested by JoshRosen
d396589 [navis.ryu] added comments
1b07556 [navis.ryu] [SPARK-8357] [SQL] Memory leakage on unsafe aggregation path with empty input

9ba7c64d

Revert "[SPARK-9154] [SQL] codegen StringFormat" · 87d890cc

Michael Armbrust authored 9 years ago

This reverts commit 7f072c3d.

Revert #7546

Author: Michael Armbrust <michael@databricks.com>

Closes #7570 from marmbrus/revert9154 and squashes the following commits:

ed2c32a [Michael Armbrust] Revert "[SPARK-9154] [SQL] codegen StringFormat"

87d890cc

[SPARK-5989] [MLLIB] Model save/load for LDA · 89db3c0b

MechCoder authored 9 years ago

Add support for saving and loading LDA both the local and distributed versions.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6948 from MechCoder/lda_save_load and squashes the following commits:

49bcdce [MechCoder] minor style fixes
cc14054 [MechCoder] minor
4587d1d [MechCoder] Minor changes
c753122 [MechCoder] Load and save the model in private methods
2782326 [MechCoder] [SPARK-5989] Model save/load for LDA

89db3c0b

[SPARK-9154] [SQL] codegen StringFormat · 7f072c3d

Tarek Auel authored 9 years ago

Jira: https://issues.apache.org/jira/browse/SPARK-9154

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7546 from tarekauel/SPARK-9154 and squashes the following commits:

a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives
10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait
cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format
086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format

7f072c3d

[SPARK-5423] [CORE] Register a TaskCompletionListener to make sure release all resources · d45355ee

zsxwing authored 9 years ago

Make `DiskMapIterator.cleanup` idempotent and register a TaskCompletionListener to make sure call `cleanup`.

Author: zsxwing <zsxwing@gmail.com>

Closes #7529 from zsxwing/SPARK-5423 and squashes the following commits:

3e3c413 [zsxwing] Remove TODO
9556c78 [zsxwing] Fix NullPointerException for tests
3d574d9 [zsxwing] Register a TaskCompletionListener to make sure release all resources

d45355ee

[SPARK-4598] [WEBUI] Task table pagination for the Stage page · 4f7f1ee3

zsxwing authored 9 years ago

This PR adds pagination for the task table to solve the scalability issue of the stage page. Here is the initial screenshot:
<img width="1347" alt="pagination" src="https://cloud.githubusercontent.com/assets/1000778/8679669/9e63863c-2a8e-11e5-94e4-994febcd6717.png">
The task table only shows 100 tasks. There is a page navigation above the table. Users can click the page navigation or type the page number to jump to another page. The table can be sorted by clicking the headers. However, unlike previous implementation, the sorting work is done in the server now. So clicking a table column to sort needs to refresh the web page.

Author: zsxwing <zsxwing@gmail.com>

Closes #7399 from zsxwing/task-table-pagination and squashes the following commits:

144f513 [zsxwing] Display the page navigation when the page number is out of range
a3eee22 [zsxwing] Add extra space for the error message
54c5b84 [zsxwing] Reset page to 1 if the user changes the page size
c2f7f39 [zsxwing] Add a text field to let users fill the page size
bad52eb [zsxwing] Display user-friendly error messages
410586b [zsxwing] Scroll down to the tasks table if the url contains any sort column
a0746d1 [zsxwing] Use expand-dag-viz-arrow-job and expand-dag-viz-arrow-stage instead of expand-dag-viz-arrow-true and expand-dag-viz-arrow-false
b123f67 [zsxwing] Use localStorage to remember the user's actions and replay them when loading the page
894a342 [zsxwing] Show the link cursor when hovering for headers and page links and other minor fix
4d4fecf [zsxwing] Address Carson's comments
d9285f0 [zsxwing] Add comments and fix the style
74285fa [zsxwing] Merge branch 'master' into task-table-pagination
db6c859 [zsxwing] Task table pagination for the Stage page

4f7f1ee3

[SPARK-7171] Added a method to retrieve metrics sources in TaskContext · 31954910

Jacek Lewandowski authored 9 years ago

Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #5805 from jacek-lewandowski/SPARK-7171 and squashes the following commits:

ed20bda [Jacek Lewandowski] SPARK-7171: Added a method to retrieve metrics sources in TaskContext

31954910

[SPARK-9128] [CORE] Get outerclasses and objects with only one method calling in ClosureCleaner · 9a4fd875

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-9128

Currently, in `ClosureCleaner`, the outerclasses and objects are retrieved using two different methods. However, the logic of the two methods is the same, and we can get both the outerclasses and objects with only one method calling.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7459 from viirya/remove_extra_closurecleaner and squashes the following commits:

7c9858d [Liang-Chi Hsieh] For comments.
a096941 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into remove_extra_closurecleaner
2ec5ce1 [Liang-Chi Hsieh] Remove unnecessary methods.
4df5a51 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into remove_extra_closurecleaner
dc110d1 [Liang-Chi Hsieh] Add method to get outerclasses and objects at the same time.

9a4fd875

[SPARK-9036] [CORE] SparkListenerExecutorMetricsUpdate messages not included in JsonProtocol · f67da43c

Ben authored 9 years ago

This PR implements a JSON serializer and deserializer in the JSONProtocol to handle the (de)serialization of SparkListenerExecutorMetricsUpdate events. It also includes a unit test in the JSONProtocolSuite file. This was implemented to satisfy the improvement request in the JIRA issue SPARK-9036.

Author: Ben <benjaminpiering@gmail.com>

Closes #7555 from NamelessAnalyst/master and squashes the following commits:

fb4e3cc [Ben] Update JSON Protocol and tests
aa69517 [Ben] Update JSON Protocol and tests --Corrected Stage Attempt to Stage Attempt ID
33e5774 [Ben] Update JSON Protocol Tests
3f237e7 [Ben] Update JSON Protocol Tests
84ca798 [Ben] Update JSON Protocol Tests
cde57a0 [Ben] Update JSON Protocol Tests
8049600 [Ben] Update JSON Protocol Tests
c5bc061 [Ben] Update JSON Protocol Tests
6f25785 [Ben] Merge remote-tracking branch 'origin/master'
df2a609 [Ben] Update JSON Protocol
dcda80b [Ben] Update JSON Protocol

f67da43c

[SPARK-9193] Avoid assigning tasks to "lost" executor(s) · 6592a605

Grace authored 9 years ago

Now, when some executors are killed by dynamic-allocation, it leads to some mis-assignment onto lost executors sometimes. Such kind of mis-assignment causes task failure(s) or even job failure if it repeats that errors for 4 times.

The root cause is that ***killExecutors*** doesn't remove those executors under killing ASAP. It depends on the ***OnDisassociated*** event to refresh the active working list later. The delay time really depends on your cluster status (from several milliseconds to sub-minute). When new tasks to be scheduled during that period of time, it will be assigned to those "active" but "under killing" executors. Then the tasks will be failed due to "executor lost". The better way is to exclude those executors under killing in the makeOffers(). Then all those tasks won't be allocated onto those executors "to be lost" any more.

Author: Grace <jie.huang@intel.com>

Closes #7528 from GraceH/AssignToLostExecutor and squashes the following commits:

ecc1da6 [Grace] scala style fix
6e2ed96 [Grace] Re-word makeOffers by more readable lines
b5546ce [Grace] Add comments about the fix
30a9ad0 [Grace] Avoid assigning tasks to lost executors

6592a605

[SPARK-8915] [DOCUMENTATION, MLLIB] Added @since tags to mllib.classification · df4ddb31

petz2000 authored 9 years ago

Created since tags for methods in mllib.classification

Author: petz2000 <petz2000@gmail.com>

Closes #7371 from petz2000/add_since_mllib.classification and squashes the following commits:

39fe291 [petz2000] Removed whitespace in block comment
c9b1e03 [petz2000] Removed @since tags again from protected and private methods
cd759b6 [petz2000] Added @since tags to methods

df4ddb31

[SPARK-9081] [SPARK-9168] [SQL] nanvl & dropna/fillna supporting nan as well · be5c5d37

Yijie Shen authored 9 years ago

JIRA:
https://issues.apache.org/jira/browse/SPARK-9081
https://issues.apache.org/jira/browse/SPARK-9168

This PR target at two modifications:
1.  Change `isNaN` to return `false` on `null` input
2.  Make `dropna` and `fillna` to fill/drop NaN values as well
3.  Implement `nanvl`

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7523 from yjshen/fillna_dropna and squashes the following commits:

f0a51db [Yijie Shen] make coalesce untouched and implement nanvl
1d3e35f [Yijie Shen] make Coalesce aware of NaN in order to support fillna
2760cbc [Yijie Shen] change isNaN(null) to false as well as implement dropna

be5c5d37

[SPARK-8401] [BUILD] Scala version switching build enhancements · f5b6dc5e

Michael Allman authored 9 years ago

These commits address a few minor issues in the Scala cross-version support in the build:

1. Correct two missing `${scala.binary.version}` pom file substitutions.
2. Don't update `scala.binary.version` in parent POM. This property is set through profiles.
3. Update the source of the generated scaladocs in `docs/_plugins/copy_api_dirs.rb`.
4. Factor common code out of `dev/change-version-to-*.sh` and add some validation. We also test `sed` to see if it's GNU sed and try `gsed` as an alternative if not. This prevents the script from running with a non-GNU sed.

This is my original work and I license this work to the Spark project under the Apache License.

Author: Michael Allman <michael@videoamp.com>

Closes #6832 from mallman/scala-versions and squashes the following commits:

cde2f17 [Michael Allman] Delete dev/change-version-to-*.sh, replacing them with single dev/change-scala-version.sh script that takes a version as argument
02296f2 [Michael Allman] Make the scala version change scripts cross-platform by restricting ourselves to POSIX sed syntax instead of looking for GNU sed
ad9b40a [Michael Allman] Factor change-scala-version.sh out of change-version-to-*.sh, adding command line argument validation and testing for GNU sed
bdd20bf [Michael Allman] Update source of scaladocs when changing Scala version
475088e [Michael Allman] Replace jackson-module-scala_2.10 with jackson-module-scala_${scala.binary.version}

f5b6dc5e

[SPARK-8875] Remove BlockStoreShuffleFetcher class · 6364735b

Kay Ousterhout authored 9 years ago

The shuffle code has gotten increasingly difficult to read as it has evolved, and many classes
have evolved significantly since they were originally created. The BlockStoreShuffleFetcher class
now serves little purpose other than to make the code more difficult to read; this commit moves its
functionality into the ShuffleBlockFetcherIterator class.

cc massie JoshRosen (Josh, this PR also removes the Try you pointed out as being confusing / not necessarily useful in a previous comment). Matt, would be helpful to know whether this will interfere in any negative ways with your new shuffle PR (I took a look and it seems like this should still cleanly integrate with your parquet work, but want to double check).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #7268 from kayousterhout/SPARK-8875 and squashes the following commits:

2b24a97 [Kay Ousterhout] Fixed DAGSchedulerSuite compile error
98a1831 [Kay Ousterhout] Merge remote-tracking branch 'upstream/master' into SPARK-8875
90f0e89 [Kay Ousterhout] Fixed broken test
14bfcbb [Kay Ousterhout] Last style fix
bc69d2b [Kay Ousterhout] Style improvements based on Josh's code review
ad3c8d1 [Kay Ousterhout] Better documentation for MapOutputTracker methods
0bc0e59 [Kay Ousterhout] [SPARK-8875] Remove BlockStoreShuffleFetcher class

6364735b

[SPARK-9173][SQL]UnionPushDown should also support Intersect and Except · ae230596

Yijie Shen authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-9173

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7540 from yjshen/union_pushdown and squashes the following commits:

278510a [Yijie Shen] rename UnionPushDown to SetOperationPushDown
91741c1 [Yijie Shen] Add UnionPushDown support for intersect and except

ae230596

[SPARK-8230][SQL] Add array/map size method · 560c658a

Pedro Rodriguez authored 9 years ago

Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230

Primary issue resolved is to implement array/map size for Spark SQL. Code is ready for review by a committer. Chen Hao is on the JIRA ticket, but I don't know his username on github, rxin is also on JIRA ticket.

Things to review:
1. Where to put added functions namespace wise, they seem to be part of a few operations on collections which includes `sort_array` and `array_contains`. Hence the name given `collectionOperations.scala` and `_collection_functions` in python.
2. In Python code, should it be in a `1.5.0` function array or in a collections array?
3. Are there any missing methods on the `Size` case class? Looks like many of these functions have generated Java code, is that also needed in this case?
4. Something else?

Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
Author: Pedro Rodriguez <prodriguez@trulia.com>

Closes #7462 from EntilZha/SPARK-8230 and squashes the following commits:

9a442ae [Pedro Rodriguez] fixed functions and sorted __all__
9aea3bb [Pedro Rodriguez] removed imports from python docs
15d4bf1 [Pedro Rodriguez] Added null test case and changed to nullSafeCodeGen
d88247c [Pedro Rodriguez] removed python code
bd5f0e4 [Pedro Rodriguez] removed duplicate function from rebase/merge
59931b4 [Pedro Rodriguez] fixed compile bug instroduced when merging
c187175 [Pedro Rodriguez] updated code to add size to __all__ directly and removed redundent pretty print
130839f [Pedro Rodriguez] fixed failing test
aa9bade [Pedro Rodriguez] fix style
e093473 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
0449377 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
9a1a2ff [Pedro Rodriguez] added unit tests for map size
2bfbcb6 [Pedro Rodriguez] added unit test for size
20df2b4 [Pedro Rodriguez] Finished working version of size function and added it to python
b503e75 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
99a6a5c [Pedro Rodriguez] fixed failing test
cac75ac [Pedro Rodriguez] fix style
933d843 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
42bb7d4 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
f9c3b8a [Pedro Rodriguez] added unit tests for map size
2515d9f [Pedro Rodriguez] added documentation
0e60541 [Pedro Rodriguez] added unit test for size
acf9853 [Pedro Rodriguez] Finished working version of size function and added it to python
84a5d38 [Pedro Rodriguez] First attempt at implementing size for maps and arrays

560c658a

[SPARK-8255] [SPARK-8256] [SQL] Add regex_extract/regex_replace · 8c8f0ef5

Cheng Hao authored 9 years ago

Add expressions `regex_extract` & `regex_replace`

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7468 from chenghao-intel/regexp and squashes the following commits:

e5ea476 [Cheng Hao] minor update for documentation
ef96fd6 [Cheng Hao] update the code gen
72cf28f [Cheng Hao] Add more log for compilation error
4e11381 [Cheng Hao] Add regexp_replace / regexp_extract support

8c8f0ef5

[SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC · d38c5029

Cheng Lian authored 9 years ago

This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python.

Author: Cheng Lian <lian@databricks.com>

Closes #7444 from liancheng/spark-9100 and squashes the following commits:

284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments
e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC

d38c5029

[SPARK-9161][SQL] codegen FormatNumber · 1ddd0f2f

Tarek Auel authored 9 years ago

Jira https://issues.apache.org/jira/browse/SPARK-9161

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7545 from tarekauel/SPARK-9161 and squashes the following commits:

21425c8 [Tarek Auel] [SPARK-9161][SQL] codegen FormatNumber

1ddd0f2f

[SPARK-9179] [BUILD] Use default primary author if unspecified · 228ab65a

Shivaram Venkataraman authored 9 years ago

Fixes feature introduced in #7508 to use the default value if nothing is specified in command line

cc liancheng rxin pwendell

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #7558 from shivaram/merge-script-fix and squashes the following commits:

7092141 [Shivaram Venkataraman] Use default primary author if unspecified

228ab65a

[SPARK-9023] [SQL] Followup for #7456 (Efficiency improvements for UnsafeRows in Exchange) · 48f8fd46

Josh Rosen authored 9 years ago

This patch addresses code review feedback from #7456.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7551 from JoshRosen/unsafe-exchange-followup and squashes the following commits:

76dbdf8 [Josh Rosen] Add comments + more methods to UnsafeRowSerializer
3d7a1f2 [Josh Rosen] Add writeToStream() method to UnsafeRow

48f8fd46

[SPARK-9208][SQL] Remove variant of DataFrame string functions that accept column names. · 67570bee

Reynold Xin authored 9 years ago

It can be ambiguous whether that is a string literal or a column name.

cc marmbrus

Author: Reynold Xin <rxin@databricks.com>

Closes #7556 from rxin/str-exprs and squashes the following commits:

92afa83 [Reynold Xin] [SPARK-9208][SQL] Remove variant of DataFrame string functions that accept column names.

67570bee

[SPARK-9157] [SQL] codegen substring · 560b355c

Tarek Auel authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-9157

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7534 from tarekauel/SPARK-9157 and squashes the following commits:

e65e3e9 [Tarek Auel] [SPARK-9157] indent fix
44e89f8 [Tarek Auel] [SPARK-9157] use EMPTY_UTF8
37d54c4 [Tarek Auel] Merge branch 'master' into SPARK-9157
60732ea [Tarek Auel] [SPARK-9157] created substringSQL in UTF8String
18c3576 [Tarek Auel] [SPARK-9157][SQL] remove slice pos
1a2e611 [Tarek Auel] [SPARK-9157][SQL] codegen substring

560b355c

[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and... · c032b0bf

Josh Rosen authored 9 years ago

[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL

This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort. The root of this problem is that `NaN > anything`, `NaN == anything`, and `NaN < anything` all return `false`.

Per the design specified in SPARK-9079, we have decided that `NaN = NaN` should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value).

In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7194 from JoshRosen/nan and squashes the following commits:

983d4fc [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
88bd73c [Josh Rosen] Fix Row.equals()
a702e2e [Josh Rosen] normalization -> canonicalization
a7267cf [Josh Rosen] Normalize NaNs in UnsafeRow
fe629ae [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
fbb2a29 [Josh Rosen] Fix NaN comparisons in BinaryComparison expressions
c1fd4fe [Josh Rosen] Fold NaN test into existing test framework
b31eb19 [Josh Rosen] Uncomment failing tests
7fe67af [Josh Rosen] Support NaN == NaN (SPARK-9145)
58bad2c [Josh Rosen] Revert "Compare rows' string representations to work around NaN incomparability."
fc6b4d2 [Josh Rosen] Update CodeGenerator
3998ef2 [Josh Rosen] Remove unused code
a2ba2e7 [Josh Rosen] Fix prefix comparision for NaNs
a30d371 [Josh Rosen] Compare rows' string representations to work around NaN incomparability.
6f03f85 [Josh Rosen] Fix bug in Double / Float ordering
42a1ad5 [Josh Rosen] Stop filtering NaNs in UnsafeExternalSortSuite
bfca524 [Josh Rosen] Change ordering so that NaN is maximum value.
8d7be61 [Josh Rosen] Update randomized test to use ScalaTest's assume()
b20837b [Josh Rosen] Add failing test for new NaN comparision ordering
5b88b2b [Josh Rosen] Fix compilation of CodeGenerationSuite
d907b5b [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
630ebc5 [Josh Rosen] Specify an ordering for NaN values.
9bf195a [Josh Rosen] Re-enable NaNs in CodeGenerationSuite to produce more regression tests
13fc06a [Josh Rosen] Add regression test for NaN sorting issue
f9efbb5 [Josh Rosen] Fix ORDER BY NULL
e7dc4fb [Josh Rosen] Add very generic test for ordering
7d5c13e [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL)
b55875a [Josh Rosen] Generate doubles and floats over entire possible range.
5acdd5c [Josh Rosen] Infinity and NaN are interesting.
ab76cbd [Josh Rosen] Move code to Catalyst package.
d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.

c032b0bf

[SPARK-9204][ML] Add default params test for linearyregression suite · 4d97be95

Holden Karau authored 9 years ago

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7553 from holdenk/SPARK-9204-add-default-params-test-to-linear-regression and squashes the following commits:

630ba19 [Holden Karau] style fix
faa08a3 [Holden Karau] Add default params test for linearyregression suite

4d97be95

[SPARK-9132][SPARK-9163][SQL] codegen conv · a3c7a3ce

Tarek Auel authored 9 years ago

Jira: https://issues.apache.org/jira/browse/SPARK-9132
https://issues.apache.org/jira/browse/SPARK-9163

rxin as you proposed in the Jira ticket, I just moved the logic to a separate object. I haven't changed anything of the logic of `NumberConverter`.

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7552 from tarekauel/SPARK-9163 and squashes the following commits:

40dcde9 [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] style fix
fa985bd [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] codegen conv

a3c7a3ce

Jul 20, 2015

[SPARK-9201] [ML] Initial integration of MLlib + SparkR using RFormula · 1cbdd899

Eric Liang authored 9 years ago

This exposes the SparkR:::glm() and SparkR:::predict() APIs. It was necessary to change RFormula to silently drop the label column if it was missing from the input dataset, which is kind of a hack but necessary to integrate with the Pipeline API.

The umbrella design doc for MLlib + SparkR integration can be viewed here: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7483 from ericl/spark-8774 and squashes the following commits:

3dfac0c [Eric Liang] update
17ef516 [Eric Liang] more comments
1753a0f [Eric Liang] make glm generic
b0f50f8 [Eric Liang] equivalence test
550d56d [Eric Liang] export methods
c015697 [Eric Liang] second pass
117949a [Eric Liang] comments
5afbc67 [Eric Liang] test label columns
6b7f15f [Eric Liang] Fri Jul 17 14:20:22 PDT 2015
3a63ae5 [Eric Liang] Fri Jul 17 13:41:52 PDT 2015
ce61367 [Eric Liang] Fri Jul 17 13:41:17 PDT 2015
0299c59 [Eric Liang] Fri Jul 17 13:40:32 PDT 2015
e37603f [Eric Liang] Fri Jul 17 12:15:03 PDT 2015
d417d0c [Eric Liang] Merge remote-tracking branch 'upstream/master' into spark-8774
29a2ce7 [Eric Liang] Merge branch 'spark-8774-1' into spark-8774
d1959d2 [Eric Liang] clarify comment
2db68aa [Eric Liang] second round of comments
dc3c943 [Eric Liang] address comments
5765ec6 [Eric Liang] fix style checks
1f361b0 [Eric Liang] doc
d33211b [Eric Liang] r support
fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer

1cbdd899

[SPARK-9052] [SPARKR] Fix comments after curly braces · 2bdf9914

Yu ISHIKAWA authored 9 years ago

[[SPARK-9052] Fix comments after curly braces - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9052)

This is the full result of lintr at the rivision:01155162.
[[SPARK-9052] the result of lint-r at the revision:01155162](https://gist.github.com/yu-iskw/e7246041b173a3f29482)

This is the difference of the result between before and after.
https://gist.github.com/yu-iskw/e7246041b173a3f29482/revisions

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7440 from yu-iskw/SPARK-9052 and squashes the following commits:

015d738 [Yu ISHIKAWA] Fix the indentations and move the placement of commna
5cc30fe [Yu ISHIKAWA] Fix the indentation in a condition
4ead0e5 [Yu ISHIKAWA] [SPARK-9052][SparkR] Fix comments after curly braces

2bdf9914

[SPARK-9164] [SQL] codegen hex/unhex · 936a96cb

Tarek Auel authored 9 years ago

Jira: https://issues.apache.org/jira/browse/SPARK-9164

The diff looks heavy, but I just moved the `hex` and `unhex` methods to `object Hex`. This allows me to call them from `eval` and `codeGen`

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7548 from tarekauel/SPARK-9164 and squashes the following commits:

dd91c57 [Tarek Auel] [SPARK-9164][SQL] codegen hex/unhex

936a96cb

[SPARK-9142][SQL] Removing unnecessary self types in expressions. · e90543e5

Reynold Xin authored 9 years ago

Also added documentation to expressions to explain the important traits and abstract classes.

Author: Reynold Xin <rxin@databricks.com>

Closes #7550 from rxin/remove-self-types and squashes the following commits:

b2a3ec1 [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in expressions.

e90543e5

[SPARK-9156][SQL] codegen StringSplit · 6853ac7c

Tarek Auel authored 9 years ago

Jira: https://issues.apache.org/jira/browse/SPARK-9156

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7547 from tarekauel/SPARK-9156 and squashes the following commits:

0be2700 [Tarek Auel] [SPARK-9156][SQL] indention fix
b860eaf [Tarek Auel] [SPARK-9156][SQL] codegen StringSplit
5ad6a1f [Tarek Auel] [SPARK-9156] codegen StringSplit

6853ac7c

[SPARK-9178][SQL] Add an empty string constant to UTF8String · 047ccc8c

Tarek Auel authored 9 years ago

Jira: https://issues.apache.org/jira/browse/SPARK-9178

In order to avoid calls of `UTF8String.fromString("")` this pr adds an `EMPTY_STRING` constant to `UTF8String`. An `UTF8String` is immutable, so we can use a constant, isn't it?

I searched for current usage of `UTF8String.fromString("")` with
`grep -R  "UTF8String.fromString(\"\")" .`

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7509 from tarekauel/SPARK-9178 and squashes the following commits:

8d6c405 [Tarek Auel] [SPARK-9178] revert intellij indents
3627b80 [Tarek Auel] [SPARK-9178] revert concat tests changes
3f5fbf5 [Tarek Auel] [SPARK-9178] rebase and add final to UTF8String.EMPTY_UTF8
47cda68 [Tarek Auel] Merge branch 'master' into SPARK-9178
4a37344 [Tarek Auel] [SPARK-9178] changed name to EMPTY_UTF8, added tests
748b87a [Tarek Auel] [SPARK-9178] Add empty string constant to UTF8String

047ccc8c

[SPARK-9187] [WEBUI] Timeline view may show negative value for running tasks · 66bb8003

Carson Wang authored 9 years ago

For running tasks, the executorRunTime metrics is 0 which causes negative executorComputingTime in the timeline. It also causes an incorrect SchedulerDelay time.
![timelinenegativevalue](https://cloud.githubusercontent.com/assets/9278199/8770953/f4362378-2eec-11e5-81e6-a06a07c04794.png)

Author: Carson Wang <carson.wang@intel.com>

Closes #7526 from carsonwang/timeline-negValue and squashes the following commits:

7b17db2 [Carson Wang] Fix negative value in timeline view

66bb8003

[SPARK-9175] [MLLIB] BLAS.gemm fails to update matrix C when alpha==0 and beta!=1 · ff3c72db

Meihua Wu authored 9 years ago

Fix BLAS.gemm to update matrix C when alpha==0 and beta!=1
Also include unit tests to verify the fix.

mengxr brkyvz

Author: Meihua Wu <meihuawu@umich.edu>

Closes #7503 from rotationsymmetry/fix_BLAS_gemm and squashes the following commits:

fce199c [Meihua Wu] Fix BLAS.gemm to update C when alpha==0 and beta!=1

ff3c72db

[SPARK-9198] [MLLIB] [PYTHON] Fixed typo in pyspark sparsevector doc tests · a5d05819

Joseph K. Bradley authored 9 years ago

Several places in the PySpark SparseVector docs have one defined as:
```
SparseVector(4, [2, 4], [1.0, 2.0])
```
The index 4 goes out of bounds (but this is not checked).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7541 from jkbradley/sparsevec-doc-typo-fix and squashes the following commits:

c806a65 [Joseph K. Bradley] fixed doc test
e2dcb23 [Joseph K. Bradley] Fixed typo in pyspark sparsevector doc tests

a5d05819

[SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery · a1064df0

Cheng Lian authored 9 years ago

This PR tries to accelerate Parquet schema discovery and `HadoopFsRelation` partition discovery.  The acceleration is done by the following means:

- Turning off schema merging by default

  Schema merging is not the most common case, but requires reading footers of all Parquet part-files and can be very slow.

- Avoiding `FileSystem.globStatus()` call when possible

  `FileSystem.globStatus()` may issue multiple synchronous RPC calls, and can be very slow (esp. on S3).  This PR adds `SparkHadoopUtil.globPathIfNecessary()`, which only issues RPC calls when the path contain glob-pattern specific character(s) (`{}[]*?\`).

  This is especially useful when converting a metastore Parquet table with lots of partitions, since Spark SQL adds all partition directories as the input paths, and currently we do a `globStatus` call on each input path sequentially.

- Listing leaf files in parallel when the number of input paths exceeds a threshold

  Listing leaf files is required by partition discovery.  Currently it is done on driver side, and can be slow when there are lots of (nested) directories, since each `FileSystem.listStatus()` call issues an RPC.  In this PR, we list leaf files in a BFS style, and resort to a Spark job once we found that the number of directories need to be listed exceed a threshold.

  The threshold is controlled by `SQLConf` option `spark.sql.sources.parallelPartitionDiscovery.threshold`, which defaults to 32.

- Discovering Parquet schema in parallel

  Currently, schema merging is also done on driver side, and needs to read footers of all part-files.  This PR uses a Spark job to do schema merging.  Together with task side metadata reading in Parquet 1.7.0, we never read any footers on driver side now.

Author: Cheng Lian <lian@databricks.com>

Closes #7396 from liancheng/accel-parquet and squashes the following commits:

5598efc [Cheng Lian] Uses ParquetInputFormat[InternalRow] instead of ParquetInputFormat[Row]
ff32cd0 [Cheng Lian] Excludes directories while listing leaf files
3c580f1 [Cheng Lian] Fixes test failure caused by making "mergeSchema" default to "false"
b1646aa [Cheng Lian] Should allow empty input paths
32e5f0d [Cheng Lian] Moves schema merging to executor side

a1064df0

[SPARK-9160][SQL] codegen encode, decode · dac7dbf5

Tarek Auel authored 9 years ago

Jira: https://issues.apache.org/jira/browse/SPARK-9160

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7543 from tarekauel/SPARK-9160 and squashes the following commits:

7528f0e [Tarek Auel] [SPARK-9160][SQL] codegen encode, decode

dac7dbf5

[SPARK-9159][SQL] codegen ascii, base64, unbase64 · c9db8eaa

Tarek Auel authored 9 years ago

Jira: https://issues.apache.org/jira/browse/SPARK-9159

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7542 from tarekauel/SPARK-9159 and squashes the following commits:

772e6bc [Tarek Auel] [SPARK-9159][SQL] codegen ascii, base64, unbase64

c9db8eaa