Commits · 536d2adc12c88ff67b5dde3dec998bcacef50fba · cs525-sp18-g07 / spark

Aug 02, 2015

[SPARK-9535][SQL][DOCS] Modify document for codegen. · 536d2adc

KaiXinXiaoLei authored 9 years ago

#7142 made codegen enabled by default so let's modify the corresponding documents.

Closes #7142

Author: KaiXinXiaoLei <huleilei1@huawei.com>
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #7863 from sarutak/SPARK-9535 and squashes the following commits:

0884424 [Kousuke Saruta] Removed a line which mentioned about the effect of codegen enabled
3c11af0 [Kousuke Saruta] Merge branch 'sqlconfig' of https://github.com/KaiXinXiaoLei/spark into SPARK-9535
4ee531d [KaiXinXiaoLei] delete space
4cfd11d [KaiXinXiaoLei] change spark.sql.planner.externalSort
d624cf8 [KaiXinXiaoLei] sql config is wrong

536d2adc

[SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter. · 9d03ad91

Reynold Xin authored 9 years ago

The detailed approach is documented in UnsafeKVExternalSorterSuite.testKVSorter(), working as follows:

1. Create input by generating data randomly based on the given key/value schema (which is also randomly drawn from a list of candidate types)
2. Run UnsafeKVExternalSorter on the generated data
3. Collect the output from the sorter, and make sure the keys are sorted in ascending order
4. Sort the input by both key and value, and sort the sorter output also by both key and value. Compare the sorted input and sorted output together to make sure all the key/values match.
5. Check memory allocation to make sure there is no memory leak.

There is also a spill flag. When set to true, the sorter will spill probabilistically roughly every 100 records.

Author: Reynold Xin <rxin@databricks.com>

Closes #7873 from rxin/kvsorter-randomized-test and squashes the following commits:

a08c251 [Reynold Xin] Resource cleanup.
0488b5c [Reynold Xin] [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter.

9d03ad91

[SPARK-7937][SQL] Support comparison on StructType · 0722f433

Liang-Chi Hsieh authored 9 years ago

This brings #6519 up-to-date with master branch.

Closes #6519.

Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7877 from rxin/sort-struct and squashes the following commits:

4968231 [Reynold Xin] Minor fixes.
2537813 [Reynold Xin] Merge branch 'compare_named_struct' of github.com:viirya/spark-1 into sort-struct
d2ba8ad [Liang-Chi Hsieh] Remove unused import.
3a3f40e [Liang-Chi Hsieh] Don't need to add compare to InternalRow because we can use RowOrdering.
dae6aad [Liang-Chi Hsieh] Fix nested struct.
d5349c7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
43d4354 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
1f66196 [Liang-Chi Hsieh] Reuse RowOrdering and GenerateOrdering.
f8b2e9c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
1187a65 [Liang-Chi Hsieh] Fix scala style.
9d67f68 [Liang-Chi Hsieh] Fix wrongly merging.
8f4d775 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
94b27d5 [Liang-Chi Hsieh] Remove test for error on complex type comparison.
2071693 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
3c142e4 [Liang-Chi Hsieh] Fix scala style.
cf58dc3 [Liang-Chi Hsieh] Use checkAnswer.
f651b8d [Liang-Chi Hsieh] Remove Either and move orderings to BinaryComparison to reuse it.
b6e1009 [Liang-Chi Hsieh] Fix scala style.
3922b54 [Liang-Chi Hsieh] Support ordering on named_struct.

0722f433

[SPARK-9531] [SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter · 2e981b7b

Reynold Xin authored 9 years ago

This pull request adds a destructAndCreateExternalSorter method to UnsafeFixedWidthAggregationMap. The new method does the following:

1. Creates a new external sorter UnsafeKVExternalSorter
2. Adds all the data into an in-memory sorter, sorts them
3. Spills the sorted in-memory data to disk

This method can be used to fallback to sort-based aggregation when under memory pressure.

The pull request also includes accounting fixes from JoshRosen.

TODOs (that can be done in follow-up PRs)
- [x] Address Josh's feedbacks from #7849
- [x] More documentation and test cases
- [x] Make sure we are doing memory accounting correctly with test cases (e.g. did we release the memory in BytesToBytesMap twice?)
- [ ] Look harder at possible memory leaks and exception handling
- [ ] Randomized tester for the KV sorter as well as the aggregation map

Author: Reynold Xin <rxin@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #7860 from rxin/kvsorter and squashes the following commits:

986a58c [Reynold Xin] Bug fix.
599317c [Reynold Xin] Style fix and slightly more compact code.
fe7bd4e [Reynold Xin] Bug fixes.
fd71bef [Reynold Xin] Merge remote-tracking branch 'josh/large-records-in-sql-sorter' into kvsorter-with-josh-fix
3efae38 [Reynold Xin] More fixes and documentation.
45f1b09 [Josh Rosen] Ensure that spill files are cleaned up
f6a9bd3 [Reynold Xin] Josh feedback.
9be8139 [Reynold Xin] Remove testSpillFrequency.
7cbe759 [Reynold Xin] [SPARK-9531][SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter.
ae4a8af [Josh Rosen] Detect leaked unsafe memory in UnsafeExternalSorterSuite.
52f9b06 [Josh Rosen] Detect ShuffleMemoryManager leaks in UnsafeExternalSorter.

2e981b7b

[SPARK-9527] [MLLIB] add PrefixSpanModel and make PrefixSpan Java friendly · 66924ffa

Xiangrui Meng authored 9 years ago

1. Use `PrefixSpanModel` to wrap the frequent sequences.
2. Define `FreqSequence` to wrap each frequent sequence, which contains a Java-friendly method `javaSequence`
3. Overload `run` for Java users.
4. Added a unit test in Java to check Java compatibility.

zhangjiajin feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #7869 from mengxr/SPARK-9527 and squashes the following commits:

4345594 [Xiangrui Meng] add PrefixSpanModel and make PrefixSpan Java friendly

66924ffa

[SPARK-9208][SQL] Sort DataFrame functions alphabetically. · 8eafa2ae

Reynold Xin authored 9 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #7861 from rxin/api-audit and squashes the following commits:

7200256 [Reynold Xin] [SPARK-9208][SQL] Sort DataFrame functions alphabetically.

8eafa2ae

[SPARK-9149] [ML] [EXAMPLES] Add an example of spark.ml KMeans · 244016a9

Yu ISHIKAWA authored 9 years ago

[SPARK-9149] Add an example of spark.ml KMeans - ASF JIRA https://issues.apache.org/jira/browse/SPARK-9149

jkbradley Should we support other data formats, such as TSV or CSV. I have implemented these examples which support only space separated file which is same as the example for `spark.mllib`'s `KMeans`.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7697 from yu-iskw/SPARK-9149 and squashes the following commits:

7137bad [Yu ISHIKAWA] Fix the typo
56b9da2 [Yu ISHIKAWA] Fix the place of the wrong import statment
554e574 [Yu ISHIKAWA] Change the way to format input data in KMeansExample
e7a948a [Yu ISHIKAWA] Import spark.ml.clustering.KMeans
1901e0c [Yu ISHIKAWA] Change how to initialize an array for a DataFrame schema
d8043f5 [Yu ISHIKAWA] Return a value directly
d81bf55 [Yu ISHIKAWA] Fix a typo and its access specifiers
3e0862d [Yu ISHIKAWA] Make KMeansExample more simple
51ce9c1 [Yu ISHIKAWA] Make JavaKMeansExample more simple
a5a01e0 [Yu ISHIKAWA] Fix a Javadoc about the command to execute the example
b09ec13 [Yu ISHIKAWA] [SPARK-9149][ML][Examples] Add an example of spark.ml KMeans

244016a9

[SPARK-9521] [BUILD] Require Maven 3.3.3+ in the build · 9d1c0252

Sean Owen authored 9 years ago

Enforce Maven 3.3.3+ in the build. (Also update the scala compiler plugin while we're at it.)

Author: Sean Owen <sowen@cloudera.com>

Closes #7852 from srowen/SPARK-9521 and squashes the following commits:

3093039 [Sean Owen] Enforce Maven 3.3.3+ in the build. (Also update the scala compiler plugin while we're at it.)

9d1c0252

[SPARK-9529] [SQL] improve TungstenSort on DecimalType · 16b928c5

Davies Liu authored 9 years ago

Generate prefix for DecimalType, fix the random generator of decimal

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #7857 from davies/sort_decimal and squashes the following commits:

2433959 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_decimal
de24253 [Davies Liu] fix style
0a54c1a [Davies Liu] sort decimal

16b928c5

[SPARK-9000] [MLLIB] Support generic item types in PrefixSpan · 28d944e8

Feynman Liang authored 9 years ago

mengxr Please review after #7818 merges and master is rebased.

Continues work by rikima

Closes #7400

Author: Feynman Liang <fliang@databricks.com>
Author: masaki rikitoku <rikima3132@gmail.com>

Closes #7837 from feynmanliang/SPARK-7400-genericItems and squashes the following commits:

8b2c756 [Feynman Liang] Remove orig
92443c8 [Feynman Liang] Style fixes
42c6349 [Feynman Liang] Style fix
14e67fc [Feynman Liang] Generic prefixSpan itemtypes
b3b21e0 [Feynman Liang] Initial support for generic itemtype in public api
b86e0d5 [masaki rikitoku] modify to support generic item type

28d944e8

Aug 01, 2015

[SPARK-9459] [SQL] use generated FromUnsafeProjection to do deep copy for UTF8String and struct · 57084e0c

Davies Liu authored 9 years ago

When accessing a column in UnsafeRow, it's good to avoid the copy, then we should do deep copy when turn the UnsafeRow into generic Row, this PR brings generated FromUnsafeProjection to do that.

This PR also fix the expressions that cache the UTF8String, which should also copy it.

Author: Davies Liu <davies@databricks.com>

Closes #7840 from davies/avoid_copy and squashes the following commits:

230c8a1 [Davies Liu] address comment
fd797c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into avoid_copy
e095dd0 [Davies Liu] rollback rename
8ef5b0b [Davies Liu] copy String in Columnar
81360b8 [Davies Liu] fix class name
9aecb88 [Davies Liu] use FromUnsafeProjection to do deep copy for UTF8String and struct

57084e0c

[SPARK-8185] [SPARK-8188] [SPARK-8191] [SQL] function datediff,... · c1b0cbd7

Davies Liu authored 9 years ago

[SPARK-8185] [SPARK-8188] [SPARK-8191] [SQL] function datediff, to_utc_timestamp, from_utc_timestamp

This PR is based on #7643 , thanks to adrian-wang

Author: Davies Liu <davies@databricks.com>
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #7847 from davies/datediff and squashes the following commits:

74333d7 [Davies Liu] fix bug
22d8a8c [Davies Liu] optimize
85cdd21 [Davies Liu] remove unnecessary tests
241d90c [Davies Liu] Merge branch 'master' of github.com:apache/spark into datediff
e9dc0f5 [Davies Liu] fix datediff/to_utc_timestamp/from_utc_timestamp
c360447 [Daoyuan Wang] function datediff, to_utc_timestamp, from_utc_timestamp (commits merged)

c1b0cbd7

[SPARK-8269] [SQL] string function: initcap · 00cd92f3

HuJiayin authored 9 years ago

This PR is based on #7208 , thanks to HuJiayin

Closes #7208

Author: HuJiayin <jiayin.hu@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7850 from davies/initcap and squashes the following commits:

54472e9 [Davies Liu] fix python test
17ffe51 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
ca46390 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
3a906e4 [Davies Liu] implement title case in UTF8String
8b2506a [HuJiayin] Update functions.py
2cd43e5 [HuJiayin] fix python style check
b616c0e [HuJiayin] add python api
1f5a0ef [HuJiayin] add codegen
7e0c604 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark into initcap
6a0b958 [HuJiayin] add column
c79482d [HuJiayin] support soundex
7ce416b [HuJiayin] support initcap rebase code

00cd92f3

[SPARK-9495] prefix of DateType/TimestampType · 5d9e33d9

Davies Liu authored 9 years ago

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7856 from davies/sort_improve and squashes the following commits:

5fc81bd [Davies Liu] support DateType/TimestampType

5d9e33d9

[SPARK-9530] [MLLIB] ScalaDoc should not indicate LDAModel.describeTopics and... · 84a6982b

Meihua Wu authored 9 years ago

[SPARK-9530] [MLLIB] ScalaDoc should not indicate LDAModel.describeTopics and DistributedLDAModel.topDocumentsPerTopic as approximate

Remove ScalaDoc that suggests describeTopics and topDocumentsPerTopic are approximate.

cc jkbradley

Author: Meihua Wu <meihuawu@umich.edu>

Closes #7858 from rotationsymmetry/SPARK-9530 and squashes the following commits:

b574923 [Meihua Wu] Remove ScalaDoc that suggests describeTopics and topDocumentsPerTopic are approximate.

84a6982b

[SPARK-9520] [SQL] Support in-place sort in UnsafeFixedWidthAggregationMap · 3d1535d4

Reynold Xin authored 9 years ago

This pull request adds a sortedIterator method to UnsafeFixedWidthAggregationMap that sorts its data in-place by the grouping key.

This is needed so we can fallback to external sorting for aggregation.

Author: Reynold Xin <rxin@databricks.com>

Closes #7849 from rxin/bytes2bytes-sorting and squashes the following commits:

75018c6 [Reynold Xin] Updated documentation.
81a8694 [Reynold Xin] [SPARK-9520][SQL] Support in-place sort in UnsafeFixedWidthAggregationMap.

3d1535d4

[SPARK-9491] Avoid fetching HBase tokens when not needed. · df733cbe

Marcelo Vanzin authored 9 years ago

Look at HBase's configuration to make sure it's configured for
Kerberos. If the HBase configuration is missing, or if HBase is
configured for non-kerberos authentication, then skip getting
tokens.

Reference: http://hbase.apache.org/book.html#security.prerequisites

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7810 from vanzin/SPARK-9491 and squashes the following commits:

a57c776 [Marcelo Vanzin] [SPARK-9491] Avoid fetching HBase tokens when not needed.

df733cbe

[SPARK-4751] Dynamic allocation in standalone mode · 6688ba6e

Andrew Or authored 9 years ago

Dynamic allocation is a feature that allows a Spark application to scale the number of executors up and down dynamically based on the workload. Support was first introduced in YARN since 1.2, and then extended to Mesos coarse-grained mode recently. Today, it is finally supported in standalone mode as well!

I tested this locally and it works as expected. This is WIP because unit tests are coming.

Author: Andrew Or <andrew@databricks.com>

Closes #7532 from andrewor14/standalone-da and squashes the following commits:

b3c1736 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
879e928 [Andrew Or] Add end-to-end tests for standalone dynamic allocation
accc8f6 [Andrew Or] Address comments
ee686a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
c0a2c02 [Andrew Or] Fix build after merge conflict
24149eb [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
2e762d6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
6832bd7 [Andrew Or] Add tests for scheduling with executor limit
a82e907 [Andrew Or] Fix comments
0a8be79 [Andrew Or] Simplify logic by removing the worker blacklist
b7742af [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
2eb5f3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
1334e9a [Andrew Or] Fix MiMa
32abe44 [Andrew Or] Fix style
58cb06f [Andrew Or] Privatize worker blacklist for cleanliness
42ac215 [Andrew Or] Clean up comments and rewrite code for readability
49702d1 [Andrew Or] Clean up shuffle files after application exits
80047aa [Andrew Or] First working implementation

6688ba6e

[SPARK-8263] [SQL] substr/substring should also support binary type · c5166f7a

zhichao.li authored 9 years ago

This is based on #7641, thanks to zhichao-li

Closes #7641

Author: zhichao.li <zhichao.li@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7848 from davies/substr and squashes the following commits:

461b709 [Davies Liu] remove bytearry from tests
b45377a [Davies Liu] Merge branch 'master' of github.com:apache/spark into substr
01d795e [zhichao.li] scala style
99aa130 [zhichao.li] add substring to dataframe
4f68bfe [zhichao.li] add binary type support for substring

c5166f7a

[SPARK-8232] [SQL] Add sort_array support · cf6c9ca3

Cheng Hao authored 9 years ago

This PR is based on #7581 , just fix the conflict.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7851 from davies/sort_array and squashes the following commits:

a80ef66 [Davies Liu] fix conflict
7cfda65 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_array
664c960 [Cheng Hao] update the sort_array by using the ArrayData
276d2d5 [Cheng Hao] add empty line
0edab9c [Cheng Hao] Add asending/descending support for sort_array
80fc0f8 [Cheng Hao] Add type checking
a42b678 [Cheng Hao] Add sort_array support

cf6c9ca3

[SPARK-8169] [ML] Add StopWordsRemover as a transformer · 87656650

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-8169

stop words: http://en.wikipedia.org/wiki/Stop_words

StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default.

Currently I used a minimum stop words set since on some [case](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html), small set of stop words is preferred.
ASCII char has been tested, Yet I cannot check it in due to style check.

Further thought,
1. Maybe I should use OpenHashSet. Is it recommended?
2. Currently I leave the null in input array untouched, i.e. Array(null, null) => Array(null, null).
3. If the current stop words set looks too limited, any suggestion for replacement? We can have something similar to the one in [SKlearn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py).

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6742 from hhbyyh/stopwords and squashes the following commits:

fa959d8 [Yuhao Yang] separating udf
f190217 [Yuhao Yang] replace default list and other small fix
04403ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into stopwords
b3aa957 [Yuhao Yang] add stopWordsRemover

87656650

[SPARK-8999] [MLLIB] PrefixSpan non-temporal sequences · d2a9b66f

zhangjiajin authored 9 years ago

mengxr Extends PrefixSpan to non-temporal itemsets. Continues work by zhangjiajin

 * Internal API uses List[Set[Int]] which is likely not efficient; will need to refactor during QA

Closes #7646

Author: zhangjiajin <zhangjiajin@huawei.com>
Author: Feynman Liang <fliang@databricks.com>
Author: zhang jiajin <zhangjiajin@huawei.com>

Closes #7818 from feynmanliang/SPARK-8999-nonTemporal and squashes the following commits:

4ded81d [Feynman Liang] Replace all filters to filter nonempty
350e67e [Feynman Liang] Code review feedback
03156ca [Feynman Liang] Fix tests, drop delimiters at boundaries of sequences
d1fe0ed [Feynman Liang] Remove comments
86ca4e5 [Feynman Liang] Fix style
7c7bf39 [Feynman Liang] Fixed itemSet sequences
6073b10 [Feynman Liang] Basic itemset functionality, failing test
1a7fb48 [Feynman Liang] Add delimiter to results
5db00aa [Feynman Liang] Working for items, not itemsets
6787716 [Feynman Liang] Working on temporal sequences
f1114b9 [Feynman Liang] Add -1 delimiter
00fe756 [Feynman Liang] Reset base files for rebase
f486dcd [zhangjiajin] change maxLocalProjDBSize and fix a bug (remove -3 from frequent items).
60a0b76 [zhangjiajin] fixed a scala style error.
740c203 [zhangjiajin] fixed a scala style error.
5785cb8 [zhangjiajin] support non-temporal sequence
a5d649d [zhangjiajin] restore original version
09dc409 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into multiItems_2
ae8c02d [zhangjiajin] Fixed some Scala style errors.
216ab0c [zhangjiajin] Support non-temporal sequence in PrefixSpan
b572f54 [zhangjiajin] initialize file before rebase.
f06772f [zhangjiajin] fix a scala style error.
a7e50d4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan.
c1d13d0 [zhang jiajin] Delete PrefixspanSuite.scala
d9d8137 [zhang jiajin] Delete Prefixspan.scala
c6ceb63 [zhangjiajin] Add new algorithm PrefixSpan and test file.

d2a9b66f

[SPARK-7446] [MLLIB] Add inverse transform for string indexer · 65038973

Holden Karau authored 9 years ago

It is useful to convert the encoded indices back to their string representation for result inspection. We can add a function which creates an inverse transformation.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #6339 from holdenk/SPARK-7446-inverse-transform-for-string-indexer and squashes the following commits:

7cdf915 [Holden Karau] scala style comment fix
b9cffb6 [Holden Karau] Update the labels param to have the metadata note
6a38edb [Holden Karau] Setting the default needs to come after the value gets defined
9e241d8 [Holden Karau] use Array.empty
21c8cfa [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
64dd3a3 [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
4f06c59 [Holden Karau] Fix comment styles, use empty array as the default, etc.
a60c0e3 [Holden Karau] CR feedback (remove old constructor, add a note about use of setLabels)
1987b95 [Holden Karau] Use default copy
71e8d66 [Holden Karau] Make labels a local param for StringIndexerInverse
8450d0b [Holden Karau] Use the labels param in StringIndexerInverse
7464019 [Holden Karau] Add a labels param
868b1a9 [Holden Karau] Update scaladoc since we don't have labelsCol anymore
5aa38bf [Holden Karau] Add an inverse test using only meta data, pass labels when calling inverse method
f3e0c64 [Holden Karau] CR feedback
ebed932 [Holden Karau] Add Experimental tag and some scaladocs. Also don't require that the inputCol has the metadata on it, instead have the labelsCol specified when creating the inverse.
03ebf95 [Holden Karau] Add explicit type for invert function
ecc65e0 [Holden Karau] Read the metadata correctly, use the array, pass the test
a42d773 [Holden Karau] Fix test to supply cols as per new invert method
16cc3c3 [Holden Karau] Add an invert method
d4bcb20 [Holden Karau] Make the inverse string indexer into a transformer (still needs test updates but compiles)
e8bf3ad [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
c3fdee1 [Holden Karau] Some WIP refactoring based on jkbradley's CR feedback. Definite work-in-progress
557bef8 [Holden Karau] Instead of using a private inverse transform, add an invert function so we can use it in a pipeline
88779c1 [Holden Karau] fix long line
78b28c1 [Holden Karau] Finish reverse part and add a test :)
bb16a6a [Holden Karau] Some progress

65038973

Revert "[SPARK-8232] [SQL] Add sort_array support" · 60ea7ab4
Davies Liu authored 9 years ago
```
This reverts commit 67ad4e21.
```
60ea7ab4

[SPARK-9480][SQL] add MapData and cleanup internal row stuff · 1d59a416

Wenchen Fan authored 9 years ago

This PR adds a `MapData` as internal representation of map type in Spark SQL, and provides a default implementation with just 2 `ArrayData`.

After that, we have specialized getters for all internal type, so I removed generic getter in `ArrayData` and added specialized `toArray` for it.
Also did some refactor and cleanup for `InternalRow` and its subclasses.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7799 from cloud-fan/map-data and squashes the following commits:

77d482f [Wenchen Fan] fix python
e8f6682 [Wenchen Fan] skip MapData equality check in HiveInspectorSuite
40cc9db [Wenchen Fan] add toString
6e06ec9 [Wenchen Fan] some more cleanup
a90aca1 [Wenchen Fan] add MapData

1d59a416

[SPARK-9517][SQL] BytesToBytesMap should encode data the same way as UnsafeExternalSorter · d90f2cf7

Reynold Xin authored 9 years ago

BytesToBytesMap current encodes key/value data in the following format:
```
8B key length, key data, 8B value length, value data
```

UnsafeExternalSorter, on the other hand, encodes data this way:
```
4B record length, data
```

As a result, we cannot pass records encoded by BytesToBytesMap directly into UnsafeExternalSorter for sorting. However, if we rearrange data slightly, we can then pass the key/value records directly into UnsafeExternalSorter:
```
4B key+value length, 4B key length, key data, value data
```

Author: Reynold Xin <rxin@databricks.com>

Closes #7845 from rxin/kvsort-rebase and squashes the following commits:

5716b59 [Reynold Xin] Fixed test.
2e62ccb [Reynold Xin] Updated BytesToBytesMap's data encoding to put the key first.
a51b641 [Reynold Xin] Added a KV sorter interface.

d90f2cf7

[SPARK-8232] [SQL] Add sort_array support · 67ad4e21

Cheng Hao authored 9 years ago

Add expression `sort_array` support.

Author: Cheng Hao <hao.cheng@intel.com>

This patch had conflicts when merged, resolved by
Committer: Davies Liu <davies.liu@gmail.com>

Closes #7581 from chenghao-intel/sort_array and squashes the following commits:

664c960 [Cheng Hao] update the sort_array by using the ArrayData
276d2d5 [Cheng Hao] add empty line
0edab9c [Cheng Hao] Add asending/descending support for sort_array
80fc0f8 [Cheng Hao] Add type checking
a42b678 [Cheng Hao] Add sort_array support

67ad4e21

[SPARK-9415][SQL] Throw AnalysisException when using MapType on Join and Aggregate · 3320b0ba

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-9415

Following up #7787. We shouldn't use MapType as grouping keys and join keys too.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7819 from viirya/map_join_groupby and squashes the following commits:

005ee0c [Liang-Chi Hsieh] For comments.
7463398 [Liang-Chi Hsieh] MapType can't be used as join keys, grouping keys.

3320b0ba

Jul 31, 2015

[SPARK-9464][SQL] Property checks for UTF8String · 14f26344

Josh Rosen authored 9 years ago

This PR is based on the original work by JoshRosen in #7780, which adds ScalaCheck property-based tests for UTF8String.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7830 from yjshen/utf8-property-checks and squashes the following commits:

593da3a [Yijie Shen] resolve comments
c0800e6 [Yijie Shen] Finish all todos in suite
52f51a0 [Josh Rosen] Add some more failing tests
49ed0697 [Josh Rosen] Rename suite
9209c64 [Josh Rosen] UTF8String Property Checks.

14f26344

[SPARK-8264][SQL]add substring_index function · 6996bd2e

zhichao.li authored 9 years ago

This PR is based on #7533 , thanks to zhichao-li

Closes #7533

Author: zhichao.li <zhichao.li@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7843 from davies/str_index and squashes the following commits:

391347b [Davies Liu] add python api
3ce7802 [Davies Liu] fix substringIndex
f2d29a1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into str_index
515519b [zhichao.li] add foldable and remove null checking
9546991 [zhichao.li] scala style
67c253a [zhichao.li] hide some apis and clean code
b19b013 [zhichao.li] add codegen and clean code
ac863e9 [zhichao.li] reduce the calling of numChars
12e108f [zhichao.li] refine unittest
d92951b [zhichao.li] add lastIndexOf
52d7b03 [zhichao.li] add substring_index function

6996bd2e

[SPARK-9358][SQL] Code generation for UnsafeRow joiner. · 03377d25

Reynold Xin authored 9 years ago

This patch creates a code generated unsafe row concatenator that can be used to concatenate/join two UnsafeRows into a single UnsafeRow.

Since it is inherently hard to test these low level stuff, the test suites employ randomized testing heavily in order to guarantee correctness.

Author: Reynold Xin <rxin@databricks.com>

Closes #7821 from rxin/rowconcat and squashes the following commits:

8717f35 [Reynold Xin] Rebase and code review.
72c5d8e [Reynold Xin] Fixed a bug.
a84ed2e [Reynold Xin] Fixed offset.
40c3fb2 [Reynold Xin] Reset random data generator.
f0913aa [Reynold Xin] Test fixes.
6687b6f [Reynold Xin] Updated documentation.
00354b9 [Reynold Xin] Support concat data as well.
e9a4347 [Reynold Xin] Updated.
6269f96 [Reynold Xin] Fixed a bug .
0f89716 [Reynold Xin] [SPARK-9358][SQL][WIP] Code generation for UnsafeRow concat.

03377d25

[SPARK-9318] [SPARK-9320] [SPARKR] Aliases for merge and summary functions on DataFrames · 712f5b7a

Hossein authored 9 years ago

This PR adds synonyms for ```merge``` and ```summary``` in SparkR DataFrame API.

cc shivaram

Author: Hossein <hossein@databricks.com>

Closes #7806 from falaki/SPARK-9320 and squashes the following commits:

72600f7 [Hossein] Updated docs
92a6e75 [Hossein] Fixed merge generic signature issue
4c2b051 [Hossein] Fixing naming with mllib summary
0f3a64c [Hossein] Added ... to generic for merge
30fbaf8 [Hossein] Merged master
ae1a4cf [Hossein] Merge branch 'master' into SPARK-9320
e8eb86f [Hossein] Add a generic for merge
fc01f2d [Hossein] Added unit test
8d92012 [Hossein] Added merge as an alias for join
5b8bedc [Hossein] Added unit test
632693d [Hossein] Added summary as an alias for describe for DataFrame

712f5b7a

[SPARK-9451] [SQL] Support entries larger than default page size in... · 8cb415a4

Josh Rosen authored 9 years ago

[SPARK-9451] [SQL] Support entries larger than default page size in BytesToBytesMap & integrate with ShuffleMemoryManager

This patch adds support for entries larger than the default page size in BytesToBytesMap.  These large rows are handled by allocating special overflow pages to hold individual entries.

In addition, this patch integrates BytesToBytesMap with the ShuffleMemoryManager:

- Move BytesToBytesMap from `unsafe` to `core` so that it can import `ShuffleMemoryManager`.
- Before allocating new data pages, ask the ShuffleMemoryManager to reserve the memory:
  - `putNewKey()` now returns a boolean to indicate whether the insert succeeded or failed due to a lack of memory.  The caller can use this value to respond to the memory pressure (e.g. by spilling).
- `UnsafeFixedWidthAggregationMap. getAggregationBuffer()` now returns `null` to signal failure due to a lack of memory.
- Updated all uses of these classes to handle these error conditions.
- Added new tests for allocating large records and for allocations which fail due to memory pressure.
- Extended the `afterAll()` test teardown methods to detect ShuffleMemoryManager leaks.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7762 from JoshRosen/large-rows and squashes the following commits:

ae7bc56 [Josh Rosen] Fix compilation
82fc657 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-rows
34ab943 [Josh Rosen] Remove semi
31a525a [Josh Rosen] Integrate BytesToBytesMap with ShuffleMemoryManager.
626b33c [Josh Rosen] Move code to sql/core and spark/core packages so that ShuffleMemoryManager can be integrated
ec4484c [Josh Rosen] Move BytesToBytesMap from unsafe package to core.
642ed69 [Josh Rosen] Rename size to numElements
bea1152 [Josh Rosen] Add basic test.
2cd3570 [Josh Rosen] Remove accidental duplicated code
07ff9ef [Josh Rosen] Basic support for large rows in BytesToBytesMap.

8cb415a4

[SPARK-8936] [MLLIB] OnlineLDA document-topic Dirichlet hyperparameter optimization · f51fd6fb

Feynman Liang authored 9 years ago

Adds `alpha` (document-topic Dirichlet parameter) hyperparameter optimization to `OnlineLDAOptimizer` following Huang: Maximum Likelihood Estimation of Dirichlet Distribution Parameters. Also introduces a private `setSampleWithReplacement` to `OnlineLDAOptimizer` for unit testing purposes.

Author: Feynman Liang <fliang@databricks.com>

Closes #7836 from feynmanliang/SPARK-8936-alpha-optimize and squashes the following commits:

4bef484 [Feynman Liang] Documentation improvements
c3c6c1d [Feynman Liang] Fix docs
151e859 [Feynman Liang] Fix style
fa77518 [Feynman Liang] Hyperparameter optimization

f51fd6fb

[SPARK-8271][SQL]string function: soundex · 4d5a6e7b

HuJiayin authored 9 years ago

This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738

It's based on #7115 , thanks to HuJiayin

Author: HuJiayin <jiayin.hu@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7812 from davies/soundex and squashes the following commits:

fa75941 [Davies Liu] Merge branch 'master' of github.com:apache/spark into soundex
a4bd6d8 [Davies Liu] fix soundex
2538908 [HuJiayin] add codegen soundex
d15d329 [HuJiayin] add back ut
ded1a14 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark
e2dec2c [HuJiayin] support soundex rebase code

4d5a6e7b

[SPARK-9233] [SQL] Enable code-gen in window function unit tests · 3fc0cb92

Yin Huai authored 9 years ago

Since code-gen is enabled by default, it is better to run window function tests with code-gen.

https://issues.apache.org/jira/browse/SPARK-9233

Author: Yin Huai <yhuai@databricks.com>

Closes #7832 from yhuai/SPARK-9233 and squashes the following commits:

4e4e4cc [Yin Huai] style
ca80e07 [Yin Huai] Test window function with codegen.

3fc0cb92

[SPARK-9324] [SPARK-9322] [SPARK-9321] [SPARKR] Some aliases for R-like functions in DataFrames · 710c2b5d

Hossein authored 9 years ago

Adds following aliases:
* unique (distinct)
* rbind (unionAll): accepts many DataFrames
* nrow (count)
* ncol
* dim
* names (columns): along with the replacement function to change names

Author: Hossein <hossein@databricks.com>

Closes #7764 from falaki/sparkR-alias and squashes the following commits:

56016f5 [Hossein] Updated R documentation
5e4a4d0 [Hossein] Removed extra code
f51cbef [Hossein] Merge branch 'master' into sparkR-alias
c1b88bd [Hossein] Moved setGeneric and other comments applied
d9307f8 [Hossein] Added tests
b5aa988 [Hossein] Added dim, ncol, nrow, names, rbind, and unique functions to DataFrames

710c2b5d

[SPARK-9510] [SPARKR] Remaining SparkR style fixes · 82f47b81

Shivaram Venkataraman authored 9 years ago

With the change in this patch, I get no more warnings from `./dev/lint-r` in my machine

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #7834 from shivaram/sparkr-style-fixes and squashes the following commits:

716cd8e [Shivaram Venkataraman] Remaining SparkR style fixes

82f47b81

[SPARK-9507] [BUILD] Remove dependency reduced POM hack now that shade plugin is updated · 6e5fd613

Sean Owen authored 9 years ago

Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here

See https://issues.apache.org/jira/browse/SPARK-8819

I verified that `mvn clean package -DskipTests` works with Maven 3.3.3.

pwendell are you up for trying this for the 1.5.0 release?

Author: Sean Owen <sowen@cloudera.com>

Closes #7826 from srowen/SPARK-9507 and squashes the following commits:

e0b0fd2 [Sean Owen] Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here

6e5fd613

[SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code... · 873ab0f9

Sean Owen authored 9 years ago

[SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code uses deprecated print statement

Use print(x) not print x for Python 3 in eval examples
CC sethah mengxr -- just wanted to close this out before 1.5

Author: Sean Owen <sowen@cloudera.com>

Closes #7822 from srowen/SPARK-9490 and squashes the following commits:

01abeba [Sean Owen] Change "print x" to "print(x)" in the rest of the docs too
bd7f7fb [Sean Owen] Use print(x) not print x for Python 3 in eval examples

873ab0f9