Commits · aa6966ff34dacc83c3ca675b5109b05e35015469 · cs525-sp18-g07 / spark

Apr 25, 2015

[SQL] Update SQL readme to include instructions on generating golden answer... · aa6966ff

Yin Huai authored 9 years ago

[SQL] Update SQL readme to include instructions on generating golden answer files based on Hive 0.13.1.

Author: Yin Huai <yhuai@databricks.com>

Closes #5702 from yhuai/howToGenerateGoldenFiles and squashes the following commits:

9c4a7f8 [Yin Huai] Update readme to include instructions on generating golden answer files based on Hive 0.13.1.

aa6966ff

[SPARK-6113] [ML] Tree ensembles for Pipelines API · a7160c4e

Joseph K. Bradley authored 9 years ago

This is a continuation of [https://github.com/apache/spark/pull/5530] (which was for Decision Trees), but for ensembles: Random Forests and Gradient-Boosted Trees. Please refer to the JIRA [https://issues.apache.org/jira/browse/SPARK-6113], the design doc linked from the JIRA, and the previous PR linked above for design discussions.

This PR follows the example set by the previous PR for Decision Trees. It includes a few cleanups to Decision Trees.

Note: There is one issue which will be addressed in a separate PR: Ensembles' component Models have no parent or fittingParamMap. I plan to submit a separate PR which makes those values in Model be Options. It does not matter much which PR gets merged first.

CC: mengxr manishamde codedeft chouqin

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5626 from jkbradley/dt-api-ensembles and squashes the following commits:

729167a [Joseph K. Bradley] small cleanups based on code review
bbae2a2 [Joseph K. Bradley] Updated per all comments in code review
855aa9a [Joseph K. Bradley] scala style fix
ea3d901 [Joseph K. Bradley] Added GBT to spark.ml, with tests and examples
c0f30c1 [Joseph K. Bradley] Added random forests and test suites to spark.ml. Not tested yet. Need to add example as well
d045ebd [Joseph K. Bradley] some more updates, but far from done
ee1a10b [Joseph K. Bradley] Added files from old PR and did some initial updates.

a7160c4e

Revert "[SPARK-6752][Streaming] Allow StreamingContext to be recreated from... · a61d65fc

Patrick Wendell authored 9 years ago

Revert "[SPARK-6752][Streaming] Allow StreamingContext to be recreated from checkpoint and existing SparkContext"

This reverts commit 534f2a43.

a61d65fc

update the deprecated CountMinSketchMonoid function to TopPctCMS function · cca9905b

KeheCAI authored 9 years ago

http://twitter.github.io/algebird/index.html#com.twitter.algebird.legacy.CountMinSketchMonoid$
The CountMinSketchMonoid has been deprecated since 0.8.1. Newer code should use TopPctCMS.monoid().

![image](https://cloud.githubusercontent.com/assets/1327396/7269619/d8b48b92-e8d5-11e4-8902-087f630e6308.png)

Author: KeheCAI <caikehe@gmail.com>

Closes #5629 from caikehe/master and squashes the following commits:

e8aa06f [KeheCAI] update algebird-core to version 0.9.0 from 0.8.1
5653351 [KeheCAI] change scala code style
4c0dfd1 [KeheCAI] update the deprecated CountMinSketchMonoid function to TopPctCMS function

cca9905b

Apr 24, 2015

[SPARK-7136][Docs] Spark SQL and DataFrame Guide fix example file and paths · 59b7cfc4

Deborah Siegel authored 9 years ago

Changes example file for Generic Load/Save Functions to users.parquet rather than people.parquet which doesn't exist unless a later example has already been executed. Also adds filepaths.

Author: Deborah Siegel <deborah.siegel@gmail.com>
Author: DEBORAH SIEGEL <deborahsiegel@d-140-142-0-49.dhcp4.washington.edu>
Author: DEBORAH SIEGEL <deborahsiegel@DEBORAHs-MacBook-Pro.local>
Author: DEBORAH SIEGEL <deborahsiegel@d-69-91-154-197.dhcp4.washington.edu>

Closes #5693 from d3borah/master and squashes the following commits:

4d5e43b [Deborah Siegel] sparkSQL doc change
b15a497 [Deborah Siegel] Revert "sparkSQL doc change"
5a2863c [DEBORAH SIEGEL] Merge remote-tracking branch 'upstream/master'
91972fc [DEBORAH SIEGEL] sparkSQL doc change
f000e59 [DEBORAH SIEGEL] Merge remote-tracking branch 'upstream/master'
db54173 [DEBORAH SIEGEL] fixed aggregateMessages example in graphX doc

59b7cfc4

[PySpark][Minor] Update sql example, so that can read file correctly · d874f8b5

linweizhong authored 9 years ago

To run Spark, default will read file from HDFS if we don't set the schema.

Author: linweizhong <linweizhong@huawei.com>

Closes #5684 from Sephiroth-Lin/pyspark_example_minor and squashes the following commits:

19fe145 [linweizhong] Update example sql.py, so that can read file correctly

d874f8b5

[SPARK-6122] [CORE] Upgrade tachyon-client version to 0.6.3 · 438859eb

Calvin Jia authored 9 years ago

This is a reopening of #4867.
A short summary of the issues resolved from the previous PR:

1. HTTPClient version mismatch: Selenium (used for UI tests) requires version 4.3.x, and Tachyon included 4.2.5 through a transitive dependency of its shaded thrift jar. To address this, Tachyon 0.6.3 will promote the transitive dependencies of the shaded jar so they can be excluded in spark.

2. Jackson-Mapper-ASL version mismatch: In lower versions of hadoop-client (ie. 1.0.4), version 1.0.1 is included. The parquet library used in spark sql requires version 1.8+. Its unclear to me why upgrading tachyon-client would cause this dependency to break. The solution was to exclude jackson-mapper-asl from hadoop-client.

It seems that the dependency management in spark-parent will not work on transitive dependencies, one way to make sure jackson-mapper-asl is included with the correct version is to add it as a top level dependency. The best solution would be to exclude the dependency in the modules which require a higher version, but that did not fix the unit tests. Any suggestions on the best way to solve this would be appreciated!

Author: Calvin Jia <jia.calvin@gmail.com>

Closes #5354 from calvinjia/upgrade_tachyon_0.6.3 and squashes the following commits:

0eefe4d [Calvin Jia] Handle httpclient version in maven dependency management. Remove httpclient version setting from profiles.
7c00dfa [Calvin Jia] Set httpclient version to 4.3.2 for selenium. Specify version of httpclient for sql/hive (previously 4.2.5 transitive dependency of libthrift).
9263097 [Calvin Jia] Merge master to test latest changes
dbfc1bd [Calvin Jia] Use Tachyon 0.6.4 for cleaner dependencies.
e2ff80a [Calvin Jia] Exclude the jetty and curator promoted dependencies from tachyon-client.
a3a29da [Calvin Jia] Update tachyon-client exclusions.
0ae6c97 [Calvin Jia] Change tachyon version to 0.6.3
a204df9 [Calvin Jia] Update make distribution tachyon version.
a93c94f [Calvin Jia] Exclude jackson-mapper-asl from hadoop client since it has a lower version than spark's expected version.
a8a923c [Calvin Jia] Exclude httpcomponents from Tachyon
910fabd [Calvin Jia] Update to master
eed9230 [Calvin Jia] Update tachyon version to 0.6.1.
11907b3 [Calvin Jia] Use TachyonURI for tachyon paths instead of strings.
71bf441 [Calvin Jia] Upgrade Tachyon client version to 0.6.0.

438859eb

[SPARK-6852] [SPARKR] Accept numeric as numPartitions in SparkR. · caf0136e

Sun Rui authored 9 years ago

Author: Sun Rui <rui.sun@intel.com>

Closes #5613 from sun-rui/SPARK-6852 and squashes the following commits:

abaf02e [Sun Rui] Change the type of default numPartitions from integer to numeric in generics.R.
29d67c1 [Sun Rui] [SPARK-6852][SPARKR] Accept numeric as numPartitions in SparkR.

caf0136e

[SPARK-7033] [SPARKR] Clean usage of split. Use partition instead where applicable. · ebb77b2a

Sun Rui authored 9 years ago

Author: Sun Rui <rui.sun@intel.com>

Closes #5628 from sun-rui/SPARK-7033 and squashes the following commits:

046bc9e [Sun Rui] Clean split usage in tests.
d531c86 [Sun Rui] [SPARK-7033][SPARKR] Clean usage of split. Use partition instead where applicable.

ebb77b2a

[SPARK-6528] [ML] Add IDF transformer · 6e57d57b

Xusen Yin authored 9 years ago

See [SPARK-6528](https://issues.apache.org/jira/browse/SPARK-6528). Add IDF transformer in ML package.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5266 from yinxusen/SPARK-6528 and squashes the following commits:

741db31 [Xusen Yin] get param from new paramMap
d169967 [Xusen Yin] add final to param and IDF class
c9c3759 [Xusen Yin] simplify test suite
5867c09 [Xusen Yin] refine IDF transformer with new interfaces
7727cae [Xusen Yin] Merge branch 'master' into SPARK-6528
4338a37 [Xusen Yin] Merge branch 'master' into SPARK-6528
aef2cdf [Xusen Yin] add doc and group for param
5760b49 [Xusen Yin] fix code style
2add691 [Xusen Yin] fix code style and test
03fbecb [Xusen Yin] remove duplicated code
2aa4be0 [Xusen Yin] clean test suite
4802c67 [Xusen Yin] add IDF transformer and test suite

6e57d57b

[SPARK-7115] [MLLIB] skip the very first 1 in poly expansion · 78b39c7e

Xiangrui Meng authored 9 years ago

yinxusen

Author: Xiangrui Meng <meng@databricks.com>

Closes #5681 from mengxr/SPARK-7115 and squashes the following commits:

9ac27cd [Xiangrui Meng] skip the very first 1 in poly expansion

78b39c7e

[SPARK-5894] [ML] Add polynomial mapper · 8509519d

Xusen Yin authored 9 years ago

See [SPARK-5894](https://issues.apache.org/jira/browse/SPARK-5894).

Author: Xusen Yin <yinxusen@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5245 from yinxusen/SPARK-5894 and squashes the following commits:

dc461a6 [Xusen Yin] merge polynomial expansion v2
6d0c3cc [Xusen Yin] Merge branch 'SPARK-5894' of https://github.com/mengxr/spark into mengxr-SPARK-5894
57bfdd5 [Xusen Yin] Merge branch 'master' into SPARK-5894
3d02a7d [Xusen Yin] Merge branch 'master' into SPARK-5894
a067da2 [Xiangrui Meng] a new approach for poly expansion
0789d81 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5894
4e9aed0 [Xusen Yin] fix test suite
95d8fb9 [Xusen Yin] fix sparse vector indices
8d39674 [Xusen Yin] fix sparse vector expansion error
5998dd6 [Xusen Yin] fix dense vector fillin
fa3ade3 [Xusen Yin] change the functional code into imperative one to speedup
b70e7e1 [Xusen Yin] remove useless case class
6fa236f [Xusen Yin] fix vector slice error
daff601 [Xusen Yin] fix index error of sparse vector
6bd0a10 [Xusen Yin] merge repeated features
419f8a2 [Xusen Yin] need to merge same columns
4ebf34e [Xusen Yin] add test suite of polynomial expansion
372227c [Xusen Yin] add polynomial expansion

8509519d

Fixed a typo from the previous commit. · 4c722d77
Reynold Xin authored 9 years ago

4c722d77

Apr 23, 2015

[SQL] Fixed expression data type matching. · d3a302de

Reynold Xin authored 9 years ago

Also took the chance to improve documentation for various types.

Author: Reynold Xin <rxin@databricks.com>

Closes #5675 from rxin/data-type-matching-expr and squashes the following commits:

0f31856 [Reynold Xin] One more function documentation.
27c1973 [Reynold Xin] Added more documentation.
336a36d [Reynold Xin] [SQL] Fixed expression data type matching.

d3a302de

Update sql-programming-guide.md · 67bccbda

Ken Geis authored 9 years ago

fix typo

Author: Ken Geis <geis.ken@gmail.com>

Closes #5674 from kgeis/patch-1 and squashes the following commits:

5ae67de [Ken Geis] Update sql-programming-guide.md

67bccbda

[SPARK-7060][SQL] Add alias function to python dataframe · 2d010f7a

Yin Huai authored 9 years ago

This pr tries to provide a way to let python users workaround https://issues.apache.org/jira/browse/SPARK-6231.

Author: Yin Huai <yhuai@databricks.com>

Closes #5634 from yhuai/pythonDFAlias and squashes the following commits:

8465acd [Yin Huai] Add an alias to a Python DF.

2d010f7a

[SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties in... · 336f7f53

Cheolsoo Park authored 9 years ago

[SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties in spark-shell and spark-submit

When specifying non-spark properties (i.e. names don't start with spark.) in the command line and config file, spark-submit and spark-shell behave differently, causing confusion to users.
Here is the summary-
* spark-submit
  * --conf k=v => silently ignored
  * spark-defaults.conf => applied
* spark-shell
  * --conf k=v => show a warning message and ignored
  *  spark-defaults.conf => show a warning message and ignored

I assume that ignoring non-spark properties is intentional. If so, it should always be ignored with a warning message in all cases.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #5617 from piaozhexiu/SPARK-7037 and squashes the following commits:

8957950 [Cheolsoo Park] Add IgnoreNonSparkProperties method
fedd01c [Cheolsoo Park] Ignore non-spark properties with a warning message in all cases

336f7f53

[SPARK-6818] [SPARKR] Support column deletion in SparkR DataFrame API. · 73db132b

Sun Rui authored 9 years ago

Author: Sun Rui <rui.sun@intel.com>

Closes #5655 from sun-rui/SPARK-6818 and squashes the following commits:

7c66570 [Sun Rui] [SPARK-6818][SPARKR] Support column deletion in SparkR DataFrame API.

73db132b

[SQL] Break dataTypes.scala into multiple files. · 6220d933

Reynold Xin authored 9 years ago

It was over 1000 lines of code, making it harder to find all the types. Only moved code around, and didn't change any.

Author: Reynold Xin <rxin@databricks.com>

Closes #5670 from rxin/break-types and squashes the following commits:

8c59023 [Reynold Xin] Check in missing files.
dcd5193 [Reynold Xin] [SQL] Break dataTypes.scala into multiple files.

6220d933

[SPARK-7070] [MLLIB] LDA.setBeta should call setTopicConcentration. · 1ed46a60

Xiangrui Meng authored 9 years ago

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5649 from mengxr/SPARK-7070 and squashes the following commits:

c66023c [Xiangrui Meng] setBeta should call setTopicConcentration

1ed46a60

[SPARK-7087] [BUILD] Fix path issue change version script · 6d0749ca

Tijo Thomas authored 9 years ago

Author: Tijo Thomas <tijoparacka@gmail.com>

Closes #5656 from tijoparacka/FIX_PATHISSUE_CHANGE_VERSION_SCRIPT and squashes the following commits:

ab4f4b1 [Tijo Thomas] removed whitespace
24478c9 [Tijo Thomas] modified to provide the spark base dir while searching for pom and also while changing the vesrion no
7b8e10b [Tijo Thomas] Modified for providing the base directories while finding the list of pom files and also while changing the version no

6d0749ca

[SPARK-6879] [HISTORYSERVER] check if app is completed before clean it up · baa83a9a

WangTaoTheTonic authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-6879

Use `applications` to replace `FileStatus`, and check if the app is completed before clean it up.
If an exception was throwed, add it to `applications` to wait for the next loop.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5491 from WangTaoTheTonic/SPARK-6879 and squashes the following commits:

4a533eb [WangTaoTheTonic] treat ACE specially
cb45105 [WangTaoTheTonic] rebase
d4d5251 [WangTaoTheTonic] per Marcelo's comments
d7455d8 [WangTaoTheTonic] slightly change when delete file
b0abca5 [WangTaoTheTonic] use global var to store apps to clean
94adfe1 [WangTaoTheTonic] leave expired apps alone to be deleted
9872a9d [WangTaoTheTonic] use the right path
fdef4d6 [WangTaoTheTonic] check if app is completed before clean it up

baa83a9a

[SPARK-7085][MLlib] Fix miniBatchFraction parameter in train method called with 4 arguments · 3e91cc27

wizz authored 9 years ago

Author: wizz <wizz@wizz-dev01.kawasaki.flab.fujitsu.com>

Closes #5658 from kuromatsu-nobuyuki/SPARK-7085 and squashes the following commits:

6ec2d21 [wizz] Fix miniBatchFraction parameter in train method called with 4 arguments

3e91cc27

[SPARK-7058] Include RDD deserialization time in "task deserialization time" metric · 6afde2c7

Josh Rosen authored 9 years ago

The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5635 from JoshRosen/SPARK-7058 and squashes the following commits:

ed90f75 [Josh Rosen] Update UI tooltip
a3743b4 [Josh Rosen] Update comments.
4f52910 [Josh Rosen] Roll back whitespace change
e9cf9f4 [Josh Rosen] Remove unused variable
9f32e55 [Josh Rosen] Expose executorDeserializeTime on Task instead of pushing runtime calculation into Task.
21f5b47 [Josh Rosen] Don't double-count the broadcast deserialization time in task runtime
1752f0e [Josh Rosen] [SPARK-7058] Incorporate RDD deserialization time in task deserialization time metric

6afde2c7

[SPARK-7055][SQL]Use correct ClassLoader for JDBC Driver in JDBCRDD.getConnector · c1213e6a

Vinod K C authored 9 years ago

Author: Vinod K C <vinod.kc@huawei.com>

Closes #5633 from vinodkc/use_correct_classloader_driverload and squashes the following commits:

73c5380 [Vinod K C] Use correct ClassLoader for JDBC Driver

c1213e6a

[SPARK-6752][Streaming] Allow StreamingContext to be recreated from checkpoint... · 534f2a43

Tathagata Das authored 9 years ago

[SPARK-6752][Streaming] Allow StreamingContext to be recreated from checkpoint and existing SparkContext

Currently if you want to create a StreamingContext from checkpoint information, the system will create a new SparkContext. This prevent StreamingContext to be recreated from checkpoints in managed environments where SparkContext is precreated.

The solution in this PR: Introduce the following methods on StreamingContext
1. `new StreamingContext(checkpointDirectory, sparkContext)`
   Recreate StreamingContext from checkpoint using the provided SparkContext
2. `StreamingContext.getOrCreate(checkpointDirectory, sparkContext, createFunction: SparkContext => StreamingContext)`
   If checkpoint file exists, then recreate StreamingContext using the provided SparkContext (that is, call 1.), else create StreamingContext using the provided createFunction

TODO: the corresponding Java and Python API has to be added as well.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #5428 from tdas/SPARK-6752 and squashes the following commits:

94db63c [Tathagata Das] Fix long line.
524f519 [Tathagata Das] Many changes based on PR comments.
eabd092 [Tathagata Das] Added Function0, Java API and unit tests for StreamingContext.getOrCreate
36a7823 [Tathagata Das] Minor changes.
204814e [Tathagata Das] Added StreamingContext.getOrCreate with existing SparkContext

534f2a43

[SPARK-7044] [SQL] Fix the deadlock in script transformation · cc48e638

Cheng Hao authored 9 years ago

Author: Cheng Hao <hao.cheng@intel.com>

Closes #5625 from chenghao-intel/transform and squashes the following commits:

5ec1dd2 [Cheng Hao] fix the deadlock issue in ScriptTransform

cc48e638

[minor][streaming]fixed scala string interpolation error · 975f53e4

Prabeesh K authored 9 years ago

Author: Prabeesh K <prabeesh.k@namshi.com>

Closes #5653 from prabeesh/fix and squashes the following commits:

9d7a9f5 [Prabeesh K] fixed scala string interpolation error

975f53e4

[HOTFIX] [SQL] Fix compilation for scala 2.11. · a7d65d38

Prashant Sharma authored 9 years ago

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #5652 from ScrapCodes/hf/compilation-fix-scala-2.11 and squashes the following commits:

819ff06 [Prashant Sharma] [HOTFIX] Fix compilation for scala 2.11.

a7d65d38

[SPARK-7069][SQL] Rename NativeType -> AtomicType. · f60bece1

Reynold Xin authored 9 years ago

Also renamed JvmType to InternalType.

Author: Reynold Xin <rxin@databricks.com>

Closes #5651 from rxin/native-to-atomic-type and squashes the following commits:

cbd4028 [Reynold Xin] [SPARK-7069][SQL] Rename NativeType -> AtomicType.

f60bece1

[SPARK-7068][SQL] Remove PrimitiveType · 29163c52

Reynold Xin authored 9 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #5646 from rxin/remove-primitive-type and squashes the following commits:

01b673d [Reynold Xin] [SPARK-7068][SQL] Remove PrimitiveType

29163c52

[MLlib] Add support for BooleanType to VectorAssembler. · 2d33323c

Reynold Xin authored 9 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #5648 from rxin/vectorAssembler-boolean and squashes the following commits:

1bf3d40 [Reynold Xin] [MLlib] Add support for BooleanType to VectorAssembler.

2d33323c

[HOTFIX][SQL] Fix broken cached test · d9e70f33

Liang-Chi Hsieh authored 9 years ago

Added in #5475. Pointed as broken in #5639.
/cc marmbrus

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5640 from viirya/fix_cached_test and squashes the following commits:

c0cf69a [Liang-Chi Hsieh] Fix broken cached test.

d9e70f33

Apr 22, 2015

[SPARK-7046] Remove InputMetrics from BlockResult · 03e85b4a

Kay Ousterhout authored 9 years ago

This is a code cleanup.

The BlockResult class originally contained an InputMetrics object so that InputMetrics could
directly be used as the InputMetrics for the whole task. Now we copy the fields out of here, and
the presence of this object is confusing because it's only a partial input metrics (it doesn't
include the records read). Because this object is no longer useful (and is confusing), it should
be removed.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #5627 from kayousterhout/SPARK-7046 and squashes the following commits:

bf64bbe [Kay Ousterhout] Import fix
a08ca19 [Kay Ousterhout] [SPARK-7046] Remove InputMetrics from BlockResult

03e85b4a

[SPARK-7066][MLlib] VectorAssembler should use NumericType not NativeType. · d2068606

Reynold Xin authored 9 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #5642 from rxin/mllib-native-type and squashes the following commits:

e23af5b [Reynold Xin] Remove StringType
7cbb205 [Reynold Xin] [SPARK-7066][MLlib] VectorAssembler should use NumericType and StringType, not NativeType.

d2068606

[MLlib] UnaryTransformer nullability should not depend on PrimitiveType. · 1b85e085

Reynold Xin authored 9 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #5644 from rxin/mllib-nullable and squashes the following commits:

a727e5b [Reynold Xin] [MLlib] UnaryTransformer nullability should not depend on primitive types.

1b85e085

Disable flaky test: ReceiverSuite "block generator throttling". · b69c4f9b
Reynold Xin authored 9 years ago

b69c4f9b

[SPARK-6967] [SQL] fix date type convertion in jdbcrdd · 04525c07

Daoyuan Wang authored 9 years ago

This pr convert java.sql.Date type into Int for JDBCRDD.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #5590 from adrian-wang/datebug and squashes the following commits:

f897b81 [Daoyuan Wang] add a test case
3c9184c [Daoyuan Wang] fix date type convertion in jdbcrdd

04525c07

[SPARK-6827] [MLLIB] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API · f4f39981

Yanbo Liang authored 9 years ago

Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures```
It return a RDD with each tuple is composed of an array and a long value.
I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5614 from yanboliang/spark-6827 and squashes the following commits:

da8c404 [Yanbo Liang] use namedtuple
5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API

f4f39981

[SPARK-7059][SQL] Create a DataFrame join API to facilitate equijoin. · baf865dd

Reynold Xin authored 9 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #5638 from rxin/joinUsing and squashes the following commits:

13e9cc9 [Reynold Xin] Code review + Python.
b1bd914 [Reynold Xin] [SPARK-7059][SQL] Create a DataFrame join API to facilitate equijoin and self join.

baf865dd