Commits · 43dfc84f883822ea27b6e312d4353bf301c2e7ef · cs525-sp18-g07 / spark

Aug 27, 2014

[SPARK-2830][MLLIB] doc update for 1.1 · 43dfc84f

Xiangrui Meng authored 10 years ago

1. renamed mllib-basics to mllib-data-types
1. renamed mllib-stats to mllib-statistics
1. moved random data generation to the bottom of mllib-stats
1. updated toc accordingly

atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #2151 from mengxr/mllib-doc-1.1 and squashes the following commits:

0bd79f3 [Xiangrui Meng] add mllib-data-types
b64a5d7 [Xiangrui Meng] update the content list of basis statistics in mllib-guide
f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types
4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics
e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md

43dfc84f

[SPARK-3237][SQL] Fix parquet filters with UDFs · e1139dd6

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #2153 from marmbrus/parquetFilters and squashes the following commits:

712731a [Michael Armbrust] Use closure serializer for sending filters.
1e83f80 [Michael Armbrust] Clean udf functions.

e1139dd6

[SPARK-3139] Made ContextCleaner to not block on shuffles · 3e2864e4

Tathagata Das authored 10 years ago

As a workaround for SPARK-3015, the ContextCleaner was made "blocking", that is, it cleaned items one-by-one. But shuffles can take a long time to be deleted. Given that the RC for 1.1 is imminent, this PR makes a narrow change in the context cleaner - not wait for shuffle cleanups to complete. Also it changes the error messages on failure to delete to be milder warnings, as exceptions in the delete code path for one item does not really stop the actual functioning of the system.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2143 from tdas/cleaner-shuffle-fix and squashes the following commits:

9c84202 [Tathagata Das] Restoring default blocking behavior in ContextCleanerSuite, and added docs to identify that spark.cleaner.referenceTracking.blocking does not control shuffle.
2181329 [Tathagata Das] Mark shuffle cleanup as non-blocking.
e337cc2 [Tathagata Das] Changed semantics based on PR comments.
387b578 [Tathagata Das] Made ContextCleaner to not block on shuffles

3e2864e4

HOTFIX: Minor typo in conf template · 9d65f271
Patrick Wendell authored 10 years ago

9d65f271

[SPARK-3167] Handle special driver configs in Windows · 7557c4cf

Andrew Or authored 10 years ago

This is an effort to bring the Windows scripts up to speed after recent splashing changes in #1845.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2129 from andrewor14/windows-config and squashes the following commits:

881a8f0 [Andrew Or] Add reference to Windows taskkill
92e6047 [Andrew Or] Update a few comments (minor)
22b1acd [Andrew Or] Fix style again (minor)
afcffea [Andrew Or] Fix style (minor)
72004c2 [Andrew Or] Actually respect --driver-java-options
803218b [Andrew Or] Actually respect SPARK_*_CLASSPATH
eeb34a0 [Andrew Or] Update outdated comment (minor)
35caecc [Andrew Or] In Windows, actually kill Java processes on exit
f97daa2 [Andrew Or] Fix Windows spark shell stdin issue
83ebe60 [Andrew Or] Parse special driver configs in Windows (broken)

7557c4cf

[SPARK-3224] FetchFailed reduce stages should only show up once in failed stages (in UI) · bf719056

Reynold Xin authored 10 years ago

This is a HOTFIX for 1.1.

Author: Reynold Xin <rxin@apache.org>
Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #2127 from rxin/SPARK-3224 and squashes the following commits:

effb1ce [Reynold Xin] Move log message.
49282b3 [Reynold Xin] Kay's feedback.
3f01847 [Reynold Xin] Merge pull request #2 from kayousterhout/SPARK-3224
796d282 [Kay Ousterhout] Added unit test for SPARK-3224
3d3d356 [Reynold Xin] Remove map output loc even for repeated FetchFaileds.
1dd3eb5 [Reynold Xin] [SPARK-3224] FetchFailed reduce stages should only show up once in the failed stages UI.

bf719056

Aug 26, 2014

Manually close old pull requests · e70aff6c
Matei Zaharia authored 10 years ago
```
Closes #671, Closes #515
```
e70aff6c
Manually close some old pull requests · ee91eb8c
Matei Zaharia authored 10 years ago
```
Closes #530, Closes #223, Closes #738, Closes #546
```
ee91eb8c
Fix unclosed HTML tag in Yarn docs. · d8345471
Josh Rosen authored 10 years ago

d8345471

[SPARK-3240] Adding known issue for MESOS-1688 · be043e3f

Martin Weindel authored 10 years ago

When using Mesos with the fine-grained mode, a Spark job can run into a dead lock on low allocatable memory on Mesos slaves. As a work-around 32 MB (= Mesos MIN_MEM) are allocated for each task, to ensure Mesos making new offers after task completion.
From my perspective, it would be better to fix this problem in Mesos by dropping the constraint on memory for offers, but as temporary solution this patch helps to avoid the dead lock on current Mesos versions.
See [[MESOS-1688] No offers if no memory is allocatable](https://issues.apache.org/jira/browse/MESOS-1688) for details for this problem.

Author: Martin Weindel <martin.weindel@gmail.com>

Closes #1860 from MartinWeindel/master and squashes the following commits:

5762030 [Martin Weindel] reverting work-around
a6bf837 [Martin Weindel] added known issue for issue MESOS-1688
d9d2ca6 [Martin Weindel] work around for problem with Mesos offering semantic (see [https://issues.apache.org/jira/browse/MESOS-1688])

be043e3f

[SPARK-3036][SPARK-3037][SQL] Add MapType/ArrayType containing null value support to Parquet. · 727cb25b

Takuya UESHIN authored 10 years ago

JIRA:
- https://issues.apache.org/jira/browse/SPARK-3036
- https://issues.apache.org/jira/browse/SPARK-3037

Currently this uses the following Parquet schema for `MapType` when `valueContainsNull` is `true`:

```
message root {
  optional group a (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required int32 key;
      optional int32 value;
    }
  }
}
```

for `ArrayType` when `containsNull` is `true`:

```
message root {
  optional group a (LIST) {
    repeated group bag {
      optional int32 array;
    }
  }
}
```

We have to think about compatibilities with older version of Spark or Hive or others I mentioned in the JIRA issues.

Notice:
This PR is based on #1963 and #1889.
Please check them first.

/cc marmbrus, yhuai

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2032 from ueshin/issues/SPARK-3036_3037 and squashes the following commits:

4e8e9e7 [Takuya UESHIN] Add ArrayType containing null value support to Parquet.
013c2ca [Takuya UESHIN] Add MapType containing null value support to Parquet.
62989de [Takuya UESHIN] Merge branch 'issues/SPARK-2969' into issues/SPARK-3036_3037
8e38b53 [Takuya UESHIN] Merge branch 'issues/SPARK-3063' into issues/SPARK-3036_3037

727cb25b

[Docs] Run tests like in contributing guide · 73b3089b

nchammas authored 10 years ago

The Contributing to Spark guide [recommends](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting) running tests by calling `./dev/run-tests`. The README should, too.

`./sbt/sbt test` does not cover Python tests or style tests.

Author: nchammas <nicholas.chammas@gmail.com>

Closes #2149 from nchammas/patch-2 and squashes the following commits:

2b3b132 [nchammas] [Docs] Run tests like in contributing guide

73b3089b

[SPARK-2964] [SQL] Remove duplicated code from spark-sql and start-thriftserver.sh · faeb9c0e

Cheng Lian authored 10 years ago

Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1886 from sarutak/SPARK-2964 and squashes the following commits:

8ef8751 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2964
26e7c95 [Kousuke Saruta] Revert "Shorten timeout to more reasonable value"
ffb68fa [Kousuke Saruta] Modified spark-sql and start-thriftserver.sh to use bin/utils.sh
8c6f658 [Kousuke Saruta] Merge branch 'spark-3026' of https://github.com/liancheng/spark into SPARK-2964
81b43a8 [Cheng Lian] Shorten timeout to more reasonable value
a89e66d [Cheng Lian] Fixed command line options quotation in scripts
9c894d3 [Cheng Lian] Fixed bin/spark-sql -S option typo
be4736b [Cheng Lian] Report better error message when running JDBC/CLI without hive-thriftserver profile enabled

faeb9c0e

[SPARK-3225]Typo in script · 2ffd3290

WangTao authored 10 years ago

use_conf_dir => user_conf_dir in load-spark-env.sh.

Author: WangTao <barneystinson@aliyun.com>

Closes #1926 from WangTaoTheTonic/TypoInScript and squashes the following commits:

0c104ad [WangTao] Typo in script

2ffd3290

[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() · f1e71d4c

Davies Liu authored 10 years ago

Using external sort to support sort large datasets in reduce stage.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1978 from davies/sort and squashes the following commits:

bbcd9ba [Davies Liu] check spilled bytes in tests
b125d2f [Davies Liu] add test for external sort in rdd
eae0176 [Davies Liu] choose different disks from different processes and instances
1f075ed [Davies Liu] Merge branch 'master' into sort
eb53ca6 [Davies Liu] Merge branch 'master' into sort
644abaf [Davies Liu] add license in LICENSE
19f7873 [Davies Liu] improve tests
55602ee [Davies Liu] use external sort in sortBy() and sortByKey()

f1e71d4c

[SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences · c4787a36

Michael Armbrust authored 10 years ago

It is common to want to describe sets of attributes that are in various parts of a query plan.  However, the semantics of putting `AttributeReference` objects into a standard Scala `Set` result in subtle bugs when references differ cosmetically.  For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal.

In this PR I introduce a new abstraction, an `AttributeSet`, which performs all comparisons using the globally unique `ExpressionId` instead of case class equality.  (There is already a related class, [`AttributeMap`](https://github.com/marmbrus/spark/blob/inMemStats/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala#L32))  This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators.

I also took this opportunity to refactor the expression and query plan base classes.  In all but one instance the logic for computing the `references` of an `Expression` were the same.  Thus, I moved this logic into the base class.

For query plans the semantics of  the `references` method were ill defined (is it the references output? or is it those used by expression evaluation? or what?).  As a result, this method wasn't really used very much.  So, I removed it.

TODO:
 - [x] Finish scala doc for `AttributeSet`
 - [x] Scan the code for other instances of `Set[Attribute]` and refactor them.
 - [x] Finish removing `references` from `QueryPlan`

Author: Michael Armbrust <michael@databricks.com>

Closes #2109 from marmbrus/attributeSets and squashes the following commits:

1c0dae5 [Michael Armbrust] work on serialization bug.
9ba868d [Michael Armbrust] Merge remote-tracking branch 'origin/master' into attributeSets
3ae5288 [Michael Armbrust] review comments
40ce7f6 [Michael Armbrust] style
d577cc7 [Michael Armbrust] Scaladoc
cae5d22 [Michael Armbrust] remove more references implementations
d6e16be [Michael Armbrust] Remove more instances of "def references" and normal sets of attributes.
fc26b49 [Michael Armbrust] Add AttributeSet class, remove references from Expression.

c4787a36

[SPARK-2839][MLlib] Stats Toolkit documentation updated · 1208f72a

Burak authored 10 years ago

Documentation updated for the Statistics Toolkit of MLlib. mengxr atalwalkar

https://issues.apache.org/jira/browse/SPARK-2839

P.S. Accidentally closed #2123. New commits didn't show up after I reopened the PR. I've opened this instead and closed the old one.

Author: Burak <brkyvz@gmail.com>

Closes #2130 from brkyvz/StatsLib-Docs and squashes the following commits:

a54a855 [Burak] [SPARK-2839][MLlib] Addressed comments
bfc6896 [Burak] [SPARK-2839][MLlib] Added a more specific link to colStats() for pyspark
213fe3f [Burak] [SPARK-2839][MLlib] Modifications made according to review
fec4d9d [Burak] [SPARK-2830][MLlib] Stats Toolkit documentation updated

1208f72a

[SPARK-3226][MLLIB] doc update for native libraries · adbd5c16

Xiangrui Meng authored 10 years ago

to mention `-Pnetlib-lgpl` option. atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #2128 from mengxr/mllib-native and squashes the following commits:

4cbba57 [Xiangrui Meng] update mllib dependencies

adbd5c16

[SPARK-3063][SQL] ExistingRdd should convert Map to catalyst Map. · 6b5584ef

Takuya UESHIN authored 10 years ago

Currently `ExistingRdd.convertToCatalyst` doesn't convert `Map` value.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1963 from ueshin/issues/SPARK-3063 and squashes the following commits:

3ba41f2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
4d7bae2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
9321379 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
d8a900a [Takuya UESHIN] Make ExistingRdd.convertToCatalyst be able to convert Map value.

6b5584ef

[SPARK-2969][SQL] Make ScalaReflection be able to handle... · 98c2bb0b

Takuya UESHIN authored 10 years ago

[SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull.

Make `ScalaReflection` be able to handle like:

- `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
- `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
- `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
- `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits:

24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
7cd1a7a [Takuya UESHIN] Fix json test failures.
2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
9fa02f5 [Takuya UESHIN] Fix a test failure.
1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.

98c2bb0b

[SPARK-2871] [PySpark] add histgram() API · 3cedc4f4

Davies Liu authored 10 years ago

RDD.histogram(buckets)

        Compute a histogram using the provided buckets. The buckets
        are all open to the right except for the last which is closed.
        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
        and 50 we would have a histogram of 1,0,1.

        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
        this can be switched from an O(log n) inseration to O(1) per
        element(where n = # buckets).

        Buckets must be sorted and not contain any duplicates, must be
        at least two elements.

        If `buckets` is a number, it will generates buckets which is
        evenly spaced between the minimum and maximum of the RDD. For
        example, if the min value is 0 and the max is 100, given buckets
        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
        be at least 1 If the RDD contains infinity, NaN throws an exception
        If the elements in RDD do not vary (max == min) always returns
        a single bucket.

        It will return an tuple of buckets and histogram.

        >>> rdd = sc.parallelize(range(51))
        >>> rdd.histogram(2)
        ([0, 25, 50], [25, 26])
        >>> rdd.histogram([0, 5, 25, 50])
        ([0, 5, 25, 50], [5, 20, 26])
        >>> rdd.histogram([0, 15, 30, 45, 60], True)
        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
        >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
        >>> rdd.histogram(("a", "b", "c"))
        (('a', 'b', 'c'), [2, 2])

closes #122, it's duplicated.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2091 from davies/histgram and squashes the following commits:

a322f8a [Davies Liu] fix deprecation of e.message
84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
d9a0722 [Davies Liu] address comments
0e18a2d [Davies Liu] add histgram() API

3cedc4f4

[SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext · 8856c3d8

chutium authored 10 years ago

There are 4 different compression codec available for ```ParquetOutputFormat```

in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```

original discuss:
https://github.com/apache/spark/pull/195#discussion-diff-11002083

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).

Author: chutium <teng.qiu@gmail.com>

Closes #2039 from chutium/parquet-compression and squashes the following commits:

2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext

8856c3d8

[SPARK-2886] Use more specific actor system name than "spark" · b21ae5bb

Andrew Or authored 10 years ago

As of #1777 we log the name of the actor system when it binds to a port. The current name "spark" is super general and does not convey any meaning. For instance, the following line is taken from my driver log after setting `spark.driver.port` to 5001.
```
14/08/13 19:33:29 INFO Remoting: Remoting started; listening on addresses:
[akka.tcp://sparkandrews-mbp:5001]
14/08/13 19:33:29 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkandrews-mbp:5001]
14/08/06 13:40:05 INFO Utils: Successfully started service 'spark' on port 5001.
```
This commit renames this to "sparkDriver" and "sparkExecutor". The goal of this unambitious PR is simply to make the logged information more explicit without introducing any change in functionality.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1810 from andrewor14/service-name and squashes the following commits:

8c459ed [Andrew Or] Use a common variable for driver/executor actor system names
3a92843 [Andrew Or] Change actor name to sparkDriver and sparkExecutor
921363e [Andrew Or] Merge branch 'master' of github.com:apache/spark into service-name
c8c6a62 [Andrew Or] Do not include hyphens in actor name
1c1b42e [Andrew Or] Avoid spaces in akka system name
f644b55 [Andrew Or] Use more specific service name

b21ae5bb

[Spark-3222] [SQL] Cross join support in HiveQL · 52fbdc2d

Daoyuan Wang authored 10 years ago

We can simple treat cross join as inner join without join conditions.

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: adrian-wang <daoyuanwong@gmail.com>

Closes #2124 from adrian-wang/crossjoin and squashes the following commits:

8c9b7c5 [Daoyuan Wang] add a test
7d47bbb [adrian-wang] add cross join support for hql

52fbdc2d

Aug 25, 2014

[SPARK-2976] Replace tabs with spaces · 62f5009f

Kousuke Saruta authored 10 years ago

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1895 from sarutak/SPARK-2976 and squashes the following commits:

1cf7e69 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2976
d1e0666 [Kousuke Saruta] Modified styles
c5e80a4 [Kousuke Saruta] Remove tab from JavaPageRank.java and JavaKinesisWordCountASL.java
c003b36 [Kousuke Saruta] Removed tab from sorttable.js

62f5009f

SPARK-2481: The environment variables SPARK_HISTORY_OPTS is covered in spark-env.sh · 9f04db17

witgo authored 10 years ago

Author: witgo <witgo@qq.com>
Author: GuoQiang Li <witgo@qq.com>

Closes #1341 from witgo/history_env and squashes the following commits:

b4fd9f8 [GuoQiang Li] review commit
0ebe401 [witgo] *-history-server.sh load spark-config.sh

9f04db17

[SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile · 4243bb66

Chia-Yung Su authored 10 years ago

fix compile error on hadoop 0.23 for the pull request #1924.

Author: Chia-Yung Su <chiayung@appier.com>

Closes #1959 from joesu/bugfix-spark3011 and squashes the following commits:

be30793 [Chia-Yung Su] remove .* and _* except _metadata
8fe2398 [Chia-Yung Su] add note to explain
40ea9bd [Chia-Yung Su] fix hadoop-0.23 compile error
c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir

4243bb66

[SQL] logWarning should be logInfo in getResultSetSchema · 507a1b52

wangfei authored 10 years ago

Author: wangfei <wangfei_hello@126.com>

Closes #1939 from scwf/patch-5 and squashes the following commits:

f952d10 [wangfei] [SQL] logWarning should be logInfo in getResultSetSchema

507a1b52

[SPARK-3058] [SQL] Support EXTENDED for EXPLAIN · 156eb396

Cheng Hao authored 10 years ago

Provide `extended` keyword support for `explain` command in SQL. e.g.
```
explain extended select key as a1, value as a2 from src where key=1;
== Parsed Logical Plan ==
Project ['key AS a1#3,'value AS a2#4]
 Filter ('key = 1)
  UnresolvedRelation None, src, None

== Analyzed Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
  MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = 1.0)
  MetastoreRelation default, src, None

== Physical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = 1.0)
  HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None

Code Generation: false
== RDD ==
(2) MappedRDD[14] at map at HiveContext.scala:350
  MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
  MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
  MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
  MappedRDD[10] at map at TableReader.scala:240
  HadoopRDD[9] at HadoopRDD at TableReader.scala:230
```

It's the sub task of #1847. But can go without any dependency.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1962 from chenghao-intel/explain_extended and squashes the following commits:

295db74 [Cheng Hao] Fix bug in printing the simple execution plan
48bc989 [Cheng Hao] Support EXTENDED for EXPLAIN

156eb396

[SPARK-2929][SQL] Refactored Thrift server and CLI suites · cae9414d

Cheng Lian authored 10 years ago

Removed most hard coded timeout, timing assumptions and all `Thread.sleep`. Simplified IPC and synchronization with `scala.sys.process` and future/promise so that the test suites can run more robustly and faster.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1856 from liancheng/thriftserver-tests and squashes the following commits:

2d914ca [Cheng Lian] Minor refactoring
0e12e71 [Cheng Lian] Cleaned up test output
0ee921d [Cheng Lian] Refactored Thrift server and CLI suites

cae9414d

[SPARK-3204][SQL] MaxOf would be foldable if both left and right are foldable. · d299e2bf

Takuya UESHIN authored 10 years ago

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2116 from ueshin/issues/SPARK-3204 and squashes the following commits:

7d9b107 [Takuya UESHIN] Make MaxOf foldable if both left and right are foldable.

d299e2bf

Fixed a typo in docs/running-on-mesos.md · 805fec84

Cheng Lian authored 10 years ago

It should be `spark-env.sh` rather than `spark.env.sh`.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2119 from liancheng/fix-mesos-doc and squashes the following commits:

f360548 [Cheng Lian] Fixed a typo in docs/running-on-mesos.md

805fec84

[FIX] fix error message in sendMessageReliably · fd8ace2d

Xiangrui Meng authored 10 years ago

rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #2120 from mengxr/sendMessageReliably and squashes the following commits:

b14400c [Xiangrui Meng] fix error message in sendMessageReliably

fd8ace2d

SPARK-3180 - Better control of security groups · cc40a709

Allan Douglas R. de Oliveira authored 10 years ago

Adds the --authorized-address and --additional-security-group options as explained in the issue.

Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com>

Closes #2088 from douglaz/configurable_sg and squashes the following commits:

e3e48ca [Allan Douglas R. de Oliveira] Adds the option to specify the address authorized to access the SG and another option to provide an additional existing SG

cc40a709

SPARK-2798 [BUILD] Correct several small errors in Flume module pom.xml files · cd30db56

Sean Owen authored 10 years ago

(EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink `pom.xml`

- `scalatest` is not declared as a test-scope dependency
- Its Avro version doesn't match the rest of the build
- Its Flume version is not synced with the other Flume module
- The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version
- It depends on Scala Lang directly, which it shouldn't

Author: Sean Owen <sowen@cloudera.com>

Closes #1726 from srowen/SPARK-2798 and squashes the following commits:

a46e2c6 [Sean Owen] scalatest to test scope, harmonize Avro and Flume versions, remove direct Scala dependency, fix '2.10' in Flume dependency

cd30db56

[SPARK-2495][MLLIB] make KMeans constructor public · 220f4136

Xiangrui Meng authored 10 years ago

to re-construct k-means models freeman-lab

Author: Xiangrui Meng <meng@databricks.com>

Closes #2112 from mengxr/public-constructors and squashes the following commits:

18d53a9 [Xiangrui Meng] make KMeans constructor public

220f4136

Aug 24, 2014

[SPARK-2871] [PySpark] add zipWithIndex() and zipWithUniqueId() · fb0db772

Davies Liu authored 10 years ago

RDD.zipWithIndex()

        Zips this RDD with its element indices.

        The ordering is first based on the partition index and then the
        ordering of items within each partition. So the first item in
        the first partition gets index 0, and the last item in the last
        partition receives the largest index.

        This method needs to trigger a spark job when this RDD contains
        more than one partitions.

        >>> sc.parallelize(range(4), 2).zipWithIndex().collect()
        [(0, 0), (1, 1), (2, 2), (3, 3)]

RDD.zipWithUniqueId()

        Zips this RDD with generated unique Long ids.

        Items in the kth partition will get ids k, n+k, 2*n+k, ..., where
        n is the number of partitions. So there may exist gaps, but this
        method won't trigger a spark job, which is different from
        L{zipWithIndex}

        >>> sc.parallelize(range(4), 2).zipWithUniqueId().collect()
        [(0, 0), (2, 1), (1, 2), (3, 3)]

Author: Davies Liu <davies.liu@gmail.com>

Closes #2092 from davies/zipWith and squashes the following commits:

cebe5bf [Davies Liu] improve test cases, reverse the order of index
0d2a128 [Davies Liu] add zipWithIndex() and zipWithUniqueId()

fb0db772

[MLlib][SPARK-2997] Update SVD documentation to reflect roughly square · b1b20301

Reza Zadeh authored 10 years ago

Update the documentation to reflect the fact we can handle roughly square matrices.

Author: Reza Zadeh <rizlar@gmail.com>

Closes #2070 from rezazadeh/svddocs and squashes the following commits:

826b8fe [Reza Zadeh] left singular vectors
3f34fc6 [Reza Zadeh] PCA is still TS
7ffa2aa [Reza Zadeh] better title
aeaf39d [Reza Zadeh] More docs
788ed13 [Reza Zadeh] add computational cost explanation
6429c59 [Reza Zadeh] Add link to rowmatrix docs
1eeab8b [Reza Zadeh] Update SVD documentation to reflect roughly square

b1b20301

[SPARK-2841][MLlib] Documentation for feature transformations · 572952ae

DB Tsai authored 10 years ago

Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #2068 from dbtsai/transformer-documentation and squashes the following commits:

109f324 [DB Tsai] address feedback

572952ae

[SPARK-3192] Some scripts have 2 space indentation but other scripts have 4 space indentation. · ded6796b

Kousuke Saruta authored 10 years ago

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2104 from sarutak/SPARK-3192 and squashes the following commits:

db78419 [Kousuke Saruta] Modified indentation of spark-shell

ded6796b