Commits · 1725a1a5d10a53762bd80f391eddbf306f2841ee · cs525-sp18-g07 / spark

Sep 05, 2014

[SPARK-3391][EC2] Support attaching up to 8 EBS volumes. · 1725a1a5

Reynold Xin authored 10 years ago

Please merge this at the same time as https://github.com/mesos/spark-ec2/pull/66

Author: Reynold Xin <rxin@apache.org>

Closes #2260 from rxin/ec2-ebs-vol and squashes the following commits:

b9527d9 [Reynold Xin] Removed io1 ebs type.
bf9c403 [Reynold Xin] Made EBS volume type configurable.
c8e25ea [Reynold Xin] Support up to 8 EBS volumes.
adf4f2e [Reynold Xin] Revert git repo change.
020c542 [Reynold Xin] [SPARK-3391] Support attaching more than 1 EBS volumes.

1725a1a5

Sep 04, 2014

[SPARK-3392] [SQL] Show value spark.sql.shuffle.partitions for mapred.reduce.tasks · 1904bac3

Cheng Hao authored 10 years ago

This is a tiny fix for getting the value of "mapred.reduce.tasks", which make more sense for the hive user.
As well as the command "set -v", which should output verbose information for all of the key/values.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2261 from chenghao-intel/set_mapreduce_tasks and squashes the following commits:

653858a [Cheng Hao] show value spark.sql.shuffle.partitions for mapred.reduce.tasks

1904bac3

[SPARK-2219][SQL] Added support for the "add jar" command · ee575f12

Cheng Lian authored 10 years ago

Adds logical and physical command classes for the "add jar" command.

Note that this PR conflicts with and should be merged after #2215.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2242 from liancheng/add-jar and squashes the following commits:

e43a2f1 [Cheng Lian] Updates AddJar according to conventions introduced in #2215
b99107f [Cheng Lian] Added test case for ADD JAR command
095b2c7 [Cheng Lian] Also forward ADD JAR command to Hive
9be031b [Cheng Lian] Trims Jar path string
8195056 [Cheng Lian] Added support for the "add jar" command

ee575f12

[SPARK-3310][SQL] Directly use currentTable without unnecessary implicit conversion · 3eb6ef31

Liang-Chi Hsieh authored 10 years ago

We can directly use currentTable there without unnecessary implicit conversion.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #2203 from viirya/direct_use_inmemoryrelation and squashes the following commits:

4741d02 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into direct_use_inmemoryrelation
b671f67 [Liang-Chi Hsieh] Can directly use currentTable there without unnecessary implicit conversion.

3eb6ef31

Manually close old PR · 90b17a70
Matei Zaharia authored 10 years ago
```
Closes #544
```
90b17a70
Manually close old PR · 0fdf2f5a
Matei Zaharia authored 10 years ago
```
Closes #1588
```
0fdf2f5a

[SPARK-3378] [DOCS] Replace the word "SparkSQL" with right word "Spark SQL" · dc1ba9e9

Kousuke Saruta authored 10 years ago

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2251 from sarutak/SPARK-3378 and squashes the following commits:

0bfe234 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3378
bb5938f [Kousuke Saruta] Replaced rest of "SparkSQL" with "Spark SQL"
6df66de [Kousuke Saruta] Replaced "SparkSQL" with "Spark SQL"

dc1ba9e9

[SPARK-3401][PySpark] Wrong usage of tee command in python/run-tests · 4feb46c5

Kousuke Saruta authored 10 years ago

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2272 from sarutak/SPARK-3401 and squashes the following commits:

2b35a59 [Kousuke Saruta] Modified wrong usage of tee command in python/run-tests

4feb46c5

[Minor]Remove extra semicolon in FlumeStreamSuite.scala · 90586190

GuoQiang Li authored 10 years ago

Author: GuoQiang Li <witgo@qq.com>

Closes #2265 from witgo/FlumeStreamSuite and squashes the following commits:

6c99e6e [GuoQiang Li] Remove extra semicolon in FlumeStreamSuite.scala

90586190

[HOTFIX] [SPARK-3400] Revert "fix GraphX EdgeRDD zipPartitions" · 00362dac

Ankur Dave authored 10 years ago

9b225ac3 has been causing GraphX tests
to fail nondeterministically, which is blocking development for others.

Author: Ankur Dave <ankurdave@gmail.com>

Closes #2271 from ankurdave/SPARK-3400 and squashes the following commits:

10c2a97 [Ankur Dave] [HOTFIX] [SPARK-3400] Revert 9b225ac3 "fix GraphX EdgeRDD zipPartitions"

00362dac

Sep 03, 2014

[SPARK-3372] [MLlib] MLlib doesn't pass maven build / checkstyle due to... · 1bed0a38

Kousuke Saruta authored 10 years ago

[SPARK-3372] [MLlib] MLlib doesn't pass maven build / checkstyle due to multi-byte character contained in Gradient.scala

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2248 from sarutak/SPARK-3372 and squashes the following commits:

73a28b8 [Kousuke Saruta] Replaced UTF-8 hyphen with ascii hyphen

1bed0a38

[SPARK-2435] Add shutdown hook to pyspark · 7c6e71f0

Matthew Farrellee authored 10 years ago

Author: Matthew Farrellee <matt@redhat.com>

Closes #2183 from mattf/SPARK-2435 and squashes the following commits:

ee0ee99 [Matthew Farrellee] [SPARK-2435] Add shutdown hook to pyspark

7c6e71f0

[SPARK-3335] [SQL] [PySpark] support broadcast in Python UDF · c5cbc492

Davies Liu authored 10 years ago

After this patch, broadcast can be used in Python UDF.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2243 from davies/udf_broadcast and squashes the following commits:

7b88861 [Davies Liu] support broadcast in UDF

c5cbc492

[SPARK-2961][SQL] Use statistics to prune batches within cached partitions · 248067ad

Cheng Lian authored 10 years ago

This PR is based on #1883 authored by marmbrus. Key differences:

1. Batch pruning instead of partition pruning

When #1883 was authored, batched column buffer building (#1880) hadn't been introduced. This PR combines these two and provide partition batch level pruning, which leads to smaller memory footprints and can generally skip more elements. The cost is that the pruning predicates are evaluated more frequently (partition number multiplies batch number per partition).

1. More filters are supported

Filter predicates consist of `=`, `<`, `<=`, `>`, `>=` and their conjunctions and disjunctions are supported.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2188 from liancheng/in-mem-batch-pruning and squashes the following commits:

68cf019 [Cheng Lian] Marked sqlContext as @transient
4254f6c [Cheng Lian] Enables in-memory partition pruning in PartitionBatchPruningSuite
3784105 [Cheng Lian] Overrides InMemoryColumnarTableScan.sqlContext
d2a1d66 [Cheng Lian] Disables in-memory partition pruning by default
062c315 [Cheng Lian] HiveCompatibilitySuite code cleanup
16b77bf [Cheng Lian] Fixed pruning predication conjunctions and disjunctions
16195c5 [Cheng Lian] Enabled both disjunction and conjunction
89950d0 [Cheng Lian] Worked around Scala style check
9c167f6 [Cheng Lian] Minor code cleanup
3c4d5c7 [Cheng Lian] Minor code cleanup
ea59ee5 [Cheng Lian] Renamed PartitionSkippingSuite to PartitionBatchPruningSuite
fc517d0 [Cheng Lian] More test cases
1868c18 [Cheng Lian] Code cleanup, bugfix, and adding tests
cb76da4 [Cheng Lian] Added more predicate filters, fixed table scan stats for testing purposes
385474a [Cheng Lian] Merge branch 'inMemStats' into in-mem-batch-pruning

248067ad

[SPARK-2973][SQL] Lightweight SQL commands without distributed jobs when calling .collect() · f48420fd

Cheng Lian authored 10 years ago

By overriding `executeCollect()` in physical plan classes of all commands, we can avoid to kick off a distributed job when collecting result of a SQL command, e.g. `sql("SET").collect()`.

Previously, `Command.sideEffectResult` returns a `Seq[Any]`, and the `execute()` method in sub-classes of `Command` typically convert that to a `Seq[Row]` then parallelize it to an RDD. Now with this PR, `sideEffectResult` is required to return a `Seq[Row]` directly, so that `executeCollect()` can directly leverage that and be factored to the `Command` parent class.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2215 from liancheng/lightweight-commands and squashes the following commits:

3fbef60 [Cheng Lian] Factored execute() method of physical commands to parent class Command
5a0e16c [Cheng Lian] Passes test suites
e0e12e9 [Cheng Lian] Refactored Command.sideEffectResult and Command.executeCollect
995bdd8 [Cheng Lian] Cleaned up DescribeHiveTableCommand
542977c [Cheng Lian] Avoids confusion between logical and physical plan by adding package prefixes
55b2aa5 [Cheng Lian] Avoids distributed jobs when execution SQL commands

f48420fd

[SPARK-3233] Executor never stop its SparnEnv, BlockManager, ConnectionManager etc. · 4bba10c4

Kousuke Saruta authored 10 years ago

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2138 from sarutak/SPARK-3233 and squashes the following commits:

c0205b7 [Kousuke Saruta] Merge branch 'SPARK-3233' of github.com:sarutak/spark into SPARK-3233
064679d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233
d3005fd [Kousuke Saruta] Modified Class definition format of BlockManagerMaster
039b747 [Kousuke Saruta] Modified style
889e2d1 [Kousuke Saruta] Modified BlockManagerMaster to be able to be past isDriver flag
4da8535 [Kousuke Saruta] Modified BlockManagerMaster#stop to send StopBlockManagerMaster message when sender is Driver
6518c3a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233
d5ab19a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233
6bce25c [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3233
6058a58 [Kousuke Saruta] Modified Executor not to invoke SparkEnv#stop in local mode
e5ad9d3 [Kousuke Saruta] Modified Executor to stop SparnEnv at the end of itself

4bba10c4

[SPARK-3303][core] fix SparkContextSchedulerCreationSuite test error · e08ea739

scwf authored 10 years ago

run test with the master branch with this command when mesos native lib is set
sbt/sbt -Phive "test-only org.apache.spark.SparkContextSchedulerCreationSuite"

get this error:
[info] SparkContextSchedulerCreationSuite:
[info] - bad-master
[info] - local
[info] - local-*
[info] - local-n
[info] - local--n-failures
[info] - local-n-failures
[info] - bad-local-n
[info] - bad-local-n-failures
[info] - local-default-parallelism
[info] - simr
[info] - local-cluster
[info] - yarn-cluster
[info] - yarn-standalone
[info] - yarn-client
[info] - mesos fine-grained
[info] - mesos coarse-grained ** FAILED ***
[info] Executor Spark home `spark.mesos.executor.home` is not set!

Since `executorSparkHome` only used in `createCommand`, move `val executorSparkHome...` to `createCommand` to fix this issue.

Author: scwf <wangfei1@huawei.com>
Author: wangfei <wangfei_hello@126.com>

Closes #2199 from scwf/SparkContextSchedulerCreationSuite and squashes the following commits:

ef1de22 [scwf] fix code fomate
19d26f3 [scwf] fix conflict
d9a8a60 [wangfei] fix SparkContextSchedulerCreationSuite test error

e08ea739

[SPARK-2419][Streaming][Docs] Updates to the streaming programming guide · a5224079

Tathagata Das authored 10 years ago

Updated the main streaming programming guide, and also added source-specific guides for Kafka, Flume, Kinesis.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Jacek Laskowski <jacek@japila.pl>

Closes #2254 from tdas/streaming-doc-fix and squashes the following commits:

e45c6d7 [Jacek Laskowski] More fixes from an old PR
5125316 [Tathagata Das] Fixed links
dc02f26 [Tathagata Das] Refactored streaming kinesis guide and made many other changes.
acbc3e3 [Tathagata Das] Fixed links between streaming guides.
cb7007f [Tathagata Das] Added Streaming + Flume integration guide.
9bd9407 [Tathagata Das] Updated streaming programming guide with additional information from SPARK-2419.

a5224079

[SPARK-3345] Do correct parameters for ShuffleFileGroup · 996b7434

Liang-Chi Hsieh authored 10 years ago

In the method `newFileGroup` of class `FileShuffleBlockManager`, the parameters for creating new `ShuffleFileGroup` object is in wrong order.

Because in current codes, the parameters `shuffleId` and `fileId` are not used. So it doesn't cause problem now. However it should be corrected for readability and avoid future problem.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #2235 from viirya/correct_shufflefilegroup_params and squashes the following commits:

fe72567 [Liang-Chi Hsieh] Do correct parameters for ShuffleFileGroup.

996b7434

[Minor] Fix outdated Spark version · 2784822e

Andrew Or authored 10 years ago

This is causing the event logs to include a file called SPARK_VERSION_1.0.0, which is not accurate.

Author: Andrew Or <andrewor14@gmail.com>
Author: andrewor14 <andrewor14@gmail.com>

Closes #2255 from andrewor14/spark-version and squashes the following commits:

1fbdfe9 [andrewor14] Snapshot
805a1c8 [Andrew Or] JK. Update Spark version to 1.2.0 instead.
bffbaab [Andrew Or] Update Spark version to 1.1.0

2784822e

[SPARK-3388] Expose aplication ID in ApplicationStart event, use it in history server. · f2b5b619

Marcelo Vanzin authored 10 years ago

This change exposes the application ID generated by the Spark Master, Mesos or Yarn
via the SparkListenerApplicationStart event. It then uses that information to expose the
application via its ID in the history server, instead of using the internal directory name
generated by the event logger as an application id. This allows someone who knows
the application ID to easily figure out the URL for the application's entry in the HS, aside
from looking better.

In Yarn mode, this is used to generate a direct link from the RM application list to the
Spark history server entry (thus providing a fix for SPARK-2150).

Note this sort of assumes that the different managers will generate app ids that are
sufficiently different from each other that clashes will not occur.

Author: Marcelo Vanzin <vanzin@cloudera.com>

This patch had conflicts when merged, resolved by
Committer: Andrew Or <andrewor14@gmail.com>

Closes #1218 from vanzin/yarn-hs-link-2 and squashes the following commits:

2d19f3c [Marcelo Vanzin] Review feedback.
6706d3a [Marcelo Vanzin] Implement applicationId() in base classes.
56fe42e [Marcelo Vanzin] Fix cluster mode history address, plus a cleanup.
44112a8 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
8278316 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
a86bbcf [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
a0056e6 [Marcelo Vanzin] Unbreak test.
4b10cfd [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
cb0cab2 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
25f2826 [Marcelo Vanzin] Add MIMA excludes.
f0ba90f [Marcelo Vanzin] Use BufferedIterator.
c90a08d [Marcelo Vanzin] Remove unused code.
3f8ec66 [Marcelo Vanzin] Review feedback.
21aa71b [Marcelo Vanzin] Fix JSON test.
b022bae [Marcelo Vanzin] Undo SparkContext cleanup.
c6d7478 [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
4e3483f [Marcelo Vanzin] Fix test.
57517b8 [Marcelo Vanzin] Review feedback. Mostly, more consistent use of Scala's Option.
311e49d [Marcelo Vanzin] Merge branch 'master' into yarn-hs-link-2
d35d86f [Marcelo Vanzin] Fix yarn backend after rebase.
36dc362 [Marcelo Vanzin] Don't use Iterator::takeWhile().
0afd696 [Marcelo Vanzin] Wait until master responds before returning from start().
abc4697 [Marcelo Vanzin] Make FsHistoryProvider keep a map of applications by id.
26b266e [Marcelo Vanzin] Use Mesos framework ID as Spark application ID.
b3f3664 [Marcelo Vanzin] [yarn] Make the RM link point to the app direcly in the HS.
2fb7de4 [Marcelo Vanzin] Expose the application ID in the ApplicationStart event.
ed10348 [Marcelo Vanzin] Expose application id to spark context.

f2b5b619

[SPARK-2845] Add timestamps to block manager events. · ccc69e26

Marcelo Vanzin authored 10 years ago

These are not used by the UI but are useful when analysing the
logs from a spark job.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #654 from vanzin/bm-event-tstamp and squashes the following commits:

d5d6e66 [Marcelo Vanzin] Fix tests.
ec06218 [Marcelo Vanzin] Review feedback.
f134dbc [Marcelo Vanzin] Merge branch 'master' into bm-event-tstamp
b495b7c [Marcelo Vanzin] Merge branch 'master' into bm-event-tstamp
7d2fe9e [Marcelo Vanzin] Review feedback.
d6f381c [Marcelo Vanzin] Update tests added after patch was created.
45e3bf8 [Marcelo Vanzin] Fix unit test after merge.
b37a10f [Marcelo Vanzin] Use === in test assertions.
ef72824 [Marcelo Vanzin] Handle backwards compatibility with 1.0.0.
aca1151 [Marcelo Vanzin] Fix unit test to check new fields.
efdda8e [Marcelo Vanzin] Add timestamps to block manager events.

ccc69e26

[SPARK-3263][GraphX] Fix changes made to GraphGenerator.logNormalGraph in PR #720 · e5d37680

RJ Nowling authored 10 years ago

PR #720 made multiple changes to GraphGenerator.logNormalGraph including:

* Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations. Based on reading the Pregel paper, I believe the in-line functions are incorrect.
* Hard-coding of RNG seeds so that method now generates the same graph for a given number of vertices, edges, mu, and sigma -- user is not able to override seed or specify that seed should be randomly generated.
* Backwards-incompatible change to logNormalGraph signature with introduction of new required parameter.
* Failed to update scala docs and programming guide for API changes
* Added a Synthetic Benchmark in the examples.

This PR:
* Removes the in-line calls and calls original vertex / edge generation functions again
* Adds an optional seed parameter for deterministic behavior (when desired)
* Keeps the number of partitions parameter that was added.
* Keeps compatibility with the synthetic benchmark example
* Maintains backwards-compatible API

Author: RJ Nowling <rnowling@gmail.com>
Author: Ankur Dave <ankurdave@gmail.com>

Closes #2168 from rnowling/graphgenrand and squashes the following commits:

f1cd79f [Ankur Dave] Style fixes
e11918e [RJ Nowling] Fix bad comparisons in unit tests
785ac70 [RJ Nowling] Fix style error
c70868d [RJ Nowling] Fix logNormalGraph scala doc for seed
41fd1f8 [RJ Nowling] Fix logNormalGraph scala doc for seed
799f002 [RJ Nowling] Added test for different seeds for sampleLogNormal
43949ad [RJ Nowling] Added test for different seeds for generateRandomEdges
2faf75f [RJ Nowling] Added unit test for logNormalGraph
82f22397 [RJ Nowling] Add unit test for sampleLogNormal
b99cba9 [RJ Nowling] Make sampleLogNormal private to Spark (vs private) for unit testing
6803da1 [RJ Nowling] Add GraphGeneratorsSuite with test for generateRandomEdges
1c8fc44 [RJ Nowling] Connected components part of SynthBenchmark was failing to call count on RDD before printing
dfbb6dd [RJ Nowling] Fix parameter name in SynthBenchmark docs
b5eeb80 [RJ Nowling] Add optional seed parameter to SynthBenchmark and set default to randomly generate a seed
1ff8d30 [RJ Nowling] Fix bug in generateRandomEdges where numVertices instead of numEdges was used to control number of edges to generate
98bb73c [RJ Nowling] Add documentation for logNormalGraph parameters
d40141a [RJ Nowling] Fix style error
684804d [RJ Nowling] revert PR #720 which introduce errors in logNormalGraph and messed up seeding of RNGs.  Add user-defined optional seed for deterministic behavior
c183136 [RJ Nowling] Fix to deterministic GraphGenerators.logNormalGraph that allows generating graphs randomly using optional seed.
015010c [RJ Nowling] Fixed GraphGenerator logNormalGraph API to make backward-incompatible change in commit 894ecde0

e5d37680

[SPARK-3309] [PySpark] Put all public API in __all__ · 6481d274

Davies Liu authored 10 years ago

Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2205 from davies/public and squashes the following commits:

c6c5567 [Davies Liu] fix message
f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module
7e3016a [Davies Liu] add __all__ in mllib
6281b48 [Davies Liu] fix doc for SchemaRDD
6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py

6481d274

[SPARK-3187] [yarn] Cleanup allocator code. · 6a72a369

Marcelo Vanzin authored 10 years ago

Move all shared logic to the base YarnAllocator class, and leave
the version-specific logic in the version-specific module.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2169 from vanzin/SPARK-3187 and squashes the following commits:

46c2826 [Marcelo Vanzin] Hide the privates.
4dc9c83 [Marcelo Vanzin] Actually release containers.
8b1a077 [Marcelo Vanzin] Changes to the Yarn alpha allocator.
f3f5f1d [Marcelo Vanzin] [SPARK-3187] [yarn] Cleanup allocator code.

6a72a369

Sep 02, 2014

SPARK-3358: [EC2] Switch back to HVM instances for m3.X. · c64cc435

Patrick Wendell authored 10 years ago

During regression tests of Spark 1.1 we discovered perf issues with
PVM instances when running PySpark. This reverts a change added in #1156
which changed the default type for m3 instances to PVM.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #2244 from pwendell/ec2-hvm and squashes the following commits:

1342d7e [Patrick Wendell] SPARK-3358: [EC2] Switch back to HVM instances for m3.X.

c64cc435

[SPARK-3300][SQL] No need to call clear() and shorten build() · 24ab3840

Liang-Chi Hsieh authored 10 years ago

The function `ensureFreeSpace` in object `ColumnBuilder` clears old buffer before copying its content to new buffer. This PR fixes it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #2195 from viirya/fix_buffer_clear and squashes the following commits:

792f009 [Liang-Chi Hsieh] no need to call clear(). use flip() instead of calling limit(), position() and rewind().
df2169f [Liang-Chi Hsieh] should clean old buffer after copying its content.

24ab3840

[SQL] Renamed ColumnStat to ColumnMetrics to avoid confusion between ColumnStats · 19d3e1e8

Cheng Lian authored 10 years ago

Class names of these two are just too similar.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2189 from liancheng/column-metrics and squashes the following commits:

8bb3b21 [Cheng Lian] Renamed ColumnStat to ColumnMetrics to avoid confusion between ColumnStats

19d3e1e8

[SPARK-3341][SQL] The dataType of Sqrt expression should be DoubleType. · 0cd91f66

Takuya UESHIN authored 10 years ago

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2233 from ueshin/issues/SPARK-3341 and squashes the following commits:

e497320 [Takuya UESHIN] Fix data type of Sqrt expression.

0cd91f66

[SPARK-2823][GraphX]fix GraphX EdgeRDD zipPartitions · 9b225ac3

luluorta authored 10 years ago

If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw:
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions

Author: luluorta <luluorta@gmail.com>

Closes #1763 from luluorta/fix-graph-zip and squashes the following commits:

8338961 [luluorta] fix GraphX EdgeRDD zipPartitions

9b225ac3

[SPARK-1981][Streaming][Hotfix] Fixed docs related to kinesis · e9bb12be

Tathagata Das authored 10 years ago

- Include kinesis in the unidocs
- Hide non-public classes from docs

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2239 from tdas/kinesis-doc-fix and squashes the following commits:

156e20c [Tathagata Das] More fixes, based on PR comments.
e9a6c01 [Tathagata Das] Fixed docs related to kinesis

e9bb12be

[SPARK-2981][GraphX] EdgePartition1D Int overflow · aa7de128

Larry Xiao authored 10 years ago

minor fix
detail is here: https://issues.apache.org/jira/browse/SPARK-2981

Author: Larry Xiao <xiaodi@sjtu.edu.cn>

Closes #1902 from larryxiao/2981 and squashes the following commits:

88059a2 [Larry Xiao] [SPARK-2981][GraphX] EdgePartition1D Int overflow

aa7de128

[SPARK-3123][GraphX]: override the "setName" function to set EdgeRDD's name... · 7c9bbf17

uncleGen authored 10 years ago

[SPARK-3123][GraphX]: override the "setName" function to set EdgeRDD's name manually just as VertexRDD does.

Author: uncleGen <hustyugm@gmail.com>

Closes #2033 from uncleGen/master_origin and squashes the following commits:

801994b [uncleGen] Update EdgeRDD.scala

7c9bbf17

[SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples · 7c92b49d

Larry Xiao authored 10 years ago

to support ~/spark/bin/run-example GraphXAnalytics triangles
/soc-LiveJournal1.txt --numEPart=256

Author: Larry Xiao <xiaodi@sjtu.edu.cn>

Closes #1766 from larryxiao/1986 and squashes the following commits:

bb77cd9 [Larry Xiao] [SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples

7c92b49d

SPARK-3328 fixed make-distribution script --with-tachyon option. · 644e3152

Prudhvi Krishna authored 10 years ago

Directory path for dependencies jar and resources in Tachyon 0.5.0 has been changed.

Author: Prudhvi Krishna <prudhvi953@gmail.com>

Closes #2228 from prudhvije/SPARK-3328/make-dist-fix and squashes the following commits:

d1d2c22 [Prudhvi Krishna] SPARK-3328 fixed make-distribution script --with-tachyon option.

644e3152

[SPARK-2871] [PySpark] add countApproxDistinct() API · e2c901b4

Davies Liu authored 10 years ago

RDD.countApproxDistinct(relativeSD=0.05):

        :: Experimental ::
        Return approximate number of distinct elements in the RDD.

        The algorithm used is based on streamlib's implementation of
        "HyperLogLog in Practice: Algorithmic Engineering of a State
        of The Art Cardinality Estimation Algorithm", available
        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.

        This support all the types of objects, which is supported by
        Pyrolite, nearly all builtin types.

        param relativeSD Relative accuracy. Smaller values create
                           counters that require more space.
                           It must be greater than 0.000017.

        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
        >>> 950 < n < 1050
        True
        >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
        >>> 18 < n < 22
        True

Author: Davies Liu <davies.liu@gmail.com>

Closes #2142 from davies/countApproxDistinct and squashes the following commits:

e20da47 [Davies Liu] remove the correction in Python
c38c4e4 [Davies Liu] fix doc tests
2ab157c [Davies Liu] fix doc tests
9d2565f [Davies Liu] add commments and link for hash collision correction
d306492 [Davies Liu] change range of hash of tuple to [0, maxint]
ded624f [Davies Liu] calculate hash in Python
4cba98f [Davies Liu] add more tests
a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct
e97e342 [Davies Liu] add countApproxDistinct()

e2c901b4

SPARK-3052. Misleading and spurious FileSystem closed errors whenever a ... · 81b9d5b6

Sandy Ryza authored 10 years ago

...job fails while reading from Hadoop

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1956 from sryza/sandy-spark-3052 and squashes the following commits:

815813a [Sandy Ryza] SPARK-3052. Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop

81b9d5b6

[SPARK-3347] [yarn] Fix yarn-alpha compilation. · 066f31a6

Marcelo Vanzin authored 10 years ago

Missing import. Oops.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2236 from vanzin/SPARK-3347 and squashes the following commits:

594fc39 [Marcelo Vanzin] [SPARK-3347] [yarn] Fix yarn-alpha compilation.

066f31a6

[SPARK-1919] Fix Windows spark-shell --jars · 8f1f9aaf

Andrew Or authored 10 years ago

We were trying to add `file:/C:/path/to/my.jar` to the class path. We should add `C:/path/to/my.jar` instead. Tested on Windows 8.1.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2211 from andrewor14/windows-shell-jars and squashes the following commits:

262c6a2 [Andrew Or] Oops... Add the new code to the correct place
0d5a0c1 [Andrew Or] Format jar path only for adding to shell classpath
42bd626 [Andrew Or] Remove unnecessary code
0049f1b [Andrew Or] Remove embarrassing log messages
b1755a0 [Andrew Or] Format jar paths properly before adding them to the classpath

8f1f9aaf

[SPARK-3061] Fix Maven build under Windows · 378b2315

Josh Rosen authored 10 years ago

The Maven build was failing on Windows because it tried to call the unix `unzip` utility to extract the Py4J files into core's build directory.  I've fixed this issue by using the `maven-antrun-plugin` to perform the unzipping.

I also fixed an issue that prevented tests from running under Windows:

In the Maven ScalaTest plugin, the filename listed in <filereports> is placed under the <reportsDirectory>; the current code places it in a subdirectory of reportsDirectory, e.g.

```
${project.build.directory}/surefire-reports/${project.build.directory}/SparkTestSuite.txt
```

This caused problems under Windows because it would try to create a subdirectory named "c:\\".

Note that the tests still fail under Windows (for other reasons); this PR just allows them to run and fail rather than crash when trying to create the test reports directory.

Author: Josh Rosen <joshrosen@apache.org>
Author: Josh Rosen <rosenville@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #2165 from JoshRosen/windows-support and squashes the following commits:

651d210 [Josh Rosen] Unzip to python/build instead of core/build
fbf3e61 [Josh Rosen] 4 spaces -> 2 spaces
e347668 [Josh Rosen] Fix Maven scalatest filereports path:
4994af1 [Josh Rosen] [SPARK-3061] Use maven-antrun-plugin to unzip Py4J.

378b2315