Commits · 9497b12d429cf9d075807896637e40e205175203 · cs525-sp18-g07 / spark

Aug 14, 2014

[SPARK-3006] Failed to execute spark-shell in Windows OS · 9497b12d

Modified the order of the options and arguments in spark-shell.cmd

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #1918 from tsudukim/feature/SPARK-3006 and squashes the following commits:

8bba494 [Masayoshi TSUZUKI] [SPARK-3006] Failed to execute spark-shell in Windows OS
1a32410 [Masayoshi TSUZUKI] [SPARK-3006] Failed to execute spark-shell in Windows OS

9497b12d

Aug 13, 2014

SPARK-3020: Print completed indices rather than tasks in web UI · 0c7b4529

Patrick Wendell authored 10 years ago

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1933 from pwendell/speculation and squashes the following commits:

33a3473 [Patrick Wendell] Use OpenHashSet
8ce2ff0 [Patrick Wendell] SPARK-3020: Print completed indices rather than tasks in web UI

0c7b4529

[SPARK-2986] [SQL] fixed: setting properties does not effect · 63d67777

guowei authored 10 years ago

it seems that set command does not run by SparkSQLDriver. it runs on hive api.
user can not change reduce number by setting spark.sql.shuffle.partitions

but i think setting hive properties seems just a role to spark sql.

Author: guowei <guowei@upyoo.com>

Closes #1904 from guowei2/temp-branch and squashes the following commits:

7d47dde [guowei] fixed: setting properties like spark.sql.shuffle.partitions does not effective

63d67777

[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled · 905dc4b4

Kousuke Saruta authored 10 years ago

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1891 from sarutak/SPARK-2970 and squashes the following commits:

4a2d2fe [Kousuke Saruta] Modified comment style
8bd833c [Kousuke Saruta] Modified style
6c0997c [Kousuke Saruta] Modified the timing of shutdown hook execution. It should be executed before shutdown hook of o.a.h.f.FileSystem

905dc4b4

[SPARK-2935][SQL]Fix parquet predicate push down bug · 9fde1ff5

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #1863 from marmbrus/parquetPredicates and squashes the following commits:

10ad202 [Michael Armbrust] left <=> right
f249158 [Michael Armbrust] quiet parquet tests.
802da5b [Michael Armbrust] Add test case.
eab2eda [Michael Armbrust] Fix parquet predicate push down bug

9fde1ff5

[SPARK-2650][SQL] More precise initial buffer size estimation for in-memory column buffer · 376a82e1

Cheng Lian authored 10 years ago

This is a follow up of #1880.

Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1901 from liancheng/precise-init-buffer-size and squashes the following commits:

d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer

376a82e1

[SPARK-2994][SQL] Support for udfs that take complex types · 9256d4a9

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #1915 from marmbrus/arrayUDF and squashes the following commits:

a1c503d [Michael Armbrust] Support for udfs that take complex types

9256d4a9

[SPARK-2817] [SQL] add "show create table" support · 13f54e2b

tianyi authored 10 years ago

In spark sql component, the "show create table" syntax had been disabled.
We thought it is a useful funciton to describe a hive table.

Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi@asiainfo.com>
Author: tianyi <tianyi.asiainfo@gmail.com>

Closes #1760 from tianyi/spark-2817 and squashes the following commits:

7d28b15 [tianyi] [SPARK-2817] fix too short prefix problem
cbffe8b [tianyi] [SPARK-2817] fix the case problem
565ec14 [tianyi] [SPARK-2817] fix the case problem
60d48a9 [tianyi] [SPARK-2817] use system temporary folder instead of temporary files in the source tree, and also clean some empty line
dbe1031 [tianyi] [SPARK-2817] move some code out of function rewritePaths, as it may be called multiple times
9b2ba11 [tianyi] [SPARK-2817] fix the line length problem
9f97586 [tianyi] [SPARK-2817] remove test.tmp.dir from pom.xml
bfc2999 [tianyi] [SPARK-2817] add "File.separator" support, create a "testTmpDir" outside the rewritePaths
bde800a [tianyi] [SPARK-2817] add "${system:test.tmp.dir}" support add "last_modified_by" to nonDeterministicLineIndicators in HiveComparisonTest
bb82726 [tianyi] [SPARK-2817] remove test which requires a system from the whitelist.
bbf6b42 [tianyi] [SPARK-2817] add a systemProperties named "test.tmp.dir" to pass the test which contains "${system:test.tmp.dir}"
a337bd6 [tianyi] [SPARK-2817] add "show create table" support
a03db77 [tianyi] [SPARK-2817] add "show create table" support

13f54e2b

[SPARK-3004][SQL] Added null checking when retrieving row set · bdc7a1a4

Cheng Lian authored 10 years ago

JIRA issue: [SPARK-3004](https://issues.apache.org/jira/browse/SPARK-3004)

HiveThriftServer2 throws exception when the result set contains `NULL`. Should check `isNullAt` in `SparkSQLOperationManager.getNextRowSet`.

Note that simply using `row.addColumnValue(null)` doesn't work, since Hive set the column type of a null `ColumnValue` to String by default.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1920 from liancheng/spark-3004 and squashes the following commits:

1b1db1c [Cheng Lian] Adding NULL column values in the Hive way
2217722 [Cheng Lian] Fixed SPARK-3004: added null checking when retrieving row set

bdc7a1a4

[MLLIB] use Iterator.fill instead of Array.fill · 7ecb867c

Xiangrui Meng authored 10 years ago

Iterator.fill uses less memory

Author: Xiangrui Meng <meng@databricks.com>

Closes #1930 from mengxr/rand-gen-iter and squashes the following commits:

24178ca [Xiangrui Meng] use Iterator.fill instead of Array.fill

7ecb867c

[SPARK-2983] [PySpark] improve performance of sortByKey() · 434bea1c

Davies Liu authored 10 years ago

1. skip partitionBy() when numOfPartition is 1
2. use bisect_left (O(lg(N))) instread of loop (O(N)) in
rangePartitioner

Author: Davies Liu <davies.liu@gmail.com>

Closes #1898 from davies/sort and squashes the following commits:

0a9608b [Davies Liu] Merge branch 'master' into sort
1cf9565 [Davies Liu] improve performance of sortByKey()

434bea1c

[SPARK-3013] [SQL] [PySpark] convert array into list · c974a716

Davies Liu authored 10 years ago

because Pyrolite does not support array from Python 2.6

Author: Davies Liu <davies.liu@gmail.com>

Closes #1928 from davies/fix_array and squashes the following commits:

858e6c5 [Davies Liu] convert array into list

c974a716

[SPARK-2963] [SQL] There no documentation about building to use HiveServer and CLI for SparkSQL · 869f06c7

Kousuke Saruta authored 10 years ago

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1885 from sarutak/SPARK-2963 and squashes the following commits:

ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md
6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963
c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL

869f06c7

[SPARK-2993] [MLLib] colStats (wrapper around MultivariateStatisticalSummary) in Statistics · fe473595

Doris Xin authored 10 years ago

For both Scala and Python.

The ser/de util functions were moved out of `PythonMLLibAPI` and into their own object to avoid creating the `PythonMLLibAPI` object inside of `MultivariateStatisticalSummarySerialized`, which is then referenced inside of a method in `PythonMLLibAPI`.

`MultivariateStatisticalSummarySerialized` was created to serialize the `Vector` fields in `MultivariateStatisticalSummary`.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1911 from dorx/colStats and squashes the following commits:

77b9924 [Doris Xin] developerAPI tag
de9cbbe [Doris Xin] reviewer comments and moved more ser/de
459faba [Doris Xin] colStats in Statistics for both Scala and Python

fe473595

[SPARK-1777 (partial)] bugfix: make size of requested memory correctly · 2bd81263

Zhang, Liye authored 10 years ago

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #1892 from liyezhang556520/lazy_memory_request and squashes the following commits:

335ab61 [Zhang, Liye] [SPARK-1777 (partial)] bugfix: make size of requested memory correctly

2bd81263

Use transferTo when copy merge files in ExternalSorter · 246cb3f1

Raymond Liu authored 10 years ago

Since this is a file to file copy, using transferTo should be faster.

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1884 from colorant/externalSorter and squashes the following commits:

6e42f3c [Raymond Liu] More code into copyStream
bfb496b [Raymond Liu] Use transferTo when copy merge files in ExternalSorter

246cb3f1

[SPARK-2953] Allow using short names for io compression codecs · 676f9828

Reynold Xin authored 10 years ago

Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy".

Author: Reynold Xin <rxin@apache.org>

Closes #1873 from rxin/compressionCodecShortForm and squashes the following commits:

9f50962 [Reynold Xin] Specify short-form compression codec names first.
63f78ee [Reynold Xin] Updated configuration documentation.
47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs

676f9828

Aug 12, 2014

SPARK-2830 [MLlib]: re-organize mllib documentation · c235b83e

Ameet Talwalkar authored 10 years ago

As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.

Author: Ameet Talwalkar <atalwalkar@gmail.com>

Closes #1908 from atalwalkar/master and squashes the following commits:

fe6938a [Ameet Talwalkar] made xiangruis suggested changes
840028b [Ameet Talwalkar] made xiangruis suggested changes
7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation

c235b83e

fix flaky tests · 882da57a

Davies Liu authored 10 years ago

Python 2.6 does not handle float error well as 2.7+

Author: Davies Liu <davies.liu@gmail.com>

Closes #1910 from davies/fix_test and squashes the following commits:

7e51200 [Davies Liu] fix flaky tests

882da57a

[MLlib] Correctly set vectorSize and alpha · f0060b75

Liquan Pei authored 10 years ago

mengxr
Correctly set vectorSize and alpha in Word2Vec training.

Author: Liquan Pei <liquanpei@gmail.com>

Closes #1900 from Ishiihara/Word2Vec-bugfix and squashes the following commits:

85f64f2 [Liquan Pei] correctly set vectorSize and alpha

f0060b75

[SPARK-2923][MLLIB] Implement some basic BLAS routines · 9038d94e

Xiangrui Meng authored 10 years ago

Having some basic BLAS operations implemented in MLlib can help simplify the current implementation and improve some performance.

Tested on my local machine:

~~~
bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification \
examples/target/scala-*/spark-examples-*.jar --algorithm LR --regType L2 \
--regParam 1.0 --numIterations 1000 ~/share/data/rcv1.binary/rcv1_train.binary
~~~

1. before: ~1m
2. after: ~30s

CC: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #1849 from mengxr/ml-blas and squashes the following commits:

ba583a2 [Xiangrui Meng] exclude Vector.copy
a4d7d2f [Xiangrui Meng] Merge branch 'master' into ml-blas
6edeab9 [Xiangrui Meng] address comments
940bdeb [Xiangrui Meng] rename MLlibBLAS to BLAS
c2a38bc [Xiangrui Meng] enhance dot tests
4cfaac4 [Xiangrui Meng] add apache header
48d01d2 [Xiangrui Meng] add tests for zeros and copy
3b882b1 [Xiangrui Meng] use blas.scal in gradient
735eb23 [Xiangrui Meng] remove d from BLAS routines
d2d7d3c [Xiangrui Meng] update gradient and lbfgs
7f78186 [Xiangrui Meng] add zeros to Vectors; add dscal and dcopy to BLAS
14e6645 [Xiangrui Meng] add ddot
cbb8273 [Xiangrui Meng] add daxpy test
07db0bb [Xiangrui Meng] Merge branch 'master' into ml-blas
e8c326d [Xiangrui Meng] axpy

9038d94e

Aug 11, 2014

[SQL] [SPARK-2826] Reduce the memory copy while building the hashmap for HashOuterJoin · 5d54d71d

Cheng Hao authored 10 years ago

This is a follow up for #1147 , this PR will improve the performance about 10% - 15% in my local tests.
```
Before:
LeftOuterJoin: took 16750 ms ([3000000] records)
LeftOuterJoin: took 15179 ms ([3000000] records)
RightOuterJoin: took 15515 ms ([3000000] records)
RightOuterJoin: took 15276 ms ([3000000] records)
FullOuterJoin: took 19150 ms ([6000000] records)
FullOuterJoin: took 18935 ms ([6000000] records)

After:
LeftOuterJoin: took 15218 ms ([3000000] records)
LeftOuterJoin: took 13503 ms ([3000000] records)
RightOuterJoin: took 13663 ms ([3000000] records)
RightOuterJoin: took 14025 ms ([3000000] records)
FullOuterJoin: took 16624 ms ([6000000] records)
FullOuterJoin: took 16578 ms ([6000000] records)
```

Besides the performance improvement, I also do some clean up as suggested in #1147

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1765 from chenghao-intel/hash_outer_join_fixing and squashes the following commits:

ab1f9e0 [Cheng Hao] Reduce the memory copy while building the hashmap

5d54d71d

[SPARK-2650][SQL] Build column buffers in smaller batches · bad21ed0

Michael Armbrust authored 10 years ago

Author: Michael Armbrust <michael@databricks.com>

Closes #1880 from marmbrus/columnBatches and squashes the following commits:

0649987 [Michael Armbrust] add test
4756fad [Michael Armbrust] fix compilation
2314532 [Michael Armbrust] Build column buffers in smaller batches

bad21ed0

[SPARK-2968][SQL] Fix nullabilities of Explode. · c686b7dd

Takuya UESHIN authored 10 years ago

Output nullabilities of `Explode` could be detemined by `ArrayType.containsNull` or `MapType.valueContainsNull`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1888 from ueshin/issues/SPARK-2968 and squashes the following commits:

d128c95 [Takuya UESHIN] Fix nullability of Explode.

c686b7dd

[SPARK-2965][SQL] Fix HashOuterJoin output nullabilities. · c9c89c31

Takuya UESHIN authored 10 years ago

Output attributes of opposite side of `OuterJoin` should be nullable.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1887 from ueshin/issues/SPARK-2965 and squashes the following commits:

bcb2d37 [Takuya UESHIN] Fix HashOuterJoin output nullabilities.

c9c89c31

[SQL] A tiny refactoring in HiveContext#analyze · 647aeba3

Yin Huai authored 10 years ago

I should use `EliminateAnalysisOperators` in `analyze` instead of manually pattern matching.

Author: Yin Huai <huaiyin.thu@gmail.com>

Closes #1881 from yhuai/useEliminateAnalysisOperators and squashes the following commits:

f3e1e7f [Yin Huai] Use EliminateAnalysisOperators.

647aeba3

[sql]use SparkSQLEnv.stop() in ShutdownHook · e83fdcd4

wangfei authored 10 years ago

Author: wangfei <wangfei1@huawei.com>

Closes #1852 from scwf/patch-3 and squashes the following commits:

ae28c29 [wangfei] use SparkSQLEnv.stop() in ShutdownHook

e83fdcd4

[SPARK-2590][SQL] Added option to handle incremental collection, disabled by default · 21a95ef0

Cheng Lian authored 10 years ago

JIRA issue: [SPARK-2590](https://issues.apache.org/jira/browse/SPARK-2590)

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1853 from liancheng/inc-collect-option and squashes the following commits:

cb3ea45 [Cheng Lian] Moved incremental collection option to Thrift server
43ce3aa [Cheng Lian] Changed incremental collect option name
623abde [Cheng Lian] Added option to handle incremental collection, disabled by default

21a95ef0

[SPARK-2844][SQL] Correctly set JVM HiveContext if it is passed into Python HiveContext constructor · 490ecfa2

Ahir Reddy authored 10 years ago

https://issues.apache.org/jira/browse/SPARK-2844

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #1768 from ahirreddy/python-hive-context-fix and squashes the following commits:

7972d3b [Ahir Reddy] Correctly set JVM HiveContext if it is passed into Python HiveContext constructor

490ecfa2

[SPARK-2934][MLlib] Adding LogisticRegressionWithLBFGS Interface · 6fab941b

DB Tsai authored 10 years ago

for training with LBFGS Optimizer which will converge faster than SGD.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1862 from dbtsai/dbtsai-lbfgs-lor and squashes the following commits:

aa84b81 [DB Tsai] small change
f852bcd [DB Tsai] Remove duplicate method
f119fdc [DB Tsai] Formatting
97776aa [DB Tsai] address more feedback
85b4a91 [DB Tsai] address feedback
3cf50c2 [DB Tsai] LogisticRegressionWithLBFGS interface

6fab941b

[SPARK-2515][mllib] Chi Squared test · 32638b5e

Doris Xin authored 10 years ago

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1733 from dorx/chisquare and squashes the following commits:

cafb3a7 [Doris Xin] fixed p-value for extreme case.
d286783 [Doris Xin] Merge branch 'master' into chisquare
e95e485 [Doris Xin] reviewer comments.
7dde711 [Doris Xin] ChiSqTestResult renaming and changed to Class
80d03e2 [Doris Xin] Reviewer comments.
c39eeb5 [Doris Xin] units passed with updated API
e90d90a [Doris Xin] Merge branch 'master' into chisquare
7eea80b [Doris Xin] WIP
d64c2fb [Doris Xin] Merge branch 'master' into chisquare
5686082 [Doris Xin] facelift
bc7eb2e [Doris Xin] unit passed; still need docs and some refactoring
50703a5 [Doris Xin] merge master
4e4e361 [Doris Xin] WIP
e6b83f3 [Doris Xin] reviewer comments
3d61582 [Doris Xin] input names
706d436 [Doris Xin] Added API for RDD[Vector]
6598379 [Doris Xin] API and code structure.
ff17423 [Doris Xin] WIP

32638b5e

[SPARK-2931] In TaskSetManager, reset currentLocalityIndex after recomputing locality levels · 7712e724

Josh Rosen authored 10 years ago

This addresses SPARK-2931, a bug where getAllowedLocalityLevel() could throw ArrayIndexOutOfBoundsException. The fix here is to reset currentLocalityIndex after recomputing the locality levels.

Thanks to kayousterhout, mridulm, and lirui-intel for helping me to debug this.

Author: Josh Rosen <joshrosen@apache.org>

Closes #1896 from JoshRosen/SPARK-2931 and squashes the following commits:

48b60b5 [Josh Rosen] Move FakeRackUtil.cleanUp() info beforeEach().
6fec474 [Josh Rosen] Set currentLocalityIndex after recomputing locality levels.
9384897 [Josh Rosen] Update SPARK-2931 test to reflect changes in 63bdb1f4.
9ecd455 [Josh Rosen] Apply @mridulm's patch for reproducing SPARK-2931.

7712e724

[SPARK-2952] Enable logging actor messages at DEBUG level · 37338666

Reynold Xin authored 10 years ago

Example messages:
```
14/08/09 21:37:01 DEBUG BlockManagerMasterActor: [actor] received message RegisterBlockManager(BlockManagerId(0, rxin-mbp, 58092, 0),278302556,Actor[akka.tcp://spark@rxin-mbp:58088/user/BlockManagerActor1#-63596539]) from Actor[akka.tcp://spark@rxin-mbp:58088/temp/$c]

14/08/09 21:37:01 DEBUG BlockManagerMasterActor: [actor] handled message (0.279 ms) RegisterBlockManager(BlockManagerId(0, rxin-mbp, 58092, 0),278302556,Actor[akka.tcp://spark@rxin-mbp:58088/user/BlockManagerActor1#-63596539]) from Actor[akka.tcp://spark@rxin-mbp:58088/temp/$c]
```

cc @mengxr @tdas @pwendell

Author: Reynold Xin <rxin@apache.org>

Closes #1870 from rxin/actorLogging and squashes the following commits:

c531ee5 [Reynold Xin] Added license header for ActorLogReceive.
f6b1ebe [Reynold Xin] [SPARK-2952] Enable logging actor messages at DEBUG level

37338666

[PySpark] [SPARK-2954] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 Fixes · db06a81f

Josh Rosen authored 10 years ago

- Modify python/run-tests to test with Python 2.6
- Use unittest2 when running on Python 2.6.
- Fix issue with namedtuple.
- Skip TestOutputFormat.test_newhadoop on Python 2.6 until SPARK-2951 is fixed.
- Fix MLlib _deserialize_double on Python 2.6.

Closes #1868.  Closes #1042.

Author: Josh Rosen <joshrosen@apache.org>

Closes #1874 from JoshRosen/python2.6 and squashes the following commits:

983d259 [Josh Rosen] [SPARK-2954] Fix MLlib _deserialize_double on Python 2.6.
5d18fd7 [Josh Rosen] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 fixes

db06a81f

Aug 10, 2014

[SPARK-2936] Migrate Netty network module from Java to Scala · ba28a8fc

Reynold Xin authored 10 years ago

The Netty network module was originally written when Scala 2.9.x had a bug that prevents a pure Scala implementation, and a subset of the files were done in Java. We have since upgraded to Scala 2.10, and can migrate all Java files now to Scala.

https://github.com/netty/netty/issues/781

https://github.com/mesos/spark/pull/522

Author: Reynold Xin <rxin@apache.org>

Closes #1865 from rxin/netty and squashes the following commits:

332422f [Reynold Xin] Code review feedback
ca9eeee [Reynold Xin] Minor update.
7f1434b [Reynold Xin] [SPARK-2936] Migrate Netty network module from Java to Scala

ba28a8fc

[SPARK-2937] Separate out samplyByKeyExact as its own API in PairRDDFunction · b715aa0c

Doris Xin authored 10 years ago

To enable Python consistency and `Experimental` label of the `sampleByKeyExact` API.

Author: Doris Xin <doris.s.xin@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1866 from dorx/stratified and squashes the following commits:

0ad97b2 [Doris Xin] reviewer comments.
2948aae [Doris Xin] remove unrelated changes
e990325 [Doris Xin] Merge branch 'master' into stratified
555a3f9 [Doris Xin] separate out sampleByKeyExact as its own API
616e55c [Doris Xin] merge master
245439e [Doris Xin] moved minSamplingRate to getUpperBound
eaf5771 [Doris Xin] bug fixes.
17a381b [Doris Xin] fixed a merge issue and a failed unit
ea7d27f [Doris Xin] merge master
b223529 [Xiangrui Meng] use approx bounds for poisson fix poisson mean for waitlisting add unit tests for Java
b3013a4 [Xiangrui Meng] move math3 back to test scope
eecee5f [Doris Xin] Merge branch 'master' into stratified
f4c21f3 [Doris Xin] Reviewer comments
a10e68d [Doris Xin] style fix
a2bf756 [Doris Xin] Merge branch 'master' into stratified
680b677 [Doris Xin] use mapPartitionWithIndex instead
9884a9f [Doris Xin] style fix
bbfb8c9 [Doris Xin] Merge branch 'master' into stratified
ee9d260 [Doris Xin] addressed reviewer comments
6b5b10b [Doris Xin] Merge branch 'master' into stratified
254e03c [Doris Xin] minor fixes and Java API.
4ad516b [Doris Xin] remove unused imports from PairRDDFunctions
bd9dc6e [Doris Xin] unit bug and style violation fixed
1fe1cff [Doris Xin] Changed fractionByKey to a map to enable arg check
944a10c [Doris Xin] [SPARK-2145] Add lower bound on sampling rate
0214a76 [Doris Xin] cleanUp
90d94c0 [Doris Xin] merge master
9e74ab5 [Doris Xin] Separated out most of the logic in sampleByKey
7327611 [Doris Xin] merge master
50581fc [Doris Xin] added a TODO for logging in python
46f6c8c [Doris Xin] fixed the NPE caused by closures being cleaned before being passed into the aggregate function
7e1a481 [Doris Xin] changed the permission on SamplingUtil
1d413ce [Doris Xin] fixed checkstyle issues
9ee94ee [Doris Xin] [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size
e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
7cab53a [Doris Xin] fixed import bug in rdd.py
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS

b715aa0c

[SPARK-2898] [PySpark] fix bugs in deamon.py · 28dcbb53

Davies Liu authored 10 years ago

1. do not use signal handler for SIGCHILD, it's easy to cause deadlock
2. handle EINTR during accept()
3. pass errno into JVM
4. handle EAGAIN during fork()

Now, it can pass 50k tasks tests in 180 seconds.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1842 from davies/qa and squashes the following commits:

f0ea451 [Davies Liu] fix lint
03a2e8c [Davies Liu] cleanup dead children every seconds
32cb829 [Davies Liu] fix lint
0cd0817 [Davies Liu] fix bugs in deamon.py

28dcbb53

[SPARK-2950] Add gc time and shuffle write time to JobLogger · 1d03a26a

Shivaram Venkataraman authored 10 years ago

The JobLogger is very useful for performing offline performance profiling of Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but are currently missed from the JobLogger output. This patch adds these two fields.

~~Since this is a small change, I didn't create a JIRA. Let me know if I should do that.~~

cc kayousterhout

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #1869 from shivaram/job-logger and squashes the following commits:

1b709fc [Shivaram Venkataraman] Add a space before GC_TIME
c418105 [Shivaram Venkataraman] Add gc time and shuffle write time to JobLogger

1d03a26a

Remove extra semicolon in Task.scala · 3570119c

GuoQiang Li authored 10 years ago

Author: GuoQiang Li <witgo@qq.com>

Closes #1876 from witgo/remove_semicolon_in_Task_scala and squashes the following commits:

c6ea732 [GuoQiang Li] Remove extra semicolon in Task.scala

3570119c

Turn UpdateBlockInfo into case class. · 482c5afb

Reynold Xin authored 10 years ago

This helps us log UpdateBlockInfo properly once #1870 is merged.

Author: Reynold Xin <rxin@apache.org>

Closes #1872 from rxin/UpdateBlockInfo and squashes the following commits:

0cee1c2 [Reynold Xin] Turn UpdateBlockInfo into case class.

482c5afb