Commits · 557a797a273f1668065806cba53e19e6134a66d3 · cs525-sp18-g07 / spark

Apr 15, 2015

[SPARK-6937][MLLIB] Fixed bug in PICExample in which the radius were not being accepted on c... · 557a797a

sboeschhuawei authored 9 years ago

Tiny bug in PowerIterationClusteringExample in which radius not accepted from command line

Author: sboeschhuawei <stephen.boesch@huawei.com>

Closes #5531 from javadba/picsub and squashes the following commits:

2aab8cf [sboeschhuawei] Fixed bug in PICExample in which the radius were not being accepted on command line

557a797a

[SPARK-6844][SQL] Clean up accumulators used in InMemoryRelation when it is uncached · cf38fe04

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-6844

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5475 from viirya/cache_memory_leak and squashes the following commits:

0b41235 [Liang-Chi Hsieh] fix style.
dc1d5d5 [Liang-Chi Hsieh] For comments.
78af229 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cache_memory_leak
26c9bb6 [Liang-Chi Hsieh] Add configuration to enable in-memory table scan accumulators.
1c3b06e [Liang-Chi Hsieh] Clean up accumulators used in InMemoryRelation when it is uncached.

cf38fe04

[SPARK-6638] [SQL] Improve performance of StringType in SQL · 85842760

Davies Liu authored 9 years ago

This PR change the internal representation for StringType from java.lang.String to UTF8String, which is implemented use ArrayByte.

This PR should not break any public API, Row.getString() will still return java.lang.String.

This is the first step of improve the performance of String in SQL.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #5350 from davies/string and squashes the following commits:

3b7bfa8 [Davies Liu] fix schema of AddJar
2772f0d [Davies Liu] fix new test failure
6d776a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
59025c8 [Davies Liu] address comments from @marmbrus
341ec2c [Davies Liu] turn off scala style check in UTF8StringSuite
744788f [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
b04a19c [Davies Liu] add comment for getString/setString
08d897b [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
5116b43 [Davies Liu] rollback unrelated changes
1314a37 [Davies Liu] address comments from Yin
867bf50 [Davies Liu] fix String filter push down
13d9d42 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
2089d24 [Davies Liu] add hashcode check back
ac18ae6 [Davies Liu] address comment
fd11364 [Davies Liu] optimize UTF8String
8d17f21 [Davies Liu] fix hive compatibility tests
e5fa5b8 [Davies Liu] remove clone in UTF8String
28f3d81 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
28d6f32 [Davies Liu] refactor
537631c [Davies Liu] some comment about Date
9f4c194 [Davies Liu] convert data type for data source
956b0a4 [Davies Liu] fix hive tests
73e4363 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
9dc32d1 [Davies Liu] fix some hive tests
23a766c [Davies Liu] refactor
8b45864 [Davies Liu] fix codegen with UTF8String
bb52e44 [Davies Liu] fix scala style
c7dd4d2 [Davies Liu] fix some catalyst tests
38c303e [Davies Liu] fix python sql tests
5f9e120 [Davies Liu] fix sql tests
6b499ac [Davies Liu] fix style
a85fb27 [Davies Liu] refactor
d32abd1 [Davies Liu] fix utf8 for python api
4699c3a [Davies Liu] use Array[Byte] in UTF8String
21f67c6 [Davies Liu] cleanup
685fd07 [Davies Liu] use UTF8String instead of String for StringType

85842760

[SPARK-6887][SQL] ColumnBuilder misses FloatType · 785f9558

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-6887

Author: Yin Huai <yhuai@databricks.com>

Closes #5499 from yhuai/inMemFloat and squashes the following commits:

84cba38 [Yin Huai] Add test.
4b75ba6 [Yin Huai] Add FloatType back.

785f9558

[SPARK-6800][SQL] Update doc for JDBCRelation's columnPartition · e3e4e9a3

Liang-Chi Hsieh authored 9 years ago

JIRA https://issues.apache.org/jira/browse/SPARK-6800

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5488 from viirya/fix_jdbc_where and squashes the following commits:

51386c8 [Liang-Chi Hsieh] Update code comment.
1dcc929 [Liang-Chi Hsieh] Update document.
3eb74d6 [Liang-Chi Hsieh] Revert and modify doc.
df11783 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_jdbc_where
3e7db15 [Liang-Chi Hsieh] Fix wrong logic to generate WHERE clause for JDBC.

e3e4e9a3

[SPARK-6730][SQL] Allow using keyword as identifier in OPTIONS · b75b3070

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-6730

It is very possible that keyword will be used as identifier in `OPTIONS`, this pr makes it works.

However, another approach is that we can request that `OPTIONS` can't include keywords and has to use alternative identifier (e.g. table -> cassandraTable) if needed.

If so, please let me know to close this pr. Thanks.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5520 from viirya/relax_options and squashes the following commits:

339fd68 [Liang-Chi Hsieh] Use regex parser.
92be11c [Liang-Chi Hsieh] Allow using keyword as identifier in OPTIONS.

b75b3070

[SPARK-6886] [PySpark] fix big closure with shuffle · f11288d5

Davies Liu authored 9 years ago

Currently, the created broadcast object will have same life cycle as RDD in Python. For multistage jobs, an PythonRDD will be created in JVM and the RDD in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD.

This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy.

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #5496 from davies/big_closure and squashes the following commits:

9a0ea4c [Davies Liu] fix big closure with shuffle

f11288d5

SPARK-6861 [BUILD] Scalastyle config prevents building Maven child modules alone · 6c5ed8a6

Sean Owen authored 9 years ago

Move scalastyle-config.xml to dev/ (SBT config still doesn't work) to fix running mvn targets from subdirs; make scalastyle a verify stage target again in Maven; output results in target not project root; update to scalastyle 0.7.0

Author: Sean Owen <sowen@cloudera.com>

Closes #5471 from srowen/SPARK-6861 and squashes the following commits:

acac637 [Sean Owen] Oops, add back execution but leave it at the default verify phase
35a4fd2 [Sean Owen] Revert change to scalastyle-config.xml location, but return scalastyle Maven check to verify phase instead of package to get it farther out of the way, since the Maven invocation is optional
c4fb42c [Sean Owen] Move scalastyle-config.xml to dev/ (SBT config still doesn't work) to fix running mvn targets from subdirs; make scalastyle a verify stage target again in Maven; output results in target not project root; update to scalastyle 0.7.0

6c5ed8a6

[HOTFIX] [SPARK-6896] [SQL] fix compile error in hive-thriftserver · 29aabdd6

Daoyuan Wang authored 9 years ago

SPARK-6440 #5424 import guava but did not promote guava dependency to compile level.

[INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
[info] Compiling 8 Scala sources to /root/projects/spark/sql/hive-thriftserver/target/scala-2.10/classes...
[error] bad symbolic reference. A signature in Utils.class refers to term util
[error] in package com.google.common which is not available.
[error] It may be completely missing from the current classpath, or the version on
[error] the classpath might be incompatible with the version used when compiling Utils.class.
[error]
[error] while compiling: /root/projects/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala
[error] during phase: erasure
[error] library version: version 2.10.4
[error] compiler version: version 2.10.4
[error] reconstructed args: -deprecation -classpath

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #5507 from adrian-wang/guava and squashes the following commits:

c337dad [Daoyuan Wang] fix compile error

29aabdd6

[SPARK-6871][SQL] WITH clause in CTE can not following another WITH clause · 6be91894

Liang-Chi Hsieh authored 9 years ago

JIRA https://issues.apache.org/jira/browse/SPARK-6871

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5480 from viirya/no_cte_after_cte and squashes the following commits:

4da3712 [Liang-Chi Hsieh] Create new test.
40b38ed [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into no_cte_after_cte
0edf568 [Liang-Chi Hsieh] for comments.
6591b79 [Liang-Chi Hsieh] WITH clause in CTE can not following another WITH clause.

6be91894

Apr 14, 2015

[SPARK-5634] [core] Show correct message in HS when no incomplete apps f... · 30a6e0dc

Marcelo Vanzin authored 9 years ago

...ound.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5515 from vanzin/SPARK-5634 and squashes the following commits:

f74ecf1 [Marcelo Vanzin] [SPARK-5634] [core] Show correct message in HS when no incomplete apps found.

30a6e0dc

[SPARK-6890] [core] Fix launcher lib work with SPARK_PREPEND_CLASSES. · 97173893

Marcelo Vanzin authored 9 years ago

The fix for SPARK-6406 broke the case where sub-processes are launched
when SPARK_PREPEND_CLASSES is set, because the code now would only add
the launcher's build directory to the sub-process's classpath instead
of the complete assembly.

This patch fixes the problem by having the launch scripts stash the
assembly's location in an environment variable. This is not the prettiest
solution, but it avoids having to plumb that location all the way through
the Worker code that launches executors. The env variable is always
set by the launch scripts, so users cannot override it.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5504 from vanzin/SPARK-6890 and squashes the following commits:

7aec921 [Marcelo Vanzin] Fix tests.
ff87a60 [Marcelo Vanzin] Merge branch 'master' into SPARK-6890
31d3ce8 [Marcelo Vanzin] [SPARK-6890] [core] Fix launcher lib work with SPARK_PREPEND_CLASSES.

97173893

[SPARK-6796][Streaming][WebUI] Add "Active Batches" and "Completed Batches" lists to StreamingPage · 6de282e2

zsxwing authored 9 years ago

This PR adds two lists, `Active Batches` and `Completed Batches`. Here is the screenshot:

![batch_list](https://cloud.githubusercontent.com/assets/1000778/7060458/d8898572-deb3-11e4-938b-6f8602c71a9f.png)

Due to [SPARK-6766](https://issues.apache.org/jira/browse/SPARK-6766), I need to merge #5414 in my local machine to get the above screenshot.

Author: zsxwing <zsxwing@gmail.com>

Closes #5434 from zsxwing/SPARK-6796 and squashes the following commits:

be50fc6 [zsxwing] Fix the code style
51b792e [zsxwing] Fix the unit test
6f3078e [zsxwing] Make 'startTime' readable
f40e0a9 [zsxwing] Merge branch 'master' into SPARK-6796
2525336 [zsxwing] Rename 'Processed batches' and 'Waiting batches' and also add links
a69c091 [zsxwing] Show the number of total completed batches too
a12ad7b [zsxwing] Change 'records' to 'events' in the UI
86b5e7f [zsxwing] Make BatchTableBase abstract
b248787 [zsxwing] Add tests to verify the new tables
d18ab7d [zsxwing] Fix the code style
6ceffb3 [zsxwing] Add "Active Batches" and "Completed Batches" lists to StreamingPage

6de282e2

Revert "[SPARK-6352] [SQL] Add DirectParquetOutputCommitter" · a76b921a

Josh Rosen authored 9 years ago

This reverts commit b29663ee.

I'm reverting this because it broke test compilation for the Hadoop 1.x
profiles.

a76b921a

[SPARK-6769][YARN][TEST] Usage of the ListenerBus in YarnClusterSuite is wrong · 4d4b2492

Kousuke Saruta authored 9 years ago

In YarnClusterSuite, a test case uses `SaveExecutorInfo`  to handle ExecutorAddedEvent as follows.

```
private class SaveExecutorInfo extends SparkListener {
  val addedExecutorInfos = mutable.Map[String, ExecutorInfo]()

  override def onExecutorAdded(executor: SparkListenerExecutorAdded) {
    addedExecutorInfos(executor.executorId) = executor.executorInfo
  }
}

...

    listener = new SaveExecutorInfo
    val sc = new SparkContext(new SparkConf()
      .setAppName("yarn \"test app\" 'with quotes' and \\back\\slashes and $dollarSigns"))
    sc.addSparkListener(listener)
    val status = new File(args(0))
    var result = "failure"
    try {
      val data = sc.parallelize(1 to 4, 4).collect().toSet
      assert(sc.listenerBus.waitUntilEmpty(WAIT_TIMEOUT_MILLIS))
      data should be (Set(1, 2, 3, 4))
      result = "success"
    } finally {
      sc.stop()
      Files.write(result, status, UTF_8)
    }
```

But, the usage is wrong because Executors will spawn during initializing SparkContext and SparkContext#addSparkListener should be invoked after the initialization, thus after Executors spawn, so SaveExecutorInfo cannot handle ExecutorAddedEvent.

Following code refers the result of the handling ExecutorAddedEvent. Because of the reason above, we cannot reach the assertion.

```
    // verify log urls are present
    listener.addedExecutorInfos.values.foreach { info =>
      assert(info.logUrlMap.nonEmpty)
    }
```

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #5417 from sarutak/SPARK-6769 and squashes the following commits:

8adc8ba [Kousuke Saruta] Fixed compile error
e258530 [Kousuke Saruta] Fixed style
591cf3e [Kousuke Saruta] Fixed style
48ec89a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6769
860c965 [Kousuke Saruta] Simplified code
207d325 [Kousuke Saruta] Added findListenersByClass method to ListenerBus
2408c84 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6769
2d7e409 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6769
3874adf [Kousuke Saruta] Fixed the usage of listener bus in LogUrlsStandaloneSuite
153a91b [Kousuke Saruta] Fixed the usage of listener bus in YarnClusterSuite

4d4b2492

[SPARK-5808] [build] Package pyspark files in sbt assembly. · 65774370

Marcelo Vanzin authored 9 years ago

This turned out to be more complicated than I wanted because the
layout of python/ doesn't really follow the usual maven conventions.
So some extra code is needed to copy just the right things.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5461 from vanzin/SPARK-5808 and squashes the following commits:

7153dac [Marcelo Vanzin] Only try to create resource dir if it doesn't already exist.
ee90e84 [Marcelo Vanzin] [SPARK-5808] [build] Package pyspark files in sbt assembly.

65774370

[SPARK-6905] Upgrade to snappy-java 1.1.1.7 · 6adb8bcb

Josh Rosen authored 9 years ago

We should upgrade our snappy-java dependency to 1.1.1.7 in order to include a fix for a bug that results in worse compression in SnappyOutputStream (see https://github.com/xerial/snappy-java/issues/100).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits:

f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7.

6adb8bcb

[SPARK-6700] [yarn] Re-enable flaky test. · b075e4b7

Marcelo Vanzin authored 9 years ago

Test runs have been successful on jenkins. So let's re-enable the test and look out for any failures, and fix things appropriately.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5459 from vanzin/SPARK-6700 and squashes the following commits:

2ead85b [Marcelo Vanzin] WIP: re-enable flaky test to catch failure in jenkins.

b075e4b7

SPARK-1706: Allow multiple executors per worker in Standalone mode · 8f8dc45f

CodingCat authored 9 years ago

resubmit of https://github.com/apache/spark/pull/636 for a totally different algorithm

https://issues.apache.org/jira/browse/SPARK-1706

In current implementation, the user has to start multiple workers in a server for starting multiple executors in a server, which introduces additional overhead due to the more JVM processes...

In this patch, I changed the scheduling logic in master to enable the user to start multiple executor processes within the same JVM process.

1. user configure spark.executor.maxCoreNumPerExecutor to suggest the maximum core he/she would like to allocate to each executor

2. Master assigns the executors to the workers with the major consideration on the memoryPerExecutor and the worker.freeMemory, and tries to allocate as many as possible cores to the executor ```min(min(memoryPerExecutor, worker.freeCore), maxLeftCoreToAssign)``` where ```maxLeftCoreToAssign = maxExecutorCanAssign * maxCoreNumPerExecutor```

---------------------------------------

Other small changes include

change memoryPerSlave in ApplicationDescription to memoryPerExecutor, as "Slave" is overrided to represent both worker and executor in the documents... (we have some discussion on this before?)

Author: CodingCat <zhunansjtu@gmail.com>

Closes #731 from CodingCat/SPARK-1706-2 and squashes the following commits:

6dee808 [CodingCat] change filter predicate
fbeb7e5 [CodingCat] address the comments
940cb42 [CodingCat] avoid unnecessary allocation
b8ca561 [CodingCat] revert a change
45967b4 [CodingCat] remove unused method
2eeff77 [CodingCat] stylistic fixes
12a1b32 [CodingCat] change the semantic of coresPerExecutor to exact core number
f035423 [CodingCat] stylistic fix
d9c1685 [CodingCat] remove unused var
f595bd6 [CodingCat] recover some unintentional changes
63b3df9 [CodingCat] change the description of the parameter in the submit script
4cf61f1 [CodingCat] improve the code and docs
ff011e2 [CodingCat] start multiple executors on the worker by rewriting startExeuctor logic
2c2bcc5 [CodingCat] fix wrong usage info
497ec2c [CodingCat] address andrew's comments
878402c [CodingCat] change the launching executor code
f64a28d [CodingCat] typo fix
387f4ec [CodingCat] bug fix
35c462c [CodingCat] address Andrew's comments
0b64fea [CodingCat] fix compilation issue
19d3da7 [CodingCat] address the comments
5b81466 [CodingCat] remove outdated comments
ec7d421 [CodingCat] test commit
e5efabb [CodingCat] more java docs and consolidate canUse function
a26096d [CodingCat] stylistic fix
a5d629a [CodingCat] java doc
b34ec0c [CodingCat] make master support multiple executors per worker

8f8dc45f

[SPARK-2033] Automatically cleanup checkpoint · 25998e4d

GuoQiang Li authored 9 years ago

Author: GuoQiang Li <witgo@qq.com>

Closes #855 from witgo/cleanup_checkpoint_date and squashes the following commits:

1649850 [GuoQiang Li] review commit
c0087e0 [GuoQiang Li] Automatically cleanup checkpoint

25998e4d

[CORE] SPARK-6880: Fixed null check when all the dependent stages are... · dcf8a9f3

pankaj arora authored 9 years ago

[CORE] SPARK-6880: Fixed null check when all the dependent stages are cancelled due to previous stage failure

Fixed null check when all the dependent stages are cancelled due to previous stage failure. This happens when one of the executor node goes down and all the dependent stages are cancelled.

Author: pankaj arora <pankaj.arora@guavus.com>

Closes #5494 from pankajarora12/NEWBRANCH and squashes the following commits:

55ba5e3 [pankaj arora] [CORE] SPARK-6880: Fixed null check when all the dependent stages are cancelled due to previous stage failure
4575720 [pankaj arora] [CORE] SPARK-6880: Fixed null check when all the dependent stages are cancelled due to previous stage failure

dcf8a9f3

[SPARK-6894]spark.executor.extraLibraryOptions => spark.executor.extraLibraryPath · f63b44a5

WangTaoTheTonic authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-6894

cc vanzin

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5506 from WangTaoTheTonic/SPARK-6894 and squashes the following commits:

4b7ced7 [WangTaoTheTonic] spark.executor.extraLibraryOptions => spark.executor.extraLibraryPath

f63b44a5

[SPARK-6081] Support fetching http/https uris in driver runner. · 320bca45

Timothy Chen authored 9 years ago

Currently if passed uris such as http/https, it won't able to fetch them as it only calls HadoopFs get.
This fix utilizes the existing util method to fetch remote uris as well.

Author: Timothy Chen <tnachen@gmail.com>

Closes #4832 from tnachen/driver_remote and squashes the following commits:

aa52cd6 [Timothy Chen] Support fetching remote uris in driver runner.

320bca45

SPARK-6878 [CORE] Fix for sum on empty RDD fails with exception · 51b306b9

Erik van Oosten authored 9 years ago

Author: Erik van Oosten <evanoosten@ebay.com>

Closes #5489 from erikvanoosten/master and squashes the following commits:

1c91954 [Erik van Oosten] Rewrote double range matcher to an exact equality assert (SPARK-6878)
f1708c9 [Erik van Oosten] Fix for sum on empty RDD fails with exception (SPARK-6878)

51b306b9

[SPARK-6731] Bump version of apache commons-math3 · 628a72f7

Punyashloka Biswal authored 9 years ago

Version 3.1.1 is two years old and the newer version includes
approximate percentile statistics (among other things).

Author: Punyashloka Biswal <punya.biswal@gmail.com>

Closes #5380 from punya/patch-1 and squashes the following commits:

226622b [Punyashloka Biswal] Bump version of apache commons-math3

628a72f7

[WIP][HOTFIX][SPARK-4123]: Fix bug in PR dependency (all deps. removed issue) · 77eeb10f

Brennon York authored 9 years ago

We're seeing a bug sporadically in the new PR dependency comparison test whereby it notes that *all* dependencies are removed. This happens when the current PR is built, but the final, sorted, dependency file is left blank. I believe this is an error either in the way the `git checkout` calls have been or an error within the `mvn` build for that PR (again, likely related to the `git checkout`). As such I've set the checkouts to now force (with `-f` flag) which is more in line with what Jenkins currently does on the initial checkout.

Setting this as a WIP for now to trigger the build process myriad times to see if the issue still arises.

Author: Brennon York <brennon.york@capitalone.com>

Closes #5443 from brennonyork/HOTFIX2-SPARK-4123 and squashes the following commits:

f2186be [Brennon York] added output for the various git commit refs
3f073d6 [Brennon York] removed the git checkouts piping to dev null
07765a6 [Brennon York] updated the diff logic to reference the filenames rather than hardlink
e3f63c7 [Brennon York] added '-f' to the checkout flags for git
710c8d1 [Brennon York] added 30 minutes to the test benchmark

77eeb10f

Apr 13, 2015

[SPARK-5957][ML] better handling of parameters · 971b95b0

Xiangrui Meng authored 9 years ago

The design doc was posted on the JIRA page. Python changes will be in a follow-up PR. jkbradley

1. Use codegen for shared params.
1. Move shared params to package `ml.param.shared`.
1. Set default values in `Params` instead of in `Param`.
1. Add a few methods to `Params` and `ParamMap`.
1. Move schema handling to `SchemaUtils` from `Params`.

- [x] check visibility of the methods added

Author: Xiangrui Meng <meng@databricks.com>

Closes #5431 from mengxr/SPARK-5957 and squashes the following commits:

d19236d [Xiangrui Meng] fix test
26ae2d7 [Xiangrui Meng] re-gen code and mark clear protected
38b78c7 [Xiangrui Meng] update Param.toString and remove Params.explain()
409e2d5 [Xiangrui Meng] address comments
2d637bd [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5957
eec2264 [Xiangrui Meng] make get* public in Params
4090d95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5957
4fee9e7 [Xiangrui Meng] re-gen shared params
2737c2d [Xiangrui Meng] rename SharedParamCodeGen to SharedParamsCodeGen
e938f81 [Xiangrui Meng] update code to set default parameter values
28ed322 [Xiangrui Meng] merge master
55be1f3 [Xiangrui Meng] merge master
d63b5cc [Xiangrui Meng] fix examples
29b004c [Xiangrui Meng] update ParamsSuite
94fd98e [Xiangrui Meng] fix explain params
48d0e84 [Xiangrui Meng] add remove and update explainParams
4ac6348 [Xiangrui Meng] move schema utils to SchemaUtils add a few methods to Params
0d9594e [Xiangrui Meng] add getOrElse to ParamMap
eeeffe8 [Xiangrui Meng] map ++ paramMap => extractValues
0d3fc5b [Xiangrui Meng] setDefault after param
a9dbf59 [Xiangrui Meng] minor updates
d9302b8 [Xiangrui Meng] generate default values
1c72579 [Xiangrui Meng] pass test compile
abb7a3b [Xiangrui Meng] update default values handling
dcab97a [Xiangrui Meng] add codegen for shared params

971b95b0

[Minor][SparkR] Minor refactor and removes redundancy related to cleanClosure. · 0ba3fdd5

hlin09 authored 9 years ago

1. Only use `cleanClosure` in creation of RRDDs. Normally, user and developer do not need to call `cleanClosure` in their function definition.
2. Removes redundant code (e.g. unnecessary wrapper functions) related to `cleanClosure`.

Author: hlin09 <hlin09pu@gmail.com>

Closes #5495 from hlin09/cleanClosureFix and squashes the following commits:

74ec303 [hlin09] Minor refactor and removes redundancy.

0ba3fdd5

[SPARK-5794] [SQL] fix add jar · b45059d0

Daoyuan Wang authored 9 years ago

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4586 from adrian-wang/addjar and squashes the following commits:

efdd602 [Daoyuan Wang] move jar to another place
6c707e8 [Daoyuan Wang] restrict hive version for test
32c4fb8 [Daoyuan Wang] fix style and add a test
9957d87 [Daoyuan Wang] use sessionstate classloader in makeRDDforTable
0810e71 [Daoyuan Wang] remove variable substitution
1898309 [Daoyuan Wang] fix classnotfound
95a40da [Daoyuan Wang] support env argus in add jar, and set add jar ret to 0

b45059d0

[SQL] [Minor] Fix for SqlApp.scala · 3782e1f2

Fei Wang authored 9 years ago

SqlApp.scala is out of date.

Author: Fei Wang <wangfei1@huawei.com>

Closes #5485 from scwf/patch-1 and squashes the following commits:

6f731c2 [Fei Wang] SqlApp.scala compile error

3782e1f2

[Spark-4848] Allow different Worker configurations in standalone cluster · 435b8779

Nathan Kronenfeld authored 9 years ago

This refixes #3699 with the latest code.
This fixes SPARK-4848

I've changed the stand-alone cluster scripts to allow different workers to have different numbers of instances, with both port and web-ui port following allong appropriately.

I did this by moving the loop over instances from start-slaves and stop-slaves (on the master) to start-slave and stop-slave (on the worker).

Wile I was at it, I changed SPARK_WORKER_PORT to work the same way as SPARK_WORKER_WEBUI_PORT, since the new methods work fine for both.

Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com>

Closes #5140 from nkronenfeld/feature/spark-4848 and squashes the following commits:

cf5f47e [Nathan Kronenfeld] Merge remote branch 'upstream/master' into feature/spark-4848
044ca6f [Nathan Kronenfeld] Documentation and formatting as requested by by andrewor14
d739640 [Nathan Kronenfeld] Move looping through instances from the master to the workers, so that each worker respects its own number of instances and web-ui port

435b8779

[SPARK-6877][SQL] Add code generation support for Min · 4898dfa4

Liang-Chi Hsieh authored 9 years ago

Currently `min` is not supported in code generation. This pr adds the support for it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5487 from viirya/add_min_codegen and squashes the following commits:

0ddec23 [Liang-Chi Hsieh] Add code generation support for Min.

4898dfa4

[SPARK-6303][SQL] Remove unnecessary Average in GeneratedAggregate · 5b8b324f

Liang-Chi Hsieh authored 9 years ago

Because `Average` is a `PartialAggregate`, we never get a `Average` node when reaching `HashAggregation` to prepare `GeneratedAggregate`.

That is why in SQLQuerySuite there is already a test for `avg` with codegen. And it works.

But we can find a case in `GeneratedAggregate` to deal with `Average`. Based on the above, we actually never execute this case.

So we can remove this case from `GeneratedAggregate`.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4996 from viirya/add_average_codegened and squashes the following commits:

621c12f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_average_codegened
368cfbc [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_average_codegened
74926d1 [Liang-Chi Hsieh] Add Average in canBeCodeGened lists.

5b8b324f

[SPARK-6881][SparkR] Changes the checkpoint directory name. · d7f2c198

hlin09 authored 9 years ago

Author: hlin09 <hlin09pu@gmail.com>

Closes #5493 from hlin09/fixCheckpointDir and squashes the following commits:

e67fc40 [hlin09] Change to temp dir.
1f7ed9e [hlin09] Change the checkpoint dir name.

d7f2c198

[SPARK-5931][CORE] Use consistent naming for time properties · c4ab255e

Ilya Ganelin authored 9 years ago

I've added new utility methods to do the conversion from times specified as e.g. 120s, 240ms, 360us to convert to a consistent internal representation. I've updated usage of these constants throughout the code to be consistent.

I believe I've captured all usages of time-based properties throughout the code. I've also updated variable names in a number of places to reflect their units for clarity and updated documentation where appropriate.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Author: Ilya Ganelin <ilganeli@gmail.com>

Closes #5236 from ilganeli/SPARK-5931 and squashes the following commits:

4526c81 [Ilya Ganelin] Update configuration.md
de3bff9 [Ilya Ganelin] Fixing style errors
f5fafcd [Ilya Ganelin] Doc updates
951ca2d [Ilya Ganelin] Made the most recent round of changes
bc04e05 [Ilya Ganelin] Minor fixes and doc updates
25d3f52 [Ilya Ganelin] Minor nit fixes
642a06d [Ilya Ganelin] Fixed logic for invalid suffixes and addid matching test
8927e66 [Ilya Ganelin] Fixed handling of -1
69fedcc [Ilya Ganelin] Added test for zero
dc7bd08 [Ilya Ganelin] Fixed error in exception handling
7d19cdd [Ilya Ganelin] Added fix for possible NPE
6f651a8 [Ilya Ganelin] Now using regexes to simplify code in parseTimeString. Introduces getTimeAsSec and getTimeAsMs methods in SparkConf. Updated documentation
cbd2ca6 [Ilya Ganelin] Formatting error
1a1122c [Ilya Ganelin] Formatting fixes and added m for use as minute formatter
4e48679 [Ilya Ganelin] Fixed priority order and mixed up conversions in a couple spots
d4efd26 [Ilya Ganelin] Added time conversion for yarn.scheduler.heartbeat.interval-ms
cbf41db [Ilya Ganelin] Got rid of thrown exceptions
1465390 [Ilya Ganelin] Nit
28187bf [Ilya Ganelin] Convert straight to seconds
ff40bfe [Ilya Ganelin] Updated tests to fix small bugs
19c31af [Ilya Ganelin] Added cleaner computation of time conversions in tests
6387772 [Ilya Ganelin] Updated suffix handling to handle overlap of units more gracefully
5193d5f [Ilya Ganelin] Resolved merge conflicts
76cfa27 [Ilya Ganelin] [SPARK-5931] Minor nit fixes'
bf779b0 [Ilya Ganelin] Special handling of overlapping usffixes for java
dd0a680 [Ilya Ganelin] Updated scala code to call into java
b2fc965 [Ilya Ganelin] replaced get or default since it's not present in this version of java
39164f9 [Ilya Ganelin] [SPARK-5931] Updated Java conversion to be similar to scala conversion. Updated conversions to clean up code a little using TimeUnit.convert. Added Unit tests
3b126e1 [Ilya Ganelin] Fixed conversion to US from seconds
1858197 [Ilya Ganelin] Fixed bug where all time was being converted to us instead of the appropriate units
bac9edf [Ilya Ganelin] More whitespace
8613631 [Ilya Ganelin] Whitespace
1c0c07c [Ilya Ganelin] Updated Java code to add day, minutes, and hours
647b5ac [Ilya Ganelin] Udpated time conversion to use map iterator instead of if fall through
70ac213 [Ilya Ganelin] Fixed remaining usages to be consistent. Updated Java-side time conversion
68f4e93 [Ilya Ganelin] Updated more files to clean up usage of default time strings
3a12dd8 [Ilya Ganelin] Updated host revceiver
5232a36 [Ilya Ganelin] [SPARK-5931] Changed default behavior of time string conversion.
499bdf0 [Ilya Ganelin] Merge branch 'SPARK-5931' of github.com:ilganeli/spark into SPARK-5931
9e2547c [Ilya Ganelin] Reverting doc changes
8f741e1 [Ilya Ganelin] Update JavaUtils.java
34f87c2 [Ilya Ganelin] Update Utils.scala
9a29d8d [Ilya Ganelin] Fixed misuse of time in streaming context test
42477aa [Ilya Ganelin] Updated configuration doc with note on specifying time properties
cde9bff [Ilya Ganelin] Updated spark.streaming.blockInterval
c6a0095 [Ilya Ganelin] Updated spark.core.connection.auth.wait.timeout
5181597 [Ilya Ganelin] Updated spark.dynamicAllocation.schedulerBacklogTimeout
2fcc91c [Ilya Ganelin] Updated spark.dynamicAllocation.executorIdleTimeout
6d1518e [Ilya Ganelin] Upated spark.speculation.interval
3f1cfc8 [Ilya Ganelin] Updated spark.scheduler.revive.interval
3352d34 [Ilya Ganelin] Updated spark.scheduler.maxRegisteredResourcesWaitingTime
272c215 [Ilya Ganelin] Updated spark.locality.wait
7320c87 [Ilya Ganelin] updated spark.akka.heartbeat.interval
064ebd6 [Ilya Ganelin] Updated usage of spark.cleaner.ttl
21ef3dd [Ilya Ganelin] updated spark.shuffle.sasl.timeout
c9f5cad [Ilya Ganelin] Updated spark.shuffle.io.retryWait
4933fda [Ilya Ganelin] Updated usage of spark.storage.blockManagerSlaveTimeout
7db6d2a [Ilya Ganelin] Updated usage of spark.akka.timeout
404f8c3 [Ilya Ganelin] Updated usage of spark.core.connection.ack.wait.timeout
59bf9e1 [Ilya Ganelin] [SPARK-5931] Updated Utils and JavaUtils classes to add helper methods to handle time strings. Updated time strings in a few places to properly parse time

c4ab255e

[SPARK-5941] [SQL] Unit Test loads the table `src` twice for leftsemijoin.q · c5602bdc

Cheng Hao authored 9 years ago

In `leftsemijoin.q`, there is a data loading command for table `sales` already, but in `TestHive`, it also created the table `sales`, which causes duplicated records inserted into the `sales`.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4506 from chenghao-intel/df_table and squashes the following commits:

0be05f7 [Cheng Hao] Remove the table `sales` creating from TestHive

c5602bdc

[SPARK-6872] [SQL] add copy in external sort · e63a86ab

Daoyuan Wang authored 9 years ago

We need add copy before call externalsort.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #5481 from adrian-wang/extsort and squashes the following commits:

9611586 [Daoyuan Wang] fix bug in external sort

e63a86ab

[SPARK-5972] [MLlib] Cache residuals and gradient in GBT during training and validation · 2a55cb41

MechCoder authored 9 years ago

The previous PR https://github.com/apache/spark/pull/4906 helped to extract the learning curve giving the error for each iteration. This continues the work refactoring some code and extending the same logic during training and validation.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5330 from MechCoder/spark-5972 and squashes the following commits:

0b5d659 [MechCoder] minor
32d409d [MechCoder] EvaluateeachIteration and training cache should follow different paths
d542bb0 [MechCoder] Remove unused imports and docs
58f4932 [MechCoder] Remove unpersist
70d3b4c [MechCoder] Broadcast for each tree
5869533 [MechCoder] Access broadcasted values locally and other minor changes
923dbf6 [MechCoder] [SPARK-5972] Cache residuals and gradient in GBT during training and validation

2a55cb41

[SQL][SPARK-6742]: Don't push down predicates which reference partition column(s) · 3a205bbd

Yash Datta authored 9 years ago

cc liancheng

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #5390 from saucam/fpush and squashes the following commits:

3f026d6 [Yash Datta] SPARK-6742: Fix scalastyle
ce3d702 [Yash Datta] SPARK-6742: Add test case, fix scalastyle
8592acc [Yash Datta] SPARK-6742: Don't push down predicates which reference partition column(s)

3a205bbd

[SPARK-6130] [SQL] support if not exists for insert overwrite into partition in hiveQl · 85ee0cab

Daoyuan Wang authored 9 years ago

Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
 
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
 
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4865 from adrian-wang/insertoverwrite and squashes the following commits:

2fce94f [Daoyuan Wang] add assert
10ea6f3 [Daoyuan Wang] add name for boolean parameter
0bbe9b9 [Daoyuan Wang] fix failure
4391154 [Daoyuan Wang] support if not exists for insert overwrite into partition in hiveQl

85ee0cab