- Nov 07, 2016
-
-
Yanbo Liang authored
## What changes were proposed in this pull request? SparkR ```spark.glm``` predict should output original label when family = "binomial". ## How was this patch tested? Add unit test. You can also run the following code to test: ```R training <- suppressWarnings(createDataFrame(iris)) training <- training[training$Species %in% c("versicolor", "virginica"), ] model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit")) showDF(predict(model, training)) ``` Before this change: ``` +------------+-----------+------------+-----------+----------+-----+-------------------+ |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label| prediction| +------------+-----------+------------+-----------+----------+-----+-------------------+ | 7.0| 3.2| 4.7| 1.4|versicolor| 0.0| 0.8271421517601544| | 6.4| 3.2| 4.5| 1.5|versicolor| 0.0| 0.6044595910413112| | 6.9| 3.1| 4.9| 1.5|versicolor| 0.0| 0.7916340858281998| | 5.5| 2.3| 4.0| 1.3|versicolor| 0.0|0.16080518180591158| | 6.5| 2.8| 4.6| 1.5|versicolor| 0.0| 0.6112229217050189| | 5.7| 2.8| 4.5| 1.3|versicolor| 0.0| 0.2555087295500885| | 6.3| 3.3| 4.7| 1.6|versicolor| 0.0| 0.5681507664364834| | 4.9| 2.4| 3.3| 1.0|versicolor| 0.0|0.05990570219972002| | 6.6| 2.9| 4.6| 1.3|versicolor| 0.0| 0.6644434078306246| | 5.2| 2.7| 3.9| 1.4|versicolor| 0.0|0.11293577405862379| | 5.0| 2.0| 3.5| 1.0|versicolor| 0.0|0.06152372321585971| | 5.9| 3.0| 4.2| 1.5|versicolor| 0.0|0.35250697207602555| | 6.0| 2.2| 4.0| 1.0|versicolor| 0.0|0.32267018290814303| | 6.1| 2.9| 4.7| 1.4|versicolor| 0.0| 0.433391153814592| | 5.6| 2.9| 3.6| 1.3|versicolor| 0.0| 0.2280744262436993| | 6.7| 3.1| 4.4| 1.4|versicolor| 0.0| 0.7219848389339459| | 5.6| 3.0| 4.5| 1.5|versicolor| 0.0|0.23527698971404695| | 5.8| 2.7| 4.1| 1.0|versicolor| 0.0| 0.285024533520016| | 6.2| 2.2| 4.5| 1.5|versicolor| 0.0| 0.4107047877447493| | 5.6| 2.5| 3.9| 1.1|versicolor| 0.0|0.20083561961645083| +------------+-----------+------------+-----------+----------+-----+-------------------+ ``` After this change: ``` +------------+-----------+------------+-----------+----------+-----+----------+ |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label|prediction| +------------+-----------+------------+-----------+----------+-----+----------+ | 7.0| 3.2| 4.7| 1.4|versicolor| 0.0| virginica| | 6.4| 3.2| 4.5| 1.5|versicolor| 0.0| virginica| | 6.9| 3.1| 4.9| 1.5|versicolor| 0.0| virginica| | 5.5| 2.3| 4.0| 1.3|versicolor| 0.0|versicolor| | 6.5| 2.8| 4.6| 1.5|versicolor| 0.0| virginica| | 5.7| 2.8| 4.5| 1.3|versicolor| 0.0|versicolor| | 6.3| 3.3| 4.7| 1.6|versicolor| 0.0| virginica| | 4.9| 2.4| 3.3| 1.0|versicolor| 0.0|versicolor| | 6.6| 2.9| 4.6| 1.3|versicolor| 0.0| virginica| | 5.2| 2.7| 3.9| 1.4|versicolor| 0.0|versicolor| | 5.0| 2.0| 3.5| 1.0|versicolor| 0.0|versicolor| | 5.9| 3.0| 4.2| 1.5|versicolor| 0.0|versicolor| | 6.0| 2.2| 4.0| 1.0|versicolor| 0.0|versicolor| | 6.1| 2.9| 4.7| 1.4|versicolor| 0.0|versicolor| | 5.6| 2.9| 3.6| 1.3|versicolor| 0.0|versicolor| | 6.7| 3.1| 4.4| 1.4|versicolor| 0.0| virginica| | 5.6| 3.0| 4.5| 1.5|versicolor| 0.0|versicolor| | 5.8| 2.7| 4.1| 1.0|versicolor| 0.0|versicolor| | 6.2| 2.2| 4.5| 1.5|versicolor| 0.0|versicolor| | 5.6| 2.5| 3.9| 1.1|versicolor| 0.0|versicolor| +------------+-----------+------------+-----------+----------+-----+----------+ ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #15788 from yanboliang/spark-18291.
-
Liang-Chi Hsieh authored
## What changes were proposed in this pull request? As reported in the jira, sometimes the generated java code in codegen will cause compilation error. Code snippet to test it: case class Route(src: String, dest: String, cost: Int) case class GroupedRoutes(src: String, dest: String, routes: Seq[Route]) val ds = sc.parallelize(Array( Route("a", "b", 1), Route("a", "b", 2), Route("a", "c", 2), Route("a", "d", 10), Route("b", "a", 1), Route("b", "a", 5), Route("b", "c", 6)) ).toDF.as[Route] val grped = ds.map(r => GroupedRoutes(r.src, r.dest, Seq(r))) .groupByKey(r => (r.src, r.dest)) .reduceGroups { (g1: GroupedRoutes, g2: GroupedRoutes) => GroupedRoutes(g1.src, g1.dest, g1.routes ++ g2.routes) }.map(_._2) The problem here is, in `ReferenceToExpressions` we evaluate the children vars to local variables. Then the result expression is evaluated to use those children variables. In the above case, the result expression code is too long and will be split by `CodegenContext.splitExpression`. So those local variables cannot be accessed and cause compilation error. ## How was this patch tested? Jenkins tests. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #15693 from viirya/fix-codege-compilation-error.
-
gatorsmile authored
### What changes were proposed in this pull request? Currently, the Hive built-in `hash` function is not being used in Spark since Spark 2.0. The public interface does not allow users to unregister the Spark built-in functions. Thus, users will never use Hive's built-in `hash` function. The only exception here is `TestHiveFunctionRegistry`, which allows users to unregister the built-in functions. Thus, we can load Hive's hash function in the test cases. If we disable it, 10+ test cases will fail because the results are different from the Hive golden answer files. This PR is to remove `hash` from the list of `hiveFunctions` in `HiveSessionCatalog`. It will also remove `TestHiveFunctionRegistry`. This removal makes us easier to remove `TestHiveSessionState` in the future. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #14498 from gatorsmile/removeHash.
-
Reynold Xin authored
## What changes were proposed in this pull request? We have an undocumented naming convention to call expression unit tests ExpressionsSuite, and the end-to-end tests FunctionsSuite. It'd be great to make all test suites consistent with this naming convention. ## How was this patch tested? This is a test-only naming change. Author: Reynold Xin <rxin@databricks.com> Closes #15793 from rxin/SPARK-18296.
-
Reynold Xin authored
-
- Nov 06, 2016
-
-
Wenchen Fan authored
## What changes were proposed in this pull request? Previously `TRUNCATE TABLE ... PARTITION` will always truncate the whole table for data source tables, this PR fixes it and improve `InMemoryCatalog` to make this command work with it. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #15688 from cloud-fan/truncate.
-
hyukjinkwon authored
## What changes were proposed in this pull request? Currently, there are the three cases when reading CSV by datasource when it is `PERMISSIVE` parse mode. - schema == parsed tokens (from each line) No problem to cast the value in the tokens to the field in the schema as they are equal. - schema < parsed tokens (from each line) It slices the tokens into the number of fields in schema. - schema > parsed tokens (from each line) It appends `null` into parsed tokens so that safely values can be casted with the schema. However, when `null` is appended in the third case, we should take `null` into account when casting the values. In case of `StringType`, it is fine as `UTF8String.fromString(datum)` produces `null` when the input is `null`. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not `StringType`. The codes below: ```scala val path = "/tmp/a" Seq("1").toDF().write.text(path.getAbsolutePath) val schema = StructType( StructField("a", IntegerType, true) :: StructField("b", IntegerType, true) :: Nil) spark.read.schema(schema).option("header", "false").csv(path).show() ``` prints **Before** ``` java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:542) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24) ``` **After** ``` +---+----+ | a| b| +---+----+ | 1|null| +---+----+ ``` ## How was this patch tested? Unit test in `CSVSuite.scala` and `CSVTypeCastSuite.scala` Author: hyukjinkwon <gurwls223@gmail.com> Closes #15767 from HyukjinKwon/SPARK-18269.
-
Wojciech Szymanski authored
## What changes were proposed in this pull request? Motivation: `org.apache.spark.ml.Pipeline.copy(extra: ParamMap)` does not create an instance with the same UID. It does not conform to the method specification from its base class `org.apache.spark.ml.param.Params.copy(extra: ParamMap)` Solution: - fix for Pipeline UID - introduced new tests for `org.apache.spark.ml.Pipeline.copy` - minor improvements in test for `org.apache.spark.ml.PipelineModel.copy` ## How was this patch tested? Introduced new unit test: `org.apache.spark.ml.PipelineSuite."Pipeline.copy"` Improved existing unit test: `org.apache.spark.ml.PipelineSuite."PipelineModel.copy"` Author: Wojciech Szymanski <wk.szymanski@gmail.com> Closes #15759 from wojtek-szymanski/SPARK-18210.
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR proposes `rand`/`randn` accept `null` as input in Scala/SQL and `LongType` as input in SQL. In this case, it treats the values as `0`. So, this PR includes both changes below: - `null` support It seems MySQL also accepts this. ``` sql mysql> select rand(0); +---------------------+ | rand(0) | +---------------------+ | 0.15522042769493574 | +---------------------+ 1 row in set (0.00 sec) mysql> select rand(NULL); +---------------------+ | rand(NULL) | +---------------------+ | 0.15522042769493574 | +---------------------+ 1 row in set (0.00 sec) ``` and also Hive does according to [HIVE-14694](https://issues.apache.org/jira/browse/HIVE-14694) So the codes below: ``` scala spark.range(1).selectExpr("rand(null)").show() ``` prints.. **Before** ``` Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444) ``` **After** ``` +-----------------------+ |rand(CAST(NULL AS INT))| +-----------------------+ | 0.13385709732307427| +-----------------------+ ``` - `LongType` support in SQL. In addition, it make the function allows to take `LongType` consistently within Scala/SQL. In more details, the codes below: ``` scala spark.range(1).select(rand(1), rand(1L)).show() spark.range(1).selectExpr("rand(1)", "rand(1L)").show() ``` prints.. **Before** ``` +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ Input argument to rand must be an integer literal.;; line 1 pos 0 org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465) at ``` **After** ``` +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ +------------------+------------------+ | rand(1)| rand(1)| +------------------+------------------+ |0.2630967864682161|0.2630967864682161| +------------------+------------------+ ``` ## How was this patch tested? Unit tests in `DataFrameSuite.scala` and `RandomSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15432 from HyukjinKwon/SPARK-17854.
-
sethah authored
## What changes were proposed in this pull request? Only some of the models which contain a training summary currently set the summaries in the copy method. Linear/Logistic regression do, GLR, GMM, KM, and BKM do not. Additionally, these copy methods did not set the parent pointer of the copied model. This patch modifies the copy methods of the four models mentioned above to copy the training summary and set the parent. ## How was this patch tested? Add unit tests in Linear/Logistic/GeneralizedLinear regression and GaussianMixture/KMeans/BisectingKMeans to check the parent pointer of the copied model and check that the copied model has a summary. Author: sethah <seth.hendrickson16@gmail.com> Closes #15773 from sethah/SPARK-18276.
-
- Nov 05, 2016
-
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR proposes to improve documentation and fix some descriptions equivalent to several minor fixes identified in https://github.com/apache/spark/pull/15677 Also, this suggests to change `Note:` and `NOTE:` to `.. note::` consistently with the others which marks up pretty. ## How was this patch tested? Jenkins tests and manually. For PySpark, `Note:` and `NOTE:` to `.. note::` make the document as below: **From**      **To**      Author: hyukjinkwon <gurwls223@gmail.com> Closes #15765 from HyukjinKwon/minor-function-doc.
-
Susan X. Huynh authored
## What changes were proposed in this pull request? Enabled SparkR with Mesos client mode and cluster mode. Just a few changes were required to get this working on Mesos: (1) removed the SparkR on Mesos error checks and (2) do not require "--class" to be specified for R apps. The logic to check spark.mesos.executor.home was already in there. sun-rui ## How was this patch tested? 1. SparkSubmitSuite 2. On local mesos cluster (on laptop): ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application. 3. On multi-node mesos cluster: ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application. I tested with the following --conf values set: spark.mesos.executor.docker.image and spark.mesos.executor.home This contribution is my original work and I license the work to the project under the project's open source license. Author: Susan X. Huynh <xhuynh@mesosphere.com> Closes #15700 from susanxhuynh/susan-r-branch.
-
wangyang authored
## What changes were proposed in this pull request? Prior this pr, the following code would cause an NPE: `case class point(a:String, b:String, c:String, d: Int)` `val data = Seq( point("1","2","3", 1), point("4","5","6", 1), point("7","8","9", 1) )` `sc.parallelize(data).toDF().registerTempTable("table")` `spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()` The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out. Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly. This pr will fix this problem. ## How was this patch tested? add integration tests Author: wangyang <wangyang@haizhi.com> Closes #15416 from yangw1234/groupingid.
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR proposes to fix ```diff test("FileStreamSink - json") { - testFormat(Some("text")) + testFormat(Some("json")) } ``` `text` is being tested above ``` test("FileStreamSink - text") { testFormat(Some("text")) } ``` ## How was this patch tested? Fixed test in `FileStreamSinkSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15785 from HyukjinKwon/SPARK-18192.
-
Reynold Xin authored
## What changes were proposed in this pull request? As the title suggests, this patch moves hash expressions from misc.scala into hash.scala, to make it easier to find the hash functions. I wanted to do this a while ago but decided to wait for the branch-2.1 cut so the chance of conflicts will be smaller. ## How was this patch tested? Test cases were also moved out of MiscFunctionsSuite into HashExpressionsSuite. Author: Reynold Xin <rxin@databricks.com> Closes #15784 from rxin/SPARK-18287.
-
Wenchen Fan authored
[SPARK-17183][SPARK-17983][SPARK-18101][SQL] put hive serde table schema to table properties like data source table ## What changes were proposed in this pull request? For data source tables, we will put its table schema, partition columns, etc. to table properties, to work around some hive metastore issues, e.g. not case-preserving, bad decimal type support, etc. We should also do this for hive serde tables, to reduce the difference between hive serde tables and data source tables, e.g. column names should be case preserving. ## How was this patch tested? existing tests, and a new test in `HiveExternalCatalog` Author: Wenchen Fan <wenchen@databricks.com> Closes #14750 from cloud-fan/minor1.
-
Burak Yavuz authored
## What changes were proposed in this pull request? `from_json` is currently not safe against `null` rows. This PR adds a fix and a regression test for it. ## How was this patch tested? Regression test Author: Burak Yavuz <brkyvz@gmail.com> Closes #15771 from brkyvz/json_fix.
-
Weiqing Yang authored
## What changes were proposed in this pull request? Add comments. ## How was this patch tested? Build passed. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #15776 from weiqingy/SPARK-17710.
-
Reynold Xin authored
closes #15774
-
- Nov 04, 2016
-
-
Josh Rosen authored
## What changes were proposed in this pull request? This patch significantly improves the performance of event log replay in the HistoryServer via two simple changes: - **Don't use `extractOpt`**: it turns out that `json4s`'s `extractOpt` method uses exceptions for control flow, causing huge performance bottlenecks due to the overhead of initializing exceptions. To avoid this overhead, we can simply use our own` Utils.jsonOption` method. This patch replaces all uses of `extractOpt` with `Utils.jsonOption` and adds a style checker rule to ban the use of the slow `extractOpt` method. - **Don't call `Utils.getFormattedClassName` for every event**: the old code called` Utils.getFormattedClassName` dozens of times per replayed event in order to match up class names in events with SparkListener event names. By simply storing the results of these calls in constants rather than recomputing them, we're able to eliminate a huge performance hotspot by removing thousands of expensive `Class.getSimpleName` calls. ## How was this patch tested? Tested by profiling the replay of a long event log using YourKit. For an event log containing 1000+ jobs, each of which had thousands of tasks, the changes in this patch cut the replay time in half:  Prior to this patch's changes, the two slowest methods in log replay were internal exceptions thrown by `Json4S` and calls to `Class.getSimpleName()`:  After this patch, these hotspots are completely eliminated. Author: Josh Rosen <joshrosen@databricks.com> Closes #15756 from JoshRosen/speed-up-jsonprotocol.
-
Eric Liang authored
## What changes were proposed in this pull request? It seems the proximate cause of the test failures is that `cast(str as decimal)` in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts `__HIVE_DEFAULT_PARTITION__` entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034 Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in `MetaStoreDirectSql.java` above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as `>`, that will fail as well. The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness. **Update:** we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production). ## How was this patch tested? N/A Author: Eric Liang <ekl@databricks.com> Closes #15725 from ericl/print-confs-out.
-
Herman van Hovell authored
## What changes were proposed in this pull request? The `PushDownPredicate` rule can create a wrong result if we try to push a filter containing a predicate subquery through a project when the subquery and the project share attributes (have the same source). The current PR fixes this by making sure that we do not push down when there is a predicate subquery that outputs the same attributes as the filters new child plan. ## How was this patch tested? Added a test to `SubquerySuite`. nsyca has done previous work this. I have taken test from his initial PR. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15761 from hvanhovell/SPARK-17337.
-
Adam Roberts authored
## What changes were proposed in this pull request? This improvement works by using the fastest comparison test first and we observed a 1% throughput performance improvement on PageRank (HiBench large profile) with this change. We used tprof and before the change in AppendOnlyMap.changeValue (where the optimisation occurs) this method was being used for 8053 profiling ticks representing 0.72% of the overall application time. After this change we observed this method only occurring for 2786 ticks and for 0.25% of the overall time. ## How was this patch tested? Existing unit tests and for performance we used HiBench large, profiling with tprof and IBM Healthcenter. Author: Adam Roberts <aroberts@uk.ibm.com> Closes #15714 from a-roberts/patch-9.
-
Reynold Xin authored
Closes #15758 Closes #15753 Closes #12708
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? This is a follow-up PR of #15741 in order to keep `nextPowerOf2` consistent. **Before** ``` nextPowerOf2(0) => 2 nextPowerOf2(1) => 1 nextPowerOf2(2) => 2 nextPowerOf2(3) => 4 nextPowerOf2(4) => 4 nextPowerOf2(5) => 8 ``` **After** ``` nextPowerOf2(0) => 1 nextPowerOf2(1) => 1 nextPowerOf2(2) => 2 nextPowerOf2(3) => 4 nextPowerOf2(4) => 4 nextPowerOf2(5) => 8 ``` ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15754 from dongjoon-hyun/SPARK-18200-2.
-
Felix Cheung authored
## What changes were proposed in this pull request? minor doc update that should go to master & branch-2.1 ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15747 from felixcheung/pySPARK-14393.
-
- Nov 03, 2016
-
-
Herman van Hovell authored
## What changes were proposed in this pull request? `QueryExecution.toString` currently captures `java.lang.Throwable`s; this is far from a best practice and can lead to confusing situation or invalid application states. This PR fixes this by only capturing `AnalysisException`s. ## How was this patch tested? Added a `QueryExecutionSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15760 from hvanhovell/SPARK-18259.
-
Sean Owen authored
[SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0 ## What changes were proposed in this pull request? Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it. ## How was this patch tested? Doc build Author: Sean Owen <sowen@cloudera.com> Closes #15733 from srowen/SPARK-18138.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch improves error reporting for FileStressSuite, when there is an error in Spark itself (not user code). This works by simply tightening the exception verification, and gets rid of the unnecessary thread for starting the stream. Also renamed the class FileStreamStressSuite to make it more obvious it is a streaming suite. ## How was this patch tested? This is a test only change and I manually verified error reporting by injecting some bug in the addBatch code for FileStreamSink. Author: Reynold Xin <rxin@databricks.com> Closes #15757 from rxin/SPARK-18257.
-
wm624@hotmail.com authored
## What changes were proposed in this pull request? In test_mllib.R, there are two unnecessary suppressWarnings. This PR just removes them. ## How was this patch tested? Existing unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #15697 from wangmiao1981/rtest.
-
cody koeninger authored
## What changes were proposed in this pull request? Increase poll timeout to try and address flaky test ## How was this patch tested? Ran existing unit tests Author: cody koeninger <cody@koeninger.org> Closes #15737 from koeninger/SPARK-18212.
-
Kishor Patil authored
## What changes were proposed in this pull request? During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives. ## How was this patch tested? Manually tested: 1. if same jar is mentioned in --jars and --files it will continue to submit the job. - basically functionality [SPARK-14423] #12203 is unchanged 1. if same file is mentioned in --files and --archives it will fail to submit the job. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. … under archives and files Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #15627 from kishorvpatil/spark18099.
-
福星 authored
hive.exec.stagingdir have no effect in spark2.0.1, Hive confs in hive-site.xml will be loaded in `hadoopConf`, so we should use `hadoopConf` in `InsertIntoHiveTable` instead of `SessionState.conf` Author: 福星 <fuxing@wacai.com> Closes #15744 from ClassNotFoundExp/master.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch renames partitionProviderIsHive to tracksPartitionsInCatalog, as the old name was too Hive specific. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #15750 from rxin/SPARK-18244.
-
Cheng Lian authored
## What changes were proposed in this pull request? This PR adds a new hash-based aggregate operator named `ObjectHashAggregateExec` that supports `TypedImperativeAggregate`, which may use arbitrary Java objects as aggregation states. Please refer to the [design doc](https://issues.apache.org/jira/secure/attachment/12834260/%5BDesign%20Doc%5D%20Support%20for%20Arbitrary%20Aggregation%20States.pdf) attached in [SPARK-17949](https://issues.apache.org/jira/browse/SPARK-17949) for more details about it. The major benefit of this operator is better performance when evaluating `TypedImperativeAggregate` functions, especially when there are relatively few distinct groups. Functions like Hive UDAFs, `collect_list`, and `collect_set` may also benefit from this after being migrated to `TypedImperativeAggregate`. The following feature flag is introduced to enable or disable the new aggregate operator: - Name: `spark.sql.execution.useObjectHashAggregateExec` - Default value: `true` We can also configure the fallback threshold using the following SQL operation: - Name: `spark.sql.objectHashAggregate.sortBased.fallbackThreshold` - Default value: 128 Fallback to sort-based aggregation when more than 128 distinct groups are accumulated in the aggregation hash map. This number is intentionally made small to avoid GC problems since aggregation buffers of this operator may contain arbitrary Java objects. This may be improved by implementing size tracking for this operator, but that can be done in a separate PR. Code generation and size tracking are planned to be implemented in follow-up PRs. ## Benchmark results ### `ObjectHashAggregateExec` vs `SortAggregateExec` The first benchmark compares `ObjectHashAggregateExec` and `SortAggregateExec` by evaluating `typed_count`, a testing `TypedImperativeAggregate` version of the SQL `count` function. ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5 Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz object agg v.s. sort agg: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ sort agg w/ group by 31251 / 31908 3.4 298.0 1.0X object agg w/ group by w/o fallback 6903 / 7141 15.2 65.8 4.5X object agg w/ group by w/ fallback 20945 / 21613 5.0 199.7 1.5X sort agg w/o group by 4734 / 5463 22.1 45.2 6.6X object agg w/o group by w/o fallback 4310 / 4529 24.3 41.1 7.3X ``` The next benchmark compares `ObjectHashAggregateExec` and `SortAggregateExec` by evaluating the Spark native version of `percentile_approx`. Note that `percentile_approx` is so heavy an aggregate function that the bottleneck of the benchmark is evaluating the aggregate function itself rather than the aggregate operator since I couldn't run a large scale benchmark on my laptop. That's why the results are so close and looks counter-intuitive (aggregation with grouping is even faster than that aggregation without grouping). ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5 Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz object agg v.s. sort agg: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ sort agg w/ group by 3418 / 3530 0.6 1630.0 1.0X object agg w/ group by w/o fallback 3210 / 3314 0.7 1530.7 1.1X object agg w/ group by w/ fallback 3419 / 3511 0.6 1630.1 1.0X sort agg w/o group by 4336 / 4499 0.5 2067.3 0.8X object agg w/o group by w/o fallback 4271 / 4372 0.5 2036.7 0.8X ``` ### Hive UDAF vs Spark AF This benchmark compares the following two kinds of aggregate functions: - "hive udaf": Hive implementation of `percentile_approx`, without partial aggregation supports, evaluated using `SortAggregateExec`. - "spark af": Spark native implementation of `percentile_approx`, with partial aggregation support, evaluated using `ObjectHashAggregateExec` The performance differences are mostly due to faster implementation and partial aggregation support in the Spark native version of `percentile_approx`. This benchmark basically shows the performance differences between the worst case, where an aggregate function without partial aggregation support is evaluated using `SortAggregateExec`, and the best case, where a `TypedImperativeAggregate` with partial aggregation support is evaluated using `ObjectHashAggregateExec`. ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5 Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz hive udaf vs spark af: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ hive udaf w/o group by 5326 / 5408 0.0 81264.2 1.0X spark af w/o group by 93 / 111 0.7 1415.6 57.4X hive udaf w/ group by 3804 / 3946 0.0 58050.1 1.4X spark af w/ group by w/o fallback 71 / 90 0.9 1085.7 74.8X spark af w/ group by w/ fallback 98 / 111 0.7 1501.6 54.1X ``` ### Real world benchmark We also did a relatively large benchmark using a real world query involving `percentile_approx`: - Hive UDAF implementation, sort-based aggregation, w/o partial aggregation support 24.77 minutes - Native implementation, sort-based aggregation, w/ partial aggregation support 4.64 minutes - Native implementation, object hash aggregator, w/ partial aggregation support 1.80 minutes ## How was this patch tested? New unit tests and randomized test cases are added in `ObjectAggregateFunctionSuite`. Author: Cheng Lian <lian@databricks.com> Closes #15590 from liancheng/obj-hash-agg.
-
gatorsmile authored
### What changes were proposed in this pull request? When `FilterExec` contains `isNotNull`, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not correct, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expressions. For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a null-intolerant expression, but obviously`coalesce` is null-tolerant. Thus, it could eat null. When the nullability is wrong, we could generate incorrect results in different cases. For example, ``` Scala val df1 = Seq((1, 2), (2, 3)).toDF("a", "b") val df2 = Seq((2, 5), (3, 4)).toDF("a", "c") val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0) val df3 = Seq((3, 1)).toDF("a", "d") joinedDf.join(df3, "a").show ``` The optimized plan is like ``` Project [a#29, b#30, c#31, d#42] +- Join Inner, (a#29 = a#41) :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31] : +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int)) : +- Join FullOuter, (a#5 = a#15) : :- LocalRelation [a#5, b#6] : +- LocalRelation [a#15, c#16] +- LocalRelation [a#41, d#42] ``` Without the fix, it returns an empty result. With the fix, it can return a correct answer: ``` +---+---+---+---+ | a| b| c| d| +---+---+---+---+ | 3| 0| 4| 1| +---+---+---+---+ ``` ### How was this patch tested? Added test cases to verify the nullability changes in FilterExec. Also added a test case for verifying the reported incorrect result. Author: gatorsmile <gatorsmile@gmail.com> Closes #15523 from gatorsmile/nullabilityFilterExec.
-
Zheng RuiFeng authored
## What changes were proposed in this pull request? Add missing 'subsamplingRate' of pyspark GBTClassifier ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15692 from zhengruifeng/gbt_subsamplingRate.
-
Reynold Xin authored
## What changes were proposed in this pull request? I was reading this part of the code and was really confused by the "partition" parameter. This patch adds some documentation for it to reduce confusion in the future. I also looked around other logical plans but most of them are either already documented, or pretty self-evident to people that know Spark SQL. ## How was this patch tested? N/A - doc change only. Author: Reynold Xin <rxin@databricks.com> Closes #15749 from rxin/doc-improvement.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch moves the new commit protocol API from sql/core to core module, so we can use it in the future in the RDD API. As part of this patch, I also moved the speficiation of the random uuid for the write path out of the commit protocol, and instead pass in a job id. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #15731 from rxin/SPARK-18219.
-
Daoyuan Wang authored
## What changes were proposed in this pull request? In Spark 1.6 and earlier, we can drop the database we are using. In Spark 2.0, native implementation prevent us from dropping current database, which may break some old queries. This PR would re-enable the feature. ## How was this patch tested? one new unit test in `SessionCatalogSuite`. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #15011 from adrian-wang/dropcurrent.
-