- Jan 08, 2017
-
-
Dilip Biswal authored
## What changes were proposed in this pull request? Consider the plans inside subquery expressions while looking up cache manager to make use of cached data. Currently CacheManager.useCachedData does not consider the subquery expressions in the plan. SQL ``` select * from rows where not exists (select * from rows) ``` Before the fix ``` == Optimized Logical Plan == Join LeftAnti :- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string> +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] +- Relation[_1#3775,_2#3776] parquet ``` After ``` == Optimized Logical Plan == Join LeftAnti :- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) : +- *FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string> +- Project [_1#256 AS _1#256#298, _2#257 AS _2#257#299] +- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string> ``` Query2 ``` SELECT * FROM t1 WHERE c1 IN (SELECT c1 FROM t2 WHERE c1 IN (SELECT c1 FROM t3 WHERE c1 = 1)) ``` Before ``` == Analyzed Logical Plan == c1: int Project [c1#3] +- Filter predicate-subquery#47 [(c1#3 = c1#10)] : +- Project [c1#10] : +- Filter predicate-subquery#46 [(c1#10 = c1#17)] : : +- Project [c1#17] : : +- Filter (c1#17 = 1) : : +- SubqueryAlias t3, `t3` : : +- Project [value#15 AS c1#17] : : +- LocalRelation [value#15] : +- SubqueryAlias t2, `t2` : +- Project [value#8 AS c1#10] : +- LocalRelation [value#8] +- SubqueryAlias t1, `t1` +- Project [value#1 AS c1#3] +- LocalRelation [value#1] == Optimized Logical Plan == Join LeftSemi, (c1#3 = c1#10) :- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1 : +- LocalTableScan [c1#3] +- Project [value#8 AS c1#10] +- Join LeftSemi, (value#8 = c1#17) :- LocalRelation [value#8] +- Project [value#15 AS c1#17] +- Filter (value#15 = 1) +- LocalRelation [value#15] ``` After ``` == Analyzed Logical Plan == c1: int Project [c1#3] +- Filter predicate-subquery#47 [(c1#3 = c1#10)] : +- Project [c1#10] : +- Filter predicate-subquery#46 [(c1#10 = c1#17)] : : +- Project [c1#17] : : +- Filter (c1#17 = 1) : : +- SubqueryAlias t3, `t3` : : +- Project [value#15 AS c1#17] : : +- LocalRelation [value#15] : +- SubqueryAlias t2, `t2` : +- Project [value#8 AS c1#10] : +- LocalRelation [value#8] +- SubqueryAlias t1, `t1` +- Project [value#1 AS c1#3] +- LocalRelation [value#1] == Optimized Logical Plan == Join LeftSemi, (c1#3 = c1#10) :- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1 : +- LocalTableScan [c1#3] +- Join LeftSemi, (c1#10 = c1#17) :- InMemoryRelation [c1#10], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t2 : +- LocalTableScan [c1#10] +- Filter (c1#17 = 1) +- InMemoryRelation [c1#17], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1 +- LocalTableScan [c1#3] ``` ## How was this patch tested? Added new tests in CachedTableSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #16493 from dilipbiswal/SPARK-19093.
-
zuotingbing authored
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-19026 SPARK_LOCAL_DIRS (Standalone) can be a comma-separated list of multiple directories on different disks, e.g. SPARK_LOCAL_DIRS=/dir1,/dir2,/dir3, if there is a IOExecption when create sub directory on dir3 , the sub directory which have been created successfully on dir1 and dir2 cannot be deleted anymore when the application finishes. So we should catch the IOExecption at Utils.createDirectory , otherwise the variable "appDirectories(appId)" which the function maybeCleanupApplication calls will not be set then dir1 and dir2 will not be cleaned up . Author: zuotingbing <zuo.tingbing9@zte.com.cn> Closes #16439 from zuotingbing/master.
-
Yanbo Liang authored
## What changes were proposed in this pull request? SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain: * mllib_classification.R * mllib_clustering.R * mllib_recommendation.R * mllib_regression.R * mllib_stat.R * mllib_tree.R * mllib_utils.R Note: Only reorg, no actual code change. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16312 from yanboliang/spark-18862.
-
- Jan 07, 2017
-
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? This PR adds a new behavior change description on `CREATE TABLE ... LOCATION` at `sql-programming-guide.md` clearly under `Upgrading From Spark SQL 1.6 to 2.0`. This change is introduced at Apache Spark 2.0.0 as [SPARK-15276](https://issues.apache.org/jira/browse/SPARK-15276). ## How was this patch tested? ``` SKIP_API=1 jekyll build ``` **Newly Added Description** <img width="913" alt="new" src="https://cloud.githubusercontent.com/assets/9700541/21743606/7efe2b12-d4ba-11e6-8a0d-551222718ea2.png"> Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16400 from dongjoon-hyun/SPARK-18941.
-
Sean Owen authored
## What changes were proposed in this pull request? configuration.html section headings were not specified correctly in markdown and weren't rendering, being recognized correctly. Removed extra p tags and pulled level 4 titles up to level 3, since level 3 had been skipped. This improves the TOC. ## How was this patch tested? Doc build, manual check. Author: Sean Owen <sowen@cloudera.com> Closes #16490 from srowen/SPARK-19106.
-
wm624@hotmail.com authored
[SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for original and loaded model ## What changes were proposed in this pull request? While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different. For example, in the test("read/write DistributedLDAModel"), I add the test: val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior assert(logPrior === logPrior2) The test fails: -4.394180878889078 did not equal -4.294290536919573 The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`. Please refer to #16464 for details. ## How was this patch tested? Add a new unit test for testing logPrior. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16491 from wangmiao1981/ldabug.
-
Wenchen Fan authored
## What changes were proposed in this pull request? `OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs: 1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it. 2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #16479 from cloud-fan/hive-writer.
-
Yanbo Liang authored
## What changes were proposed in this pull request? #16126 bumps master branch version to 2.2.0-SNAPSHOT, but it seems R version was omitted. ## How was this patch tested? N/A Author: Yanbo Liang <ybliang8@gmail.com> Closes #16488 from yanboliang/r-version.
-
hyukjinkwon authored
[SPARK-13748][PYSPARK][DOC] Add the description for explictly setting None for a named argument for a Row ## What changes were proposed in this pull request? It seems allowed to not set a key and value for a dict to represent the value is `None` or missing as below: ``` python spark.createDataFrame([{"x": 1}, {"y": 2}]).show() ``` ``` +----+----+ | x| y| +----+----+ | 1|null| |null| 2| +----+----+ ``` However, it seems it is not for `Row` as below: ``` python spark.createDataFrame([Row(x=1), Row(y=2)]).show() ``` ``` scala 16/06/19 16:25:56 ERROR Executor: Exception in task 6.0 in stage 66.0 (TID 316) java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 2 fields are required while 1 values are provided. at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:147) at org.apache.spark.sql.SparkSession$$anonfun$7.apply(SparkSession.scala:656) at org.apache.spark.sql.SparkSession$$anonfun$7.apply(SparkSession.scala:656) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) ``` The behaviour seems right but it seems it might confuse users just like this JIRA was reported. This PR adds the explanation for `Row` class. ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #13771 from HyukjinKwon/SPARK-13748.
-
- Jan 06, 2017
-
-
sueann authored
## What changes were proposed in this pull request? Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions. ## How was this patch tested? Ran unit tests and checked the log file (see output in comments). Author: sueann <sueann@databricks.com> Closes #16480 from sueann/SPARK-18194.
-
Tathagata Das authored
[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options ## What changes were proposed in this pull request? Updates - Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure. - Updated Output Modes section with Update mode - Added options for all the sources and sinks --------------------------- ---------------------------  --------------------------- --------------------------- <img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png"> <img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png"> --------------------------- ---------------------------    Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16468 from tdas/SPARK-19074.
-
zuotingbing authored
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-19083# sbin/start-history-server.sh script use of $ without quotes, this will affect the length of args which used in HistoryServerArguments::parse(args: List[String]) Author: zuotingbing <zuo.tingbing9@zte.com.cn> Closes #16484 from zuotingbing/sh.
-
Kay Ousterhout authored
In the existing code, there are three layers of serialization involved in sending a task from the scheduler to an executor: - A Task object is serialized - The Task object is copied to a byte buffer that also contains serialized information about any additional JARs, files, and Properties needed for the task to execute. This byte buffer is stored as the member variable serializedTask in the TaskDescription class. - The TaskDescription is serialized (in addition to the serialized task + JARs, the TaskDescription class contains the task ID and other metadata) and sent in a LaunchTask message. While it *is* necessary to have two layers of serialization, so that the JAR, file, and Property info can be deserialized prior to deserializing the Task object, the third layer of deserialization is unnecessary. This commit eliminates a layer of serialization by moving the JARs, files, and Properties into the TaskDescription class. This commit also serializes the Properties manually (by traversing the map), as is done with the JARs and files, which reduces the final serialized size. Unit tests This is a simpler alternative to the approach proposed in #15505. shivaram and I did some benchmarking of this and #15505 on a 20-machine m2.4xlarge EC2 machines (160 cores). We ran ~30 trials of code [1] (a very simple job with 10K tasks per stage) and measured the average time per stage: Before this change: 2490ms With this change: 2345 ms (so ~6% improvement over the baseline) With witgo's approach in #15505: 2046 ms (~18% improvement over baseline) The reason that #15505 has a more significant improvement is that it also moves the serialization from the TaskSchedulerImpl thread to the CoarseGrainedSchedulerBackend thread. I added that functionality on top of this change, and got almost the same improvement [1] as #15505 (average of 2103ms). I think we should decouple these two changes, both so we have some record of the improvement form each individual improvement, and because this change is more about simplifying the code base (the improvement is negligible) while the other is about performance improvement. The plan, currently, is to merge this PR and then merge the remaining part of #15505 that moves serialization. [1] The reason the improvement wasn't quite as good as with #15505 when we ran the benchmarks is almost certainly because, at the point when we ran the benchmarks, I hadn't updated the code to manually serialize the Properties (instead the code was using Java's default serialization for the Properties object, whereas #15505 manually serialized the Properties). This PR has since been updated to manually serialize the Properties, just like the other maps. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #16053 from kayousterhout/SPARK-17931.
-
jerryshao authored
## What changes were proposed in this pull request? Current HistoryServer's ACLs is derived from application event-log, which means the newly changed ACLs cannot be applied to the old data, this will become a problem where newly added admin cannot access the old application history UI, only the new application can be affected. So here propose to add admin ACLs for history server, any configured user/group could have the view access to all the applications, while the view ACLs derived from application run-time still take effect. ## How was this patch tested? Unit test added. Author: jerryshao <sshao@hortonworks.com> Closes #16470 from jerryshao/SPARK-19033.
-
Michal Senkyr authored
[SPARK-16792][SQL] Dataset containing a Case Class with a List type causes a CompileException (converting sequence to list) ## What changes were proposed in this pull request? Added a `to` call at the end of the code generated by `ScalaReflection.deserializerFor` if the requested type is not a supertype of `WrappedArray[_]` that uses `CanBuildFrom[_, _, _]` to convert result into an arbitrary subtype of `Seq[_]`. Care was taken to preserve the original deserialization where it is possible to avoid the overhead of conversion in cases where it is not needed `ScalaReflection.serializerFor` could already be used to serialize any `Seq[_]` so it was not altered `SQLImplicits` had to be altered and new implicit encoders added to permit serialization of other sequence types Also fixes [SPARK-16815] Dataset[List[T]] leads to ArrayStoreException ## How was this patch tested? ```bash ./build/mvn -DskipTests clean package && ./dev/run-tests ``` Also manual execution of the following sets of commands in the Spark shell: ```scala case class TestCC(key: Int, letters: List[String]) val ds1 = sc.makeRDD(Seq( (List("D")), (List("S","H")), (List("F","H")), (List("D","L","L")) )).map(x=>(x.length,x)).toDF("key","letters").as[TestCC] val test1=ds1.map{_.key} test1.show ``` ```scala case class X(l: List[String]) spark.createDataset(Seq(List("A"))).map(X).show ``` ```scala spark.sqlContext.createDataset(sc.parallelize(List(1) :: Nil)).collect ``` After adding arbitrary sequence support also tested with the following commands: ```scala case class QueueClass(q: scala.collection.immutable.Queue[Int]) spark.createDataset(Seq(List(1,2,3))).map(x => QueueClass(scala.collection.immutable.Queue(x: _*))).map(_.q.dequeue).collect ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Closes #16240 from michalsenkyr/sql-caseclass-list-fix.
-
- Jan 05, 2017
-
-
Kevin Yu authored
## What changes were proposed in this pull request? This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite. Based on the discussion, we will create `subquery/in-subquery` sub structure under `sql/core/src/test/resources/sql-tests/inputs` directory. This is the high level grouping for IN subquery: `subquery/in-subquery/` `subquery/in-subquery/simple-in.sql` `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)` `subquery/in-subquery/not-in-group-by.sql` `subquery/in-subquery/in-order-by.sql` `subquery/in-subquery/in-limit.sql` `subquery/in-subquery/in-having.sql` `subquery/in-subquery/in-joins.sql` `subquery/in-subquery/not-in-joins.sql` `subquery/in-subquery/in-set-operations.sql` `subquery/in-subquery/in-with-cte.sql` `subquery/in-subquery/not-in-with-cte.sql` subquery/in-subquery/in-multiple-columns.sql` We will deliver it through multiple prs, this is the first pr for the IN subquery, it has `subquery/in-subquery/simple-in.sql` `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)` These are the results from running on DB2. [Modified test file of in-group-by.sql used to run on DB2](https://github.com/apache/spark/files/683367/in-group-by.sql.db2.txt) [Output of the run result on DB2](https://github.com/apache/spark/files/683362/in-group-by.sql.db2.out.txt) [Modified test file of simple-in.sql used to run on DB2](https://github.com/apache/spark/files/683378/simple-in.sql.db2.txt) [Output of the run result on DB2](https://github.com/apache/spark/files/683379/simple-in.sql.db2.out.txt) ## How was this patch tested? This patch is adding tests. Author: Kevin Yu <qyu@us.ibm.com> Closes #16337 from kevinyu98/spark-18871.
-
Yanbo Liang authored
## What changes were proposed in this pull request? Set correct column names for ```force to use probability2prediction``` in ```LogisticRegressionSuite```. ## How was this patch tested? Change unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16477 from yanboliang/lor-pred.
-
Wenchen Fan authored
## What changes were proposed in this pull request? Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source. Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details. TODO(for follow-up PRs): 1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later. 2. `SHOW CREATE TABLE` should be updated to use the new syntax. 3. we should decide if we wanna change the behavior of `SET LOCATION`. ## How was this patch tested? new tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16296 from cloud-fan/create-table.
-
Rui Li authored
## What changes were proposed in this pull request? TaskResultGetter tries to deserialize the TaskEndReason before handling the failed task. If an error is thrown during deserialization, the failed task won't be handled, which leaves the job hanging. The PR proposes to handle the failed task in a finally block. ## How was this patch tested? In my case I hit a NoClassDefFoundError and the job hangs. Manually verified the patch can fix it. Author: Rui Li <rui.li@intel.com> Author: Rui Li <lirui@apache.org> Author: Rui Li <shlr@cn.ibm.com> Closes #12775 from lirui-intel/SPARK-14958.
-
Wenchen Fan authored
## What changes were proposed in this pull request? When we append data to a partitioned table with `DataFrameWriter.saveAsTable`, there are 2 issues: 1. doesn't work when the partition has custom location. 2. will recover all partitions This PR fixes them by moving the special partition handling code from `DataSourceAnalysis` to `InsertIntoHadoopFsRelationCommand`, so that the `DataFrameWriter.saveAsTable` code path can also benefit from it. ## How was this patch tested? newly added regression tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16460 from cloud-fan/append.
-
- Jan 04, 2017
-
-
uncleGen authored
## What changes were proposed in this pull request? add streaming rest api doc related to pr #16253 cc saturday-shi srowen ## How was this patch tested? Author: uncleGen <hustyugm@gmail.com> Closes #16414 from uncleGen/SPARK-19009.
-
Kay Ousterhout authored
This commit changes Utils.writeByteBuffer so that it does not change the position of the ByteBuffer that it writes out, and adds a unit test for this functionality. cc mridulm Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #16462 from kayousterhout/SPARK-19062.
-
Herman van Hovell authored
## What changes were proposed in this pull request? Dataset actions currently spin off a new `Dataframe` only to track query execution. This PR simplifies this code path by using the `Dataset.queryExecution` directly. This PR also merges the typed and untyped action evaluation paths. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #16466 from hvanhovell/SPARK-19070.
-
Niranjan Padmanabhan authored
## What changes were proposed in this pull request? There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words. ## How was this patch tested? N/A since only docs or comments were updated. Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com> Closes #16455 from neurons/np.structure_streaming_doc.
-
Zheng RuiFeng authored
## What changes were proposed in this pull request? eliminate unnecessary extra pass in NB's train ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16453 from zhengruifeng/nb_getNC.
-
Wenchen Fan authored
## What changes were proposed in this pull request? Now all aggregation functions support partial aggregate, we can remove the `supportsPartual` flag in `AggregateFunction` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #16461 from cloud-fan/partial.
-
mingfei authored
## What changes were proposed in this pull request? LauncherState should be only set to SUBMITTED after the application is submitted. Currently the state is set before the application is actually submitted. ## How was this patch tested? no test is added in this patch Author: mingfei <mingfei.smf@alipay.com> Closes #16459 from shimingfei/fixLauncher.
-
Wenchen Fan authored
## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/16402 we made a mistake that, when double/float is infinity, the `Literal` codegen will output boxed value and cause wrong result. This PR fixes this by special handling infinity to not output boxed value. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #16469 from cloud-fan/literal.
-
- Jan 03, 2017
-
-
gatorsmile authored
[SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned Tables in InMemoryCatalog ### What changes were proposed in this pull request? The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition. This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`. ### How was this patch tested? Added test cases for both HiveExternalCatalog and InMemoryCatalog Author: gatorsmile <gatorsmile@gmail.com> Closes #16448 from gatorsmile/unsetSerdeProp.
-
Devaraj K authored
## What changes were proposed in this pull request? Not adding the Killed applications for retry. ## How was this patch tested? I have verified manually in the Mesos cluster, with the changes the killed applications move to Finished Drivers section and will not retry. Author: Devaraj K <devaraj@apache.org> Closes #13323 from devaraj-kavali/SPARK-15555.
-
Dongjoon Hyun authored
[SPARK-18877][SQL] `CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar` ## What changes were proposed in this pull request? CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`. **decimal.csv** ``` 9.03E+12 1.19E+11 ``` **BEFORE** ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root |-- _c0: decimal(3,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show 16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3 ``` **AFTER** ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root |-- _c0: decimal(4,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show +---------+ | _c0| +---------+ |9.030E+12| | 1.19E+11| +---------+ ``` ## How was this patch tested? Pass the newly add test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16320 from dongjoon-hyun/SPARK-18877.
-
Liang-Chi Hsieh authored
## What changes were proposed in this pull request? Currently collect_set/collect_list aggregation expression don't support partial aggregation. This patch is to enable partial aggregation for them. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16371 from viirya/collect-partial-support.
-
Weiqing Yang authored
## What changes were proposed in this pull request? Add `finally` clause for `sc.stop()` in the `test("register and deregister Spark listener from SparkContext")`. ## How was this patch tested? Pass the build and unit tests. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #16426 from weiqingy/testIssue.
-
- Jan 02, 2017
-
-
Zhenhua Wang authored
## What changes were proposed in this pull request? We add a cbo configuration to switch between default stats and estimated stats. We also define a new statistics method `planStats` in LogicalPlan with conf as its parameter, in order to pass the cbo switch and other estimation related configurations in the future. `planStats` is used on the caller sides (i.e. in Optimizer and Strategies) to make transformation decisions based on stats. ## How was this patch tested? Add a test case using a dummy LogicalPlan. Author: Zhenhua Wang <wzh_zju@163.com> Closes #16401 from wzhfy/cboSwitch.
-
gatorsmile authored
### What changes were proposed in this pull request? Remove useless `databaseName ` from `SimpleCatalogRelation`. ### How was this patch tested? Existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #16438 from gatorsmile/removeDBFromSimpleCatalogRelation.
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR proposes to check pep8 against all other Python scripts and fix the errors as below: ```bash ./dev/create-release/generate-contributors.py ./dev/create-release/releaseutils.py ./dev/create-release/translate-contributors.py ./dev/lint-python ./python/docs/epytext.py ./examples/src/main/python/mllib/decision_tree_classification_example.py ./examples/src/main/python/mllib/decision_tree_regression_example.py ./examples/src/main/python/mllib/gradient_boosting_classification_example.py ./examples/src/main/python/mllib/gradient_boosting_regression_example.py ./examples/src/main/python/mllib/linear_regression_with_sgd_example.py ./examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py ./examples/src/main/python/mllib/naive_bayes_example.py ./examples/src/main/python/mllib/random_forest_classification_example.py ./examples/src/main/python/mllib/random_forest_regression_example.py ./examples/src/main/python/mllib/svm_with_sgd_example.py ./examples/src/main/python/streaming/network_wordjoinsentiments.py ./sql/hive/src/test/resources/data/scripts/cat.py ./sql/hive/src/test/resources/data/scripts/cat_error.py ./sql/hive/src/test/resources/data/scripts/doubleescapedtab.py ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py ./sql/hive/src/test/resources/data/scripts/escapedcarriagereturn.py ./sql/hive/src/test/resources/data/scripts/escapednewline.py ./sql/hive/src/test/resources/data/scripts/escapedtab.py ./sql/hive/src/test/resources/data/scripts/input20_script.py ./sql/hive/src/test/resources/data/scripts/newline.py ``` ## How was this patch tested? - `./python/docs/epytext.py` ```bash cd ./python/docs $$ make html ``` - pep8 check (Python 2.7 / Python 3.3.6) ``` ./dev/lint-python ``` - `./dev/merge_spark_pr.py` (Python 2.7 only / Python 3.3.6 not working) ```bash python -m doctest -v ./dev/merge_spark_pr.py ``` - `./dev/create-release/releaseutils.py` `./dev/create-release/generate-contributors.py` `./dev/create-release/translate-contributors.py` (Python 2.7 only / Python 3.3.6 not working) ```bash python generate-contributors.py python translate-contributors.py ``` - Examples (Python 2.7 / Python 3.3.6) ```bash ./bin/spark-submit examples/src/main/python/mllib/decision_tree_classification_example.py ./bin/spark-submit examples/src/main/python/mllib/decision_tree_regression_example.py ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_classification_example.py ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_regression_example.p ./bin/spark-submit examples/src/main/python/mllib/random_forest_classification_example.py ./bin/spark-submit examples/src/main/python/mllib/random_forest_regression_example.py ``` - Examples (Python 2.7 only / Python 3.3.6 not working) ``` ./bin/spark-submit examples/src/main/python/mllib/linear_regression_with_sgd_example.py ./bin/spark-submit examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py ./bin/spark-submit examples/src/main/python/mllib/naive_bayes_example.py ./bin/spark-submit examples/src/main/python/mllib/svm_with_sgd_example.py ``` - `sql/hive/src/test/resources/data/scripts/*.py` (Python 2.7 / Python 3.3.6 within suggested changes) Manually tested only changed ones. - `./dev/github_jira_sync.py` (Python 2.7 only / Python 3.3.6 not working) Manually tested this after disabling actually adding comments and links. And also via Jenkins tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #16405 from HyukjinKwon/minor-pep8.
-
hyukjinkwon authored
## What changes were proposed in this pull request? There are two tests failing on Windows due to the different newlines. ``` - StreamingQueryProgress - prettyJson *** FAILED *** (0 milliseconds) "{ "id" : "39788670-6722-48b7-a248-df6ba08722ac", "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", "name" : "myName", ... }" did not equal "{ "id" : "39788670-6722-48b7-a248-df6ba08722ac", "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", "name" : "myName", ... }" ... ``` ``` - StreamingQueryStatus - prettyJson *** FAILED *** (0 milliseconds) "{ "message" : "active", "isDataAvailable" : true, "isTriggerActive" : false }" did not equal "{ "message" : "active", "isDataAvailable" : true, "isTriggerActive" : false }" ... ``` The reason is, `pretty` in `org.json4s.pretty` writes OS-dependent newlines but the string defined in the tests are `\n`. This ends up with test failures. This PR proposes to compare these regardless of newline concerns. ## How was this patch tested? Manually tested via AppVeyor. **Before** https://ci.appveyor.com/project/spark-test/spark/build/417-newlines-fix-before **After** https://ci.appveyor.com/project/spark-test/spark/build/418-newlines-fix Author: hyukjinkwon <gurwls223@gmail.com> Closes #16433 from HyukjinKwon/tests-StreamingQueryStatusAndProgressSuite.
-
Liang-Chi Hsieh authored
## What changes were proposed in this pull request? The configuration `spark.yarn.security.tokens.{service}.enabled` is deprecated. Now we should use `spark.yarn.security.credentials.{service}.enabled`. Some places in the doc is not updated yet. ## How was this patch tested? N/A. Just doc change. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16444 from viirya/minor-credential-provider-doc.
-
Liwei Lin authored
## What changes were proposed in this pull request? Currently some code snippets in the programming guide just do not compile. We should fix them. ## How was this patch tested? ``` SKIP_API=1 jekyll build ``` ## Screenshot from part of the change:  Author: Liwei Lin <lwlin7@gmail.com> Closes #16442 from lw-lin/ss-pro-guide-.
-
Sean Owen authored
Closes #12968 Closes #16215 Closes #16212 Closes #16086 Closes #15713 Closes #16413 Closes #16396 Author: Sean Owen <sowen@cloudera.com> Closes #16447 from srowen/CloseStalePRs.
-