- Apr 13, 2016
-
-
Davies Liu authored
## What changes were proposed in this pull request? When prune the partitions or push down predicates, case-sensitivity is not respected. In order to make it work with case-insensitive, this PR update the AttributeReference inside predicate to use the name from schema. ## How was this patch tested? Add regression tests for case-insensitive. Author: Davies Liu <davies@databricks.com> Closes #12371 from davies/case_insensi.
-
Bryan Cutler authored
Currently, JavaWrapper is only a wrapper class for pipeline classes that have Params and JavaCallable is a separate mixin that provides methods to make Java calls. This change simplifies the class structure and to define the Java wrapper in a plain base class along with methods to make Java calls. Also, renames Java wrapper classes to better reflect their purpose. Ran existing Python ml tests and generated documentation to test this change. Author: Bryan Cutler <cutlerb@gmail.com> Closes #12304 from BryanCutler/pyspark-cleanup-JavaWrapper-SPARK-14472.
-
Yuhao Yang authored
jira: https://issues.apache.org/jira/browse/SPARK-13089 Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus example code (using include_example to clip code from examples/ folder files). Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11015 from hhbyyh/naiveBayesDoc.
-
Zheng RuiFeng authored
## What changes were proposed in this pull request? Add python CountVectorizerExample ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11917 from zhengruifeng/cv_pe.
-
Yanbo Liang authored
## What changes were proposed in this pull request? * Modify ```KMeansSummary.clusterSizes``` method to make it robust to empty clusters. * Add unit test for spark.ml ```KMeansSummary```. * Add Since tag. ## How was this patch tested? unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12254 from yanboliang/spark-14375.
-
Yanbo Liang authored
## What changes were proposed in this pull request? GLM training summaries should provide solver. ## How was this patch tested? Unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12253 from yanboliang/spark-14461.
-
Yanbo Liang authored
```PrefixSpanModel``` supports ```save/load```. It's similar with #9267. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10664 from yanboliang/spark-10386.
-
Davies Liu authored
## What changes were proposed in this pull request? Right now, filter push down only works with Project, Aggregate, Generate and Join, they can't be pushed through many other plans. This PR added support for Union, Intersect, Except and all unary plans. ## How was this patch tested? Added tests. Author: Davies Liu <davies@databricks.com> Closes #12342 from davies/filter_hint.
-
Yanbo Liang authored
## What changes were proposed in this pull request? * Added save/load for ```GBTClassifier/GBTClassificationModel/GBTRegressor/GBTRegressionModel```. * Meanwhile, I modified ```EnsembleModelReadWrite.saveImpl/loadImpl``` to support save/load ```treeWeights```. ## How was this patch tested? Adds standard unit tests for GBT save/load. cc jkbradley GayathriMurali Author: Yanbo Liang <ybliang8@gmail.com> Closes #12230 from yanboliang/spark-13783.
-
Andrew Or authored
## What changes were proposed in this pull request? This patch implements the `CREATE TABLE` command using the `SessionCatalog`. Previously we handled only `CTAS` and `CREATE TABLE ... USING`. This requires us to refactor `CatalogTable` to accept various fields (e.g. bucket and skew columns) and pass them to Hive. WIP: Note that I haven't verified whether this actually works yet! But I believe it does. ## How was this patch tested? Tests will come in a future commit. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12271 from andrewor14/create-table-ddl.
-
Timothy Hunter authored
## What changes were proposed in this pull request? This adds extra logging information about a `LogisticRegression` estimator when being fit on a dataset. With this PR, you see the following extra lines when running the example in the documentation: ``` 16/04/13 07:19:00 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): training: numPartitions=1 storageLevel=StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1) 16/04/13 07:19:00 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): {"regParam":0.3,"elasticNetParam":0.8,"maxIter":10} ... 16/04/12 11:48:07 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_a89eb23cb386-358781145):numClasses=2 16/04/12 11:48:07 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_a89eb23cb386-358781145):numFeatures=692 ... 16/04/13 07:19:01 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): training finished ``` ## How was this patch tested? This PR was manually tested. Author: Timothy Hunter <timhunter@databricks.com> Closes #12331 from thunterdb/1604-instrumentation.
-
Xiangrui Meng authored
This reverts commit d2a819a6.
-
Charles Allen authored
This patch makes the postStartHook throw an IllegalStateException if the SparkContext is shutdown while it is waiting for the backend to be ready Author: Charles Allen <charles@allen-net.com> Closes #12301 from drcrallen/SPARK-14537.
-
Liwei Lin authored
## What changes were proposed in this pull request? - updated `OFF_HEAP` semantics for `StorageLevels.java` - updated `OFF_HEAP` semantics for `storagelevel.py` ## How was this patch tested? no need to test Author: Liwei Lin <lwlin7@gmail.com> Closes #12126 from lw-lin/storagelevel.py.
-
- Apr 12, 2016
-
-
Wenchen Fan authored
## What changes were proposed in this pull request? address this comment: https://github.com/apache/spark/pull/12322#discussion_r59417359 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #12346 from cloud-fan/tmp.
-
hyukjinkwon authored
## What changes were proposed in this pull request? It looks several recent commits for datasources (maybe while removing old `HadoopFsRelation` interface) missed removing some unused imports. This PR removes some unused imports in datasources. ## How was this patch tested? `sbt scalastyle` and some unit tests for them. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12326 from HyukjinKwon/minor-imports.
-
Shixiong Zhu authored
## What changes were proposed in this pull request? There is a race condition in `StreamExecution.processAllAvailable`. Here is an execution order to reproduce it. | Time |Thread 1 | MicroBatchThread | |:-------------:|:-------------:|:-----:| | 1 | | `dataAvailable in constructNextBatch` returns false | | 2 | addData(newData) | | | 3 | `noNewData = false` in processAllAvailable | | | 4 | | noNewData = true | | 5 | `noNewData` is true so just return | | The root cause is that `checking dataAvailable and change noNewData to true` is not atomic. This PR puts these two actions into `synchronized` to make sure they are atomic. In addition, this PR also has the following changes: - Make `committedOffsets` and `availableOffsets` volatile to make sure they can be seen in other threads. - Copy the reference of `availableOffsets` to a local variable so that `sourceStatuses` can use a snapshot of `availableOffsets`. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12339 from zsxwing/race-condition.
-
Davies Liu authored
## What changes were proposed in this pull request? The wide schema, the expression of fields will be splitted into multiple functions, but the variable for loopVar can't be accessed in splitted functions, this PR change them as class member. ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #12338 from davies/nested_row.
-
Sital Kedia authored
## What changes were proposed in this pull request? Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization. This is a regression partially introduced in PR https://github.com/apache/spark/pull/9241 ## How was this patch tested? Tested by running a job and observed around 30% speedup after this change. Author: Sital Kedia <skedia@fb.com> Closes #12285 from sitalkedia/executor_oom.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch changes the connection creation logic in the network client module to avoid DNS resolution when reusing connections. ## How was this patch tested? Testing in production. This is too difficult to test in isolation (for high fidelity unit tests, we'd need to change the DNS resolution behavior in the JVM). Author: Reynold Xin <rxin@databricks.com> Closes #12315 from rxin/SPARK-14547.
-
Davies Liu authored
## What changes were proposed in this pull request? This PR improve the performance of SQL UI by: 1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page. 2) break-all is super slow in Chrome recently, so switch to break-word. 3) Using "display: none" to hide a block. 4) using one js closure for for all the executions, not one for each. 5) remove the height limitation of details, don't need to scroll it in the tiny window. ## How was this patch tested? Exists tests.  Author: Davies Liu <davies@databricks.com> Closes #12311 from davies/ui_perf.
-
Terence Yim authored
## What changes were proposed in this pull request? Shutting down `QueuedThreadPool` used by Jetty `Server` to avoid threads leakage after SparkContext is stopped. Note: If this fix is going to apply to the `branch-1.6`, one more patch on the `NettyRpcEnv` class is needed so that the `NettyRpcEnv._fileServer.shutdown` is called in the `NettyRpcEnv.cleanup` method. This is due to the removal of `_fileServer` field in the `NettyRpcEnv` class in the master branch. Please advice if a second PR is necessary for bring this fix back to `branch-1.6` ## How was this patch tested? Ran the ./dev/run-tests locally Author: Terence Yim <terence@cask.co> Closes #12318 from chtyim/fixes/SPARK-14513-thread-leak.
-
bomeng authored
## What changes were proposed in this pull request? Before we are using `AnalysisException`, `ParseException`, `NoSuchFunctionException` etc when a parsing error encounters. I am trying to make it consistent and also **minimum** code impact to the current implementation by changing the class hierarchy. 1. `NoSuchItemException` is removed, since it is an abstract class and it just simply takes a message string. 2. `NoSuchDatabaseException`, `NoSuchTableException`, `NoSuchPartitionException` and `NoSuchFunctionException` now extends `AnalysisException`, as well as `ParseException`, they are all under `AnalysisException` umbrella, but you can also determine how to use them in a granular way. ## How was this patch tested? The existing test cases should cover this patch. Author: bomeng <bmeng@us.ibm.com> Closes #12314 from bomeng/SPARK-14414.
-
Davies Liu authored
## What changes were proposed in this pull request? Currently, Union only takes intersect of the constraints from it's children, all others are dropped, we should try to merge them together. This PR try to merge the constraints that have the same reference but came from different children, for example: `a > 10` and `a < 100` could be merged as `a > 10 || a < 100`. ## How was this patch tested? Added more cases in existing test. Author: Davies Liu <davies@databricks.com> Closes #12328 from davies/union_const.
-
Liwei Lin authored
## What changes were proposed in this pull request? - `StateStoreConf.**max**DeltasForSnapshot` was renamed to `StateStoreConf.**min**DeltasForSnapshot` - some state switch checks were added - improved consistency between method names and string literals - other comments & typo fix ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12323 from lw-lin/streaming-state-clean-up.
-
Yanbo Liang authored
## What changes were proposed in this pull request? SparkR does not support type of vector which is the default type of feature column in ML. R predict also does not output intermediate feature column. So SparkR ```predict``` should not output feature column. In this PR, I only fix this issue for ```naiveBayes``` and ```survreg```. ```kmeans``` has the right code route already and ```glm``` will be fixed at SparkRWrapper refactor(#12294). ## How was this patch tested? No new tests. cc mengxr shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #11958 from yanboliang/spark-14147.
-
Xiangrui Meng authored
## What changes were proposed in this pull request? Use a random table name instead of `__THIS__` in SQLTransformer, and add a test for `transformSchema`. The problems of using `__THIS__` are: * It doesn't work under HiveContext (in Spark 1.6) * Race conditions ## How was this patch tested? * Manual test with HiveContext. * Added a unit test for `transformSchema` to improve coverage. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #12330 from mengxr/SPARK-14563.
-
Kai Jiang authored
## What changes were proposed in this pull request? Python API for GeneralizedLinearRegression JIRA: https://issues.apache.org/jira/browse/SPARK-13597 ## How was this patch tested? The patch is tested with Python doctest. Author: Kai Jiang <jiangkai@gmail.com> Closes #11468 from vectorijk/spark-13597.
-
Yanbo Liang authored
## What changes were proposed in this pull request? AFTSurvivalRegression should support feature standardization, it will improve the convergence rate. Test the convergence rate on the [Ovarian](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/ovarian.html) data which is standard data comes with Survival library in R, * without standardization(before this PR) -> 74 iterations. * with standardization(after this PR) -> 38 iterations. But after this fix, with or without ```standardization``` will converge to the same solution. It means that ```standardization = false``` will run the same code route as ```standardization = true```. Because if the features are not standardized at all, it will result convergency issue when the features have very different scales. This behavior is the same as ML [```LinearRegression``` and ```LogisticRegression```](https://issues.apache.org/jira/browse/SPARK-8522). See more discussion about this topic at #11247. cc mengxr ## How was this patch tested? unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11365 from yanboliang/spark-13322.
-
Yanbo Liang authored
* SparkR glm supports families and link functions which match R's signature for family. * SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```. * This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in. * This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR. Unit tests. cc mengxr jkbradley hhbyyh Author: Yanbo Liang <ybliang8@gmail.com> Closes #12294 from yanboliang/spark-12566.
-
Shixiong Zhu authored
## What changes were proposed in this pull request? Now that we have a single location for storing checkpointed state. This PR just propagates the checkpoint location into FileStreamSource so that we don't have one random log off on its own. ## How was this patch tested? test("metadataPath should be in checkpointLocation") Author: Shixiong Zhu <shixiong@databricks.com> Closes #12247 from zsxwing/file-source-log-location.
-
Yong Tang authored
## What changes were proposed in this pull request? This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search. In this PR, `featureSubsetStrategy` could be passed with: a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset, b) an integer number (`>0`) that represents the number of features in each subset. ## How was this patch tested? Two tests `JavaRandomForestClassifierSuite` and `JavaRandomForestRegressorSuite` have been updated to check the additional options for params in this PR. An additional test has been added to `org.apache.spark.mllib.tree.RandomForestSuite` to cover the cases in this PR. Author: Yong Tang <yong.tang.github@outlook.com> Closes #11989 from yongtang/SPARK-3724.
-
Cheng Lian authored
[SPARK-14488][SPARK-14493][SQL] "CREATE TEMPORARY TABLE ... USING ... AS SELECT" shouldn't create persisted table ## What changes were proposed in this pull request? When planning logical plan node `CreateTableUsingAsSelect`, we neglected its `temporary` field and always generates a `CreateMetastoreDataSourceAsSelect`. This PR fixes this issue generating `CreateTempTableUsingAsSelect` when `temporary` is true. This PR also fixes SPARK-14493 since the root cause of SPARK-14493 is that we were `CreateMetastoreDataSourceAsSelect` uses default Hive warehouse location when `PATH` data source option is absent. ## How was this patch tested? Added a test case to create a temporary table using the target syntax and check whether it's indeed a temporary table. Author: Cheng Lian <lian@databricks.com> Closes #12303 from liancheng/spark-14488-fix-ctas-using.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule. ``` case: Always omit braces in case clauses. ``` This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code. ## How was this patch tested? Pass the Jenkins tests (including Scala style checking) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12280 from dongjoon-hyun/SPARK-14508.
-
Wenchen Fan authored
## What changes were proposed in this pull request? Now `HadoopFsRelation` with all kinds of file formats can be handled in `FileSourceStrategy`, we can remove the branches for `HadoopFsRelation` in `FileSourceStrategy` and the `buildInternalScan` API from `FileFormat`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12300 from cloud-fan/remove.
-
Wenchen Fan authored
## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/12047/files#diff-94a1f59bcc9b6758c4ca874652437634R529, we may split field expressions codes in `CreateExternalRow` to support wide table. However, the whole stage codegen framework doesn't support it, because the input for expressions is not always the input row, but can be `CodeGenContext.currentVars`, which doesn't work well with `CodeGenContext.splitExpressions`. Actually we do have a check to guard against this cases, but it's incomplete, it only checks output fields. This PR improves the whole stage codegen support check, to disable it if there are too many input fields, so that we can avoid splitting field expressions codes in `CreateExternalRow` for whole stage codegen. TODO: Is it a better solution if we can make `CodeGenContext.currentVars` work well with `CodeGenContext.splitExpressions`? ## How was this patch tested? new test in DatasetSuite. Author: Wenchen Fan <wenchen@databricks.com> Closes #12322 from cloud-fan/codegen.
-
gatorsmile authored
#### What changes were proposed in this pull request? In this PR, we are trying to address the comment in the original PR: https://github.com/apache/spark/commit/dfce9665c4b2b29a19e6302216dae2800da68ff9#commitcomment-17057030 In this PR, we checks if table/view exists at the beginning and then does not need to capture the exceptions, including `NoSuchTableException` and `InvalidTableException`. We still capture the NonFatal exception when doing `sqlContext.cacheManager.tryUncacheQuery`. #### How was this patch tested? The existing test cases should cover the code changes of this PR. Author: gatorsmile <gatorsmile@gmail.com> Closes #12321 from gatorsmile/dropViewFollowup.
-
- Apr 11, 2016
-
-
Andrew Or authored
## What changes were proposed in this pull request? This implements a few alter table partition commands using the `SessionCatalog`. In particular: ``` ALTER TABLE ... ADD PARTITION ... ALTER TABLE ... DROP PARTITION ... ALTER TABLE ... RENAME PARTITION ... TO ... ``` The following operations are not supported, and an `AnalysisException` with a helpful error message will be thrown if the user tries to use them: ``` ALTER TABLE ... EXCHANGE PARTITION ... ALTER TABLE ... ARCHIVE PARTITION ... ALTER TABLE ... UNARCHIVE PARTITION ... ALTER TABLE ... TOUCH ... ALTER TABLE ... COMPACT ... ALTER TABLE ... CONCATENATE MSCK REPAIR TABLE ... ``` ## How was this patch tested? `DDLSuite`, `DDLCommandSuite` and `HiveDDLCommandSuite` Author: Andrew Or <andrew@databricks.com> Closes #12220 from andrewor14/alter-partition-ddl.
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? Fixes to eliminate warnings during package and doc builds. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12263 from jkbradley/warning-cleanups.
-
Liang-Chi Hsieh authored
## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-14520 `VectorizedParquetInputFormat` inherits `ParquetInputFormat` and overrides `createRecordReader`. However, its overridden `createRecordReader` returns a `ParquetRecordReader`. It should return a `RecordReader`. Otherwise, `ClassCastException` will be thrown. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12292 from viirya/fix-vectorized-input-format.
-