- Apr 30, 2016
-
-
pshearer authored
## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13973 Following discussion with srowen the IPYTHON and IPYTHON_OPTS variables are removed. If they are set in the user's environment, pyspark will not execute and prints an error message. Failing noisily will force users to remove these options and learn the new configuration scheme, which is much more sustainable and less confusing. ## How was this patch tested? Manual testing; set IPYTHON=1 and verified that the error message prints. Author: pshearer <pshearer@massmutual.com> Author: shearerp <shearerp@umich.edu> Closes #12528 from shearerp/master.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #12806 from rxin/SPARK-15028.
-
Xiangrui Meng authored
## What changes were proposed in this pull request? Continue the work of #12789 to rename ml.asve/ml.load to write.ml/read.ml, which are more consistent with read.df/write.df and other methods in SparkR. I didn't rename `data` to `df` because we still use `predict` for prediction, which uses `newData` to match the signature in R. ## How was this patch tested? Existing unit tests. cc: yanboliang thunterdb Author: Xiangrui Meng <meng@databricks.com> Closes #12807 from mengxr/SPARK-14831.
-
Xiangrui Meng authored
## What changes were proposed in this pull request? As discussed in #12660, this PR renames * intermediateRDDStorageLevel -> intermediateStorageLevel * finalRDDStorageLevel -> finalStorageLevel The argument name in `ALS.train` will be addressed in SPARK-15027. ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <meng@databricks.com> Closes #12803 from mengxr/SPARK-14412.
-
Sean Owen authored
[SPARK-14533][MLLIB] RowMatrix.computeCovariance inaccurate when values are very large (partial fix) ## What changes were proposed in this pull request? Fix for part of SPARK-14533: trivial simplification and more accurate computation of column means. See also https://github.com/apache/spark/pull/12299 which contained a complete fix that was very slow. This PR does _not_ resolve SPARK-14533 entirely. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #12779 from srowen/SPARK-14533.2.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? Now, `SQLContext` is used for backward-compatibility, we had better use `SparkSession` in Spark 2.0 examples. ## How was this patch tested? It's just example change. After building, run `bin/run-example org.apache.spark.examples.sql.RDDRelation`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12808 from dongjoon-hyun/rddrelation.
-
Xiangrui Meng authored
## What changes were proposed in this pull request? This PR uses `UnsafeArrayData.fromPrimitiveArray` to implement `ml.VectorUDT/MatrixUDT` to avoid boxing/unboxing. ## How was this patch tested? Exiting unit tests. cc: cloud-fan Author: Xiangrui Meng <meng@databricks.com> Closes #12805 from mengxr/SPARK-14850.
-
Marcelo Vanzin authored
There's actually a race here: the state of the handler was changed before the connection was set, so the test code could be notified of the state change, wake up, and still see the connection as null, triggering the assert. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12785 from vanzin/SPARK-14391.
-
Timothy Hunter authored
## What changes were proposed in this pull request? This PR splits the MLlib algorithms into two flavors: - the R flavor, which tries to mimic the existing R API for these algorithms (and works as an S4 specialization for Spark dataframes) - the Spark flavor, which follows the same API and naming conventions as the rest of the MLlib algorithms in the other languages In practice, the former calls the latter. ## How was this patch tested? The tests for the various algorithms were adapted to be run against both interfaces. Author: Timothy Hunter <timhunter@databricks.com> Closes #12789 from thunterdb/14831.
-
Wenchen Fan authored
## What changes were proposed in this pull request? This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT. ## How was this patch tested? existing tests and new test suite `UnsafeArraySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12640 from cloud-fan/ml.
-
hyukjinkwon authored
[SPARK-13667][SQL] Support for specifying custom date format for date and timestamp types at CSV datasource. ## What changes were proposed in this pull request? This PR adds the support to specify custom date format for `DateType` and `TimestampType`. For `TimestampType`, this uses the given format to infer schema and also to convert the values For `DateType`, this uses the given format to convert the values. If the `dateFormat` is not given, then it works with `DateTimeUtils.stringToTime()` for backwords compatibility. When it's given, then it uses `SimpleDateFormat` for parsing data. In addition, `IntegerType`, `DoubleType` and `LongType` have a higher priority than `TimestampType` in type inference. This means even if the given format is `yyyy` or `yyyy.MM`, it will be inferred as `IntegerType` or `DoubleType`. Since it is type inference, I think it is okay to give such precedences. In addition, I renamed `csv.CSVInferSchema` to `csv.InferSchema` as JSON datasource has `json.InferSchema`. Although they have the same names, I did this because I thought the parent package name can still differentiate each. Accordingly, the suite name was also changed from `CSVInferSchemaSuite` to `InferSchemaSuite`. ## How was this patch tested? unit tests are used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11550 from HyukjinKwon/SPARK-13667.
-
Yin Huai authored
## What changes were proposed in this pull request? CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser. ## How was this patch tested? Existing tests Author: Yin Huai <yhuai@databricks.com> Closes #12796 from yhuai/removeDataTypeParser.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch fixes a null handling bug in EqualNullSafe's code generation. ## How was this patch tested? Updated unit test so they would fail without the fix. Closes #12628. Author: Reynold Xin <rxin@databricks.com> Author: Arash Nabili <arash@levyx.com> Closes #12799 from rxin/equalnullsafe.
-
Nick Pentreath authored
`mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them. ## How was this patch tested? New test cases in `ALSSuite` and `tests.py`. cc yanboliang jkbradley sethah rishabhbhardwaj Author: Nick Pentreath <nickp@za.ibm.com> Closes #12660 from MLnick/SPARK-14412-als-storage-params.
-
- Apr 29, 2016
-
-
hyukjinkwon authored
## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14917 As it is described in the JIRA, it seems Hive 1.2.1 which Spark uses now supports snappy and none. So, this PR enables some tests for writing ORC files with compression codes, `SNAPPY` and `NONE`. ## How was this patch tested? Unittests in `OrcQuerySuite` and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12699 from HyukjinKwon/SPARK-14917.
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? Per discussion on [https://github.com/apache/spark/pull/12604], this removes ML persistence for Python tuning (TrainValidationSplit, CrossValidator, and their Models) since they do not handle nesting easily. This support should be re-designed and added in the next release. ## How was this patch tested? Removed unit test elements saving and loading the tuning algorithms, but kept tests to save and load their bestModel fields. Author: Joseph K. Bradley <joseph@databricks.com> Closes #12782 from jkbradley/remove-python-tuning-saveload.
-
Andrew Or authored
## What changes were proposed in this pull request? 1. Remove all the `spark.setConf` etc. Just expose `spark.conf` 2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused This was done for both the Python and Scala APIs. ## How was this patch tested? `SQLConfSuite`, python tests. This one fixes the failed tests in #12787 Closes #12787 Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12798 from yhuai/conf-api.
-
Wenchen Fan authored
## What changes were proposed in this pull request? The RPC framework will not serialize and deserialize messages in local mode, we should not call `acc.value` when receive heartbeat message, because the serialization hook of new accumulator may not be triggered and the `atDriverSide` flag may not be set. ## How was this patch tested? tested it locally via spark shell Author: Wenchen Fan <wenchen@databricks.com> Closes #12795 from cloud-fan/bug.
-
Yin Huai authored
## What changes were proposed in this pull request? This PR makes two changes: 1. We will propagate Spark Confs to HiveConf created in HiveClientImpl. So, users can also use spark conf to set warehouse location and metastore url. 2. In sql/hive, HiveClientImpl will be the only place where we create a new HiveConf. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #12791 from yhuai/onlyUseHiveConfInHiveClientImpl.
-
tedyu authored
## What changes were proposed in this pull request? This PR proposes to use ConcurrentHashMap in place of HashMap for NewAccumulator.originals This should result in better performance. ## How was this patch tested? Existing unit test suite cloud-fan Author: tedyu <yuzhihong@gmail.com> Closes #12776 from tedyu/master.
-
Herman van Hovell authored
The previous subquery PRs did not include support for pushing subqueries used in filters (`WHERE`/`HAVING`) down. This PR adds this support. For example : ```scala range(0, 10).registerTempTable("a") range(5, 15).registerTempTable("b") range(7, 25).registerTempTable("c") range(3, 12).registerTempTable("d") val plan = sql("select * from a join b on a.id = b.id left join c on c.id = b.id where a.id in (select id from d)") plan.explain(true) ``` Leads to the following Analyzed & Optimized plans: ``` == Parsed Logical Plan == ... == Analyzed Logical Plan == id: bigint, id: bigint, id: bigint Project [id#0L,id#4L,id#8L] +- Filter predicate-subquery#16 [(id#0L = id#12L)] : +- SubqueryAlias predicate-subquery#16 [(id#0L = id#12L)] : +- Project [id#12L] : +- SubqueryAlias d : +- Range 3, 12, 1, 8, [id#12L] +- Join LeftOuter, Some((id#8L = id#4L)) :- Join Inner, Some((id#0L = id#4L)) : :- SubqueryAlias a : : +- Range 0, 10, 1, 8, [id#0L] : +- SubqueryAlias b : +- Range 5, 15, 1, 8, [id#4L] +- SubqueryAlias c +- Range 7, 25, 1, 8, [id#8L] == Optimized Logical Plan == Join LeftOuter, Some((id#8L = id#4L)) :- Join Inner, Some((id#0L = id#4L)) : :- Join LeftSemi, Some((id#0L = id#12L)) : : :- Range 0, 10, 1, 8, [id#0L] : : +- Range 3, 12, 1, 8, [id#12L] : +- Range 5, 15, 1, 8, [id#4L] +- Range 7, 25, 1, 8, [id#8L] == Physical Plan == ... ``` I have also taken the opportunity to move quite a bit of code around: - Rewriting subqueris and pulling out correlated predicated from subqueries has been moved into the analyzer. The analyzer transforms `Exists` and `InSubQuery` into `PredicateSubquery` expressions. A PredicateSubquery exposes the 'join' expressions and the proper references. This makes things like type coercion, optimization and planning easier to do. - I have added support for `Aggregate` plans in subqueries. Any correlated expressions will be added to the grouping expressions. I have removed support for `Union` plans, since pulling in an outer reference from beneath a Union has no value (a filtered value could easily be part of another Union child). - Resolution of subqueries is now done using `OuterReference`s. These are used to wrap any outer reference; this makes the identification of these references easier, and also makes dealing with duplicate attributes in the outer and inner plans easier. The resolution of subqueries initially used a resolution loop which would alternate between calling the analyzer and trying to resolve the outer references. We now use a dedicated analyzer which uses a special rule for outer reference resolution. These changes are a stepping stone for enabling correlated scalar subqueries, enabling all Hive tests & allowing us to use predicate subqueries anywhere. Current tests and added test cases in FilterPushdownSuite. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12720 from hvanhovell/SPARK-14858.
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? Modified Kmeans to store cluster centers with one per row ## How was this patch tested? Existing tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12792 from jkbradley/kmeans-save-fix.
-
Andrew Or authored
## What changes were proposed in this pull request? Addresses comments in #12765. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12784 from andrewor14/python-followup.
-
Sun Rui authored
## What changes were proposed in this pull request? dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame. The function signature is: dapply(df, function(localDF) {}, schema = NULL) R function input: local data.frame from the partition on local node R function output: local data.frame Schema specifies the Row format of the resulting DataFrame. It must match the R function's output. If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply(). ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Author: Sun Rui <sunrui2016@gmail.com> Closes #12493 from sun-rui/SPARK-12919.
-
BenFradet authored
## What changes were proposed in this pull request? Added Instrumentation logging to DecisionTree{Classifier,Regressor} and RandomForest{Classifier,Regressor} ## How was this patch tested? No tests involved since it's logging related. Author: BenFradet <benjamin.fradet@gmail.com> Closes #12536 from BenFradet/SPARK-14570.
-
Yin Huai authored
## What changes were proposed in this pull request? The hiveConf in HiveSessionState is not actually used anymore. Let's remove it. ## How was this patch tested? Existing tests Author: Yin Huai <yhuai@databricks.com> Closes #12786 from yhuai/removeHiveConf.
-
Cheng Lian authored
## What changes were proposed in this pull request? Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column. ## How was this patch tested? A test case is added to test the invalid sorting order by checking exception message. Author: Cheng Lian <lian@databricks.com> Closes #12759 from liancheng/spark-14981.
-
Reynold Xin authored
## What changes were proposed in this pull request? We recently inlined Hive's thrift server code in SPARK-15004. This patch removes the code related to zookeeper service discovery, Tez, and Hive on Spark, since they are irrelevant. ## How was this patch tested? N/A - removing dead code Author: Reynold Xin <rxin@databricks.com> Closes #12780 from rxin/SPARK-15004.
-
Yin Huai authored
This test always fail with sbt's hadoop 2.3 and 2.4 tests. Let'e disable it for now and investigate the problem. Author: Yin Huai <yhuai@databricks.com> Closes #12783 from yhuai/SPARK-15011-ignore.
-
Jeff Zhang authored
## What changes were proposed in this pull request? pyspark.ml API for LDA * LDA, LDAModel, LocalLDAModel, DistributedLDAModel * includes persistence This replaces [https://github.com/apache/spark/pull/10242] ## How was this patch tested? * doc test for LDA, including Param setters * unit test for persistence Author: Joseph K. Bradley <joseph@databricks.com> Author: Jeff Zhang <zjffdu@apache.org> Closes #12723 from jkbradley/zjffdu-SPARK-11940.
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? Deprecated model field in LinearRegressionSummary Removed unnecessary Since annotations ## How was this patch tested? Existing tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12763 from jkbradley/lr-summary-api.
-
Yanbo Liang authored
SparkR ```glm``` and ```kmeans``` model persistence. Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Author: Gayathri Murali <gayathri.m.softie@gmail.com> Closes #12778 from yanboliang/spark-14311. Closes #12680 Closes #12683
-
Andrew Or authored
## What changes were proposed in this pull request? The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12765 from andrewor14/python-spark-session-more.
-
Davies Liu authored
## What changes were proposed in this pull request? This PR copy the thrift-server from hive-service-1.2 (including TCLIService.thrift and generated Java source code) into sql/hive-thriftserver, so we can do further cleanup and improvements. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #12764 from davies/thrift_server.
-
wm624@hotmail.com authored
## What changes were proposed in this pull request? Add log instrumentation for parameters: rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha, userCol, itemCol, ratingCol, predictionCol, maxIter, regParam, nonnegative, checkpointInterval, seed Add log instrumentation for numUserFeatures and numItemFeatures ## How was this patch tested? Manual test: Set breakpoint in intellij and run def testALS(). Single step debugging and check the log method is called. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12560 from wangmiao1981/log.
-
dding3 authored
## What changes were proposed in this pull request? This PR removes duplicate implementation of compute in LogisticGradient class ## How was this patch tested? unit tests Author: dding3 <dingding@dingding-ubuntu.sh.intel.com> Closes #12747 from dding3/master.
-
Jakob Odersky authored
## What changes were proposed in this pull request? In the past, genjavadoc had issues with package private members which led the spark project to use a forked version. This issue has been fixed upstream (typesafehub/genjavadoc#70) and a release is available for scala versions 2.10, 2.11 **and 2.12**, hence a forked version for spark is no longer necessary. This pull request updates the build configuration to use the newest upstream genjavadoc. ## How was this patch tested? The build was run `sbt unidoc`. During the process javadoc emits some errors on the generated java stubs, however these errors were also present before the upgrade. Furthermore, the produced html is fine. Author: Jakob Odersky <jakob@odersky.com> Closes #12707 from jodersky/SPARK-14511-genjavadoc.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch removes executionHive from HiveSessionState and HiveSharedState. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12770 from rxin/SPARK-14994.
-
Sameer Agarwal authored
## What changes were proposed in this pull request? This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL. ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #12771 from sameeragarwal/tpcds-2.
-
gatorsmile authored
#### What changes were proposed in this pull request? Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). ```SQL SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 ``` Note: 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated join conditions will be incorrect. This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like ```SQL test("except") { val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id") val df_right = Seq(1, 3).toDF("id") checkAnswer( df_left.except(df_right), Row(2) :: Row(2) :: Row(4) :: Nil ) } ``` After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`. #### How was this patch tested? Modified and added a few test cases to verify the optimization rule and the results of operators. Author: gatorsmile <gatorsmile@gmail.com> Closes #12736 from gatorsmile/exceptByAntiJoin.
-