- Apr 25, 2016
-
-
gatorsmile authored
[SPARK-14892][SQL][TEST] Disable the HiveCompatibilitySuite test case for INPUTDRIVER and OUTPUTDRIVER. #### What changes were proposed in this pull request? Disable the test case involving INPUTDRIVER and OUTPUTDRIVER, which are not supported #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #12662 from gatorsmile/disableInOutDriver.
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? Removed instances of JavaMLWriter, JavaMLReader appearing in public Python API docs ## How was this patch tested? n/a Author: Joseph K. Bradley <joseph@databricks.com> Closes #12542 from jkbradley/javamlwriter-doc.
-
wm624@hotmail.com authored
## What changes were proposed in this pull request? Add Python API in ML for GaussianMixture ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Add doctest and test cases are the same as mllib Python tests ./dev/lint-python PEP8 checks passed. rm -rf _build/* pydoc checks passed. ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-ml'] Finished test(python2.7): pyspark.ml.evaluation (18s) Finished test(python2.7): pyspark.ml.clustering (40s) Finished test(python2.7): pyspark.ml.classification (49s) Finished test(python2.7): pyspark.ml.recommendation (44s) Finished test(python2.7): pyspark.ml.feature (64s) Finished test(python2.7): pyspark.ml.regression (45s) Finished test(python2.7): pyspark.ml.tuning (30s) Finished test(python2.7): pyspark.ml.tests (56s) Tests passed in 106 seconds Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12402 from wangmiao1981/gmm.
-
Marcelo Vanzin authored
First, make all dependencies in the examples module provided, and explicitly list a couple of ones that somehow are promoted to compile by maven. This means that to run streaming examples, the streaming connector package needs to be provided to run-examples using --packages or --jars, just like regular apps. Also, remove a couple of outdated examples. HBase has had Spark bindings for a while and is even including them in the HBase distribution in the next version, making the examples obsolete. The same applies to Cassandra, which seems to have a proper Spark binding library already. I just tested the build, which passes, and ran SparkPi. The examples jars directory now has only two jars: ``` $ ls -1 examples/target/scala-2.11/jars/ scopt_2.11-3.3.0.jar spark-examples_2.11-2.0.0-SNAPSHOT.jar ``` Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12544 from vanzin/SPARK-14744.
-
Jason Lee authored
## What changes were proposed in this pull request? Removed expectedType arg from PySpark Param __init__, as suggested by the JIRA. ## How was this patch tested? Manually looked through all places that use Param. Compiled and ran all ML PySpark test cases before and after the fix. Author: Jason Lee <cjlee@us.ibm.com> Closes #12581 from jasoncl/SPARK-14768.
-
Cheng Lian authored
## What changes were proposed in this pull request? This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again, since 3rd party data sources like spark-avro depend on it. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes #12652 from liancheng/spark-14875.
-
Peter Ableda authored
## What changes were proposed in this pull request? Implement the same memory size validations for the StaticMemoryManager (Legacy) as the UnifiedMemoryManager has. ## How was this patch tested? Manual tests were done in CDH cluster. Test with small executor memory: ` spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client --master yarn --executor-memory 15m --conf spark.memory.useLegacyMode=true /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples*.jar 10 ` Exception thrown: ``` ERROR spark.SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Executor memory 15728640 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration. at org.apache.spark.memory.StaticMemoryManager$.org$apache$spark$memory$StaticMemoryManager$$getMaxExecutionMemory(StaticMemoryManager.scala:127) at org.apache.spark.memory.StaticMemoryManager.<init>(StaticMemoryManager.scala:46) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:352) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:289) at org.apache.spark.SparkContext.<init>(SparkContext.scala:462) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` Author: Peter Ableda <peter.ableda@cloudera.com> Closes #12395 from peterableda/SPARK-14636.
-
Zheng RuiFeng authored
## What changes were proposed in this pull request? add the checking for StepSize and Tol in sharedParams ## How was this patch tested? Unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12530 from zhengruifeng/ml_args_checking.
-
Eric Liang authored
## What changes were proposed in this pull request? Sbt compile and test should also run scalastyle. This makes it less likely you forget to run scalastyle and fail in jenkins. Scalastyle results are cached for efficiency. This patch was originally written by ahirreddy; I just fixed it up to work with scalastyle 0.8.0. ## How was this patch tested? Tested manually with `build/sbt package`. Author: Eric Liang <ekl@databricks.com> Closes #12555 from ericl/scalastyle.
-
Sameer Agarwal authored
## What changes were proposed in this pull request? This PR fixes a bug in `TungstenAggregate` that manifests while aggregating by keys over nullable `BigDecimal` columns. This causes a null pointer exception while executing TPCDS q14a. ## How was this patch tested? 1. Added regression test in `DataFrameAggregateSuite`. 2. Verified that TPCDS q14a works Author: Sameer Agarwal <sameer@databricks.com> Closes #12651 from sameeragarwal/tpcds-fix.
-
felixcheung authored
[SPARK-14881] [PYTHON] [SPARKR] pyspark and sparkR shell default log level should match spark-shell/Scala ## What changes were proposed in this pull request? Change default logging to WARN for pyspark shell and sparkR shell for a much cleaner environment. ## How was this patch tested? Manually running pyspark and sparkR shell Author: felixcheung <felixcheung_m@hotmail.com> Closes #12648 from felixcheung/pylogging.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules. - Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later. - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency - Fix datatypes in `sparkr.md`. - Update a data result in `sparkr.md`. - Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`. - Other minor syntax fixes and a typo. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12649 from dongjoon-hyun/SPARK-14883.
-
- Apr 24, 2016
-
-
Yin Huai authored
[SPARK-14885][SQL] When creating a CatalogColumn, we should use the catalogString of a DataType object. ## What changes were proposed in this pull request? Right now, the data type field of a CatalogColumn is using the string representation. When we create this string from a DataType object, there are places where we use simpleString instead of catalogString. Although catalogString is the same as simpleString right now, it is still good to use catalogString. So, we will not silently introduce issues when we change the semantic of simpleString or the implementation of catalogString. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #12654 from yhuai/useCatalogString.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items. - Adds a new line at the end of the files (19 files) - Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder) ## How was this patch tested? After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.) ```bash $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12632 from dongjoon-hyun/SPARK-14868.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch changes SparkSession to be case insensitive by default, in order to match other database systems. ## How was this patch tested? N/A - I'm sure some tests will fail and I will need to fix those. Author: Reynold Xin <rxin@databricks.com> Closes #12643 from rxin/SPARK-14876.
-
Reynold Xin authored
-
jliwork authored
!< means not less than which is equivalent to >= !> means not greater than which is equivalent to <= I'd to create a PR to support these two operators. I've added new test cases in: DataFrameSuite, ExpressionParserSuite, JDBCSuite, PlanParserSuite, SQLQuerySuite dilipbiswal viirya gatorsmile Author: jliwork <jiali@us.ibm.com> Closes #12316 from jliwork/SPARK-14548.
-
gatorsmile authored
#### What changes were proposed in this pull request? So far, we are capturing each unsupported Alter Table in separate visit functions. They should be unified and issue the same ParseException instead. This PR is to refactor the existing implementation and make error message consistent for Alter Table DDL. #### How was this patch tested? Updated the existing test cases and also added new test cases to ensure all the unsupported statements are covered. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12459 from gatorsmile/cleanAlterTable.
-
Jacek Laskowski authored
## What changes were proposed in this pull request? Added screenshot + minor fixes to improve reading ## How was this patch tested? Manual Author: Jacek Laskowski <jacek@japila.pl> Closes #12569 from jaceklaskowski/docs-accumulators.
-
Steve Loughran authored
Add to the REST API details on the ? args and examples from the test suite. I've used the existing table, adding all the fields to the second table. see [in the pr](https://github.com/steveloughran/spark/blob/history/SPARK-13267-doc-params/docs/monitoring.md). There's a slightly more sophisticated option: make the table 3 columns wide, and for all existing entries, have the initial `td` span 2 columns. The new entries would then have an empty 1st column, param in 2nd and text in 3rd, with any examples after a `br` entry. Author: Steve Loughran <stevel@hortonworks.com> Closes #11152 from steveloughran/history/SPARK-13267-doc-params.
-
mathieu longtin authored
## What changes were proposed in this pull request? In Python, sqlContext.getConf didn't allow getting the system default (getConf with one parameter). Now the following are supported: ``` sqlContext.getConf(confName) # System default if not locally set, this is new sqlContext.getConf(confName, myDefault) # myDefault if not locally set, old behavior ``` I also added doctests to this function. The original behavior does not change. ## How was this patch tested? Manually, but doctests were added. Author: mathieu longtin <mathieu.longtin@nuance.com> Closes #12488 from mathieulongtin/pyfixgetconf3.
-
Yin Huai authored
## What changes were proposed in this pull request? CreateMetastoreDataSource and CreateMetastoreDataSourceAsSelect are not Hive-specific. So, this PR moves them from sql/hive to sql/core. Also, I am adding `Command` suffix to these two classes. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #12645 from yhuai/moveCreateDataSource.
-
- Apr 23, 2016
-
-
Tathagata Das authored
[SPARK-14833][SQL][STREAMING][TEST] Refactor StreamTests to test for source fault-tolerance correctly. ## What changes were proposed in this pull request? Current StreamTest allows testing of a streaming Dataset generated explicitly wraps a source. This is different from the actual production code path where the source object is dynamically created through a DataSource object every time a query is started. So all the fault-tolerance testing in FileSourceSuite and FileSourceStressSuite is not really testing the actual code path as they are just reusing the FileStreamSource object. This PR fixes StreamTest and the FileSource***Suite to test this correctly. Instead of maintaining a mapping of source --> expected offset in StreamTest (which requires reuse of source object), it now maintains a mapping of source index --> offset, so that it is independent of the source object. Summary of changes - StreamTest refactored to keep track of offset by source index instead of source - AddData, AddTextData and AddParquetData updated to find the FileStreamSource object from an active query, so that it can work with sources generated when query is started. - Refactored unit tests in FileSource***Suite to test using DataFrame/Dataset generated with public, rather than reusing the same FileStreamSource. This correctly tests fault tolerance. The refactoring changed a lot of indents in FileSourceSuite, so its recommended to hide whitespace changes with this - https://github.com/apache/spark/pull/12592/files?w=1 ## How was this patch tested? Refactored unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12592 from tdas/SPARK-14833.
-
Liang-Chi Hsieh authored
[SPARK-14838] [SQL] Set default size for ObjecType to avoid failure when estimating sizeInBytes in ObjectProducer ## What changes were proposed in this pull request? We have logical plans that produce domain objects which are `ObjectType`. As we can't estimate the size of `ObjectType`, we throw an `UnsupportedOperationException` if trying to do that. We should set a default size for `ObjectType` to avoid this failure. ## How was this patch tested? `DatasetSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12599 from viirya/skip-broadcast-objectproducer.
-
felixcheung authored
## What changes were proposed in this pull request? Fixed inadvertent roxygen2 doc changes, added class name change to programming guide Follow up of #12621 ## How was this patch tested? manually checked Author: felixcheung <felixcheung_m@hotmail.com> Closes #12647 from felixcheung/rdataframe.
-
tedyu authored
## What changes were proposed in this pull request? There was a typo in the message for second assertion in "returning batch for wide table" test ## How was this patch tested? Existing tests. Author: tedyu <yuzhihong@gmail.com> Closes #12639 from tedyu/master.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? TernaryExpressions should thows proper error message for itself. ```scala protected def nullSafeEval(input1: Any, input2: Any, input3: Any): Any = - sys.error(s"BinaryExpressions must override either eval or nullSafeEval") + sys.error(s"TernaryExpressions must override either eval or nullSafeEval") ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12642 from dongjoon-hyun/minor_fix_error_msg_in_ternaryexpression.
-
Reynold Xin authored
## What changes were proposed in this pull request? It is unnecessary as DataType.catalogString largely replaces the need for this class. ## How was this patch tested? Mostly removing dead code and should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12644 from rxin/SPARK-14877.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch improves error handling in view creation. CreateViewCommand itself will analyze the view SQL query first, and if it cannot successfully analyze it, throw an AnalysisException. In addition, I also added the following two conservative guards for easier identification of Spark bugs: 1. If there is a bug and the generated view SQL cannot be analyzed, throw an exception at runtime. Note that this is not an AnalysisException because it is not caused by the user and more likely indicate a bug in Spark. 2. SQLBuilder when it gets an unresolved plan, it will also show the plan in the error message. I also took the chance to simplify the internal implementation of CreateViewCommand, and *removed* a fallback path that would've masked an exception from before. ## How was this patch tested? 1. Added a unit test for the user facing error handling. 2. Manually introduced some bugs in Spark to test the internal defensive error handling. 3. Also added a test case to test nested views (not super relevant). Author: Reynold Xin <rxin@databricks.com> Closes #12633 from rxin/SPARK-14865.
-
Reynold Xin authored
## What changes were proposed in this pull request? In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence. ## How was this patch tested? I manually hacked some bugs into Spark and made sure the exceptions were being propagated up. Author: Reynold Xin <rxin@databricks.com> Closes #12634 from rxin/SPARK-14869.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch restructures sql.execution.command package to break the commands into multiple files, in some logical organization: databases, tables, views, functions. I also renamed basicOperators.scala to basicLogicalOperators.scala and basicPhysicalOperators.scala. ## How was this patch tested? N/A - all I did was moving code around. Author: Reynold Xin <rxin@databricks.com> Closes #12636 from rxin/SPARK-14872.
-
Reynold Xin authored
## What changes were proposed in this pull request? Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately this clutters the spark-sql CLI output and makes it very difficult to read the actual query results. ## How was this patch tested? Built and tested in spark-sql CLI. Author: Reynold Xin <rxin@databricks.com> Closes #12635 from rxin/SPARK-14871.
-
Davies Liu authored
-
Reynold Xin authored
## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Reynold Xin <rxin@databricks.com> Closes #12565 from rxin/test-flaky.
-
felixcheung authored
## What changes were proposed in this pull request? When JVM backend fails without going proper error handling (eg. process crashed), the R error message could be ambiguous. ``` Error in if (returnStatus != 0) { : argument is of length zero ``` This change attempts to make it more clear (however, one would still need to investigate why JVM fails) ## How was this patch tested? manually Author: felixcheung <felixcheung_m@hotmail.com> Closes #12622 from felixcheung/rreturnstatus.
-
Reynold Xin authored
Closes #7647 Closes #8195 Closes #8741 Closes #8972 Closes #9490 Closes #10419 Closes #10761 Closes #11003 Closes #11201 Closes #11803 Closes #12111 Closes #12442
-
Sean Owen authored
[SPARK-14873][CORE] Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object ## What changes were proposed in this pull request? Java `sampleByKey` methods should accept `Map` with `java.lang.Double` values ## How was this patch tested? Existing (updated) Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12637 from srowen/SPARK-14873.
-
felixcheung authored
## What changes were proposed in this pull request? Changed class name defined in R from "DataFrame" to "SparkDataFrame". A popular package, S4Vector already defines "DataFrame" - this change is to avoid conflict. Aside from class name and API/roxygen2 references, SparkR APIs like `createDataFrame`, `as.DataFrame` are not changed (S4Vector does not define a "as.DataFrame"). Since in R, one would rarely reference type/class, this change should have minimal/almost-no impact to a SparkR user in terms of back compat. ## How was this patch tested? SparkR tests, manually loading S4Vector then SparkR package Author: felixcheung <felixcheung_m@hotmail.com> Closes #12621 from felixcheung/rdataframe.
-
Zheng RuiFeng authored
## What changes were proposed in this pull request? del unused imports in ML/MLLIB ## How was this patch tested? unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12497 from zhengruifeng/del_unused_imports.
-
Rajesh Balamohan authored
## What changes were proposed in this pull request? When FileSourceStrategy is used, record reader is created which incurs a NN call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading the file information to get the ObjectInspector. This incurs additional NN call. It would be good to avoid this additional NN call (specifically for partitioned datasets). Added OrcRecordReader which is very similar to OrcNewInputFormat.OrcRecordReader with an option of exposing the ObjectInspector. This eliminates the need to look up the file later for generating the object inspector. This would be specifically be useful for partitioned tables/datasets. ## How was this patch tested? Ran tpc-ds queries manually and also verified by running org.apache.spark.sql.hive.orc.OrcSuite,org.apache.spark.sql.hive.orc.OrcQuerySuite,org.apache.spark.sql.hive.orc.OrcPartitionDiscoverySuite,OrcPartitionDiscoverySuite.OrcHadoopFsRelationSuite,org.apache.spark.sql.hive.execution.HiveCompatibilitySuite …SourceStrategy mode Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #12319 from rajeshbalamohan/SPARK-14551.
-