- May 25, 2016
-
-
Jurriaan Pruis authored
## What changes were proposed in this pull request? Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this. See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247 This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2) https://issues.apache.org/jira/browse/SPARK-15493 ## How was this patch tested? Added a test that verifies the output is quoted correctly. Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13267 from jurriaan/quote-escaping.
-
Takuya UESHIN authored
## What changes were proposed in this pull request? Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies. This pr fixes `IncrementalExecution` to include extra strategies to use them. ## How was this patch tested? I added a test to check if extra strategies work for streams. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13261 from ueshin/issues/SPARK-15483.
-
Nick Pentreath authored
Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice. We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields. Tests N/A. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
-
lfzCarlosC authored
fixed typos for source code for components [mllib] [streaming] and [SQL] None and obvious. Author: lfzCarlosC <lfz.carlos@gmail.com> Closes #13298 from lfzCarlosC/master.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? This PR fixes the following typos in log message and comments of `HadoopRDD.scala`. Also, this removes unused imports. ```scala - logWarning("Caching NewHadoopRDDs as deserialized objects usually leads to undesired" + + logWarning("Caching HadoopRDDs as deserialized objects usually leads to undesired" + ... - // since its not removed yet + // since it's not removed yet ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13294 from dongjoon-hyun/minor_rdd_fix_log_message.
-
Eric Liang authored
[SPARK-15520][SQL] SparkSession builder in python should also allow overriding confs of existing sessions ## What changes were proposed in this pull request? This fixes the python SparkSession builder to allow setting confs correctly. This was a leftover TODO from https://github.com/apache/spark/pull/13200. ## How was this patch tested? Python doc tests. cc andrewor14 Author: Eric Liang <ekl@databricks.com> Closes #13289 from ericl/spark-15520.
-
Jeff Zhang authored
[SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this already an existing SparkContext ## What changes were proposed in this pull request? Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach. ## How was this patch tested? Manually verify it in spark-shell. rxin Please help review it, I think this is a very critical issue for spark 2.0 Author: Jeff Zhang <zjffdu@apache.org> Closes #13160 from zjffdu/SPARK-15345.
-
Lukasz authored
## What changes were proposed in this pull request? 1. Making 'name' field of RDDInfo mutable. 2. In StorageListener: catching the fact that RDD's name was changed and updating it in RDDInfo. ## How was this patch tested? 1. Manual verification - the 'Storage' tab now behaves as expected. 2. The commit also contains a new unit test which verifies this. Author: Lukasz <lgieron@gmail.com> Closes #13264 from lgieron/SPARK-9044.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly. ## How was this patch tested? Created a new SparkSqlParserSuite. Author: Reynold Xin <rxin@databricks.com> Closes #13292 from rxin/SPARK-15436.
-
Krishna Kalyan authored
## What changes were proposed in this pull request? Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, SparkSQL converts `NA` in R to `null`. ## How was this patch tested? Document update, no tests. Author: Krishna Kalyan <krishnakalyan3@gmail.com> Closes #13268 from krishnakalyan3/spark-12071-1.
-
Holden Karau authored
[SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc & doc build insturctions ## What changes were proposed in this pull request? PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways. User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands. User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more). ## How was this patch tested? built pydocs locally, tested new user build instructions Author: Holden Karau <holden@us.ibm.com> Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
-
Shixiong Zhu authored
## What changes were proposed in this pull request? `JavaKafkaStreamSuite.testKafkaStream` assumes when `sent.size == result.size`, the contents of `sent` and `result` should be same. However, that's not true. The content of `result` may not be the final content. This PR modified the test to always retry the assertions even if the contents of `sent` and `result` are not same. Here is the failure in Jenkins: http://spark-tests.appspot.com/tests/org.apache.spark.streaming.kafka.JavaKafkaStreamSuite/testKafkaStream ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13281 from zsxwing/flaky-kafka-test.
-
- May 24, 2016
-
-
Wenchen Fan authored
## What changes were proposed in this pull request? This PR fixes 3 slow tests: 1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit". 2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size. 3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13273 from cloud-fan/test.
-
Parth Brahmbhatt authored
[SPARK-15365][SQL] When table size statistics are not available from metastore, we should fallback to HDFS ## What changes were proposed in this pull request? Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins. ## How was this patch tested? I have executed queries locally to test. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes #13150 from Parth-Brahmbhatt/SPARK-15365.
-
Reynold Xin authored
## What changes were proposed in this pull request? This patch renames various scheduler backends to make them consistent: - LocalScheduler -> LocalSchedulerBackend - AppClient -> StandaloneAppClient - AppClientListener -> StandaloneAppClientListener - SparkDeploySchedulerBackend -> StandaloneSchedulerBackend - CoarseMesosSchedulerBackend -> MesosCoarseGrainedSchedulerBackend - MesosSchedulerBackend -> MesosFineGrainedSchedulerBackend ## How was this patch tested? Updated test cases to reflect the name change. Author: Reynold Xin <rxin@databricks.com> Closes #13288 from rxin/SPARK-15518.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases. **Before** ```scala scala> sc.parallelize(1 to 5).coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> sc.parallelize(1 to 5).repartition(0).collect() res1: Array[Int] = Array() // empty scala> spark.sql("select 1").coalesce(0) res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int] scala> spark.sql("select 1").coalesce(0).collect() java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. scala> spark.sql("select 1").repartition(0) res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int] scala> spark.sql("select 1").repartition(0).collect() res4: Array[org.apache.spark.sql.Row] = Array() // empty ``` **After** ```scala scala> sc.parallelize(1 to 5).coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> sc.parallelize(1 to 5).repartition(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> spark.sql("select 1").coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> spark.sql("select 1").repartition(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... ``` ## How was this patch tested? Pass the Jenkins tests with new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13282 from dongjoon-hyun/SPARK-15512.
-
Tathagata Das authored
## What changes were proposed in this pull request? If the user relies on the schema to be inferred in file streams can break easily for multiple reasons - accidentally running on a directory which has no data - schema changing underneath - on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart. To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases. In this PR, I introduce a SQLConf that determines whether schema inference for file streams is allowed or not. It is disabled by default. ## How was this patch tested? Updated unit tests that test error behavior with and without schema inference enabled. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13238 from tdas/SPARK-15458.
-
Nick Pentreath authored
This PR adds a note to clarify that the ML API for ALS only supports integers for user/item ids, and that other types for these columns can be used but the ids must fall within integer range. (Refer [SPARK-14891](https://issues.apache.org/jira/browse/SPARK-14891)). Also cleaned up a reference to `mllib` in the ML doc. ## How was this patch tested? Built and viewed User Guide doc locally. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13278 from MLnick/SPARK-15502-als-int-id-doc-note.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? This PR fixes some obsolete comments and assertion in `takeSample` testcase of `RDDSuite.scala`. ## How was this patch tested? This fixes the testcase only. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13260 from dongjoon-hyun/SPARK-15481.
-
wangyang authored
## What changes were proposed in this pull request? spark.sql("CREATE FUNCTION myfunc AS 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws "org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:NoSuchObjectException(message:Function default.myfunc does not exist))" with hive 1.2.1. I think it is introduced by pr #12853. Fixing it by catching Exception (not NoSuchObjectException) and string matching. ## How was this patch tested? added a unit test and also tested it manually Author: wangyang <wangyang@haizhi.com> Closes #13177 from wangyang1992/fixCreateFunc2.
-
Marcelo Vanzin authored
We only need one copy of it. The client code that was uploading the second copy just needs to be modified to update the metadata in the cache, so that the AM knows where to find the configuration. Tested by running app on YARN and verifying in the logs only one archive is uploaded. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #13232 from vanzin/SPARK-15405.
-
Liang-Chi Hsieh authored
## What changes were proposed in this pull request? Currently PySpark core test uses the `SerDe` from `PythonMLLibAPI` which includes many MLlib things. It should use `SerDeUtil` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13214 from viirya/pycore-use-serdeutil.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? This PR is an up-to-date and a little bit improved version of #11019 of rxin for - (1) preventing recursive printing of expressions in generated code. Since the major function of this PR is indeed the above, he should be credited for the work he did. In addition to #11019, this PR improves the followings in code generation. - (2) Improve multiline comment indentation. - (3) Reduce the number of empty lines (mainly consecutive empty lines). - (4) Remove all space characters on empty lines. **Example** ```scala spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6) ``` **Before** ``` Generated code: /* 001 */ public Object generate(Object[] references) { ... /* 005 */ /** /* 006 */ * Codegend pipeline for /* 007 */ * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 008 */ * +- Range 1, 1, 8, 999, [id#0L] /* 009 */ */ ... /* 075 */ // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 076 */ /* 077 */ // PRODUCE: Range 1, 1, 8, 999, [id#0L] /* 078 */ /* 079 */ // initialize Range ... /* 092 */ // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 093 */ /* 094 */ // CONSUME: WholeStageCodegen /* 095 */ /* 096 */ // (((input[0, bigint, false] + 1) + 2) + 3) /* 097 */ // ((input[0, bigint, false] + 1) + 2) /* 098 */ // (input[0, bigint, false] + 1) ... /* 107 */ // (((input[0, bigint, false] + 4) + 5) + 6) /* 108 */ // ((input[0, bigint, false] + 4) + 5) /* 109 */ // (input[0, bigint, false] + 4) ... /* 126 */ } ``` **After** ``` Generated code: /* 001 */ public Object generate(Object[] references) { ... /* 005 */ /** /* 006 */ * Codegend pipeline for /* 007 */ * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 008 */ * +- Range 1, 1, 8, 999, [id#0L] /* 009 */ */ ... /* 075 */ // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 076 */ // PRODUCE: Range 1, 1, 8, 999, [id#0L] /* 077 */ // initialize Range ... /* 090 */ // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 091 */ // CONSUME: WholeStageCodegen /* 092 */ // (((input[0, bigint, false] + 1) + 2) + 3) ... /* 101 */ // (((input[0, bigint, false] + 4) + 5) + 6) ... /* 118 */ } ``` ## How was this patch tested? Pass the Jenkins tests and see the result of the following command manually. ```scala scala> spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6).queryExecution.debug.codegen() ``` Author: Dongjoon Hyun <dongjoonapache.org> Author: Reynold Xin <rxindatabricks.com> Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13192 from dongjoon-hyun/SPARK-13135.
-
Liang-Chi Hsieh authored
## What changes were proposed in this pull request? Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed. ## How was this patch tested? `JsonParsingOptionsSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9759 from viirya/fix-json-nonnumric.
-
Nick Pentreath authored
This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala. Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`). Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`. ## How was this patch tested? A little doctest and built API docs locally to check HTML doc generation. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
-
Daoyuan Wang authored
## What changes were proposed in this pull request? in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1, `locate("aa", "aaa", 1)` would yield 2 and `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0. ## How was this patch tested? tested with modified `StringExpressionsSuite` and `StringFunctionsSuite` Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #13186 from adrian-wang/locate.
-
- May 23, 2016
-
-
Kazuaki Ishizaki authored
## What changes were proposed in this pull request? This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method. ## How was this patch tested? Added new tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #13243 from kiszk/SPARK-15285.
-
gatorsmile authored
#### What changes were proposed in this pull request? So far, the page Configuration in the official documentation does not have a section for Spark SQL. http://spark.apache.org/docs/latest/configuration.html For Spark users, the information and default values of these public configuration parameters are very useful. This PR is to add this missing section to the configuration.html. rxin yhuai marmbrus #### How was this patch tested? Below is the generated webpage. <img width="924" alt="screenshot 2016-05-23 11 35 57" src="https://cloud.githubusercontent.com/assets/11567269/15480492/b08fefc4-20da-11e6-9fa2-7cd5b699ed35.png"> <img width="914" alt="screenshot 2016-05-23 11 37 38" src="https://cloud.githubusercontent.com/assets/11567269/15480499/c5f9482e-20da-11e6-95ff-10821add1af4.png"> <img width="923" alt="screenshot 2016-05-23 11 36 11" src="https://cloud.githubusercontent.com/assets/11567269/15480506/cbd81644-20da-11e6-9d27-effb716b2fac.png"> <img width="920" alt="screenshot 2016-05-23 11 36 18" src="https://cloud.githubusercontent.com/assets/11567269/15480511/d013e332-20da-11e6-854a-cf8813c46f36.png"> Author: gatorsmile <gatorsmile@gmail.com> Closes #13263 from gatorsmile/configurationSQL.
-
WeichenXu authored
[SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code ## What changes were proposed in this pull request? Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
-
gatorsmile authored
#### What changes were proposed in this pull request? So far, when using In-Memory Catalog, we allow DDL operations for the tables. However, the corresponding DML operations are not supported for the tables that are neither temporary nor data source tables. For example, ```SQL CREATE TABLE tabName(i INT, j STRING) SELECT * FROM tabName INSERT OVERWRITE TABLE tabName SELECT 1, 'a' ``` In the above example, before this PR fix, we will get very confusing exception messages for either `SELECT` or `INSERT` ``` org.apache.spark.sql.AnalysisException: unresolved operator 'SimpleCatalogRelation default, CatalogTable(`default`.`tbl`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(i,int,true,None), CatalogColumn(j,string,true,None)),List(),List(),List(),-1,,1463928681802,-1,Map(),None,None,None,List()), None; ``` This PR is to issue appropriate exceptions in this case. The message will be like ``` org.apache.spark.sql.AnalysisException: Please enable Hive support when operating non-temporary tables: `tbl`; ``` #### How was this patch tested? Added a test case in `DDLSuite`. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13093 from gatorsmile/selectAfterCreate.
-
Xin Wu authored
## What changes were proposed in this pull request? Currently command `ADD FILE|JAR <filepath | jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context. Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) This PR is to support following commands: `LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])` ### For example: ##### LIST FILE(s) ``` scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false) +----------------------------------------------+ |result | +----------------------------------------------+ |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt| +----------------------------------------------+ scala> spark.sql("list files").show(false) +----------------------------------------------+ |result | +----------------------------------------------+ |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt| |hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt | +----------------------------------------------+ ``` ##### LIST JAR(s) ``` scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar") res9: org.apache.spark.sql.DataFrame = [result: int] scala> spark.sql("list jar TestUDTF.jar").show(false) +---------------------------------------------+ |result | +---------------------------------------------+ |spark://192.168.1.234:50131/jars/TestUDTF.jar| +---------------------------------------------+ scala> spark.sql("list jars").show(false) +---------------------------------------------+ |result | +---------------------------------------------+ |spark://192.168.1.234:50131/jars/TestUDTF.jar| +---------------------------------------------+ ``` ## How was this patch tested? New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path. Author: Xin Wu <xinwu@us.ibm.com> Author: xin Wu <xinwu@us.ibm.com> Closes #13212 from xwu0226/list_command.
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR adds the description for running unit tests in Windows. ## How was this patch tested? On a bare machine (Window 7, 32bits), this was manually built and tested. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13217 from HyukjinKwon/minor-r-doc.
-
sureshthalamati authored
[SPARK-15315][SQL] Adding error check to the CSV datasource writer for unsupported complex data types. ## What changes were proposed in this pull request? Adds error handling to the CSV writer for unsupported complex data types. Currently garbage gets written to the output csv files if the data frame schema has complex data types. ## How was this patch tested? Added new unit test case. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #13105 from sureshthalamati/csv_complex_types_SPARK-15315.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that. ## How was this patch tested? It's only about docs. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13087 from dongjoon-hyun/SPARK-15282.
-
Andrew Or authored
## What changes were proposed in this pull request? The user may do something like: ``` CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde' CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde' ``` None of these should be allowed because the SerDe's conflict. As of this patch: - `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE` - `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE` ## How was this patch tested? New tests in `DDLCommandSuite`. Author: Andrew Or <andrew@databricks.com> Closes #13068 from andrewor14/row-format-conflict.
-
Wenchen Fan authored
## What changes were proposed in this pull request? 1. simplify the logic of deserializing option type. 2. simplify the logic of serializing array type, and remove silentSchemaFor 3. remove some unnecessary code. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #13250 from cloud-fan/encoder.
-
Davies Liu authored
## What changes were proposed in this pull request? Currently, we create an CSVWriter for every row, it's very expensive and memory hungry, took about 15 seconds to write out 1 mm rows (two columns). This PR will write the rows in batch mode, create a CSVWriter for every 1k rows, which could write out 1 mm rows in about 1 seconds (15X faster). ## How was this patch tested? Manually benchmark it. Author: Davies Liu <davies@databricks.com> Closes #13229 from davies/csv_writer.
-
Sameer Agarwal authored
## What changes were proposed in this pull request? In order to prevent users from inadvertently writing queries with cartesian joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to `false` by default) that if not set, results in a `SparkException` if the query contains one or more cartesian products. ## How was this patch tested? Added a test to verify the new behavior in `JoinSuite`. Additionally, `SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable cartesian products. Author: Sameer Agarwal <sameer@databricks.com> Closes #13209 from sameeragarwal/disallow-cartesian.
- May 22, 2016
-
-
wangyang authored
## What changes were proposed in this pull request? When invalid date string like "2015-02-29 00:00:00" are cast as date or timestamp using spark sql, it used to not return null but another valid date (2015-03-01 in this case). In this pr, invalid date string like "2016-02-29" and "2016-04-31" are returned as null when cast as date or timestamp. ## How was this patch tested? Unit tests are added. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: wangyang <wangyang@haizhi.com> Closes #13169 from wangyang1992/invalid_date.
-