- Sep 07, 2017
-
-
Liang-Chi Hsieh authored
## What changes were proposed in this pull request? This is a follow-up of #19050 to deal with `ExistenceJoin` case. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19151 from viirya/SPARK-21835-followup.
-
- Sep 06, 2017
-
-
Tucker Beck authored
## Problem Description When pyspark is listed as a dependency of another package, installing the other package will cause an install failure in pyspark. When the other package is being installed, pyspark's setup_requires requirements are installed including pypandoc. Thus, the exception handling on setup.py:152 does not work because the pypandoc module is indeed available. However, the pypandoc.convert() function fails if pandoc itself is not installed (in our use cases it is not). This raises an OSError that is not handled, and setup fails. The following is a sample failure: ``` $ which pandoc $ pip freeze | grep pypandoc pypandoc==1.4 $ pip install pyspark Collecting pyspark Downloading pyspark-2.2.0.post0.tar.gz (188.3MB) 100% |████████████████████████████████| 188.3MB 16.8MB/s Complete output from command python setup.py egg_info: Maybe try: sudo apt-get install pandoc See http://johnmacfarlane.net/pandoc/installing.html for installation options --------------------------------------------------------------- Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in <module> long_description = pypandoc.convert('README.md', 'rst') File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 69, in convert outputfile=outputfile, filters=filters) File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 260, in _convert_input _ensure_pandoc_path() File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 544, in _ensure_pandoc_path raise OSError("No pandoc was found: either install pandoc and add it\n" OSError: No pandoc was found: either install pandoc and add it to your PATH or or call pypandoc.download_pandoc(...) or install pypandoc wheels with included pandoc. ---------------------------------------- Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-mfnizcwa/pyspark/ ``` ## What changes were proposed in this pull request? This change simply adds an additional exception handler for the OSError that is raised. This allows pyspark to be installed client-side without requiring pandoc to be installed. ## How was this patch tested? I tested this by building a wheel package of pyspark with the change applied. Then, in a clean virtual environment with pypandoc installed but pandoc not available on the system, I installed pyspark from the wheel. Here is the output ``` $ pip freeze | grep pypandoc pypandoc==1.4 $ which pandoc $ pip install --no-cache-dir ../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl Processing /home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl Requirement already satisfied: py4j==0.10.6 in /home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages (from pyspark==2.3.0.dev0) Installing collected packages: pyspark Successfully installed pyspark-2.3.0.dev0 ``` Author: Tucker Beck <tucker.beck@rentrakmail.com> Closes #18981 from dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py.
-
Jacek Laskowski authored
## What changes were proposed in this pull request? Just `StateOperatorProgress.toString` + few formatting fixes ## How was this patch tested? Local build. Waiting for OK from Jenkins. Author: Jacek Laskowski <jacek@japila.pl> Closes #19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString.
-
Jose Torres authored
## What changes were proposed in this pull request? Add an assert in logical plan optimization that the isStreaming bit stays the same, and fix empty relation rules where that wasn't happening. ## How was this patch tested? new and existing unit tests Author: Jose Torres <joseph.torres@databricks.com> Author: Jose Torres <joseph-torres@databricks.com> Closes #19056 from joseph-torres/SPARK-21765-followup.
-
Felix Cheung authored
## What changes were proposed in this pull request? set.seed() before running tests ## How was this patch tested? jenkins, appveyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19111 from felixcheung/rranseed.
-
Liang-Chi Hsieh authored
## What changes were proposed in this pull request? Correlated predicate subqueries are rewritten into `Join` by the rule `RewritePredicateSubquery` during optimization. It is possibly that the two sides of the `Join` have conflicting attributes. The query plans produced by `RewritePredicateSubquery` become unresolved and break structural integrity. We should check if there are conflicting attributes in the `Join` and de-duplicate them by adding a `Project`. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19050 from viirya/SPARK-21835.
-
hyukjinkwon authored
[SPARK-21903][BUILD][FOLLOWUP] Upgrade scalastyle-maven-plugin and scalastyle as well in POM and SparkBuild.scala ## What changes were proposed in this pull request? This PR proposes to match scalastyle version in POM and SparkBuild.scala ## How was this patch tested? Manual builds. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19146 from HyukjinKwon/SPARK-21903-follow-up.
-
Bryan Cutler authored
## What changes were proposed in this pull request? Modified `CrossValidator` and `TrainValidationSplit` to be able to evaluate models in parallel for a given parameter grid. The level of parallelism is controlled by a parameter `numParallelEval` used to schedule a number of models to be trained/evaluated so that the jobs can be run concurrently. This is a naive approach that does not check the cluster for needed resources, so care must be taken by the user to tune the parameter appropriately. The default value is `1` which will train/evaluate in serial. ## How was this patch tested? Added unit tests for CrossValidator and TrainValidationSplit to verify that model selection is the same when run in serial vs parallel. Manual testing to verify tasks run in parallel when param is > 1. Added parameter usage to relevant examples. Author: Bryan Cutler <cutlerb@gmail.com> Closes #16774 from BryanCutler/parallel-model-eval-SPARK-19357.
-
Riccardo Corbella authored
## What changes were proposed in this pull request? Update the line "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:00 - 12:10 and 12:05 - 12:15." under the programming structured streaming programming guide. Author: Riccardo Corbella <r.corbella@reply.it> Closes #19137 from riccardocorbella/bugfix.
-
- Sep 05, 2017
-
-
jerryshao authored
## What changes were proposed in this pull request? This PR exposes Netty memory usage for Spark's `TransportClientFactory` and `TransportServer`, including the details of each direct arena and heap arena metrics, as well as aggregated metrics. The purpose of adding the Netty metrics is to better know the memory usage of Netty in Spark shuffle, rpc and others network communications, and guide us to better configure the memory size of executors. This PR doesn't expose these metrics to any sink, to leverage this feature, still requires to connect to either MetricsSystem or collect them back to Driver to display. ## How was this patch tested? Add Unit test to verify it, also manually verified in real cluster. Author: jerryshao <sshao@hortonworks.com> Closes #18935 from jerryshao/SPARK-9104.
-
jerryshao authored
Spark ThriftServer doesn't support spnego auth for thrift/http protocol, this mainly used for knox+thriftserver scenario. Since in HiveServer2 CLIService there already has existing codes to support it. So here copy it to Spark ThriftServer to make it support. Related Hive JIRA HIVE-6697. Manual verification. Author: jerryshao <sshao@hortonworks.com> Closes #18628 from jerryshao/SPARK-21407. Change-Id: I61ef0c09f6972bba982475084a6b0ae3a74e385e
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? All built-in data sources support `Partition Discovery`. We had better update the document to give the users more benefit clearly. **AFTER** <img width="906" alt="1" src="https://user-images.githubusercontent.com/9700541/30083628-14278908-9244-11e7-98dc-9ad45fe233a9.png"> ## How was this patch tested? ``` SKIP_API=1 jekyll serve --watch ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19139 from dongjoon-hyun/partitiondiscovery.
-
Xingbo Jiang authored
## What changes were proposed in this pull request? For the given example below, the predicate added by `InferFiltersFromConstraints` is folded by `ConstantPropagation` later, this leads to unconverged optimize iteration: ``` Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1") Seq(1, 2).toDF("col").createOrReplaceTempView("t2") sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col") ``` We can fix this by adjusting the indent of the optimize rules. ## How was this patch tested? Add test case that would have failed in `SQLQuerySuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #19099 from jiangxb1987/unconverge-optimization.
-
Burak Yavuz authored
Forgot to update docs with behavior change. Author: Burak Yavuz <brkyvz@gmail.com> Closes #19138 from brkyvz/trigger-doc-fix.
-
gatorsmile authored
## What changes were proposed in this pull request? We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases. ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #19119 from gatorsmile/fallbackCodegen.
-
hyukjinkwon authored
## What changes were proposed in this pull request? There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below: ```scala val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS()) df.show() ``` **Before** ``` java.lang.NullPointerException at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89) at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) ... ``` **After** ``` +---+----+--------+ | a| b|unparsed| +---+----+--------+ | a|null| a| +---+----+--------+ ``` It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this. ## How was this patch tested? Unit test added in `CSVSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19113 from HyukjinKwon/bump-up-univocity.
-
hyukjinkwon authored
## What changes were proposed in this pull request? 1.0.0 fixes an issue with import order, explicit type for public methods, line length limitation and comment validation: ``` [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49: File line length exceeds 100 characters [error] .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21: Are you sure you want to println? If yes, wrap the code block with [error] // scalastyle:off println [error] println(...) [error] // scalastyle:on println [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15: Public method must have explicit type [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2: Insert a space after the start of the comment [error] .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43: JavaDStream should come before JavaDStreamLike. ``` This PR also fixes the workaround added in SPARK-16877 for `org.scalastyle.scalariform.OverrideJavaChecker` feature, added from 0.9.0. ## How was this patch tested? Manually tested. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19116 from HyukjinKwon/scalastyle-1.0.0.
-
Dongjoon Hyun authored
## What changes were proposed in this pull request? Currently, `withDatabase` fails if the database is not empty. It would be great if we drop cleanly with CASCADE. ## How was this patch tested? This is a change on test util. Pass the existing Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19125 from dongjoon-hyun/SPARK-21913.
-
- Sep 04, 2017
-
-
Sean Owen authored
[SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true ## What changes were proposed in this pull request? If no SparkConf is available to Utils.redact, simply don't redact. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19123 from srowen/SPARK-21418.
-
- Sep 03, 2017
-
-
Liang-Chi Hsieh authored
## What changes were proposed in this pull request? SQL predicates don't have complete expression description. This patch goes to complement the description by adding arguments, examples. This change also adds related test cases for the SQL predicate expressions. ## How was this patch tested? Existing tests. And added predicate test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18869 from viirya/SPARK-21654.
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR proposes to add a wrapper for `unionByName` API to R and Python as well. **Python** ```python df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"]) df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"]) df1.unionByName(df2).show() ``` ``` +----+----+----+ |col0|col1|col3| +----+----+----+ | 1| 2| 3| | 6| 4| 5| +----+----+----+ ``` **R** ```R df1 <- select(createDataFrame(mtcars), "carb", "am", "gear") df2 <- select(createDataFrame(mtcars), "am", "gear", "carb") head(unionByName(limit(df1, 2), limit(df2, 2))) ``` ``` carb am gear 1 4 1 4 2 4 1 4 3 4 1 4 4 4 1 4 ``` ## How was this patch tested? Doctests for Python and unit test added in `test_sparkSQL.R` for R. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19105 from HyukjinKwon/unionByName-r-python.
-
- Sep 02, 2017
-
-
gatorsmile authored
## What changes were proposed in this pull request? Add `TBLPROPERTIES` to the DDL statement `CREATE TABLE USING`. After this change, the DDL becomes ``` CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name USING table_provider [OPTIONS table_property_list] [PARTITIONED BY (col_name, col_name, ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (property_name=property_value, ...)] [[AS] select_statement]; ``` ## How was this patch tested? Add a few tests Author: gatorsmile <gatorsmile@gmail.com> Closes #19100 from gatorsmile/addTablePropsToCreateTableUsing.
-
- Sep 01, 2017
-
-
WeichenXu authored
[SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns ## What changes were proposed in this pull request? Add test for prediction using the model with all combinations of output columns turned on/off. Make sure the output column values match, presumably by comparing vs. the case with all 3 output columns turned on. ## How was this patch tested? Test updated. Author: WeichenXu <weichen.xu@databricks.com> Author: WeichenXu <WeichenXu123@outlook.com> Closes #19065 from WeichenXu123/generic_test_for_prob_classifier.
-
gatorsmile authored
## What changes were proposed in this pull request? Supporting moving tables across different database in HiveClient `alterTable` ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19104 from gatorsmile/alterTable.
-
Sean Owen authored
[SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure ## What changes were proposed in this pull request? This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts. In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11. It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release. - Scalatest 2.x -> 3.0.3 - Chill 0.8.0 -> 0.8.4 - Clapper 1.0.x -> 1.1.2 - json4s 3.2.x -> 3.4.2 - Jackson 2.6.x -> 2.7.9 (required by json4s) This change does _not_ fully enable a Scala 2.12 build: - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too. What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build. ## How was this patch tested? Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above. Author: Sean Owen <sowen@cloudera.com> Closes #18645 from srowen/SPARK-14280.
-
he.qiao authored
## What changes were proposed in this pull request? As shown below, for example, When the job 5 is running, It was a mistake to think that five jobs were running, So I think it would be more appropriate to change jobs to job id.  ## How was this patch tested? no need Author: he.qiao <he.qiao17@zte.com.cn> Closes #19093 from Geek-He/08_31_sqltable.
-
Marcelo Vanzin authored
- SecurityManager complains when auth is enabled but no secret is defined; SparkSubmit doesn't use the auth functionality of the SecurityManager, so use a dummy secret to work around the exception. - Only reset the log4j configuration when Spark was the one initializing it, otherwise user-defined log configuration may be lost. Tested with the log config file posted to the bug, on a secured YARN cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19089 from vanzin/SPARK-21728.
-
- Aug 31, 2017
-
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR proposes to remove private functions that look not used in the main codes, `_split_schema_abstract`, `_parse_field_abstract`, `_parse_schema_abstract` and `_infer_schema_type`. ## How was this patch tested? Existing tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18647 from HyukjinKwon/remove-abstract.
-
hyukjinkwon authored
## What changes were proposed in this pull request? This PR make `DataFrame.sample(...)` can omit `withReplacement` defaulting `False`, consistently with equivalent Scala / Java API. In short, the following examples are allowed: ```python >>> df = spark.range(10) >>> df.sample(0.5).count() 7 >>> df.sample(fraction=0.5).count() 3 >>> df.sample(0.5, seed=42).count() 5 >>> df.sample(fraction=0.5, seed=42).count() 5 ``` In addition, this PR also adds some type checking logics as below: ```python >>> df = spark.range(10) >>> df.sample().count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got []. >>> df.sample(True).count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>]. >>> df.sample(42).count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'int'>]. >>> df.sample(fraction=False, seed="a").count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>, <type 'str'>]. >>> df.sample(seed=[1]).count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'list'>]. >>> df.sample(withReplacement="a", fraction=0.5, seed=1) ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'str'>, <type 'float'>, <type 'int'>]. ``` ## How was this patch tested? Manually tested, unit tests added in doc tests and manually checked the built documentation for Python. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18999 from HyukjinKwon/SPARK-21779.
-
WeichenXu authored
## What changes were proposed in this pull request? add overflow check in PCA, otherwise it is possible to throw `NegativeArraySizeException` when `k` and `numFeatures` are too large. The overflow checking formula is here: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87 ## How was this patch tested? N/A Author: WeichenXu <weichen.xu@databricks.com> Closes #19078 from WeichenXu123/SVD_overflow_check.
-
WeichenXu authored
[SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary ## What changes were proposed in this pull request? add an "asBinary" method to LogisticRegressionSummary for convenient casting to BinaryLogisticRegressionSummary. ## How was this patch tested? Testcase updated. Author: WeichenXu <weichen.xu@databricks.com> Closes #19072 from WeichenXu123/mlor_summary_as_binary.
-
Andrew Ray authored
## What changes were proposed in this pull request? Allows `BinaryComparison` operators to work on any data type that actually supports ordering as verified by `TypeUtils.checkForOrderingExpr` instead of relying on the incomplete list `TypeCollection.Ordered` (which is removed by this PR). ## How was this patch tested? Updated unit tests to cover structs and arrays. Author: Andrew Ray <ray.andrew@gmail.com> Closes #18818 from aray/SPARK-21110.
-
gatorsmile authored
## What changes were proposed in this pull request? Also remove useless function `partitionByDeterministic` after the changes of https://github.com/apache/spark/pull/14687 ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19097 from gatorsmile/followupSPARK-17107.
-
Bryan Cutler authored
Removing a check in the ColumnarBatchSuite that depended on a Java assertion. This assertion is being compiled out in the Maven builds causing the test to fail. This part of the test is not specifically from to the functionality that is being tested here. Author: Bryan Cutler <cutlerb@gmail.com> Closes #19098 from BryanCutler/hotfix-ColumnarBatchSuite-assertion.
-
ArtRand authored
Mesos has secrets primitives for environment and file-based secrets, this PR adds that functionality to the Spark dispatcher and the appropriate configuration flags. Unit tested and manually tested against a DC/OS cluster with Mesos 1.4. Author: ArtRand <arand@soe.ucsc.edu> Closes #18837 from ArtRand/spark-20812-dispatcher-secrets-and-labels.
-
Jacek Laskowski authored
… Dataset with LogicalRDD logical operator ## What changes were proposed in this pull request? Reusing `SparkSession.internalCreateDataFrame` wherever possible (to cut dups) ## How was this patch tested? Local build and waiting for Jenkins Author: Jacek Laskowski <jacek@japila.pl> Closes #19095 from jaceklaskowski/SPARK-21886-internalCreateDataFrame.
-
gatorsmile authored
## What changes were proposed in this pull request? Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific and the other SQLMetrics test cases. Also, move two SQLMetrics test cases from sql/hive to sql/core. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19092 from gatorsmile/rewriteSQLMetrics.
-
- Aug 30, 2017
-
-
Bryan Cutler authored
## What changes were proposed in this pull request? This PR allows the creation of a `ColumnarBatch` from `ReadOnlyColumnVectors` where previously a columnar batch could only allocate vectors internally. This is useful for using `ArrowColumnVectors` in a batch form to do row-based iteration. Also added `ArrowConverter.fromPayloadIterator` which converts `ArrowPayload` iterator to `InternalRow` iterator and uses a `ColumnarBatch` internally. ## How was this patch tested? Added a new unit test for creating a `ColumnarBatch` with `ReadOnlyColumnVectors` and a test to verify the roundtrip of rows -> ArrowPayload -> rows, using `toPayloadIterator` and `fromPayloadIterator`. Author: Bryan Cutler <cutlerb@gmail.com> Closes #18787 from BryanCutler/arrow-ColumnarBatch-support-SPARK-21583.
-
Liang-Chi Hsieh authored
[SPARK-21534][SQL][PYSPARK] PickleException when creating dataframe from python row with empty bytearray ## What changes were proposed in this pull request? `PickleException` is thrown when creating dataframe from python row with empty bytearray spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: {"abc": x.xx})).show() net.razorvine.pickle.PickleException: invalid pickle data for bytearray; expected 1 or 2 args, got 0 at net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java ... `ByteArrayConstructor` doesn't deal with empty byte array pickled by Python3. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19085 from viirya/SPARK-21534.
-
jerryshao authored
In the current code, if NM recovery is not enabled then `YarnShuffleService` will write shuffle metadata to NM local dir-1, if this local dir-1 is on bad disk, then `YarnShuffleService` will be failed to start. So to solve this issue, in Spark side if NM recovery is not enabled, then Spark will not persist data into leveldb, in that case yarn shuffle service can still be served but lose the ability for recovery, (it is fine because the failure of NM will kill the containers as well as applications). Tested in the local cluster with NM recovery off and on to see if folder is created or not. MiniCluster UT isn't added because in MiniCluster NM will always set port to 0, but NM recovery requires non-ephemeral port. Author: jerryshao <sshao@hortonworks.com> Closes #19032 from jerryshao/SPARK-17321. Change-Id: I8f2fe73d175e2ad2c4e380caede3873e0192d027
-