Skip to content
Snippets Groups Projects
  1. Mar 17, 2016
    • Josh Rosen's avatar
      [SPARK-13948] MiMa check should catch if the visibility changes to private · 82066a16
      Josh Rosen authored
      MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version.
      
      This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11774 from JoshRosen/SPARK-13948.
      82066a16
    • Ryan Blue's avatar
      [SPARK-13403][SQL] Pass hadoopConfiguration to HiveConf constructors. · 5faba9fa
      Ryan Blue authored
      This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf.
      
      I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)).
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.
      5faba9fa
    • Josh Rosen's avatar
      [SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types · de1a84e5
      Josh Rosen authored
      Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo.
      
      This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers.
      
      In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11755 from JoshRosen/automatically-pick-best-serializer.
      de1a84e5
    • Daoyuan Wang's avatar
      [SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from doc and test · d1c193a2
      Daoyuan Wang authored
      ## What changes were proposed in this pull request?
      
      Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly.
      
      ## How was this patch tested?
      
      This patch will not affect the real code path.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #11758 from adrian-wang/spark12855.
      d1c193a2
    • Dongjoon Hyun's avatar
      [MINOR][SQL][BUILD] Remove duplicated lines · c890c359
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR removes three minor duplicated lines. First one is making the following unreachable code warning.
      ```
      JoinSuite.scala:52: unreachable code
      [warn]       case j: BroadcastHashJoin => j
      ```
      The other two are just consecutive repetitions in `Seq` of MiMa filters.
      
      ## How was this patch tested?
      
      Pass the existing Jenkins test.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11773 from dongjoon-hyun/remove_duplicated_line.
      c890c359
  2. Mar 16, 2016
    • Jakob Odersky's avatar
      [SPARK-13118][SQL] Expression encoding for optional synthetic classes · 7eef2463
      Jakob Odersky authored
      ## What changes were proposed in this pull request?
      
      Fix expression generation for optional types.
      Standard Java reflection causes issues when dealing with synthetic Scala objects (things that do not map to Java and thus contain a dollar sign in their name). This patch introduces Scala reflection in such cases.
      
      This patch also adds a regression test for Dataset's handling of classes defined in package objects (which was the initial purpose of this PR).
      
      ## How was this patch tested?
      A new test in ExpressionEncoderSuite that tests optional inner classes and a regression test for Dataset's handling of package objects.
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      Closes #11708 from jodersky/SPARK-13118-package-objects.
      7eef2463
    • Davies Liu's avatar
      [SPARK-13873] [SQL] Avoid copy of UnsafeRow when there is no join in whole stage codegen · c100d31d
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen.
      
      Updated the benchmark for `collect`, we could see 20-30% speedup.
      
      ## How was this patch tested?
      
      existing unit tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11740 from davies/avoid_copy2.
      c100d31d
    • hyukjinkwon's avatar
      [SPARK-13719][SQL] Parse JSON rows having an array type and a struct type in the same fieild · 917f4000
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This https://github.com/apache/spark/pull/2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below:
      
      ```json
      {"a": {"b": 1}}
      {"a": []}
      ```
      
      and the schema is given as below:
      
      ```scala
      val schema =
        StructType(
          StructField("a", StructType(
            StructField("b", StringType) :: Nil
          )) :: Nil)
      ```
      
      - **Before**
      
      ```scala
      sqlContext.read.schema(schema).json(path).show()
      ```
      
      ```scala
      Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow
      	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
      	at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
      ...
      ```
      
      - **After**
      
      ```scala
      sqlContext.read.schema(schema).json(path).show()
      ```
      
      ```bash
      +----+
      |   a|
      +----+
      | [1]|
      |null|
      +----+
      ```
      
      For other data types, in this case it converts the given values are `null` but only this case emits an exception.
      
      This PR makes the support for wrapped rows applied only at the top level.
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for code style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11752 from HyukjinKwon/SPARK-3308-follow-up.
      917f4000
    • Andrew Or's avatar
      [SPARK-13923][SQL] Implement SessionCatalog · ca9ef86c
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`.
      
      A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands.
      
      ## How was this patch tested?
      
      800+ lines of tests in `SessionCatalogSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11750 from andrewor14/temp-catalog.
      ca9ef86c
    • Yuhao Yang's avatar
      [SPARK-13761][ML] Deprecate validateParams · 92b70576
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Deprecate validateParams() method here: https://github.com/apache/spark/blob/035d3acdf3c1be5b309a861d5c5beb803b946b5e/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L553
      Move all functionality in overridden methods to transformSchema().
      Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema.
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #11620 from hhbyyh/depreValid.
      92b70576
    • Jakob Odersky's avatar
      [SPARK-11011][SQL] Narrow type of UDT serialization · d4d84936
      Jakob Odersky authored
      ## What changes were proposed in this pull request?
      
      Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type.
      
      ## How was this patch tested?
      
      Existing tests were successfully run on local machine.
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      Closes #11379 from jodersky/SPARK-11011-udt-types.
      d4d84936
    • Sameer Agarwal's avatar
      [SPARK-13869][SQL] Remove redundant conditions while combining filters · 77ba3021
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      **[I'll link it to the JIRA once ASF JIRA is back online]**
      
      This PR modifies the existing `CombineFilters` rule to remove redundant conditions while combining individual filter predicates. For instance, queries of the form `table.where('a === 1 && 'b === 1).where('a === 1 && 'c === 1)` will now be optimized to ` table.where('a === 1 && 'b === 1 && 'c === 1)` (instead of ` table.where('a === 1 && 'a === 1 && 'b === 1 && 'c === 1)`)
      
      ## How was this patch tested?
      
      Unit test in `FilterPushdownSuite`
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11670 from sameeragarwal/combine-filters.
      77ba3021
    • Sameer Agarwal's avatar
      [SPARK-13871][SQL] Support for inferring filters from data constraints · f96997ba
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR generalizes the `NullFiltering` optimizer rule in catalyst to `InferFiltersFromConstraints` that can automatically infer all relevant filters based on an operator's constraints while making sure of 2 things:
      
      (a) no redundant filters are generated, and
      (b) filters that do not contribute to any further optimizations are not generated.
      
      ## How was this patch tested?
      
      Extended all tests in `InferFiltersFromConstraintsSuite` (that were initially based on `NullFilteringSuite` to test filter inference in `Filter` and `Join` operators.
      
      In particular the 2 tests ( `single inner join with pre-existing filters: filter out values on either side` and `multiple inner joins: filter out values on all sides on equi-join keys` attempts to highlight/test the real potential of this rule for join optimization.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11665 from sameeragarwal/infer-filters.
      f96997ba
    • Sameer Agarwal's avatar
      [SPARK-13922][SQL] Filter rows with null attributes in vectorized parquet reader · b90c0206
      Sameer Agarwal authored
      # What changes were proposed in this pull request?
      
      It's common for many SQL operators to not care about reading `null` values for correctness. Currently, this is achieved by performing `isNotNull` checks (for all relevant columns) on a per-row basis. Pushing these null filters in the vectorized parquet reader should bring considerable benefits (especially for cases when the underlying data doesn't contain any nulls or contains all nulls).
      
      ## How was this patch tested?
      
              Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
              String with Nulls Scan (0%):        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
              -------------------------------------------------------------------------------------------
              SQL Parquet Vectorized                   1229 / 1648          8.5         117.2       1.0X
              PR Vectorized                             833 /  846         12.6          79.4       1.5X
              PR Vectorized (Null Filtering)            732 /  782         14.3          69.8       1.7X
      
              Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
              String with Nulls Scan (50%):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
              -------------------------------------------------------------------------------------------
              SQL Parquet Vectorized                    995 / 1053         10.5          94.9       1.0X
              PR Vectorized                             732 /  772         14.3          69.8       1.4X
              PR Vectorized (Null Filtering)            725 /  790         14.5          69.1       1.4X
      
              Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
              String with Nulls Scan (95%):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
              -------------------------------------------------------------------------------------------
              SQL Parquet Vectorized                    326 /  333         32.2          31.1       1.0X
              PR Vectorized                             190 /  200         55.1          18.2       1.7X
              PR Vectorized (Null Filtering)            168 /  172         62.2          16.1       1.9X
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11749 from sameeragarwal/perf-testing.
      b90c0206
    • Dongjoon Hyun's avatar
      [SPARK-13942][CORE][DOCS] Remove Shark-related docs for 2.x · 4ce2d24e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      `Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs.
      
      **Migration Guide**
      ```
      - ## Migration Guide for Shark Users
      - ...
      - ### Scheduling
      - ...
      - ### Reducer number
      - ...
      - ### Caching
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11770 from dongjoon-hyun/SPARK-13942.
      4ce2d24e
    • GayathriMurali's avatar
      [SPARK-13034] PySpark ml.classification support export/import · 27e1f388
      GayathriMurali authored
      ## What changes were proposed in this pull request?
      
      Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/classification.py.
      
      ## How was this patch tested?
      
      ./python/run-tests
      ./dev/lint-python
      Unit tests added to check persistence in Logistic Regression
      
      Author: GayathriMurali <gayathri.m.softie@gmail.com>
      
      Closes #11707 from GayathriMurali/SPARK-13034.
      27e1f388
    • Xiangrui Meng's avatar
      [SPARK-13927][MLLIB] add row/column iterator to local matrices · 85c42fda
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly.
      
      ## How was this patch tested?
      
      Unit tests on sparse and dense matrix.
      
      cc: dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #11757 from mengxr/SPARK-13927.
      85c42fda
    • Joseph K. Bradley's avatar
      [SPARK-11888][ML] Decision tree persistence in spark.ml · 6fc2b654
      Joseph K. Bradley authored
      ### What changes were proposed in this pull request?
      
      Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel
      * The shared implementation is in treeModels.scala
      * I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files.
      
      Other changes:
      * Made CategoricalSplit.numCategories public (to use in persistence)
      * Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument.  This caused an error in LDASuite, which I fixed.
      
      ### How was this patch tested?
      
      Persistence is tested via unit tests.  For each algorithm, there are 2 non-trivial trees (depth 2).  One is built with continuous features, and one with categorical; this ensures that both types of splits are tested.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11581 from jkbradley/dt-io.
      6fc2b654
    • Yanbo Liang's avatar
      [SPARK-13613][ML] Provide ignored tests to export test dataset into CSV format · 3f06eb72
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Provide ignored test cases to export the test dataset into CSV format in ```LinearRegressionSuite```, ```LogisticRegressionSuite```, ```AFTSurvivalRegressionSuite``` and ```GeneralizedLinearRegressionSuite```, so users can validate the training accuracy compared with R's glm, glmnet and survival package.
      cc mengxr
      ## How was this patch tested?
      The test suite is ignored, but I have enabled all these cases offline and it works as expected.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11463 from yanboliang/spark-13613.
      3f06eb72
    • Xusen Yin's avatar
      [SPARK-13038][PYSPARK] Add load/save to pipeline · ae6c677c
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038
      
      1. Add load/save to PySpark Pipeline and PipelineModel
      
      2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`.
      
      ## How was this patch tested?
      
      Test with doctest.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #11683 from yinxusen/SPARK-13038-only.
      ae6c677c
    • gatorsmile's avatar
      [SPARK-12721][SQL] SQL Generation for Script Transformation · c4bd5760
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      This PR is to convert to SQL from analyzed logical plans containing operator `ScriptTransformation`.
      
      For example, below is the SQL containing `Transform`
      ```
      SELECT TRANSFORM (a, b, c, d) USING 'cat' FROM parquet_t2
      ```
      
      Its logical plan is like
      ```
      ScriptTransformation [a#210L,b#211L,c#212L,d#213L], cat, [key#208,value#209], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true)
      +- SubqueryAlias parquet_t2
         +- Relation[a#210L,b#211L,c#212L,d#213L] ParquetRelation
      ```
      
      The generated SQL will be like
      ```
      SELECT TRANSFORM (`parquet_t2`.`a`, `parquet_t2`.`b`, `parquet_t2`.`c`, `parquet_t2`.`d`) USING 'cat' AS (`key` string, `value` string) FROM `default`.`parquet_t2`
      ```
      #### How was this patch tested?
      
      Seven test cases are added to `LogicalPlanToSQLSuite`.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #11503 from gatorsmile/transformToSQL.
      c4bd5760
    • Wenchen Fan's avatar
      [SPARK-13827][SQL] Can't add subquery to an operator with same-name outputs... · 1d1de28a
      Wenchen Fan authored
      [SPARK-13827][SQL] Can't add subquery to an operator with same-name outputs while generate SQL string
      
      ## What changes were proposed in this pull request?
      
      This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.
      
      To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.
      
      For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:
      ```
      SELECT sq_1.key, sq_1.max
      FROM (
          SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max
          FROM (
              SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y
          ) AS sq_0
      ) AS sq_1
      ```
      You can see, the `key` columns become ambiguous after `sq_0`.
      
      After this PR, it will generate something like:
      ```
      SELECT attr_30 AS key, attr_37 AS max
      FROM (
          SELECT attr_30, attr_37
          FROM (
              SELECT attr_30, attr_35, MAX(attr_35) AS attr_37
              FROM (
                  SELECT attr_30, attr_35 FROM
                      (SELECT key AS attr_30 FROM t1) AS sq_0
                  INNER JOIN
                      (SELECT key AS attr_35 FROM t1) AS sq_1
              ) AS sq_2
          ) AS sq_3
      ) AS sq_4
      ```
      The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.
      
      ## How was this patch tested?
      
      existing tests and new tests in LogicalPlanToSQLSuite.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11658 from cloud-fan/gensql.
      1d1de28a
    • Zheng RuiFeng's avatar
      [SPARK-13816][GRAPHX] Add parameter checks for algorithms in Graphx · 91984978
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13816
      
      ## What changes were proposed in this pull request?
      
      Add parameter checks for algorithms in Graphx: Pregel,LabelPropagation,PageRank,SVDPlusPlus
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11655 from zhengruifeng/graphx_param_check.
      91984978
    • Cheng Hao's avatar
      [SPARK-13894][SQL] SqlContext.range return type from DataFrame to DataSet · d9670f84
      Cheng Hao authored
      ## What changes were proposed in this pull request?
      https://issues.apache.org/jira/browse/SPARK-13894
      Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`.
      
      ## How was this patch tested?
      No additional unit test required.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #11730 from chenghao-intel/range.
      d9670f84
    • Wenchen Fan's avatar
      [SPARK-13924][SQL] officially support multi-insert · d9e8f26d
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      There is a feature of hive SQL called multi-insert. For example:
      ```
      FROM src
      INSERT OVERWRITE TABLE dest1
      SELECT key + 1
      INSERT OVERWRITE TABLE dest2
      SELECT key WHERE key > 2
      INSERT OVERWRITE TABLE dest3
      SELECT col EXPLODE(arr) exp AS col
      ...
      ```
      
      We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE.
      
      This PR removes these limitations and make us fully support multi-insert.
      
      ## How was this patch tested?
      
      new tests in `SQLQuerySuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11754 from cloud-fan/lateral-view.
      d9e8f26d
    • Jeff Zhang's avatar
      [SPARK-13360][PYSPARK][YARN] PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON… · eacd9d8e
      Jeff Zhang authored
      … is not picked up in yarn-cluster mode
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #11238 from zjffdu/SPARK-13360.
      eacd9d8e
    • Wesley Tang's avatar
      [SPARK-13281][CORE] Switch broadcast of RDD to exception from warning · 5f6bdf97
      Wesley Tang authored
      ## What changes were proposed in this pull request?
      
      In SparkContext, throw Illegalargumentexception when trying to broadcast rdd directly, instead of logging the warning.
      
      ## How was this patch tested?
      
      mvn clean install
      Add UT in BroadcastSuite
      
      Author: Wesley Tang <tangmingjun@mininglamp.com>
      
      Closes #11735 from breakdawn/master.
      5f6bdf97
    • Sean Owen's avatar
      [SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to... · 9412547e
      Sean Owen authored
      [SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines
      
      ## What changes were proposed in this pull request?
      
      I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example:
      
      https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console
      
      ```
      testCommunication(org.apache.spark.launcher.LauncherServerSuite)  Time elapsed: 0.023 sec  <<< FAILURE!
      java.lang.AssertionError: expected:<app-id> but was:<null>
      	at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93)
      ```
      
      However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences.
      
      This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway.
      
      Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence:
      
      - First attempt to acquire semaphore for 10ms actually silently times out
      - The changed being waited for happens just after that, a bit too late
      - Assertion passes since condition became true just in time
      - `release()` fires from the listener
      - Next `tryAcquire` however immediately succeeds because the first `tryAcquire` didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails
      
      Versus the original using `notifyAll()`, there's a small difference: `wait()`-ing after `notifyAll()` just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway.
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11763 from srowen/SPARK-13823.3.
      9412547e
    • Carson Wang's avatar
      [SPARK-13889][YARN] Fix integer overflow when calculating the max number of executor failure · 496d2a2b
      Carson Wang authored
      ## What changes were proposed in this pull request?
      The max number of executor failure before failing the application is default to twice the maximum number of executors if dynamic allocation is enabled. The default value for "spark.dynamicAllocation.maxExecutors" is Int.MaxValue. So this causes an integer overflow and a wrong result. The calculated value of the default max number of executor failure is 3. This PR adds a check to avoid the overflow.
      
      ## How was this patch tested?
      It tests if the value is greater that Int.MaxValue / 2 to avoid the overflow when it multiplies 2.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #11713 from carsonwang/IntOverflow.
      496d2a2b
    • Tejas Patil's avatar
      [SPARK-13793][CORE] PipedRDD doesn't propagate exceptions while reading parent RDD · 1d95fb67
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      PipedRDD creates a child thread to read output of the parent stage and feed it to the pipe process. Used a variable to save the exception thrown in the child thread and then propagating the exception in the main thread if the variable was set.
      
      ## How was this patch tested?
      
      - Added a unit test
      - Ran all the existing tests in PipedRDDSuite and they all pass with the change
      - Tested the patch with a real pipe() job, bounced the executor node which ran the parent stage to simulate a fetch failure and observed that the parent stage was re-ran.
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #11628 from tejasapatil/pipe_rdd.
      1d95fb67
    • GayathriMurali's avatar
      [SPARK-13396] Stop using our internal deprecated .metrics on Exceptio… · 56d88247
      GayathriMurali authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13396
      
      Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates
      
      Author: GayathriMurali <gayathri.m.softie@gmail.com>
      
      Closes #11544 from GayathriMurali/SPARK-13396.
      56d88247
    • Sean Owen's avatar
      [SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset follow up · 3b461d9e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Follow up to https://github.com/apache/spark/pull/11657
      
      - Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8`
      - And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests)
      - And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11725 from srowen/SPARK-13823.2.
      3b461d9e
    • Yonathan Randolph's avatar
      [SPARK-13906] Ensure that there are at least 2 dispatcher threads. · 05ab2948
      Yonathan Randolph authored
      ## What changes were proposed in this pull request?
      
      Force at least two dispatcher-event-loop threads. Since SparkDeploySchedulerBackend (in AppClient) calls askWithRetry to CoarseGrainedScheduler in the same process, there the driver needs at least two dispatcher threads to prevent the dispatcher thread from hanging.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Yonathan Randolph <yonathangmail.com>
      
      Author: Yonathan Randolph <yonathan@liftigniter.com>
      
      Closes #11728 from yonran/SPARK-13906.
      05ab2948
    • Dongjoon Hyun's avatar
      [SPARK-12653][SQL] Re-enable test "SPARK-8489: MissingRequirementError during reflection" · 431a3d04
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      The purpose of [SPARK-12653](https://issues.apache.org/jira/browse/SPARK-12653) is re-enabling a regression test.
      Historically, the target regression test is added by [SPARK-8498](https://github.com/apache/spark/commit/093c34838d1db7a9375f36a9a2ab5d96a23ae683), but is temporarily disabled by [SPARK-12615](https://github.com/apache/spark/commit/8ce645d4eeda203cf5e100c4bdba2d71edd44e6a) due to binary compatibility error.
      
      The following is the current error message at the submitting spark job with the pre-built `test.jar` file in the target regression test.
      ```
      Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.$lessinit$greater$default$6()Lscala/collection/Map;
      ```
      
      Simple rebuilding `test.jar` can not recover the purpose of testcase since we need to support both Scala 2.10 and 2.11 for a while. For example, we will face the following Scala 2.11 error if we use `test.jar` built by Scala 2.10.
      ```
      Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
      ```
      
      This PR replace the existing `test.jar` with `test-2.10.jar` and `test-2.11.jar` and improve the regression test to use the suitable jar file.
      
      ## How was this patch tested?
      
      Pass the existing Jenkins test.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11744 from dongjoon-hyun/SPARK-12653.
      431a3d04
    • hyukjinkwon's avatar
      [SPARK-13899][SQL] Produce InternalRow instead of external Row at CSV data source · 92024797
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13899
      
      This PR makes CSV data source produce `InternalRow` instead of `Row`.
      
      Basically, this resembles JSON data source. It uses the same codes for casting.
      
      ## How was this patch tested?
      
      Unit tests were used within IDE and code style was checked by `./dev/run_tests`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11717 from HyukjinKwon/SPARK-13899.
      92024797
    • Dongjoon Hyun's avatar
      [SPARK-13920][BUILD] MIMA checks should apply to @Experimental and @DeveloperAPI APIs · 3c578c59
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      We are able to change `Experimental` and `DeveloperAPI` API freely but also should monitor and manage those API carefully. This PR for [SPARK-13920](https://issues.apache.org/jira/browse/SPARK-13920) enables MiMa check and adds filters for them.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including MiMa).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11751 from dongjoon-hyun/SPARK-13920.
      3c578c59
    • Yanbo Liang's avatar
      [SPARK-9837][ML] R-like summary statistics for GLMs via iteratively reweighted least squares · 3665294d
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Provide R-like summary statistics for GLMs via iteratively reweighted least squares.
      ## How was this patch tested?
      unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11694 from yanboliang/spark-9837.
      3665294d
    • Davies Liu's avatar
      [SPARK-13917] [SQL] generate broadcast semi join · 421f6c20
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR brings codegen support for broadcast left-semi join.
      
      ## How was this patch tested?
      
      Existing tests. Added benchmark, the result show 7X speedup.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11742 from davies/gen_semi.
      421f6c20
  3. Mar 15, 2016
    • Yucai Yu's avatar
      [MINOR][TEST][SQL] Remove wrong "expected" parameter in checkNaNWithoutCodegen · 52b6a899
      Yucai Yu authored
      ## What changes were proposed in this pull request?
      
      Remove the wrong "expected" parameter in MathFunctionsSuite.scala's checkNaNWithoutCodegen.
      This function is to check NaN value, so the "expected" parameter is useless. The Callers do not pass "expected" value and the similar function like checkNaNWithGeneratedProjection and checkNaNWithOptimization do not use it also.
      
      Author: Yucai Yu <yucai.yu@intel.com>
      
      Closes #11718 from yucai/unused_expected.
      52b6a899
    • Davies Liu's avatar
      [SPARK-13918][SQL] Merge SortMergeJoin and SortMergerOuterJoin · bbd887f5
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR just move some code from SortMergeOuterJoin into SortMergeJoin.
      
      This is for support codegen for outer join.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11743 from davies/gen_smjouter.
      bbd887f5
Loading