Skip to content
Snippets Groups Projects
  1. Apr 28, 2017
    • Bill Chambers's avatar
      [SPARK-20496][SS] Bug in KafkaWriter Looks at Unanalyzed Plans · 733b81b8
      Bill Chambers authored
      ## What changes were proposed in this pull request?
      
      We didn't enforce analyzed plans in Spark 2.1 when writing out to Kafka.
      
      ## How was this patch tested?
      
      New unit test.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Bill Chambers <bill@databricks.com>
      
      Closes #17804 from anabranch/SPARK-20496-2.
      733b81b8
    • hyukjinkwon's avatar
      [SPARK-20465][CORE] Throws a proper exception when any temp directory could not be got · 8c911ada
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to throw an exception with better message rather than `ArrayIndexOutOfBoundsException` when temp directories could not be created.
      
      Running the commands below:
      
      ```bash
      ./bin/spark-shell --conf spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
      ```
      
      produces ...
      
      **Before**
      
      ```
      Exception in thread "main" java.lang.ExceptionInInitializerError
              ...
      Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
              ...
      ```
      
      **After**
      
      ```
      Exception in thread "main" java.lang.ExceptionInInitializerError
              ...
      Caused by: java.io.IOException: Failed to get a temp directory under [/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO].
              ...
      ```
      
      ## How was this patch tested?
      
      Unit tests in `LocalDirsSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17768 from HyukjinKwon/throws-temp-dir-exception.
      8c911ada
    • Takeshi Yamamuro's avatar
      [SPARK-14471][SQL] Aliases in SELECT could be used in GROUP BY · 59e3a564
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added a new rule in `Analyzer` to resolve aliases in `GROUP BY`.
      The current master throws an exception if `GROUP BY` clauses have aliases in `SELECT`;
      ```
      scala> spark.sql("select a a1, a1 + 1 as b, count(1) from t group by a1")
      org.apache.spark.sql.AnalysisException: cannot resolve '`a1`' given input columns: [a]; line 1 pos 51;
      'Aggregate ['a1], [a#83L AS a1#87L, ('a1 + 1) AS b#88, count(1) AS count(1)#90L]
      +- SubqueryAlias t
         +- Project [id#80L AS a#83L]
            +- Range (0, 10, step=1, splits=Some(8))
      
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
      ```
      
      ## How was this patch tested?
      Added tests in `SQLQuerySuite` and `SQLQueryTestSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17191 from maropu/SPARK-14471.
      59e3a564
    • Xiao Li's avatar
      [SPARK-20476][SQL] Block users to create a table that use commas in the column names · e3c81604
      Xiao Li authored
      ### What changes were proposed in this pull request?
      ```SQL
      hive> create table t1(`a,` string);
      OK
      Time taken: 1.399 seconds
      
      hive> create table t2(`a,` string, b string);
      FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements while columns.types has 2 elements!)
      
      hive> create table t2(`a,` string, b string) stored as parquet;
      FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: ParquetHiveSerde initialization failed. Number of column name and column type differs. columnNames = [a, , b], columnTypes = [string, string]
      ```
      It has a bug in Hive metastore.
      
      When users do not provide alias name in the SELECT query, we call `toPrettySQL` to generate the alias name. For example, the string `get_json_object(jstring, '$.f1')` will be the alias name for the function call in the statement
      ```SQL
      SELECT key, get_json_object(jstring, '$.f1') FROM tempView
      ```
      Above is not an issue for the SELECT query statements. However, for CTAS, we hit the issue due to a bug in Hive metastore. Hive metastore does not like the column names containing commas and returned a confusing error message, like:
      ```
      17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements!
      org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements while columns.types has 1 elements!
      ```
      
      Thus, this PR is to block users to create a table in Hive metastore when the table table has a column containing commas in the name.
      
      ### How was this patch tested?
      Added a test case
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17781 from gatorsmile/blockIllegalColumnNames.
      e3c81604
    • wangmiao1981's avatar
      [SPARKR][DOC] Document LinearSVC in R programming guide · 7fe82497
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      add link to svmLinear in the SparkR programming document.
      
      ## How was this patch tested?
      
      Build doc manually and click the link to the document. It looks good.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #17797 from wangmiao1981/doc.
      7fe82497
  2. Apr 27, 2017
    • Wenchen Fan's avatar
      [SPARK-12837][CORE] Do not send the name of internal accumulator to executor side · b90bf520
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When sending accumulator updates back to driver, the network overhead is pretty big as there are a lot of accumulators, e.g. `TaskMetrics` will send about 20 accumulators everytime, there may be a lot of `SQLMetric` if the query plan is complicated.
      
      Therefore, it's critical to reduce the size of serialized accumulator. A simple way is to not send the name of internal accumulators to executor side, as it's unnecessary. When executor sends accumulator updates back to driver, we can look up the accumulator name in `AccumulatorContext` easily. Note that, we still need to send names of normal accumulators, as the user code run at executor side may rely on accumulator names.
      
      In the future, we should reimplement `TaskMetrics` to not rely on accumulators and use custom serialization.
      
      Tried on the example in https://issues.apache.org/jira/browse/SPARK-12837, the size of serialized accumulator has been cut down by about 40%.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17596 from cloud-fan/oom.
      b90bf520
    • Shixiong Zhu's avatar
      [SPARK-20452][SS][KAFKA] Fix a potential ConcurrentModificationException for batch Kafka DataFrame · 823baca2
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Cancel a batch Kafka query but one of task cannot be cancelled, and rerun the same DataFrame may cause ConcurrentModificationException because it may launch two tasks sharing the same group id.
      
      This PR always create a new consumer when `reuseKafkaConsumer = false` to avoid ConcurrentModificationException. It also contains other minor fixes.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17752 from zsxwing/kafka-fix.
      823baca2
    • Shixiong Zhu's avatar
      [SPARK-20461][CORE][SS] Use UninterruptibleThread for Executor and fix the... · 01c999e7
      Shixiong Zhu authored
      [SPARK-20461][CORE][SS] Use UninterruptibleThread for Executor and fix the potential hang in CachedKafkaConsumer
      
      ## What changes were proposed in this pull request?
      
      This PR changes Executor's threads to `UninterruptibleThread` so that we can use `runUninterruptibly` in `CachedKafkaConsumer`. However, this is just best effort to avoid hanging forever. If the user uses`CachedKafkaConsumer` in another thread (e.g., create a new thread or Future), the potential hang may still happen.
      
      ## How was this patch tested?
      
      The new added test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17761 from zsxwing/int.
      01c999e7
    • Yanbo Liang's avatar
      [SPARK-20047][ML] Constrained Logistic Regression · 606432a1
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      MLlib ```LogisticRegression``` should support bound constrained optimization (only for L2 regularization). Users can add bound constraints to coefficients to make the solver produce solution in the specified range.
      
      Under the hood, we call Breeze [```L-BFGS-B```](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/LBFGSB.scala) as the solver for bound constrained optimization. But in the current breeze implementation, there are some bugs in L-BFGS-B, and https://github.com/scalanlp/breeze/pull/633 fixed them. We need to upgrade dependent breeze later, and currently we use the workaround L-BFGS-B in this PR temporary for reviewing.
      
      ## How was this patch tested?
      Unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17715 from yanboliang/spark-20047.
      Unverified
      606432a1
    • Davis Shepherd's avatar
      [SPARK-20483][MINOR] Test for Mesos Coarse mode may starve other Mesos frameworks · 039e32ca
      Davis Shepherd authored
      ## What changes were proposed in this pull request?
      
      Add test case for scenarios where executor.cores is set as a
      (non)divisor of spark.cores.max
      This tests the change in
      #17786
      
      ## How was this patch tested?
      
      Ran the existing test suite with the new tests
      
      dbtsai
      
      Author: Davis Shepherd <dshepherd@netflix.com>
      
      Closes #17788 from dgshep/add_mesos_test.
      Unverified
      039e32ca
    • Tejas Patil's avatar
      [SPARK-20487][SQL] `HiveTableScan` node is quite verbose in explained plan · a4aa4665
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Changed `TreeNode.argString` to handle `CatalogTable` separately (otherwise it would call the default `toString` on the `CatalogTable`)
      
      ## How was this patch tested?
      
      - Expanded scope of existing unit test to ensure that verbose information is not present
      - Manual testing
      
      Before
      
      ```
      scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true)
      == Parsed Logical Plan ==
      'Project [*]
      +- 'Filter ('name = foo)
         +- 'UnresolvedRelation `my_table`
      
      == Analyzed Logical Plan ==
      user_id: bigint, name: string, ds: string
      Project [user_id#13L, name#14, ds#15]
      +- Filter (name#14 = foo)
         +- SubqueryAlias my_table
            +- CatalogRelation CatalogTable(
      Database: default
      Table: my_table
      Owner: tejasp
      Created: Fri Apr 14 17:05:50 PDT 2017
      Last Access: Wed Dec 31 16:00:00 PST 1969
      Type: MANAGED
      Provider: hive
      Properties: [serialization.format=1]
      Statistics: 9223372036854775807 bytes
      Location: file:/tmp/warehouse/my_table
      Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      InputFormat: org.apache.hadoop.mapred.TextInputFormat
      OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
      Partition Provider: Catalog
      Partition Columns: [`ds`]
      Schema: root
      -- user_id: long (nullable = true)
      -- name: string (nullable = true)
      -- ds: string (nullable = true)
      ), [user_id#13L, name#14], [ds#15]
      
      == Optimized Logical Plan ==
      Filter (isnotnull(name#14) && (name#14 = foo))
      +- CatalogRelation CatalogTable(
      Database: default
      Table: my_table
      Owner: tejasp
      Created: Fri Apr 14 17:05:50 PDT 2017
      Last Access: Wed Dec 31 16:00:00 PST 1969
      Type: MANAGED
      Provider: hive
      Properties: [serialization.format=1]
      Statistics: 9223372036854775807 bytes
      Location: file:/tmp/warehouse/my_table
      Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      InputFormat: org.apache.hadoop.mapred.TextInputFormat
      OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
      Partition Provider: Catalog
      Partition Columns: [`ds`]
      Schema: root
      -- user_id: long (nullable = true)
      -- name: string (nullable = true)
      -- ds: string (nullable = true)
      ), [user_id#13L, name#14], [ds#15]
      
      == Physical Plan ==
      *Filter (isnotnull(name#14) && (name#14 = foo))
      +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation CatalogTable(
      Database: default
      Table: my_table
      Owner: tejasp
      Created: Fri Apr 14 17:05:50 PDT 2017
      Last Access: Wed Dec 31 16:00:00 PST 1969
      Type: MANAGED
      Provider: hive
      Properties: [serialization.format=1]
      Statistics: 9223372036854775807 bytes
      Location: file:/tmp/warehouse/my_table
      Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      InputFormat: org.apache.hadoop.mapred.TextInputFormat
      OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
      Partition Provider: Catalog
      Partition Columns: [`ds`]
      Schema: root
      -- user_id: long (nullable = true)
      -- name: string (nullable = true)
      -- ds: string (nullable = true)
      ), [user_id#13L, name#14], [ds#15]
      ```
      
      After
      
      ```
      scala> hc.sql(" SELECT * FROM my_table WHERE name = 'foo' ").explain(true)
      == Parsed Logical Plan ==
      'Project [*]
      +- 'Filter ('name = foo)
         +- 'UnresolvedRelation `my_table`
      
      == Analyzed Logical Plan ==
      user_id: bigint, name: string, ds: string
      Project [user_id#13L, name#14, ds#15]
      +- Filter (name#14 = foo)
         +- SubqueryAlias my_table
            +- CatalogRelation `default`.`my_table`, [user_id#13L, name#14], [ds#15]
      
      == Optimized Logical Plan ==
      Filter (isnotnull(name#14) && (name#14 = foo))
      +- CatalogRelation `default`.`my_table`, [user_id#13L, name#14], [ds#15]
      
      == Physical Plan ==
      *Filter (isnotnull(name#14) && (name#14 = foo))
      +- HiveTableScan [user_id#13L, name#14, ds#15], CatalogRelation `default`.`my_table`, [user_id#13L, name#14], [ds#15]
      ```
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #17780 from tejasapatil/SPARK-20487_verbose_plan.
      a4aa4665
    • Kris Mok's avatar
      [SPARK-20482][SQL] Resolving Casts is too strict on having time zone set · 26ac2ce0
      Kris Mok authored
      ## What changes were proposed in this pull request?
      
      Relax the requirement that a `TimeZoneAwareExpression` has to have its `timeZoneId` set to be considered resolved.
      With this change, a `Cast` (which is a `TimeZoneAwareExpression`) can be considered resolved if the `(fromType, toType)` combination doesn't require time zone information.
      
      Also de-relaxed test cases in `CastSuite` so Casts in that test suite don't get a default`timeZoneId = Option("GMT")`.
      
      ## How was this patch tested?
      
      Ran the de-relaxed`CastSuite` and it's passing. Also ran the SQL unit tests and they're passing too.
      
      Author: Kris Mok <kris.mok@databricks.com>
      
      Closes #17777 from rednaxelafx/fix-catalyst-cast-timezone.
      26ac2ce0
    • jinxing's avatar
      [SPARK-20426] Lazy initialization of FileSegmentManagedBuffer for shuffle service. · 85c6ce61
      jinxing authored
      ## What changes were proposed in this pull request?
      When application contains large amount of shuffle blocks. NodeManager requires lots of memory to keep metadata(`FileSegmentManagedBuffer`) in `StreamManager`. When the number of shuffle blocks is big enough. NodeManager can run OOM. This pr proposes to do lazy initialization of `FileSegmentManagedBuffer` in shuffle service.
      
      ## How was this patch tested?
      
      Manually test.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #17744 from jinxing64/SPARK-20426.
      85c6ce61
    • Marcelo Vanzin's avatar
      [SPARK-20421][CORE] Mark internal listeners as deprecated. · 561e9cc3
      Marcelo Vanzin authored
      These listeners weren't really meant for external consumption, but they're
      public and marked with DeveloperApi. Adding the deprecated tag warns people
      that they may soon go away (as they will as part of the work for SPARK-18085).
      
      Note that not all types made public by https://github.com/apache/spark/pull/648
      are being deprecated. Some remaining types are still exposed through the
      SparkListener API.
      
      Also note the text for StorageStatus is a tiny bit different, since I'm not
      so sure I'll be able to remove it. But the effect for the users should be the
      same (they should stop trying to use it).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17766 from vanzin/SPARK-20421.
      561e9cc3
    • Davis Shepherd's avatar
      [SPARK-20483] Mesos Coarse mode may starve other Mesos frameworks · 7633933e
      Davis Shepherd authored
      ## What changes were proposed in this pull request?
      
      Set maxCores to be a multiple of the smallest executor that can be launched. This ensures that we correctly detect the condition where no more executors will be launched when spark.cores.max is not a multiple of spark.executor.cores
      
      ## How was this patch tested?
      
      This was manually tested with other sample frameworks measuring their incoming offers to determine if starvation would occur.
      
      dbtsai mgummelt
      
      Author: Davis Shepherd <dshepherd@netflix.com>
      
      Closes #17786 from dgshep/fix_mesos_max_cores.
      Unverified
      7633933e
    • zero323's avatar
      [SPARK-20208][DOCS][FOLLOW-UP] Add FP-Growth to SparkR programming guide · ba766627
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Add `spark.fpGrowth` to SparkR programming guide.
      
      ## How was this patch tested?
      
      Manual tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17775 from zero323/SPARK-20208-FOLLOW-UP.
      ba766627
    • zero323's avatar
      [DOCS][MINOR] Add missing since to SparkR repeat_string note. · b58cf77c
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replace
      
          note repeat_string 2.3.0
      
      with
      
          note repeat_string since 2.3.0
      
      ## How was this patch tested?
      
      `create-docs.sh`
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17779 from zero323/REPEAT-NOTE.
      b58cf77c
    • Takeshi Yamamuro's avatar
      [SPARK-20425][SQL] Support a vertical display mode for Dataset.show · b4724db1
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added a new display mode for `Dataset.show` to print output rows vertically (one line per column value). In the current master, when printing Dataset with many columns, the readability is low like;
      
      ```
      scala> val df = spark.range(100).selectExpr((0 until 100).map(i => s"rand() AS c$i"): _*)
      scala> df.show(3, 0)
      +------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+
      |c0                |c1                |c2                |c3                 |c4                |c5                |c6                 |c7                |c8                |c9                |c10               |c11                |c12               |c13               |c14               |c15                |c16                |c17                |c18               |c19               |c20                |c21               |c22                |c23               |c24                |c25                |c26                |c27                 |c28                |c29               |c30                |c31                 |c32               |c33               |c34                |c35                |c36                |c37               |c38               |c39                |c40               |c41               |c42                |c43                |c44                |c45               |c46                 |c47                 |c48                |c49                |c50                |c51                |c52                |c53                |c54                 |c55                |c56                |c57                |c58                |c59               |c60               |c61                |c62                |c63               |c64                |c65               |c66               |c67              |c68                |c69                |c70               |c71                |c72               |c73                |c74                |c75                |c76               |c77                |c78               |c79                |c80                |c81                |c82                |c83                |c84                |c85                |c86                |c87               |c88                |c89                |c90               |c91               |c92               |c93                |c94               |c95                |c96               |c97                |c98                |c99                |
      +------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+
      |0.6306087152476858|0.9174349686288383|0.5511324165035159|0.3320844128641819 |0.7738486877101489|0.2154915886962553|0.4754997600674299 |0.922780639280355 |0.7136894772661909|0.2277580838165979|0.5926874459847249|0.40311408392226633|0.467830264333843 |0.8330466896984213|0.1893258482389527|0.6320849515511165 |0.7530911056912044 |0.06700254871955424|0.370528597355559 |0.2755437445193154|0.23704391110980128|0.8067400174905822|0.13597793616251852|0.1708888820162453|0.01672725007605702|0.983118121881555  |0.25040195628629924|0.060537253723083384|0.20000530582637488|0.3400572407133511|0.9375689433322597 |0.057039316954370256|0.8053269714347623|0.5247817572228813|0.28419308820527944|0.9798908885194533 |0.31805988175678146|0.7034448027077574|0.5400575751346084|0.25336322371116216|0.9361634546853429|0.6118681368289798|0.6295081549153907 |0.13417468943957422|0.41617137072255794|0.7267230869252035|0.023792726137561115|0.5776157058356362  |0.04884204913195467|0.26728716103441275|0.646680370807925  |0.9782712690657244 |0.16434031314818154|0.20985522381321275|0.24739842475440077 |0.26335189682977334|0.19604841662422068|0.10742950487300651|0.20283136488091502|0.3100312319723688|0.886959006630645 |0.25157102269776244|0.34428775168410786|0.3500506818575777|0.3781142441912052 |0.8560316444386715|0.4737104888956839|0.735903101602148|0.02236617130529006|0.8769074095835873 |0.2001426662503153|0.5534032319238532 |0.7289496620397098|0.41955191309992157|0.9337700133660436 |0.34059094378451005|0.6419144759403556|0.08167496930341167|0.9947099478497635|0.48010888605366586|0.22314796858167918|0.17786598882331306|0.7351521162297135 |0.5422057170020095 |0.9521927872726792 |0.7459825486368227 |0.40907708791990627|0.8903819313311575|0.7251413746923618 |0.2977174938745204 |0.9515209660203555|0.9375968604766713|0.5087851740042524|0.4255237544908751 |0.8023768698664653|0.48003189618006703|0.1775841829745185|0.09050775629268382|0.6743909291138167 |0.2498415755876865 |
      |0.6866473844170801|0.4774360641212433|0.631696201340726 |0.33979113021468343|0.5663049010847052|0.7280190472258865|0.41370958502324806|0.9977433873622218|0.7671957338989901|0.2788708556233931|0.3355106391656496|0.88478952319287   |0.0333974166999893|0.6061744715862606|0.9617779139652359|0.22484954822341863|0.12770906021550898|0.5577789629508672 |0.2877649024640704|0.5566577406549361|0.9334933255278052 |0.9166720585157266|0.9689249324600591 |0.6367502457478598|0.7993572745928459 |0.23213222324218108|0.11928284054154137|0.6173493362456599  |0.0505122058694798 |0.9050228629552983|0.17112767911121707|0.47395598348370005 |0.5820498657823081|0.6241124650645072|0.18587258258036776|0.14987593554122225|0.3079446253653946 |0.9414228822867968|0.8362276265462365|0.9155655305576353 |0.5121559807153562|0.8963362656525707|0.22765970274318037|0.8177039187132797 |0.8190326635933787 |0.5256005177032199|0.8167598457269669  |0.030936807130934496|0.6733006585281015 |0.4208049626816347 |0.24603085738518538|0.22719198954208153|0.1622280557565281 |0.22217325159218038|0.014684419513742553|0.08987111517447499|0.2157764759142622 |0.8223414104088321 |0.4868624404491777 |0.4016191733088167|0.6169281906889263|0.15603611040433385|0.18289285085714913|0.9538408988218972|0.15037154865295121|0.5364516961987454|0.8077254873163031|0.712600478545675|0.7277477241003857 |0.19822912960348305|0.8305051199208777|0.18631911396566114|0.8909532487898342|0.3470409226992506 |0.35306974180587636|0.9107058868891469 |0.3321327206004986|0.48952332459050607|0.3630403307479373|0.5400046826340376 |0.5387377194310529 |0.42860539421837585|0.23214101630985995|0.21438968839794847|0.15370603160082352|0.04355605642700022|0.6096006707067466 |0.6933354157094292|0.06302172470859002|0.03174631856164001|0.664243581650643 |0.7833239547446621|0.696884598352864 |0.34626385933237736|0.9263495598791336|0.404818892816584  |0.2085585394755507|0.6150004897990109 |0.05391193524302473|0.28188484028329097|
      +------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+
      only showing top 2 rows
      ```
      
      `psql`, CLI for PostgreSQL, supports a vertical display mode for this case like:
      http://stackoverflow.com/questions/9604723/alternate-output-format-for-psql
      
      ```
      -RECORD 0-------------------
       c0  | 0.6306087152476858
       c1  | 0.9174349686288383
       c2  | 0.5511324165035159
      ...
       c98 | 0.05391193524302473
       c99 | 0.28188484028329097
      -RECORD 1-------------------
       c0  | 0.6866473844170801
       c1  | 0.4774360641212433
       c2  | 0.631696201340726
      ...
       c98 | 0.05391193524302473
       c99 | 0.28188484028329097
      only showing top 2 rows
      ```
      
      ## How was this patch tested?
      Added tests in `DataFrameSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17733 from maropu/SPARK-20425.
      b4724db1
  3. Apr 26, 2017
    • Mark Grover's avatar
      [SPARK-20435][CORE] More thorough redaction of sensitive information · 66636ef0
      Mark Grover authored
      This change does a more thorough redaction of sensitive information from logs and UI
      Add unit tests that ensure that no regressions happen that leak sensitive information to the logs.
      
      The motivation for this change was appearance of password like so in `SparkListenerEnvironmentUpdate` in event logs under some JVM configurations:
      `"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..."
      `
      Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well.
      
      ## How was this patch tested?
      
      New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. Old unit test in UtilsSuite was modified because the test was asserting that a non-sensitive property's value won't be redacted. However, the non-sensitive value had the literal "secret" in it which was causing it to redact. Simply updating the non-sensitive property's value to another arbitrary value (that didn't have "secret" in it) fixed it.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #17725 from markgrover/spark-20435.
      66636ef0
    • Weiqing Yang's avatar
      [SPARK-12868][SQL] Allow adding jars from hdfs · 2ba1eba3
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      Spark 2.2 is going to be cut, it'll be great if SPARK-12868 can be resolved before that. There have been several PRs for this like [PR#16324](https://github.com/apache/spark/pull/16324) , but all of them are inactivity for a long time or have been closed.
      
      This PR added a SparkUrlStreamHandlerFactory, which relies on 'protocol' to choose the appropriate
      UrlStreamHandlerFactory like FsUrlStreamHandlerFactory to create URLStreamHandler.
      
      ## How was this patch tested?
      1. Add a new unit test.
      2. Check manually.
      Before: throw an exception with " failed unknown protocol: hdfs"
      <img width="914" alt="screen shot 2017-03-17 at 9 07 36 pm" src="https://cloud.githubusercontent.com/assets/8546874/24075277/5abe0a7c-0bd5-11e7-900e-ec3d3105da0b.png">
      
      After:
      <img width="1148" alt="screen shot 2017-03-18 at 11 42 18 am" src="https://cloud.githubusercontent.com/assets/8546874/24075283/69382a60-0bd5-11e7-8d30-d9405c3aaaba.png">
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #17342 from weiqingy/SPARK-18910.
      2ba1eba3
    • Michal Szafranski's avatar
      [SPARK-20474] Fixing OnHeapColumnVector reallocation · a277ae80
      Michal Szafranski authored
      ## What changes were proposed in this pull request?
      OnHeapColumnVector reallocation copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the ColumnVector.appendX API, while ColumnVector.putX is more commonly used.
      
      ## How was this patch tested?
      Tested using existing unit tests.
      
      Author: Michal Szafranski <michal@databricks.com>
      
      Closes #17773 from michal-databricks/spark-20474.
      a277ae80
    • Michal Szafranski's avatar
      [SPARK-20473] Enabling missing types in ColumnVector.Array · 99c6cf9e
      Michal Szafranski authored
      ## What changes were proposed in this pull request?
      ColumnVector implementations originally did not support some Catalyst types (float, short, and boolean). Now that they do, those types should be also added to the ColumnVector.Array.
      
      ## How was this patch tested?
      Tested using existing unit tests.
      
      Author: Michal Szafranski <michal@databricks.com>
      
      Closes #17772 from michal-databricks/spark-20473.
      99c6cf9e
    • jerryshao's avatar
      [SPARK-20391][CORE] Rename memory related fields in ExecutorSummay · 66dd5b83
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of #14617 to make the name of memory related fields more meaningful.
      
      Here  for the backward compatibility, I didn't change `maxMemory` and `memoryUsed` fields.
      
      ## How was this patch tested?
      
      Existing UT and local verification.
      
      CC squito and tgravescs .
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17700 from jerryshao/SPARK-20391.
      66dd5b83
    • Yanbo Liang's avatar
      [MINOR][ML] Fix some PySpark & SparkR flaky tests · dbb06c68
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17757 from yanboliang/flaky-test.
      dbb06c68
    • Tom Graves's avatar
      [SPARK-19812] YARN shuffle service fails to relocate recovery DB acro… · 7fecf513
      Tom Graves authored
      …ss NFS directories
      
      ## What changes were proposed in this pull request?
      
      Change from using java Files.move to use Hadoop filesystem operations to move the directories.  The java Files.move does not work when moving directories across NFS mounts and in fact also says that if the directory has entries you should do a recursive move. We are already using Hadoop filesystem here so just use the local filesystem from there as it handles this properly.
      
      Note that the DB here is actually a directory of files and not just a single file, hence the change in the name of the local var.
      
      ## How was this patch tested?
      
      Ran YarnShuffleServiceSuite unit tests.  Unfortunately couldn't easily add one here since involves NFS.
      Ran manual tests to verify that the DB directories were properly moved across NFS mounted directories. Have been running this internally for weeks.
      
      Author: Tom Graves <tgraves@apache.org>
      
      Closes #17748 from tgravescs/SPARK-19812.
      7fecf513
    • anabranch's avatar
      [SPARK-20400][DOCS] Remove References to 3rd Party Vendor Tools · 7a365257
      anabranch authored
      ## What changes were proposed in this pull request?
      
      Simple documentation change to remove explicit vendor references.
      
      ## How was this patch tested?
      
      NA
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: anabranch <bill@databricks.com>
      
      Closes #17695 from anabranch/remove-vendor.
      7a365257
    • zero323's avatar
      [SPARK-20437][R] R wrappers for rollup and cube · df58a95a
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add `rollup` and `cube` methods and corresponding generics.
      - Add short description to the vignette.
      
      ## How was this patch tested?
      
      - Existing unit tests.
      - Additional unit tests covering new features.
      - `check-cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17728 from zero323/SPARK-20437.
      df58a95a
  4. Apr 25, 2017
    • Eric Wasserman's avatar
      [SPARK-16548][SQL] Inconsistent error handling in JSON parsing SQL functions · 57e1da39
      Eric Wasserman authored
      ## What changes were proposed in this pull request?
      
      change to using Jackson's `com.fasterxml.jackson.core.JsonFactory`
      
          public JsonParser createParser(String content)
      
      ## How was this patch tested?
      
      existing unit tests
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Eric Wasserman <ericw@sgn.com>
      
      Closes #17693 from ewasserman/SPARK-20314.
      57e1da39
    • Sameer Agarwal's avatar
      [SPARK-18127] Add hooks and extension points to Spark · caf39202
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch adds support for customizing the spark session by injecting user-defined custom extensions. This allows a user to add custom analyzer rules/checks, optimizer rules, planning strategies or even a customized parser.
      
      ## How was this patch tested?
      
      Unit Tests in SparkSessionExtensionSuite
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #17724 from sameeragarwal/session-extensions.
      caf39202
    • ding's avatar
      [SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel · 0a7f5f27
      ding authored
      ## What changes were proposed in this pull request?
      
      Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains.
      
      This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set.
      This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core
      ## How was this patch tested?
      
      unit tests, manual tests
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: ding <ding@localhost.localdomain>
      Author: dding3 <ding.ding@intel.com>
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #15125 from dding3/cp2_pregel.
      0a7f5f27
    • Yanbo Liang's avatar
      [SPARK-20449][ML] Upgrade breeze version to 0.13.1 · 67eef47a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17746 from yanboliang/spark-20449.
      Unverified
      67eef47a
    • wangmiao1981's avatar
      [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant · 387565cf
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up PR of #17478.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #17754 from wangmiao1981/followup.
      387565cf
    • Sergey Zhemzhitsky's avatar
      [SPARK-20404][CORE] Using Option(name) instead of Some(name) · 0bc7a902
      Sergey Zhemzhitsky authored
      Using Option(name) instead of Some(name) to prevent runtime failures when using accumulators created like the following
      ```
      sparkContext.accumulator(0, null)
      ```
      
      Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com>
      
      Closes #17740 from szhem/SPARK-20404-null-acc-names.
      0bc7a902
    • Armin Braun's avatar
      [SPARK-20455][DOCS] Fix Broken Docker IT Docs · c8f12195
      Armin Braun authored
      ## What changes were proposed in this pull request?
      
      Just added the Maven `test`goal.
      
      ## How was this patch tested?
      
      No test needed, just a trivial documentation fix.
      
      Author: Armin Braun <me@obrown.io>
      
      Closes #17756 from original-brownbear/SPARK-20455.
      c8f12195
    • Sameer Agarwal's avatar
      [SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit · 31345fde
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping
      splits.
      
      To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism.
      
      ## How was this patch tested?
      
      Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes.
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #17751 from sameeragarwal/randomsplit2.
      31345fde
  5. Apr 24, 2017
    • Josh Rosen's avatar
      [SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT · f44c8a84
      Josh Rosen authored
      This patch bumps the master branch version to `2.3.0-SNAPSHOT`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17753 from JoshRosen/SPARK-20453.
      f44c8a84
    • jerryshao's avatar
      [SPARK-20239][CORE] Improve HistoryServer's ACL mechanism · 5280d93e
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Current SHS (Spark History Server) two different ACLs:
      
      * ACL of base URL, it is controlled by "spark.acls.enabled" or "spark.ui.acls.enabled", and with this enabled, only user configured with "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user who started SHS could list all the applications, otherwise none of them can be listed. This will also affect REST APIs which listing the summary of all apps and one app.
      * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". With this enabled only history admin user and user/group who ran this app can access the details of this app.
      
      With this two ACLs, we may encounter several unexpected behaviors:
      
      1. if base URL's ACL (`spark.acls.enable`) is enabled but user A has no view permission. User "A" cannot see the app list but could still access details of it's own app.
      2. if ACLs of base URL (`spark.acls.enable`) is disabled, then user "A" could download any application's event log, even it is not run by user "A".
      3. The changes of Live UI's ACL will affect History UI's ACL which share the same conf file.
      
      The unexpected behaviors is mainly because we have two different ACLs, ideally we should have only one to manage all.
      
      So to improve SHS's ACL mechanism, here in this PR proposed to:
      
      1. Disable "spark.acls.enable" and only use "spark.history.ui.acls.enable" for history server.
      2. Check permission for event-log download REST API.
      
      With this PR:
      
      1. Admin user could see/download the list of all applications, as well as application details.
      2. Normal user could see the list of all applications, but can only download and check the details of applications accessible to him.
      
      ## How was this patch tested?
      
      New UTs are added, also verified in real cluster.
      
      CC tgravescs vanzin please help to review, this PR changes the semantics you did previously. Thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17582 from jerryshao/SPARK-20239.
      5280d93e
    • zero323's avatar
      [SPARK-20438][R] SparkR wrappers for split and repeat · 8a272ddc
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Add wrappers for `o.a.s.sql.functions`:
      
      - `split` as `split_string`
      - `repeat` as `repeat_string`
      
      ## How was this patch tested?
      
      Existing tests, additional unit tests, `check-cran.sh`
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17729 from zero323/SPARK-20438.
      8a272ddc
    • wm624@hotmail.com's avatar
      [SPARK-18901][ML] Require in LR LogisticAggregator is redundant · 90264ace
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      In MultivariateOnlineSummarizer,
      
      `add` and `merge` have check for weights and feature sizes. The checks in LR are redundant, which are removed from this PR.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #17478 from wangmiao1981/logit.
      90264ace
    • Xiao Li's avatar
      [SPARK-20439][SQL] Fix Catalog API listTables and getTable when failed to fetch table metadata · 776a2c0e
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      `spark.catalog.listTables` and `spark.catalog.getTable` does not work if we are unable to retrieve table metadata due to any reason (e.g., table serde class is not accessible or the table type is not accepted by Spark SQL). After this PR, the APIs still return the corresponding Table without the description and tableType)
      
      ### How was this patch tested?
      Added a test case
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17730 from gatorsmile/listTables.
      776a2c0e
Loading