Skip to content
Snippets Groups Projects
  1. Feb 24, 2016
  2. Feb 23, 2016
    • Davies Liu's avatar
      [SPARK-13431] [SQL] [test-maven] split keywords from ExpressionParser.g · 86c852cf
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR pull all the keywords (and some others) from ExpressionParser.g as KeywordParser.g, because ExpressionParser is too large to compile.
      
      ## How was the this patch tested?
      
      unit test, maven build
      
      Closes #11329
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11331 from davies/split_expr.
      86c852cf
    • Davies Liu's avatar
      [SPARK-13376] [SQL] improve column pruning · e9533b41
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset).
      
      ## How was the this patch tested?
      
      This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11256 from davies/fix_column_pruning.
      e9533b41
    • JeremyNixon's avatar
      [SPARK-10759][ML] update cross validator with include_example · 230bbeaa
      JeremyNixon authored
      This pull request uses {%include_example%} to add an example for the python cross validator to ml-guide.
      
      Author: JeremyNixon <jnixon2@gmail.com>
      
      Closes #11240 from JeremyNixon/pipeline_include_example.
      230bbeaa
    • Xusen Yin's avatar
      [SPARK-13011] K-means wrapper in SparkR · 8d29001d
      Xusen Yin authored
      https://issues.apache.org/jira/browse/SPARK-13011
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #11124 from yinxusen/SPARK-13011.
      8d29001d
    • Timothy Hunter's avatar
      [SPARK-6761][SQL][ML] Fixes to API and documentation of approximate quantiles · 15e30155
      Timothy Hunter authored
      ## What changes were proposed in this pull request?
      
      This continues  thunterdb 's work on `approxQuantile` API. It changes the signature of `approxQuantile` from `(col: String, quantile: Double, epsilon: Double): Double`  to `(col: String, probabilities: Array[Double], relativeError: Double): Array[Double]` and update API doc. It also improves the error message in tests and simplifies the merge algorithm for summaries.
      
      ## How was the this patch tested?
      
      Use the same unit tests as before.
      
      Closes #11325
      
      Author: Timothy Hunter <timhunter@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #11332 from mengxr/SPARK-6761.
      15e30155
    • Davies Liu's avatar
      [SPARK-13373] [SQL] generate sort merge join · 9cdd867d
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Generates code for SortMergeJoin.
      
      ## How was the this patch tested?
      
      Unit tests and manually tested with TPCDS Q72, which showed 70% performance improvements (from 42s to 25s), but micro benchmark only show minor improvements, it may depends the distribution of data and number of columns.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11248 from davies/gen_smj.
      9cdd867d
    • Davies Liu's avatar
      [SPARK-13329] [SQL] considering output for statistics of logical plan · c481bdf5
      Davies Liu authored
      The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess.
      
      We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join.
      
      After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time.
      
      We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them:
      
      DecimalType:  8 or 16 bytes, based on the precision
      StringType:  20 bytes
      BinaryType: 100 bytes
      UDF: default size of SQL type
      
      These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11210 from davies/statics.
      c481bdf5
    • Michael Armbrust's avatar
      [SPARK-13440][SQL] ObjectType should accept any ObjectType, If should not care about nullability · c5bfe5d2
      Michael Armbrust authored
      The type checking functions of `If` and `UnwrapOption` are fixed to eliminate spurious failures.  `UnwrapOption` was checking for an input of `ObjectType` but `ObjectType`'s accept function was hard coded to return `false`.  `If`'s type check was returning a false negative in the case that the two options differed only by nullability.
      
      Tests added:
       -  an end-to-end regression test is added to `DatasetSuite` for the reported failure.
       - all the unit tests in `ExpressionEncoderSuite` are augmented to also confirm successful analysis.  These tests are actually what pointed out the additional issues with `If` resolution.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #11316 from marmbrus/datasetOptions.
      c5bfe5d2
    • Lianhui Wang's avatar
      [SPARK-7729][UI] Executor which has been killed should also be displayed on Executor Tab · 9f426339
      Lianhui Wang authored
      andrewor14 squito Dead Executors should also be displayed on Executor Tab.
      as following:
      ![image](https://cloud.githubusercontent.com/assets/545478/11492707/ae55d7f6-982b-11e5-919a-b62cd84684b2.png)
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Andrew Or <andrew@databricks.com>
      
      Closes #10058 from lianhuiwang/SPARK-7729.
      9f426339
    • Grzegorz Chilkiewicz's avatar
      [SPARK-13338][ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion · 5d69eaf0
      Grzegorz Chilkiewicz authored
      Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com>
      
      Closes #11216 from grzegorz-chilkiewicz/master.
      5d69eaf0
    • zhuol's avatar
      [SPARK-13364] Sort appId as num rather than str in history page. · 4d1e5f92
      zhuol authored
      ## What changes were proposed in this pull request?
      
      History page now sorts the appID as a string, which can lead to unexpected order for the case "application_11111_9" and "application_11111_20".
      Add a new sort type called appId-numeric can fix it.
      
      ## How was the this patch tested?
      This patch was manually tested with UI. See the screenshot below:
      ![sortappidbetter](https://cloud.githubusercontent.com/assets/11683054/13185564/7f941a16-d707-11e5-8fb7-0316368d3030.png)
      
      Author: zhuol <zhuol@yahoo-inc.com>
      
      Closes #11259 from zhuoliu/13364.
      4d1e5f92
    • Liang-Chi Hsieh's avatar
      [SPARK-13358] [SQL] Retrieve grep path when do benchmark · 87d7f890
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13358
      
      When trying to run a benchmark, I found that on my Ubuntu linux grep is not in /usr/bin/ but /bin/. So wondering if it is better to use which to retrieve grep path.
      
      cc davies
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11231 from viirya/benchmark-grep-path.
      87d7f890
    • jerryshao's avatar
      [SPARK-13220][CORE] deprecate yarn-client and yarn-cluster mode · e99d0170
      jerryshao authored
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #11229 from jerryshao/SPARK-13220.
      e99d0170
    • gatorsmile's avatar
      [SPARK-13263][SQL] SQL Generation Support for Tablesample · 87250580
      gatorsmile authored
      In the parser, tableSample clause is part of tableSource.
      ```
      tableSource
      init { gParent.pushMsg("table source", state); }
      after { gParent.popMsg(state); }
          : tabname=tableName
          ((tableProperties) => props=tableProperties)?
          ((tableSample) => ts=tableSample)?
          ((KW_AS) => (KW_AS alias=Identifier)
          |
          (Identifier) => (alias=Identifier))?
          -> ^(TOK_TABREF $tabname $props? $ts? $alias?)
          ;
      ```
      
      Two typical query samples using TABLESAMPLE are:
      ```
          "SELECT s.id FROM t0 TABLESAMPLE(10 PERCENT) s"
          "SELECT * FROM t0 TABLESAMPLE(0.1 PERCENT)"
      ```
      
      FYI, the logical plan of a TABLESAMPLE query:
      ```
      sql("SELECT * FROM t0 TABLESAMPLE(0.1 PERCENT)").explain(true)
      
      == Analyzed Logical Plan ==
      id: bigint
      Project [id#16L]
      +- Sample 0.0, 0.001, false, 381
         +- Subquery t0
            +- Relation[id#16L] ParquetRelation
      ```
      
      Thanks! cc liancheng
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      This patch had conflicts when merged, resolved by
      Committer: Cheng Lian <lian@databricks.com>
      
      Closes #11148 from gatorsmile/tablesplitsample.
      87250580
    • movelikeriver's avatar
      [SPARK-13257][IMPROVEMENT] Refine naive Bayes example by checking model after loading it · 5cd3e6f6
      movelikeriver authored
      Refine naive Bayes example by checking model after loading it
      
      Author: movelikeriver <mars.lenjoy@gmail.com>
      
      Closes #11125 from movelikeriver/naive_bayes.
      5cd3e6f6
    • Xiangrui Meng's avatar
      [SPARK-13355][MLLIB] replace GraphImpl.fromExistingRDDs by Graph.apply · 764ca180
      Xiangrui Meng authored
      `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #11226 from mengxr/SPARK-13355.
      764ca180
    • Yanbo Liang's avatar
      [SPARK-13429][MLLIB] Unify Logistic Regression convergence tolerance of ML & MLlib · 72427c3e
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```.
      cc dbtsai
      ## How was the this patch tested?
      unit tests
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11299 from yanboliang/spark-13429.
      72427c3e
    • Timothy Hunter's avatar
      [SPARK-6761][SQL] Approximate quantile for DataFrame · 4fd19936
      Timothy Hunter authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6761
      
      Compute approximate quantile based on the paper Greenwald, Michael and Khanna, Sanjeev, "Space-efficient Online Computation of Quantile Summaries," SIGMOD '01.
      
      Author: Timothy Hunter <timhunter@databricks.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6042 from viirya/approximate_quantile.
      4fd19936
    • gatorsmile's avatar
      [SPARK-13236] SQL Generation for Set Operations · 01e10c9f
      gatorsmile authored
      This PR is to implement SQL generation for the following three set operations:
      - Union Distinct
      - Intersect
      - Except
      
      liancheng Thanks!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #11195 from gatorsmile/setOpSQLGen.
      01e10c9f
    • gatorsmile's avatar
      [SPARK-12723][SQL] Comprehensive Verification and Fixing of SQL Generation Support for Expressions · 9dd5399d
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      Ensure that all built-in expressions can be mapped to its SQL representation if there is one (e.g. ScalaUDF doesn't have a SQL representation). The function lists are from the expression list in `FunctionRegistry`.
      
      window functions, grouping sets functions (`cube`, `rollup`, `grouping`, `grouping_id`), generator functions (`explode` and `json_tuple`) are covered by separate JIRA and PRs. Thus, this PR does not cover them. Except these functions, all the built-in expressions are covered. For details, see the list in `ExpressionToSQLSuite`.
      
      Fixed a few issues. For example, the `prettyName` of `approx_count_distinct` is not right. The `sql` of `hash` function is not right, since the `hash` function does not accept `seed`.
      
      Additionally, also correct the order of expressions in `FunctionRegistry` so that people are easier to find which functions are missing.
      
      cc liancheng
      
      #### How was the this patch tested?
      Added two test cases in LogicalPlanToSQLSuite for covering `not like` and `not in`.
      
      Added a new test suite `ExpressionToSQLSuite` to cover the functions:
      
      1. misc non-aggregate functions + complex type creators + null expressions
      2. math functions
      3. aggregate functions
      4. string functions
      5. date time functions + calendar interval
      6. collection functions
      7. misc functions
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11314 from gatorsmile/expressionToSQL.
      9dd5399d
  3. Feb 22, 2016
    • Daoyuan Wang's avatar
      [SPARK-11624][SPARK-11972][SQL] fix commands that need hive to exec · 5d80fac5
      Daoyuan Wang authored
      In SparkSQLCLI, we have created a `CliSessionState`, but then we call `SparkSQLEnv.init()`, which will start another `SessionState`. This would lead to exception because `processCmd` need to get the `CliSessionState` instance by calling `SessionState.get()`, but the return value would be a instance of `SessionState`. See the exception below.
      
      spark-sql> !echo "test";
      Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState
      	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112)
      	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301)
      	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
      	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242)
      	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #9589 from adrian-wang/clicommand.
      5d80fac5
    • Shixiong Zhu's avatar
      [SPARK-13298][CORE][UI] Escape "label" to avoid DAG being broken by some special character · a11b3995
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When there are some special characters (e.g., `"`, `\`) in `label`, DAG will be broken. This patch just escapes `label` to avoid DAG being broken by some special characters
      
      ## How was the this patch tested?
      
      Jenkins tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11309 from zsxwing/SPARK-13298.
      a11b3995
    • Narine Kokhlikyan's avatar
      [SPARK-13295][ ML, MLLIB ] AFTSurvivalRegression.AFTAggregator improvements -... · 33ef3aa7
      Narine Kokhlikyan authored
      [SPARK-13295][ ML, MLLIB ] AFTSurvivalRegression.AFTAggregator improvements - avoid creating new instances of arrays/vectors for each record
      
      As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) method a new array is being created for intercept value and it is being concatenated
      with another array which contains the betas, the resulted Array is being converted into a Dense vector which in its turn is being converted into breeze vector.
      This is expensive and not necessarily beautiful.
      
      I've tried to solve above mentioned problem by simple algebraic decompositions - keeping and treating intercept independently.
      
      Please let me know what do you think and if you have any questions.
      
      Thanks,
      Narine
      
      Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
      
      Closes #11179 from NarineK/survivaloptim.
      33ef3aa7
    • Devaraj K's avatar
      [SPARK-13012][DOCUMENTATION] Replace example code in ml-guide.md using include_example · 02b1feff
      Devaraj K authored
      Replaced example code in ml-guide.md using include_example
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #11053 from devaraj-kavali/SPARK-13012.
      02b1feff
    • Devaraj K's avatar
      [SPARK-13016][DOCUMENTATION] Replace example code in... · 9f410871
      Devaraj K authored
      [SPARK-13016][DOCUMENTATION] Replace example code in mllib-dimensionality-reduction.md using include_example
      
      Replaced example example code in mllib-dimensionality-reduction.md using
      include_example
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #11132 from devaraj-kavali/SPARK-13016.
      9f410871
    • Xiu Guo's avatar
      [SPARK-13422][SQL] Use HashedRelation instead of HashSet in Left Semi Joins · 20637818
      Xiu Guo authored
      Use the HashedRelation which is a more optimized datastructure and reduce code complexity
      
      Author: Xiu Guo <xguo27@gmail.com>
      
      Closes #11291 from xguo27/SPARK-13422.
      20637818
    • Michael Armbrust's avatar
      [SPARK-12546][SQL] Change default number of open parquet files · 173aa949
      Michael Armbrust authored
      A common problem that users encounter with Spark 1.6.0 is that writing to a partitioned parquet table OOMs.  The root cause is that parquet allocates a significant amount of memory that is not accounted for by our own mechanisms.  As a workaround, we can ensure that only a single file is open per task unless the user explicitly asks for more.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #11308 from marmbrus/parquetWriteOOM.
      173aa949
    • Reynold Xin's avatar
      [SPARK-13413] Remove SparkContext.metricsSystem · 4a91806a
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      
      This patch removes SparkContext.metricsSystem. SparkContext.metricsSystem returns MetricsSystem, which is a private class. I think it was added by accident.
      
      In addition, I also removed an unused private[spark] method schedulerBackend setter.
      
      ## How was the this patch tested?
      
      N/A.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11282 from rxin/SPARK-13413.
      4a91806a
    • Timothy Chen's avatar
      [SPARK-10749][MESOS] Support multiple roles with mesos cluster mode. · 00461bb9
      Timothy Chen authored
      Currently the Mesos cluster dispatcher is not using offers from multiple roles correctly, as it simply aggregates all the offers resource values into one, but doesn't apply them correctly before calling the driver as Mesos needs the resources from the offers to be specified which role it originally belongs to. Multiple roles is already supported with fine/coarse grain scheduler, so porting that logic here to the cluster scheduler.
      
      https://issues.apache.org/jira/browse/SPARK-10749
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #8872 from tnachen/cluster_multi_roles.
      00461bb9
    • Yanbo Liang's avatar
      [SPARK-13334][ML] ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should set parent · 40e6d40f
      Yanbo Liang authored
      ML ```KMeansModel / BisectingKMeansModel / QuantileDiscretizer``` should set parent.
      
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11214 from yanboliang/spark-13334.
      40e6d40f
    • Bryan Cutler's avatar
      [SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent format · e298ac91
      Bryan Cutler authored
      Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the fpm and recommendation modules.
      
      Closes #10602
      Closes #10897
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: somideshmukh <somilde@us.ibm.com>
      
      Closes #11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.
      e298ac91
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Fix all typos in markdown files of `doc` and similar patterns in other comments · 024482bf
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR tries to fix all typos in all markdown files under `docs` module,
      and fixes similar typos in other comments, too.
      
      ## How was the this patch tested?
      
      manual tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11300 from dongjoon-hyun/minor_fix_typos.
      024482bf
    • Holden Karau's avatar
      [SPARK-13399][STREAMING] Fix checkpointsuite type erasure warnings · 1b144455
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Change the checkpointsuite getting the outputstreams to explicitly be unchecked on the generic type so as to avoid the warnings. This only impacts test code.
      
      Alternatively we could encode the type tag in the TestOutputStreamWithPartitions and filter the type tag as well - but this is unnecessary since multiple testoutputstreams are not registered and the previous code was not actually checking this type.
      
      ## How was the this patch tested?
      
      unit tests (streaming/testOnly org.apache.spark.streaming.CheckpointSuite)
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #11286 from holdenk/SPARK-13399-checkpointsuite-type-erasure.
      1b144455
Loading