Skip to content
Snippets Groups Projects
  1. Mar 14, 2016
    • Josh Rosen's avatar
      [SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classloading issue · 07cb323e
      Josh Rosen authored
      This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.
      
      In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.
      
      Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2
      
      /cc zsxwing tdas davies brkyvz
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11687 from JoshRosen/py4j-0.9.2.
      07cb323e
    • Liang-Chi Hsieh's avatar
      [SPARK-13658][SQL] BooleanSimplification rule is slow with large boolean expressions · 6a4bfcd6
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13658
      
      ## What changes were proposed in this pull request?
      
      Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the optimizer rule BooleanSimplification take about 2 seconds (it use lots of sematicsEqual, which require copy the whole tree).
      
      It will great if we could speedup it.
      
      [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql
      
      How to speed up it:
      
      When we ask the canonicalized expression in `Expression`, it calls `Canonicalize.execute` on itself. `Canonicalize.execute` basically transforms up all expressions included in this expression. However, we don't keep the canonicalized versions for these children expressions. So in next time we ask the canonicalized expressions for the children expressions (e.g., `BooleanSimplification`), we will rerun `Canonicalize.execute` on each of them. It wastes much time.
      
      By forcing the children expressions to get and keep their canonicalized versions first, we can avoid re-canonicalize these expressions.
      
      I simply benchmark it with an expression which is part of the where clause in TPCDS Q3:
      
          val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int, 'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int)
      
          val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk === 'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) && (('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) || ('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) || ('ss_sold_date_sk > 2416085 && 'ss_sold_date_sk < 2416115) || ('ss_sold_date_sk > 2416450 && 'ss_sold_date_sk < 2416480) || ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk < 2416846) || ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) || ('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) || ('ss_sold_date_sk > 2417911 && 'ss_sold_date_sk < 2417941) || ('ss_sold_date_sk > 2418277 && 'ss_sold_date_sk < 2418307) || ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk < 2418672) || ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) || ('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) || ('ss_sold_date_sk > 2419738 && 'ss_sold_date_sk < 2419768) || ('ss_sold_date_sk > 2420103 && 'ss_sold_date_sk < 2420133) || ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) || ('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) || ('ss_sold_date_sk > 2421199 && 'ss_sold_date_sk < 2421229) || ('ss_sold_date_sk > 2421564 && 'ss_sold_date_sk < 2421594) || ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk < 2421959) || ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) || ('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) || ('ss_sold_date_sk > 2423025 && 'ss_sold_date_sk < 2423055) || ('ss_sold_date_sk > 2423390 && 'ss_sold_date_sk < 2423420) || ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk < 2423785) || ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) || ('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) || ('ss_sold_date_sk > 2424851 && 'ss_sold_date_sk < 2424881) || ('ss_sold_date_sk > 2425216 && 'ss_sold_date_sk < 2425246) || ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk < 2425612) || ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) || ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) || ('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) || ('ss_sold_date_sk > 2427043 && 'ss_sold_date_sk < 2427073) || ('ss_sold_date_sk > 2427408 && 'ss_sold_date_sk < 2427438) || ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk < 2427803) || ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) || ('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) || ('ss_sold_date_sk > 2428869 && 'ss_sold_date_sk < 2428899) || ('ss_sold_date_sk > 2429234 && 'ss_sold_date_sk < 2429264) || ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk < 2429629) || ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) || ('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) || ('ss_sold_date_sk > 2430695 && 'ss_sold_date_sk < 2430725) || ('ss_sold_date_sk > 2431060 && 'ss_sold_date_sk < 2431090) || ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk < 2431456) || ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) || ('ss_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) || ('ss_sold_date_sk > 2432521 && 'ss_sold_date_sk < 2432551) || ('ss_sold_date_sk > 2432887 && 'ss_sold_date_sk < 2432917) || ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk < 2433282) || ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) || ('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) || ('ss_sold_date_sk > 2434348 && 'ss_sold_date_sk < 2434378) || ('ss_sold_date_sk > 2434713 && 'ss_sold_date_sk < 2434743)))
      
          val plan = testRelation.where(input).analyze
          val actual = Optimize.execute(plan)
      
      With this patch:
      
          352 milliseconds
          346 milliseconds
          340 milliseconds
      
      Without this patch:
      
          585 milliseconds
          880 milliseconds
          677 milliseconds
      
      ## How was this patch tested?
      
      Existing tests should pass.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11647 from viirya/improve-expr-canonicalize.
      6a4bfcd6
    • Ryan Blue's avatar
      [SPARK-13779][YARN] Avoid cancelling non-local container requests. · 63f642ae
      Ryan Blue authored
      To maximize locality, the YarnAllocator would cancel any requests with a
      stale locality preference or no locality preference. This assumed that
      the majority of tasks had locality preferences, but may not be the case
      when scanning S3. This caused container requests for S3 tasks to be
      constantly cancelled and resubmitted.
      
      This changes the allocator's logic to cancel only stale requests and
      just enough requests without locality preferences to submit requests
      with locality preferences. This avoids cancelling requests without
      locality preferences that would be resubmitted without locality
      preferences.
      
      We've deployed this patch on our clusters and verified that jobs that couldn't get executors because they kept canceling and resubmitting requests are fixed. Large jobs are running fine.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #11612 from rdblue/SPARK-13779-fix-yarn-allocator-requests.
      63f642ae
    • Marcelo Vanzin's avatar
      [SPARK-13578][CORE] Modify launch scripts to not use assemblies. · 45f8053b
      Marcelo Vanzin authored
      Instead of looking for a specially-named assembly, the scripts now will
      blindly add all jars under the libs directory to the classpath. This
      libs directory is still currently the old assembly dir, so things should
      keep working the same way as before until we make more packaging changes.
      
      The only lost feature is the detection of multiple assemblies; I consider
      that a minor nicety that only really affects few developers, so it's probably
      ok.
      
      Tested locally by running spark-shell; also did some minor Win32 testing
      (just made sure spark-shell started).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11591 from vanzin/SPARK-13578.
      45f8053b
    • Josh Rosen's avatar
      [SPARK-13833] Guard against race condition when re-caching disk blocks in memory · 9a87afd7
      Josh Rosen authored
      When reading data from the DiskStore and attempting to cache it back into the memory store, we should guard against race conditions where multiple readers are attempting to re-cache the same block in memory.
      
      This patch accomplishes this by synchronizing on the block's `BlockInfo` object while trying to re-cache a block.
      
      (Will file JIRA as soon as ASF JIRA stops being down / laggy).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11660 from JoshRosen/concurrent-recaching-fixes.
      9a87afd7
    • Andrew Or's avatar
      [SPARK-13139][SQL] Follow-ups to #11573 · 9a1680c2
      Andrew Or authored
      Addressing outstanding comments in #11573.
      
      Jenkins, new test case in `DDLCommandSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11667 from andrewor14/ddl-parser-followups.
      9a1680c2
    • Yin Huai's avatar
      [SPARK-13207][SQL] Make partitioning discovery ignore _SUCCESS files. · 250832c7
      Yin Huai authored
      If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those `_SUCCESS` files.
      
      In future, it is better to ignore all files/dirs starting with `_` or `.`. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6.
      
      To ignore all files/dirs starting with `_` or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes.
      
      https://issues.apache.org/jira/browse/SPARK-13207
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #11088 from yhuai/SPARK-13207.
      250832c7
    • Wilson Wu's avatar
      [SPARK-13746][TESTS] stop using deprecated SynchronizedSet · 31d069d4
      Wilson Wu authored
      trait SynchronizedSet in package mutable is deprecated
      
      Author: Wilson Wu <wilson888888888@gmail.com>
      
      Closes #11580 from wilson888888888/spark-synchronizedset.
      31d069d4
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Fix more typos in comments/strings. · acdf2197
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes 135 typos over 107 files:
      * 121 typos in comments
      * 11 typos in testcase name
      * 3 typos in log messages
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11689 from dongjoon-hyun/fix_more_typos.
      acdf2197
    • Reynold Xin's avatar
      Closes #11668 · e58fa19d
      Reynold Xin authored
      e58fa19d
  2. Mar 13, 2016
    • Sean Owen's avatar
      [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <->... · 18408528
      Sean Owen authored
      [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
      
      ## What changes were proposed in this pull request?
      
      - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
      - Same for `InputStreamReader` and `OutputStreamWriter` constructors
      - Standardizes on UTF-8 everywhere
      - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
      - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit https://github.com/srowen/spark/commit/1deecd8d9ca986d8adb1a42d315890ce5349d29c )
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11657 from srowen/SPARK-13823.
      18408528
    • Dongjoon Hyun's avatar
      [SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x. · 473263f9
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      For 2.0.0, we had better make **sbt** and **sbt plugins** up-to-date. This PR checks the status of each plugins and bumps the followings.
      
      * sbt: 0.13.9 --> 0.13.11
      * sbteclipse-plugin: 2.2.0 --> 4.0.0
      * sbt-dependency-graph: 0.7.4 --> 0.8.2
      * sbt-mima-plugin: 0.1.6 --> 0.1.9
      * sbt-revolver: 0.7.2 --> 0.8.0
      
      All other plugins are up-to-date. (Note that `sbt-avro` seems to be change from 0.3.2 to 1.0.1, but it's not published in the repository.)
      
      During upgrade, this PR also updated the following MiMa error. Note that the related excluding filter is already registered correctly. It seems due to the change of MiMa exception result.
      ```
       // SPARK-12896 Send only accumulator updates to driver, not TaskMetrics
       ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulable.this"),
      -ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulator.this"),
      +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulator.this"),
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins build.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11669 from dongjoon-hyun/update_mima.
      473263f9
    • Jacky Li's avatar
      [SQL] fix typo in DataSourceRegister · f3daa099
      Jacky Li authored
      ## What changes were proposed in this pull request?
      fix typo in DataSourceRegister
      
      ## How was this patch tested?
      
      found when going through latest code
      
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #11686 from jackylk/patch-12.
      f3daa099
    • Sun Rui's avatar
      [SPARK-13812][SPARKR] Fix SparkR lint-r test errors. · c7e68c39
      Sun Rui authored
      ## What changes were proposed in this pull request?
      
      This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github.
      
      ## How was this patch tested?
      
      dev/lint-r
      SparkR unit tests
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #11652 from sun-rui/SPARK-13812.
      c7e68c39
    • Bjorn Jonsson's avatar
      [SPARK-13810][CORE] Add Port Configuration Suggestions on Bind Exceptions · 515e4afb
      Bjorn Jonsson authored
      ## What changes were proposed in this pull request?
      Currently, when a java.net.BindException is thrown, it displays the following message:
      
      java.net.BindException: Address already in use: Service '$serviceName' failed after 16 retries!
      
      This change adds port configuration suggestions to the BindException, for example, for the UI, it now displays
      
      java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries! Consider explicitly setting the appropriate port for 'SparkUI' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
      
      ## How was this patch tested?
      Manual tests
      
      Author: Bjorn Jonsson <bjornjon@gmail.com>
      
      Closes #11644 from bjornjon/master.
      515e4afb
  3. Mar 12, 2016
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Replace `DataFrame` with `Dataset` in Javadoc. · db88d020
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      SPARK-13817 (PR #11656) replaces `DataFrame` with `Dataset` from Java. This PR fixes the remaining broken links and sample Java code in `package-info.java`. As a result, it will update the following Javadoc.
      
      * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/attribute/package-summary.html
      * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/package-summary.html
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11675 from dongjoon-hyun/replace_dataframe_with_dataset_in_javadoc.
      db88d020
    • Cheng Lian's avatar
      [SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows() · c079420d
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful.
      
      ## How was this patch tested?
      
      Existing tests should do the work.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11678 from liancheng/remove-collect-rows-and-take-rows.
      c079420d
    • Cheng Lian's avatar
      [SPARK-13828][SQL] Bring back stack trace of AnalysisException thrown from... · 4eace4d3
      Cheng Lian authored
      [SPARK-13828][SQL] Bring back stack trace of AnalysisException thrown from QueryExecution.assertAnalyzed
      
      PR #11443 added an extra `plan: Option[LogicalPlan]` argument to `AnalysisException` and attached partially analyzed plan to thrown `AnalysisException` in `QueryExecution.assertAnalyzed()`.  However, the original stack trace wasn't properly inherited.  This PR fixes this issue by inheriting the stack trace.
      
      A test case is added to verify that the first entry of `AnalysisException` stack trace isn't from `QueryExecution`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11677 from liancheng/analysis-exception-stacktrace.
      4eace4d3
    • Davies Liu's avatar
      [SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources · ba8c86d0
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them.
      
      Also fix the problem for sameResult() on two DataSourceScan.
      
      Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad).
      
      ## How was this patch tested?
      
      Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11514 from davies/existing_rdd.
      ba8c86d0
  4. Mar 11, 2016
    • Davies Liu's avatar
      [SPARK-13830] prefer block manager than direct result for large result · 2ef4c596
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      The current RPC can't handle large blocks very well, it's very slow to fetch 100M block (about 1 minute). Once switch to block manager to fetch that, it took about 10 seconds (still could be improved).
      
      ## How was this patch tested?
      
      existing unit tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11659 from davies/direct_result.
      2ef4c596
    • Andrew Or's avatar
      [SPARK-13139][SQL] Parse Hive DDL commands ourselves · 66d9d0ed
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`.
      
      Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.
      
      ## How was this patch tested?
      
      Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11573 from andrewor14/parser-plus-plus.
      66d9d0ed
    • Zheng RuiFeng's avatar
      [SPARK-13814] [PYSPARK] Delete unnecessary imports in python examples files · 42afd72c
      Zheng RuiFeng authored
      JIRA:  https://issues.apache.org/jira/browse/SPARK-13814
      
      ## What changes were proposed in this pull request?
      
      delete unnecessary imports in python examples files
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11651 from zhengruifeng/del_import_pe.
      42afd72c
    • Josh Rosen's avatar
      [SPARK-13807] De-duplicate `Python*Helper` instantiation code in PySpark streaming · 073bf9d4
      Josh Rosen authored
      This patch de-duplicates code in PySpark streaming which loads the `Python*Helper` classes. I also changed a few `raise e` statements to simply `raise` in order to preserve the full exception stacktrace when re-throwing.
      
      Here's a link to the whitespace-change-free diff: https://github.com/apache/spark/compare/master...JoshRosen:pyspark-reflection-deduplication?w=0
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11641 from JoshRosen/pyspark-reflection-deduplication.
      073bf9d4
    • Nezih Yigitbasi's avatar
      [SPARK-13328][CORE] Poor read performance for broadcast variables with dynamic resource allocation · ff776b2f
      Nezih Yigitbasi authored
      When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt)
      
      Author: Nezih Yigitbasi <nyigitbasi@netflix.com>
      
      Closes #11241 from nezihyigitbasi/SPARK-13328.
      ff776b2f
    • Liwei Lin's avatar
      [STREAMING][MINOR] Fix a duplicate "be" in comments · eb650a81
      Liwei Lin authored
      Author: Liwei Lin <proflin.me@gmail.com>
      
      Closes #11650 from lw-lin/typo.
      eb650a81
    • Marcelo Vanzin's avatar
      [SPARK-13780][SQL] Add missing dependency to build. · 99b7187c
      Marcelo Vanzin authored
      This is needed to avoid odd compiler errors when building just the
      sql package with maven, because of odd interactions between scalac
      and shaded classes.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11640 from vanzin/SPARK-13780.
      99b7187c
    • Cheng Lian's avatar
      [SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame · 6d37e1eb
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      PR #11443 temporarily disabled MiMA check, this PR re-enables it.
      
      One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API  changes.
      
      ## How was this patch tested?
      
      Tested by MiMA check triggered by Jenkins.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #11656 from liancheng/re-enable-mima.
      6d37e1eb
    • Marcelo Vanzin's avatar
      [SPARK-13577][YARN] Allow Spark jar to be multiple jars, archive. · 07f1c544
      Marcelo Vanzin authored
      In preparation for the demise of assemblies, this change allows the
      YARN backend to use multiple jars and globs as the "Spark jar". The
      config option has been renamed to "spark.yarn.jars" to reflect that.
      
      A second option "spark.yarn.archive" was also added; if set, this
      takes precedence and uploads an archive expected to contain the jar
      files with the Spark code and its dependencies.
      
      Existing deployments should keep working, mostly. This change drops
      support for the "SPARK_JAR" environment variable, and also does not
      fall back to using "jarOfClass" if no configuration is set, falling
      back to finding files under SPARK_HOME instead. This should be fine
      since "jarOfClass" probably wouldn't work unless you were using
      spark-submit anyway.
      
      Tested with the unit tests, and trying the different config options
      on a YARN cluster.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11500 from vanzin/SPARK-13577.
      07f1c544
    • Nick Pentreath's avatar
      [HOT-FIX][SQL][ML] Fix compile error from use of DataFrame in Java MaxAbsScaler example · 8fff0f92
      Nick Pentreath authored
      ## What changes were proposed in this pull request?
      
      Fix build failure introduced in #11392 (change `DataFrame` -> `Dataset<Row>`).
      
      ## How was this patch tested?
      
      Existing build/unit tests
      
      Author: Nick Pentreath <nick.pentreath@gmail.com>
      
      Closes #11653 from MLnick/java-maxabs-example-fix.
      8fff0f92
    • sethah's avatar
      [SPARK-13787][ML][PYSPARK] Pyspark feature importances for decision tree and random forest · 234f781a
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds a `featureImportance` property to the Pyspark API for `DecisionTreeRegressionModel`, `DecisionTreeClassificationModel`, `RandomForestRegressionModel` and `RandomForestClassificationModel`.
      
      ## How was this patch tested?
      
      Python doc tests for the affected classes were updated to check feature importances.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11622 from sethah/SPARK-13787.
      234f781a
    • Yuhao Yang's avatar
      [SPARK-13512][ML] add example and doc for MaxAbsScaler · 0b713e04
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      jira: https://issues.apache.org/jira/browse/SPARK-13512
      Add example and doc for ml.feature.MaxAbsScaler.
      
      ## How was this patch tested?
       unit tests
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #11392 from hhbyyh/maxabsdoc.
      0b713e04
    • Josh Rosen's avatar
      [SPARK-13294][PROJECT INFRA] Remove MiMa's dependency on spark-class / Spark assembly · 6ca990fb
      Josh Rosen authored
      This patch removes the need to build a full Spark assembly before running the `dev/mima` script.
      
      - I modified the `tools` project to remove a direct dependency on Spark, so `sbt/sbt tools/fullClasspath` will now return the classpath for the `GenerateMIMAIgnore` class itself plus its own dependencies.
         - This required me to delete two classes full of dead code that we don't use anymore
      - `GenerateMIMAIgnore` now uses [ClassUtil](http://software.clapper.org/classutil/) to find all of the Spark classes rather than our homemade JAR traversal code. The problem in our own code was that it didn't handle folders of classes properly, which is necessary in order to generate excludes with an assembly-free Spark build.
      - `./dev/mima` no longer runs through `spark-class`, eliminating the need to reason about classpath ordering between `SPARK_CLASSPATH` and the assembly.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11178 from JoshRosen/remove-assembly-in-run-tests.
      6ca990fb
    • Zheng RuiFeng's avatar
      [SPARK-13672][ML] Add python examples of BisectingKMeans in ML and MLLIB · d18276cb
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13672
      
      ## What changes were proposed in this pull request?
      
      add two python examples of BisectingKMeans for ml and mllib
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11515 from zhengruifeng/mllib_bkm_pe.
      d18276cb
    • Marcelo Vanzin's avatar
      [MINOR][CORE] Fix a duplicate "and" in a log message. · e33bc67c
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11642 from vanzin/spark-conf-typo.
      e33bc67c
  5. Mar 10, 2016
    • Wenchen Fan's avatar
      [HOT-FIX] fix compile · 74c4e265
      Wenchen Fan authored
      Fix the compilation failure introduced by https://github.com/apache/spark/pull/11555 because of a merge conflict.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11648 from cloud-fan/hotbug.
      74c4e265
    • Wenchen Fan's avatar
      [SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions · 6871cc8f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Add SQL generation support for window functions. The idea is simple, just treat `Window` operator like `Project`, i.e. add subquery to its child when necessary, generate a `SELECT ... FROM ...` SQL string, implement `sql` method for window related expressions, e.g. `WindowSpecDefinition`, `WindowFrame`, etc.
      
      This PR also fixed SPARK-13720 by improving the process of adding extra `SubqueryAlias`(the `RecoverScopingInfo` rule). Before this PR, we update the qualifiers in project list while adding the subquery. However, this is incomplete as we need to update qualifiers in all ancestors that refer attributes here. In this PR, we split `RecoverScopingInfo` into 2 rules: `AddSubQuery` and `UpdateQualifier`. `AddSubQuery` only add subquery if necessary, and `UpdateQualifier` will re-propagate and update qualifiers bottom up.
      
      Ideally we should put the bug fix part in an individual PR, but this bug also blocks the window stuff, so I put them together here.
      
      Many thanks to gatorsmile for the initial discussion and test cases!
      
      ## How was this patch tested?
      
      new tests in `LogicalPlanToSQLSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11555 from cloud-fan/window.
      6871cc8f
    • gatorsmile's avatar
      [SPARK-13732][SPARK-13797][SQL] Remove projectList from Window and Eliminate useless Window · 560489f4
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      `projectList` is useless. Its value is always the same as the child.output. Remove it from the class `Window`. Removal can simplify the codes in Analyzer and Optimizer.
      
      This PR is based on the discussion started by cloud-fan in a separate PR:
      https://github.com/apache/spark/pull/5604#discussion_r55140466
      
      This PR also eliminates useless `Window`.
      
      cloud-fan yhuai
      
      #### How was this patch tested?
      
      Existing test cases cover it.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #11565 from gatorsmile/removeProjListWindow.
      560489f4
    • Yanbo Liang's avatar
      [SPARK-13389][SPARKR] SparkR support first/last with ignore NAs · 4d535d1f
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      
      SparkR support first/last with ignore NAs
      
      cc sun-rui felixcheung shivaram
      
      ## How was the this patch tested?
      
      unit tests
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11267 from yanboliang/spark-13389.
      4d535d1f
    • Sameer Agarwal's avatar
      [SPARK-13789] Infer additional constraints from attribute equality · c3a6269c
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for inferring an additional set of data constraints based on attribute equality. For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), we can now automatically infer an additional constraint of the form `b = 5`
      
      ## How was this patch tested?
      
      Tested that new constraints are properly inferred for filters (by adding a new test) and equi-joins (by modifying an existing test)
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11618 from sameeragarwal/infer-isequal-constraints.
      c3a6269c
    • Oscar D. Lara Yejas's avatar
      [SPARK-13327][SPARKR] Added parameter validations for colnames<- · 416e71af
      Oscar D. Lara Yejas authored
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
      
      Closes #11220 from olarayej/SPARK-13312-3.
      416e71af
Loading