Skip to content
Snippets Groups Projects
  1. Jul 14, 2017
  2. Jul 09, 2017
  3. Jul 08, 2017
  4. Jul 06, 2017
    • Sumedh Wale's avatar
      [SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream · 7f7b63bb
      Sumedh Wale authored
      
      ## What changes were proposed in this pull request?
      
      Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes.
      
      ## How was this patch tested?
      
      Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset.
      
      Author: Sumedh Wale <swale@snappydata.io>
      
      Closes #18535 from sumwale/SPARK-21312.
      
      (cherry picked from commit 14a3bb3a)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      7f7b63bb
  5. Jul 04, 2017
    • Dongjoon Hyun's avatar
      [SPARK-20256][SQL][BRANCH-2.1] SessionState should be created more lazily · 8f1ca695
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)).
      
      This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems.
      
      **BEFORE**
      ```scala
      $ bin/spark-shell
      java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
      ...
      Caused by: org.apache.spark.sql.AnalysisException:
          org.apache.hadoop.hive.ql.metadata.HiveException:
             MetaException(message:java.security.AccessControlException:
                Permission denied: user=spark, access=READ,
                   inode="/apps/hive/warehouse":hive:hdfs:drwx------
      ```
      As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user.
      
      **AFTER**
      ```scala
      $ bin/spark-shell
      ...
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /___/ .__/\_,_/_/ /_/\_\   version 2.1.2-SNAPSHOT
            /_/
      
      Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
      Type in expressions to have them evaluated.
      Type :help for more information.
      
      scala> sc.range(0, 10, 1).count()
      res0: Long = 10
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18530 from dongjoon-hyun/SPARK-20256-BRANCH-2.1.
      8f1ca695
  6. Jun 30, 2017
  7. Jun 29, 2017
    • Herman van Hovell's avatar
      [SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling · d995dac1
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.
      
      This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by https://github.com/apache/spark/pull/16909
      
      , after this PR Spark spills more eagerly.
      
      This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.
      
      ## How was this patch tested?
      Added a regression test to `DataFrameWindowFunctionsSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #18470 from hvanhovell/SPARK-21258.
      
      (cherry picked from commit e2f32ee4)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      d995dac1
    • IngoSchuster's avatar
      [SPARK-21176][WEB UI] Limit number of selector threads for admin ui proxy servlets to 8 · 083adb07
      IngoSchuster authored
      ## What changes were proposed in this pull request?
      Please see also https://issues.apache.org/jira/browse/SPARK-21176
      
      This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2).
      The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers).
      Once https://github.com/eclipse/jetty.project/issues/1643
      
       is available, the code could be cleaned up to avoid the method override.
      
      I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR?
      
      ## How was this patch tested?
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy.
      
      gurvindersingh zsxwing can you please review the change?
      
      Author: IngoSchuster <ingo.schuster@de.ibm.com>
      Author: Ingo Schuster <ingo.schuster@de.ibm.com>
      
      Closes #18437 from IngoSchuster/master.
      
      (cherry picked from commit 88a536ba)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      083adb07
  8. Jun 25, 2017
  9. Jun 24, 2017
    • gatorsmile's avatar
      [SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct · 0d6b701e
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      ```SQL
      CREATE TABLE `tab1`
      (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
      USING parquet
      
      INSERT INTO `tab1`
      SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
      
      SELECT custom_fields.id, custom_fields.value FROM tab1
      ```
      
      The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
      
      ### How was this patch tested?
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18412 from gatorsmile/castStruct.
      
      (cherry picked from commit 2e1586f6)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      0d6b701e
    • Marcelo Vanzin's avatar
      [SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster mode. · 6750db3f
      Marcelo Vanzin authored
      
      Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but
      the same scheduler implementation is used, and if it tries to connect to the
      launcher it will fail. So fix the scheduler so it only tries that in client mode;
      cluster mode applications will be correctly launched and will work, but monitoring
      through the launcher handle will not be available.
      
      Tested by running a cluster mode app with "SparkLauncher.startApplication".
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18397 from vanzin/SPARK-21159.
      
      (cherry picked from commit bfd73a7c)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      6750db3f
    • Gabor Feher's avatar
      [SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types in read path · f12883e3
      Gabor Feher authored
      This PR is to revert some code changes in the read path of https://github.com/apache/spark/pull/14377. The original fix is https://github.com/apache/spark/pull/17830
      
      When merging this PR, please give the credit to gaborfeher
      
      Added a test case to OracleIntegrationSuite.scala
      
      Author: Gabor Feher <gabor.feher@lynxanalytics.com>
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18408 from gatorsmile/OracleType.
      f12883e3
  10. Jun 23, 2017
    • Ong Ming Yang's avatar
      [MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method · bcaf06c4
      Ong Ming Yang authored
      
      ## What changes were proposed in this pull request?
      
      * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`.
      * Filled in some missing parentheses
      
      ## How was this patch tested?
      
      N/A
      
      Author: Ong Ming Yang <me@ongmingyang.com>
      
      Closes #18398 from ongmingyang/master.
      
      (cherry picked from commit 4cc62951)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      bcaf06c4
    • Dhruve Ashar's avatar
      [SPARK-21181] Release byteBuffers to suppress netty error messages · f8fd3b48
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user.
      
      ### Changes proposed in this fix
      By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0.
      
      ## How was this patch tested?
      Ran a few spark-applications and examined the logs. The error message no longer appears.
      
      Original PR was opened against branch-2.1 => https://github.com/apache/spark/pull/18392
      
      
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #18407 from dhruve/master.
      
      (cherry picked from commit 1ebe7ffe)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      f8fd3b48
  11. Jun 22, 2017
  12. Jun 20, 2017
    • assafmendelson's avatar
      [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are... · 8923bac1
      assafmendelson authored
      [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table - version to fix 2.1
      
      ## What changes were proposed in this pull request?
      
      The description for several options of File Source for structured streaming appeared in the File Sink description instead.
      
      This commit continues on PR #18342 and targets the fixes for the documentation of version spark version 2.1
      
      ## How was this patch tested?
      
      Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.
      
      zsxwing This is the PR to fix version 2.1 as discussed in PR #18342
      
      Author: assafmendelson <assaf.mendelson@gmail.com>
      
      Closes #18363 from assafmendelson/spark-21123-for-spark2.1.
      8923bac1
  13. Jun 19, 2017
  14. Jun 15, 2017
  15. Jun 14, 2017
    • gatorsmile's avatar
      [SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values... · a890466b
      gatorsmile authored
      [SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0
      
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/18244 to 2.2
      
      ---
      
      The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.
      
      The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html
      
      ). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.
      
      Before this PR, the following queries failed:
      ```SQL
      select 1 > 0.0001
      select floor(0.0001)
      select ceil(0.0001)
      ```
      
      ### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18297 from gatorsmile/backport18244.
      
      (cherry picked from commit 62651195)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      a890466b
  16. Jun 13, 2017
  17. Jun 08, 2017
  18. Jun 03, 2017
  19. Jun 01, 2017
    • Marcelo Vanzin's avatar
      [SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches. · 0b25a7d9
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18178 from vanzin/SPARK-20922-hotfix.
      0b25a7d9
    • Marcelo Vanzin's avatar
      [SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher. · 772a9b96
      Marcelo Vanzin authored
      
      Blindly deserializing classes using Java serialization opens the code up to
      issues in other libraries, since just deserializing data from a stream may
      end up execution code (think readObject()).
      
      Since the launcher protocol is pretty self-contained, there's just a handful
      of classes it legitimately needs to deserialize, and they're in just two
      packages, so add a filter that throws errors if classes from any other
      package show up in the stream.
      
      This also maintains backwards compatibility (the updated launcher code can
      still communicate with the backend code in older Spark releases).
      
      Tested with new and existing unit tests.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18166 from vanzin/SPARK-20922.
      
      (cherry picked from commit 8efc6e98)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      772a9b96
  20. May 31, 2017
    • Shixiong Zhu's avatar
      [SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateException · dade85f7
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `IllegalAccessError` is a fatal error (a subclass of LinkageError) and its meaning is `Thrown if an application attempts to access or modify a field, or to call a method that it does not have access to`. Throwing a fatal error for AccumulatorV2 is not necessary and is pretty bad because it usually will just kill executors or SparkContext ([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666
      
      ) is an example of killing SparkContext due to `IllegalAccessError`). I think the correct type of exception in AccumulatorV2 should be `IllegalStateException`.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18168 from zsxwing/SPARK-20940.
      
      (cherry picked from commit 24db3582)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      dade85f7
  21. May 30, 2017
    • jerryshao's avatar
      [SPARK-20275][UI] Do not display "Completed" column for in-progress applications · 46400867
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Current HistoryServer will display completed date of in-progress application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of unnecessarily showing this incorrect completed date, here propose to make this column invisible for in-progress applications.
      
      The purpose of only making this column invisible rather than deleting this field is that: this data is fetched through REST API, and in the REST API  the format is like below shows, in which `endTime` matches `endTimeEpoch`. So instead of changing REST API to break backward compatibility, here choosing a simple solution to only make this column invisible.
      
      ```
      [ {
        "id" : "local-1491805439678",
        "name" : "Spark shell",
        "attempts" : [ {
          "startTime" : "2017-04-10T06:23:57.574GMT",
          "endTime" : "1969-12-31T23:59:59.999GMT",
          "lastUpdated" : "2017-04-10T06:23:57.574GMT",
          "duration" : 0,
          "sparkUser" : "",
          "completed" : false,
          "startTimeEpoch" : 1491805437574,
          "endTimeEpoch" : -1,
          "lastUpdatedEpoch" : 1491805437574
        } ]
      } ]%
      ```
      
      Here is UI before changed:
      
      <img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm" src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png">
      
      And after:
      
      <img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm" src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png
      
      ">
      
      ## How was this patch tested?
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17588 from jerryshao/SPARK-20275.
      
      (cherry picked from commit 52ed9b28)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      46400867
  22. May 27, 2017
  23. May 26, 2017
  24. May 25, 2017
  25. May 24, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-20848][SQL][FOLLOW-UP] Shutdown the pool after reading parquet files · 7015f6f0
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up to #18073. Taking a safer approach to shutdown the pool to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to set a better thread name.
      
      ## How was this patch tested?
      
      Manually test.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18100 from viirya/SPARK-20848-followup.
      
      (cherry picked from commit 6b68d61c)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      7015f6f0
    • Xingbo Jiang's avatar
      [SPARK-18406][CORE][BACKPORT-2.1] Race between end-of-task and completion... · c3302e81
      Xingbo Jiang authored
      [SPARK-18406][CORE][BACKPORT-2.1] Race between end-of-task and completion iterator read lock release
      
      This is a backport PR of  #18076 to 2.1.
      
      ## What changes were proposed in this pull request?
      
      When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method.
      
      ## How was this patch tested?
      
      Add new failing regression test case in `RDDSuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18099 from jiangxb1987/completion-iterator-2.1.
      c3302e81
    • Liang-Chi Hsieh's avatar
      [SPARK-20848][SQL] Shutdown the pool after reading parquet files · 2f68631f
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads.
      
      We should shutdown the pool after reading parquet files.
      
      ## How was this patch tested?
      
      Added a test to ParquetFileFormatSuite.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18073 from viirya/SPARK-20848.
      
      (cherry picked from commit f72ad303)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      2f68631f
    • Bago Amirbekian's avatar
      [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel · 13adc0fc
      Bago Amirbekian authored
      
      ## What changes were proposed in this pull request?
      
      Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float.
      
      ## How was this patch tested?
      
      Existing tests run using python3 and numpy 1.12.
      
      Author: Bago Amirbekian <bago@databricks.com>
      
      Closes #18081 from MrBago/BF-py3floatbug.
      
      (cherry picked from commit bc66a77b)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      13adc0fc
Loading