Skip to content
Snippets Groups Projects
  1. Jun 29, 2017
    • Herman van Hovell's avatar
      [SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling · d995dac1
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.
      
      This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by https://github.com/apache/spark/pull/16909
      
      , after this PR Spark spills more eagerly.
      
      This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.
      
      ## How was this patch tested?
      Added a regression test to `DataFrameWindowFunctionsSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #18470 from hvanhovell/SPARK-21258.
      
      (cherry picked from commit e2f32ee4)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      d995dac1
    • IngoSchuster's avatar
      [SPARK-21176][WEB UI] Limit number of selector threads for admin ui proxy servlets to 8 · 083adb07
      IngoSchuster authored
      ## What changes were proposed in this pull request?
      Please see also https://issues.apache.org/jira/browse/SPARK-21176
      
      This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2).
      The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers).
      Once https://github.com/eclipse/jetty.project/issues/1643
      
       is available, the code could be cleaned up to avoid the method override.
      
      I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR?
      
      ## How was this patch tested?
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy.
      
      gurvindersingh zsxwing can you please review the change?
      
      Author: IngoSchuster <ingo.schuster@de.ibm.com>
      Author: Ingo Schuster <ingo.schuster@de.ibm.com>
      
      Closes #18437 from IngoSchuster/master.
      
      (cherry picked from commit 88a536ba)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      083adb07
  2. Jun 25, 2017
  3. Jun 24, 2017
    • gatorsmile's avatar
      [SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct · 0d6b701e
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      ```SQL
      CREATE TABLE `tab1`
      (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
      USING parquet
      
      INSERT INTO `tab1`
      SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))
      
      SELECT custom_fields.id, custom_fields.value FROM tab1
      ```
      
      The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.
      
      ### How was this patch tested?
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18412 from gatorsmile/castStruct.
      
      (cherry picked from commit 2e1586f6)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      0d6b701e
    • Marcelo Vanzin's avatar
      [SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster mode. · 6750db3f
      Marcelo Vanzin authored
      
      Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but
      the same scheduler implementation is used, and if it tries to connect to the
      launcher it will fail. So fix the scheduler so it only tries that in client mode;
      cluster mode applications will be correctly launched and will work, but monitoring
      through the launcher handle will not be available.
      
      Tested by running a cluster mode app with "SparkLauncher.startApplication".
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18397 from vanzin/SPARK-21159.
      
      (cherry picked from commit bfd73a7c)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      6750db3f
    • Gabor Feher's avatar
      [SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types in read path · f12883e3
      Gabor Feher authored
      This PR is to revert some code changes in the read path of https://github.com/apache/spark/pull/14377. The original fix is https://github.com/apache/spark/pull/17830
      
      When merging this PR, please give the credit to gaborfeher
      
      Added a test case to OracleIntegrationSuite.scala
      
      Author: Gabor Feher <gabor.feher@lynxanalytics.com>
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18408 from gatorsmile/OracleType.
      f12883e3
  4. Jun 23, 2017
    • Ong Ming Yang's avatar
      [MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method · bcaf06c4
      Ong Ming Yang authored
      
      ## What changes were proposed in this pull request?
      
      * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`.
      * Filled in some missing parentheses
      
      ## How was this patch tested?
      
      N/A
      
      Author: Ong Ming Yang <me@ongmingyang.com>
      
      Closes #18398 from ongmingyang/master.
      
      (cherry picked from commit 4cc62951)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      bcaf06c4
    • Dhruve Ashar's avatar
      [SPARK-21181] Release byteBuffers to suppress netty error messages · f8fd3b48
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user.
      
      ### Changes proposed in this fix
      By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0.
      
      ## How was this patch tested?
      Ran a few spark-applications and examined the logs. The error message no longer appears.
      
      Original PR was opened against branch-2.1 => https://github.com/apache/spark/pull/18392
      
      
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #18407 from dhruve/master.
      
      (cherry picked from commit 1ebe7ffe)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      f8fd3b48
  5. Jun 22, 2017
  6. Jun 20, 2017
    • assafmendelson's avatar
      [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are... · 8923bac1
      assafmendelson authored
      [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table - version to fix 2.1
      
      ## What changes were proposed in this pull request?
      
      The description for several options of File Source for structured streaming appeared in the File Sink description instead.
      
      This commit continues on PR #18342 and targets the fixes for the documentation of version spark version 2.1
      
      ## How was this patch tested?
      
      Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.
      
      zsxwing This is the PR to fix version 2.1 as discussed in PR #18342
      
      Author: assafmendelson <assaf.mendelson@gmail.com>
      
      Closes #18363 from assafmendelson/spark-21123-for-spark2.1.
      8923bac1
  7. Jun 19, 2017
  8. Jun 15, 2017
  9. Jun 14, 2017
    • gatorsmile's avatar
      [SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values... · a890466b
      gatorsmile authored
      [SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0
      
      ### What changes were proposed in this pull request?
      
      This PR is to backport https://github.com/apache/spark/pull/18244 to 2.2
      
      ---
      
      The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.
      
      The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html
      
      ). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.
      
      Before this PR, the following queries failed:
      ```SQL
      select 1 > 0.0001
      select floor(0.0001)
      select ceil(0.0001)
      ```
      
      ### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18297 from gatorsmile/backport18244.
      
      (cherry picked from commit 62651195)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      a890466b
  10. Jun 13, 2017
  11. Jun 08, 2017
  12. Jun 03, 2017
  13. Jun 01, 2017
    • Marcelo Vanzin's avatar
      [SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches. · 0b25a7d9
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18178 from vanzin/SPARK-20922-hotfix.
      0b25a7d9
    • Marcelo Vanzin's avatar
      [SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher. · 772a9b96
      Marcelo Vanzin authored
      
      Blindly deserializing classes using Java serialization opens the code up to
      issues in other libraries, since just deserializing data from a stream may
      end up execution code (think readObject()).
      
      Since the launcher protocol is pretty self-contained, there's just a handful
      of classes it legitimately needs to deserialize, and they're in just two
      packages, so add a filter that throws errors if classes from any other
      package show up in the stream.
      
      This also maintains backwards compatibility (the updated launcher code can
      still communicate with the backend code in older Spark releases).
      
      Tested with new and existing unit tests.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18166 from vanzin/SPARK-20922.
      
      (cherry picked from commit 8efc6e98)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      772a9b96
  14. May 31, 2017
    • Shixiong Zhu's avatar
      [SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateException · dade85f7
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `IllegalAccessError` is a fatal error (a subclass of LinkageError) and its meaning is `Thrown if an application attempts to access or modify a field, or to call a method that it does not have access to`. Throwing a fatal error for AccumulatorV2 is not necessary and is pretty bad because it usually will just kill executors or SparkContext ([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666
      
      ) is an example of killing SparkContext due to `IllegalAccessError`). I think the correct type of exception in AccumulatorV2 should be `IllegalStateException`.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18168 from zsxwing/SPARK-20940.
      
      (cherry picked from commit 24db3582)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      dade85f7
  15. May 30, 2017
    • jerryshao's avatar
      [SPARK-20275][UI] Do not display "Completed" column for in-progress applications · 46400867
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Current HistoryServer will display completed date of in-progress application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of unnecessarily showing this incorrect completed date, here propose to make this column invisible for in-progress applications.
      
      The purpose of only making this column invisible rather than deleting this field is that: this data is fetched through REST API, and in the REST API  the format is like below shows, in which `endTime` matches `endTimeEpoch`. So instead of changing REST API to break backward compatibility, here choosing a simple solution to only make this column invisible.
      
      ```
      [ {
        "id" : "local-1491805439678",
        "name" : "Spark shell",
        "attempts" : [ {
          "startTime" : "2017-04-10T06:23:57.574GMT",
          "endTime" : "1969-12-31T23:59:59.999GMT",
          "lastUpdated" : "2017-04-10T06:23:57.574GMT",
          "duration" : 0,
          "sparkUser" : "",
          "completed" : false,
          "startTimeEpoch" : 1491805437574,
          "endTimeEpoch" : -1,
          "lastUpdatedEpoch" : 1491805437574
        } ]
      } ]%
      ```
      
      Here is UI before changed:
      
      <img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm" src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png">
      
      And after:
      
      <img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm" src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png
      
      ">
      
      ## How was this patch tested?
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17588 from jerryshao/SPARK-20275.
      
      (cherry picked from commit 52ed9b28)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      46400867
  16. May 27, 2017
  17. May 26, 2017
  18. May 25, 2017
  19. May 24, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-20848][SQL][FOLLOW-UP] Shutdown the pool after reading parquet files · 7015f6f0
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up to #18073. Taking a safer approach to shutdown the pool to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to set a better thread name.
      
      ## How was this patch tested?
      
      Manually test.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18100 from viirya/SPARK-20848-followup.
      
      (cherry picked from commit 6b68d61c)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      7015f6f0
    • Xingbo Jiang's avatar
      [SPARK-18406][CORE][BACKPORT-2.1] Race between end-of-task and completion... · c3302e81
      Xingbo Jiang authored
      [SPARK-18406][CORE][BACKPORT-2.1] Race between end-of-task and completion iterator read lock release
      
      This is a backport PR of  #18076 to 2.1.
      
      ## What changes were proposed in this pull request?
      
      When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method.
      
      ## How was this patch tested?
      
      Add new failing regression test case in `RDDSuite`.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18099 from jiangxb1987/completion-iterator-2.1.
      c3302e81
    • Liang-Chi Hsieh's avatar
      [SPARK-20848][SQL] Shutdown the pool after reading parquet files · 2f68631f
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads.
      
      We should shutdown the pool after reading parquet files.
      
      ## How was this patch tested?
      
      Added a test to ParquetFileFormatSuite.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18073 from viirya/SPARK-20848.
      
      (cherry picked from commit f72ad303)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      2f68631f
    • Bago Amirbekian's avatar
      [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel · 13adc0fc
      Bago Amirbekian authored
      
      ## What changes were proposed in this pull request?
      
      Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float.
      
      ## How was this patch tested?
      
      Existing tests run using python3 and numpy 1.12.
      
      Author: Bago Amirbekian <bago@databricks.com>
      
      Closes #18081 from MrBago/BF-py3floatbug.
      
      (cherry picked from commit bc66a77b)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      13adc0fc
  20. May 22, 2017
    • liuxian's avatar
      [SPARK-20763][SQL][BACKPORT-2.1] The function of `month` and `day` return the... · f4538c95
      liuxian authored
      [SPARK-20763][SQL][BACKPORT-2.1] The function of `month` and `day` return the value which is not we expected.
      
      What changes were proposed in this pull request?
      
      This PR is to backport #17997 to Spark 2.1
      
      when the date before "1582-10-04", the function of month and day return the value which is not we expected.
      How was this patch tested?
      
      unit tests
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #18054 from 10110346/wip-lx-0522.
      f4538c95
    • Mark Grover's avatar
      [SPARK-20756][YARN] yarn-shuffle jar references unshaded guava · f5ef0762
      Mark Grover authored
      
      and contains scala classes
      
      ## What changes were proposed in this pull request?
      This change ensures that all references to guava from within the yarn shuffle jar pointed to the shaded guava class already provided in the jar.
      
      Also, it explicitly excludes scala classes from being added to the jar.
      
      ## How was this patch tested?
      Ran unit tests on the module and they passed.
      javap now returns the expected result - reference to the shaded guava under `org/spark_project` (previously this was referring to `com.google...`
      ```
      javap -cp common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar -c org/apache/spark/network/yarn/YarnShuffleService | grep Lists
            57: invokestatic  #138                // Method org/spark_project/guava/collect/Lists.newArrayList:()Ljava/util/ArrayList;
      ```
      
      Guava is still shaded in the jar:
      ```
      jar -tf common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar | grep guava | head
      META-INF/maven/com.google.guava/
      META-INF/maven/com.google.guava/guava/
      META-INF/maven/com.google.guava/guava/pom.properties
      META-INF/maven/com.google.guava/guava/pom.xml
      org/spark_project/guava/
      org/spark_project/guava/annotations/
      org/spark_project/guava/annotations/Beta.class
      org/spark_project/guava/annotations/GwtCompatible.class
      org/spark_project/guava/annotations/GwtIncompatible.class
      org/spark_project/guava/annotations/VisibleForTesting.class
      ```
      (not sure if the above META-INF/* is a problem or not)
      
      I took this jar, deployed it on a yarn cluster with shuffle service enabled, and made sure the YARN node managers came up. An application with a shuffle was run and it succeeded.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #17990 from markgrover/spark-20756.
      
      (cherry picked from commit 36309110)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      f5ef0762
    • Ignacio Bermudez's avatar
      [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix · c3a986b1
      Ignacio Bermudez authored
      ## What changes were proposed in this pull request?
      
      When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data.
      
      In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data
      
      This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations.
      
      See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add
      
      ## How was this patch tested?
      
      Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark.
      
      Bugfix for https://issues.apache.org/jira/browse/SPARK-20687
      
      
      
      Author: Ignacio Bermudez <ignaciobermudez@gmail.com>
      Author: Ignacio Bermudez Corrales <icorrales@splunk.com>
      
      Closes #17940 from ghoto/bug-fix/SPARK-20687.
      
      (cherry picked from commit 06dda1d5)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      c3a986b1
  21. May 19, 2017
Loading