Skip to content
Snippets Groups Projects
  1. Sep 21, 2016
    • hyukjinkwon's avatar
      [SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding... · 25a020be
      hyukjinkwon authored
      [SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV
      
      ## What changes were proposed in this pull request?
      
      This PR includes the changes below:
      
      1. Upgrade Univocity library from 2.1.1 to 2.2.1
      
        This includes some performance improvement and also enabling auto-extending buffer in `maxCharsPerColumn` option in CSV. Please refer the [release notes](https://github.com/uniVocity/univocity-parsers/releases).
      
      2. Remove useless `rowSeparator` variable existing in `CSVOptions`
      
        We have this unused variable in [CSVOptions.scala#L127](https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127) but it seems possibly causing confusion that it actually does not care of `\r\n`. For example, we have an issue open about this, [SPARK-17227](https://issues.apache.org/jira/browse/SPARK-17227), describing this variable.
      
        This variable is virtually not being used because we rely on `LineRecordReader` in Hadoop which deals with only both `\n` and `\r\n`.
      
      3. Set the default value of `maxCharsPerColumn` to auto-expending.
      
        We are setting 1000000 for the length of each column. It'd be more sensible we allow auto-expending rather than fixed length by default.
      
        To make sure, using `-1` is being described in the release note, [2.2.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.2.0).
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15138 from HyukjinKwon/SPARK-17583.
      Unverified
      25a020be
    • VinceShieh's avatar
      [SPARK-17219][ML] Add NaN value handling in Bucketizer · 57dc326b
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
      Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
      reserve one extra bucket for NaN values, instead of throwing an illegal exception.
      Before:
      ```
      Bucketizer.transform on NaN value threw an illegal exception.
      ```
      After:
      ```
      NaN values will be grouped in an extra bucket.
      ```
      ## How was this patch tested?
      New test cases added in `BucketizerSuite`.
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #14858 from VinceShieh/spark-17219.
      Unverified
      57dc326b
    • Peng, Meng's avatar
      [SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test · b366f184
      Peng, Meng authored
      ## What changes were proposed in this pull request?
      
      Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn.
      http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
      
      ## How was this patch tested?
      
      Add Scala ut
      
      Author: Peng, Meng <peng.meng@intel.com>
      
      Closes #14597 from mpjlu/fprChiSquare.
      Unverified
      b366f184
    • Burak Yavuz's avatar
      [SPARK-17599] Prevent ListingFileCatalog from failing if path doesn't exist · 28fafa3e
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      The `ListingFileCatalog` lists files given a set of resolved paths. If a folder is deleted at any time between the paths were resolved and the file catalog can check for the folder, the Spark job fails. This may abruptly stop long running StructuredStreaming jobs for example.
      
      Folders may be deleted by users or automatically by retention policies. These cases should not prevent jobs from successfully completing.
      
      ## How was this patch tested?
      
      Unit test in `FileCatalogSuite`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15153 from brkyvz/SPARK-17599.
      28fafa3e
    • Sean Zhong's avatar
      [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value · 3977223a
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision.
      
      This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted.
      
      ### Before change
      ```
      scala> -5083676433652386516D % 10
      res2: Double = -6.0
      
      scala> spark.sql("select -5083676433652386516D % 10 as a").show
      +---+
      |  a|
      +---+
      |0.0|
      +---+
      ```
      
      ### After change
      ```
      scala> spark.sql("select -5083676433652386516D % 10 as a").show
      +----+
      |   a|
      +----+
      |-6.0|
      +----+
      ```
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #15171 from clockfly/SPARK-17617.
      3977223a
    • William Benton's avatar
      [SPARK-17595][MLLIB] Use a bounded priority queue to find synonyms in Word2VecModel · 7654385f
      William Benton authored
      ## What changes were proposed in this pull request?
      
      The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with the highest similarity to the query vector currently sorts the collection of similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary, and that is exactly what this patch does.
      
      ## How was this patch tested?
      
      This patch adds no user-visible functionality and its correctness should be exercised by existing tests.  To ensure that this approach is actually faster, I made a microbenchmark for `findSynonyms`:
      
      ```
      object W2VTiming {
        import org.apache.spark.{SparkContext, SparkConf}
        import org.apache.spark.mllib.feature.Word2VecModel
        def run(modelPath: String, scOpt: Option[SparkContext] = None) {
          val sc = scOpt.getOrElse(new SparkContext(new SparkConf(true).setMaster("local[*]").setAppName("test")))
          val model = Word2VecModel.load(sc, modelPath)
          val keys = model.getVectors.keys
          val start = System.currentTimeMillis
          for(key <- keys) {
            model.findSynonyms(key, 5)
            model.findSynonyms(key, 10)
            model.findSynonyms(key, 25)
            model.findSynonyms(key, 50)
          }
          val finish = System.currentTimeMillis
          println("run completed in " + (finish - start) + "ms")
        }
      }
      ```
      
      I ran this test on a model generated from the complete works of Jane Austen and found that the new approach was over 3x faster than the old approach.  (If the `num` argument to `findSynonyms` is very close to the vocabulary size, the new approach will have less of an advantage over the old one.)
      
      Author: William Benton <willb@redhat.com>
      
      Closes #15150 from willb/SPARK-17595.
      Unverified
      7654385f
    • Yanbo Liang's avatar
      [SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files recursively · d3b88697
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Users would like to add a directory as dependency in some cases, they can use ```SparkContext.addFile``` with argument ```recursive=true``` to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15140 from yanboliang/spark-17585.
      d3b88697
    • wm624@hotmail.com's avatar
      [CORE][DOC] Fix errors in comments · 61876a42
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      While reading source code of CORE and SQL core, I found some minor errors in comments such as extra space, missing blank line and grammar error.
      
      I fixed these minor errors and might find more during my source code study.
      
      ## How was this patch tested?
      Manually build
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15151 from wangmiao1981/mem.
      Unverified
      61876a42
    • jerryshao's avatar
      [SPARK-15698][SQL][STREAMING][FOLLW-UP] Fix FileStream source and sink log get configuration issue · e48ebc4e
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      This issue was introduced in the previous commit of SPARK-15698. Mistakenly change the way to get configuration back to original one, so here with the follow up PR to revert them up.
      
      ## How was this patch tested?
      
      N/A
      
      Ping zsxwing , please review again, sorry to bring the inconvenience. Thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #15173 from jerryshao/SPARK-15698-follow.
      e48ebc4e
  2. Sep 20, 2016
    • Weiqing Yang's avatar
      [MINOR][BUILD] Fix CheckStyle Error · 1ea49916
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      This PR is to fix the code style errors before 2.0.1 release.
      
      ## How was this patch tested?
      Manual.
      
      Before:
      ```
      ./dev/lint-java
      Using `mvn` from path: /usr/local/bin/mvn
      Checkstyle checks failed at following occurrences:
      [ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[153] (sizes) LineLength: Line is longer than 100 characters (found 107).
      [ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[196] (sizes) LineLength: Line is longer than 100 characters (found 108).
      [ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[239] (sizes) LineLength: Line is longer than 100 characters (found 115).
      [ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[119] (sizes) LineLength: Line is longer than 100 characters (found 107).
      [ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[129] (sizes) LineLength: Line is longer than 100 characters (found 104).
      [ERROR] src/main/java/org/apache/spark/network/util/LevelDBProvider.java:[124,11] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions.
      [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[26] (regexp) RegexpSingleline: No trailing whitespace allowed.
      [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[33] (sizes) LineLength: Line is longer than 100 characters (found 110).
      [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[38] (sizes) LineLength: Line is longer than 100 characters (found 110).
      [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[43] (sizes) LineLength: Line is longer than 100 characters (found 106).
      [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[48] (sizes) LineLength: Line is longer than 100 characters (found 110).
      [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[0] (misc) NewlineAtEndOfFile: File does not end with a newline.
      [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java:[67] (sizes) LineLength: Line is longer than 100 characters (found 106).
      [ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[200] (regexp) RegexpSingleline: No trailing whitespace allowed.
      [ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[309] (regexp) RegexpSingleline: No trailing whitespace allowed.
      [ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[332] (regexp) RegexpSingleline: No trailing whitespace allowed.
      [ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[348] (regexp) RegexpSingleline: No trailing whitespace allowed.
       ```
      After:
      ```
      ./dev/lint-java
      Using `mvn` from path: /usr/local/bin/mvn
      Checkstyle checks passed.
      ```
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15170 from Sherry302/fixjavastyle.
      1ea49916
    • petermaxlee's avatar
      [SPARK-17513][SQL] Make StreamExecution garbage-collect its metadata · 976f3b12
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This PR modifies StreamExecution such that it discards metadata for batches that have already been fully processed. I used the purge method that was added as part of SPARK-17235.
      
      This is a resubmission of 15126, which was based on work by frreiss in #15067, but fixed the test case along with some typos.
      
      ## How was this patch tested?
      A new test case in StreamingQuerySuite. The test case would fail without the changes in this pull request.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #15166 from petermaxlee/SPARK-17513-2.
      976f3b12
    • Marcelo Vanzin's avatar
      [SPARK-17611][YARN][TEST] Make shuffle service test really test auth. · 7e418e99
      Marcelo Vanzin authored
      Currently, the code is just swallowing exceptions, and not really checking
      whether the auth information was being recorded properly. Fix both problems,
      and also avoid tests inadvertently affecting other tests by modifying the
      shared config variable (by making it not shared).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #15161 from vanzin/SPARK-17611.
      7e418e99
    • Yin Huai's avatar
      [SPARK-17549][SQL] Revert "[] Only collect table size stat in driver for cached relation." · 9ac68dbc
      Yin Huai authored
      This reverts commit 39e2bad6 because of the problem mentioned at https://issues.apache.org/jira/browse/SPARK-17549?focusedCommentId=15505060&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15505060
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #15157 from yhuai/revert-SPARK-17549.
      9ac68dbc
    • jerryshao's avatar
      [SPARK-15698][SQL][STREAMING] Add the ability to remove the old MetadataLog in FileStreamSource · a6aade00
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Current `metadataLog` in `FileStreamSource` will add a checkpoint file in each batch but do not have the ability to remove/compact, which will lead to large number of small files when running for a long time. So here propose to compact the old logs into one file. This method is quite similar to `FileStreamSinkLog` but simpler.
      
      ## How was this patch tested?
      
      Unit test added.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #13513 from jerryshao/SPARK-15698.
      a6aade00
    • Wenchen Fan's avatar
      [SPARK-17051][SQL] we should use hadoopConf in InsertIntoHiveTable · eb004c66
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Hive confs in hive-site.xml will be loaded in `hadoopConf`, so we should use `hadoopConf` in `InsertIntoHiveTable` instead of `SessionState.conf`
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14634 from cloud-fan/bug.
      eb004c66
    • gatorsmile's avatar
      [SPARK-17502][SQL] Fix Multiple Bugs in DDL Statements on Temporary Views · d5ec5dbb
      gatorsmile authored
      ### What changes were proposed in this pull request?
      - When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for partition-related ALTER TABLE commands. However, it always reports a confusing error message. For example,
      ```
      Partition spec is invalid. The spec (a, b) must match the partition spec () defined in table '`testview`';
      ```
      - When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for `ALTER TABLE ... UNSET TBLPROPERTIES`. However, it reports a missing table property. For example,
      ```
      Attempted to unset non-existent property 'p' in table '`testView`';
      ```
      - When `ANALYZE TABLE` is called on a view or a temporary view, we should issue an error message. However, it reports a strange error:
      ```
      ANALYZE TABLE is not supported for Project
      ```
      
      - When inserting into a temporary view that is generated from `Range`, we will get the following error message:
      ```
      assertion failed: No plan for 'InsertIntoTable Range (0, 10, step=1, splits=Some(1)), false, false
      +- Project [1 AS 1#20]
         +- OneRowRelation$
      ```
      
      This PR is to fix the above four issues.
      
      ### How was this patch tested?
      Added multiple test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15054 from gatorsmile/tempViewDDL.
      d5ec5dbb
    • Adrian Petrescu's avatar
      [SPARK-17437] Add uiWebUrl to JavaSparkContext and pyspark.SparkContext · 4a426ff8
      Adrian Petrescu authored
      ## What changes were proposed in this pull request?
      
      The Scala version of `SparkContext` has a handy field called `uiWebUrl` that tells you which URL the SparkUI spawned by that instance lives at. This is often very useful because the value for `spark.ui.port` in the config is only a suggestion; if that port number is taken by another Spark instance on the same machine, Spark will just keep incrementing the port until it finds a free one. So, on a machine with a lot of running PySpark instances, you often have to start trying all of them one-by-one until you find your application name.
      
      Scala users have a way around this with `uiWebUrl` but Java and Python users do not. This pull request fixes this in the most straightforward way possible, simply propagating this field through the `JavaSparkContext` and into pyspark through the Java gateway.
      
      Please let me know if any additional documentation/testing is needed.
      
      ## How was this patch tested?
      
      Existing tests were run to make sure there were no regressions, and a binary distribution was created and tested manually for the correct value of `sc.uiWebPort` in a variety of circumstances.
      
      Author: Adrian Petrescu <apetresc@gmail.com>
      
      Closes #15000 from apetresc/pyspark-uiweburl.
      Unverified
      4a426ff8
    • Wenchen Fan's avatar
      f039d964
    • petermaxlee's avatar
      [SPARK-17513][SQL] Make StreamExecution garbage-collect its metadata · be9d57fc
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This PR modifies StreamExecution such that it discards metadata for batches that have already been fully processed. I used the purge method that was added as part of SPARK-17235.
      
      This is based on work by frreiss in #15067, but fixed the test case along with some typos.
      
      ## How was this patch tested?
      A new test case in StreamingQuerySuite. The test case would fail without the changes in this pull request.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      Author: frreiss <frreiss@us.ibm.com>
      
      Closes #15126 from petermaxlee/SPARK-17513.
      be9d57fc
  3. Sep 19, 2016
    • sethah's avatar
      [SPARK-17163][ML] Unified LogisticRegression interface · 26145a5a
      sethah authored
      ## What changes were proposed in this pull request?
      
      Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove `MultinomialLogisticRegression`.
      
      Marked as WIP because we should discuss the coefficients API in the model. See discussion below.
      
      JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163)
      
      ## How was this patch tested?
      
      Merged test suites and added some new unit tests.
      
      ## Design
      
      ### Switching between binomial and multinomial
      
      We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called `family` which defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error.
      
      ### coefficients/intercept model API (TODO)
      
      This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points:
      
      * We must maintain compatibility with the old API, i.e. we must expose `def coefficients: Vector` and `def intercept: Double`
      * There are two separate cases: binomial lr where we have a single set of coefficients and a single intercept and multinomial lr where we have `numClasses` sets of coefficients and `numClasses` intercepts.
      
      Some options:
      
      1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would convert them to `2 x numFeatures` coefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it in `def coefficients: Vector` every time.
      
      2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We still store the coefficients as a matrix and the intercepts as a vector. When users call `coefficients` we return them a `Vector` that is backed by the same underlying array as the `coefficientMatrix`, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy.
      
      If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions.
      
      **Update:** We have chosen option 2
      
      ### Threshold/thresholds (TODO)
      
      Currently, when `threshold` is set we clear whatever value is in `thresholds` and when `thresholds` is set we clear whatever value is in `threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA.
      
      **Update:** Let's leave it for a follow up PR
      
      ## Follow up
      
      * Summary model for multiclass logistic regression [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
      * Thresholds vs threshold [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543)
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #14834 from sethah/SPARK-17163.
      26145a5a
    • Josh Rosen's avatar
      [SPARK-17160] Properly escape field names in code-generated error messages · e719b1c0
      Josh Rosen authored
      This patch addresses a corner-case escaping bug where field names which contain special characters were unsafely interpolated into error message string literals in generated Java code, leading to compilation errors.
      
      This patch addresses these issues by using `addReferenceObj` to store the error messages as string fields rather than inline string constants.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15156 from JoshRosen/SPARK-17160.
      e719b1c0
    • Davies Liu's avatar
      [SPARK-17100] [SQL] fix Python udf in filter on top of outer join · d8104158
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      In optimizer, we try to evaluate the condition to see whether it's nullable or not, but some expressions are not evaluable, we should check that before evaluate it.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15103 from davies/udf_join.
      d8104158
    • Davies Liu's avatar
      [SPARK-16439] [SQL] bring back the separator in SQL UI · e0632062
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, the SQL metrics looks like `number of rows: 111111111111`, it's very hard to read how large the number is. So a separator was added by #12425, but removed by #14142, because the separator is weird in some locales (for example, pl_PL), this PR will add that back, but always use "," as the separator, since the SQL UI are all in English.
      
      ## How was this patch tested?
      
      Existing tests.
      ![metrics](https://cloud.githubusercontent.com/assets/40902/14573908/21ad2f00-030d-11e6-9e2c-c544f30039ea.png)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15106 from davies/metric_sep.
      e0632062
    • Shixiong Zhu's avatar
      [SPARK-17438][WEBUI] Show Application.executorLimit in the application page · 80d66559
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds `Application.executorLimit` to the applicatino page
      
      ## How was this patch tested?
      
      Checked the UI manually.
      
      Screenshots:
      
      1. Dynamic allocation is disabled
      
      <img width="484" alt="screen shot 2016-09-07 at 4 21 49 pm" src="https://cloud.githubusercontent.com/assets/1000778/18332029/210056ea-7518-11e6-9f52-76d96046c1c0.png">
      
      2. Dynamic allocation is enabled.
      
      <img width="466" alt="screen shot 2016-09-07 at 4 25 30 pm" src="https://cloud.githubusercontent.com/assets/1000778/18332034/2c07700a-7518-11e6-8fce-aebe25014902.png">
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15001 from zsxwing/fix-core-info.
      80d66559
    • sureshthalamati's avatar
      [SPARK-17473][SQL] fixing docker integration tests error due to different versions of jars. · cdea1d13
      sureshthalamati authored
      ## What changes were proposed in this pull request?
      Docker tests are using older version  of jersey jars (1.19),  which was used in older releases of spark.  In 2.0 releases Spark was upgraded to use 2.x verison of Jersey. After  upgrade to new versions, docker tests  are  failing with AbstractMethodError.  Now that spark is upgraded  to 2.x jersey version, using of  shaded docker jars  may not be required any more.  Removed the exclusions/overrides of jersey related classes from pom file, and changed the docker-client to use regular jar instead of shaded one.
      
      ## How was this patch tested?
      
      Tested  using existing  docker-integration-tests
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #15114 from sureshthalamati/docker_testfix-spark-17473.
      cdea1d13
    • Sean Owen's avatar
      [SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not relative to a calendar · d720a401
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Clarify that slide and window duration are absolute, and not relative to a calendar.
      
      ## How was this patch tested?
      
      Doc build (no functional change)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15142 from srowen/SPARK-17297.
      Unverified
      d720a401
  4. Sep 18, 2016
    • petermaxlee's avatar
      [SPARK-17571][SQL] AssertOnQuery.condition should always return Boolean value · 8f0c35a4
      petermaxlee authored
      ## What changes were proposed in this pull request?
      AssertOnQuery has two apply constructor: one that accepts a closure that returns boolean, and another that accepts a closure that returns Unit. This is actually very confusing because developers could mistakenly think that AssertOnQuery always require a boolean return type and verifies the return result, when indeed the value of the last statement is ignored in one of the constructors.
      
      This pull request makes the two constructor consistent and always require boolean value. It will overall make the test suites more robust against developer errors.
      
      As an evidence for the confusing behavior, this change also identified a bug with an existing test case due to file system time granularity. This pull request fixes that test case as well.
      
      ## How was this patch tested?
      This is a test only change.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #15127 from petermaxlee/SPARK-17571.
      8f0c35a4
    • Liwei Lin's avatar
      [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly · 1dbb725d
      Liwei Lin authored
      ## Problem
      
      CSV in Spark 2.0.0:
      -  does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6;
      - does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903.
      
      ## What changes were proposed in this pull request?
      
      This patch makes changes to read all empty values back as `null`s.
      
      ## How was this patch tested?
      
      New test cases.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #14118 from lw-lin/csv-cast-null.
      Unverified
      1dbb725d
    • hyukjinkwon's avatar
      [SPARK-17586][BUILD] Do not call static member via instance reference · 7151011b
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR fixes a warning message as below:
      
      ```
      [WARNING] .../UnsafeInMemorySorter.java:284: warning: [static] static method should be qualified by type name, TaskMemoryManager, instead of by an expression
      [WARNING]       currentPageNumber = memoryManager.decodePageNumber(recordPointer)
      ```
      
      by referencing the static member via class not instance reference.
      
      ## How was this patch tested?
      
      Existing tests should cover this - Jenkins tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15141 from HyukjinKwon/SPARK-17586.
      Unverified
      7151011b
    • Sean Owen's avatar
      [SPARK-17546][DEPLOY] start-* scripts should use hostname -f · 342c0e65
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Call `hostname -f` to get fully qualified host name
      
      ## How was this patch tested?
      
      Jenkins tests of course, but also verified output of command on OS X and Linux
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15129 from srowen/SPARK-17546.
      Unverified
      342c0e65
    • jiangxingbo's avatar
      [SPARK-17506][SQL] Improve the check double values equality rule. · 5d3f4615
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      In `ExpressionEvalHelper`, we check the equality between two double values by comparing whether the expected value is within the range [target - tolerance, target + tolerance], but this can cause a negative false when the compared numerics are very large.
      Before:
      ```
      val1 = 1.6358558070241E306
      val2 = 1.6358558070240974E306
      ExpressionEvalHelper.compareResults(val1, val2)
      false
      ```
      In fact, `val1` and `val2` are but with different precisions, we should tolerant this case by comparing with percentage range, eg.,expected is within range [target - target * tolerance_percentage, target + target * tolerance_percentage].
      After:
      ```
      val1 = 1.6358558070241E306
      val2 = 1.6358558070240974E306
      ExpressionEvalHelper.compareResults(val1, val2)
      true
      ```
      
      ## How was this patch tested?
      
      Exsiting testcases.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15059 from jiangxb1987/deq.
      Unverified
      5d3f4615
    • Wenchen Fan's avatar
      [SPARK-17541][SQL] fix some DDL bugs about table management when same-name temp view exists · 3fe630d3
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are:
      
      1. `CREATE TABLE USING` will fail if a same-name temp view exists
      2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists
      3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists.
      
      These bug fixes are pulled out from https://github.com/apache/spark/pull/14962 and targets both master and 2.0 branch
      
      ## How was this patch tested?
      
      new regression tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15099 from cloud-fan/fix-view.
      3fe630d3
    • gatorsmile's avatar
      [SPARK-17518][SQL] Block Users to Specify the Internal Data Source Provider Hive · 3a3c9ffb
      gatorsmile authored
      ### What changes were proposed in this pull request?
      In Spark 2.1, we introduced a new internal provider `hive` for telling Hive serde tables from data source tables. This PR is to block users to specify this in `DataFrameWriter` and SQL APIs.
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15073 from gatorsmile/formatHive.
      3a3c9ffb
  5. Sep 17, 2016
Loading