Skip to content
Snippets Groups Projects
  1. Nov 16, 2016
    • Xianyang Liu's avatar
      [SPARK-18420][BUILD] Fix the errors caused by lint check in Java · b0ae8712
      Xianyang Liu authored
      
      Small fix, fix the errors caused by lint check in Java
      
      - Clear unused objects and `UnusedImports`.
      - Add comments around the method `finalize` of `NioBufferedFileInputStream`to turn off checkstyle.
      - Cut the line which is longer than 100 characters into two lines.
      
      Travis CI.
      ```
      $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
      $ dev/lint-java
      ```
      Before:
      ```
      Checkstyle checks failed at following occurrences:
      [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] (imports) UnusedImports: Unused import - org.apache.commons.crypto.cipher.CryptoCipherFactory.
      [ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] (modifier) RedundantModifier: Redundant 'public' modifier.
      [ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] (coding) NoFinalizer: Avoid using finalizer method.
      [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] (sizes) LineLength: Line is longer than 100 characters (found 113).
      [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] (sizes) LineLength: Line is longer than 100 characters (found 110).
      [ERROR] src/test/java/org/apache/spark/sql/catalyst/expressions/HiveHasherSuite.java:[31,17] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions.
      [ERROR]src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103).
      [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
      [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed.
      ```
      
      After:
      ```
      $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
      $ dev/lint-java
      Using `mvn` from path: /home/travis/build/ConeyLiu/spark/build/apache-maven-3.3.9/bin/mvn
      Checkstyle checks passed.
      ```
      
      Author: Xianyang Liu <xyliu0530@icloud.com>
      
      Closes #15865 from ConeyLiu/master.
      
      (cherry picked from commit 7569cf6c)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      b0ae8712
    • Zheng RuiFeng's avatar
      [SPARK-18446][ML][DOCS] Add links to API docs for ML algos · 416bc3dd
      Zheng RuiFeng authored
      
      ## What changes were proposed in this pull request?
      Add links to API docs for ML algos
      ## How was this patch tested?
      Manual checking for the API links
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15890 from zhengruifeng/algo_link.
      
      (cherry picked from commit a75e3fe9)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      416bc3dd
    • Zheng RuiFeng's avatar
      [SPARK-18434][ML] Add missing ParamValidations for ML algos · 6b6eb4e5
      Zheng RuiFeng authored
      
      ## What changes were proposed in this pull request?
      Add missing ParamValidations for ML algos
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15881 from zhengruifeng/arg_checking.
      
      (cherry picked from commit c68f1a38)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      6b6eb4e5
    • Weiqing Yang's avatar
      [MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and... · 82084700
      Weiqing Yang authored
      [MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation
      
      ## What changes were proposed in this pull request?
      
      Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation.
      
      ## How was this patch tested?
      Manually.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15886 from weiqingy/fixTypo.
      
      (cherry picked from commit 241e04bc)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      82084700
    • uncleGen's avatar
      [SPARK-18410][STREAMING] Add structured kafka example · 6b2301b8
      uncleGen authored
      
      ## What changes were proposed in this pull request?
      
      This PR provides structured kafka wordcount examples
      
      ## How was this patch tested?
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #15849 from uncleGen/SPARK-18410.
      
      (cherry picked from commit e6145772)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      6b2301b8
    • Sean Owen's avatar
      [SPARK-18400][STREAMING] NPE when resharding Kinesis Stream · a94659ce
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Avoid NPE in KinesisRecordProcessor when shutdown happens without successful init
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15882 from srowen/SPARK-18400.
      
      (cherry picked from commit 43a26899)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      a94659ce
    • Liwei Lin's avatar
      [DOC][MINOR] Kafka doc: breakup into lines · 4567db9d
      Liwei Lin authored
      ## Before
      
      ![before](https://cloud.githubusercontent.com/assets/15843379/20340231/99b039fe-ac1b-11e6-9ba9-b44582427459.png)
      
      ## After
      
      ![after](https://cloud.githubusercontent.com/assets/15843379/20340236/9d5796e2-ac1b-11e6-92bb-6da40ba1a383.png
      
      )
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #15903 from lw-lin/kafka-doc-lines.
      
      (cherry picked from commit 3e01f128)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      4567db9d
    • Dongjoon Hyun's avatar
      [SPARK-18433][SQL] Improve DataSource option keys to be more case-insensitive · b18c5a9b
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      This PR aims to improve DataSource option keys to be more case-insensitive
      
      DataSource partially use CaseInsensitiveMap in code-path. For example, the following fails to find url.
      
      ```scala
      val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
      df.write.format("jdbc")
          .option("UrL", url1)
          .option("dbtable", "TEST.SAVETEST")
          .options(properties.asScala)
          .save()
      ```
      
      This PR makes DataSource options to use CaseInsensitiveMap internally and also makes DataSource to use CaseInsensitiveMap generally except `InMemoryFileIndex` and `InsertIntoHadoopFsRelationCommand`. We can not pass them CaseInsensitiveMap because they creates new case-sensitive HadoopConfs by calling newHadoopConfWithOptions(options) inside.
      
      ## How was this patch tested?
      
      Pass the Jenkins test with newly added test cases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15884 from dongjoon-hyun/SPARK-18433.
      
      (cherry picked from commit 74f5c217)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      b18c5a9b
    • Yanbo Liang's avatar
      [SPARK-18438][SPARKR][ML] spark.mlp should support RFormula. · 7b57e480
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      ```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
      BTW, I did some cleanup and improvement for ```spark.mlp```.
      
      ## How was this patch tested?
      Unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15883 from yanboliang/spark-18438.
      
      (cherry picked from commit 95eb06bd)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      7b57e480
  2. Nov 15, 2016
  3. Nov 14, 2016
    • gatorsmile's avatar
      [SPARK-18430][SQL] Fixed Exception Messages when Hitting an Invocation Exception of Function Lookup · a0125fd6
      gatorsmile authored
      ### What changes were proposed in this pull request?
      When the exception is an invocation exception during function lookup, we return a useless/confusing error message:
      
      For example,
      ```Scala
      df.selectExpr("concat_ws()")
      ```
      Below is the error message we got:
      ```
      null; line 1 pos 0
      org.apache.spark.sql.AnalysisException: null; line 1 pos 0
      ```
      
      To get the meaningful error message, we need to get the cause. The fix is exactly the same as what we did in https://github.com/apache/spark/pull/12136
      
      . After the fix, the message we got is the exception issued in the constuctor of function implementation:
      ```
      requirement failed: concat_ws requires at least one argument.; line 1 pos 0
      org.apache.spark.sql.AnalysisException: requirement failed: concat_ws requires at least one argument.; line 1 pos 0
      ```
      
      ### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15878 from gatorsmile/functionNotFound.
      
      (cherry picked from commit 86430cc4)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      a0125fd6
    • Zheng RuiFeng's avatar
      [SPARK-18428][DOC] Update docs for GraphX · 649c15fa
      Zheng RuiFeng authored
      
      ## What changes were proposed in this pull request?
      1, Add link of `VertexRDD` and `EdgeRDD`
      2, Notify in `Vertex and Edge RDDs` that not all methods are listed
      3, `VertexID` -> `VertexId`
      
      ## How was this patch tested?
      No tests, only docs is modified
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15875 from zhengruifeng/update_graphop_doc.
      
      (cherry picked from commit c31def1d)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      649c15fa
    • Michael Armbrust's avatar
      [SPARK-18124] Observed delay based Event Time Watermarks · 27999b36
      Michael Armbrust authored
      
      This PR adds a new method `withWatermark` to the `Dataset` API, which can be used specify an _event time watermark_.  An event time watermark allows the streaming engine to reason about the point in time after which we no longer expect to see late data.  This PR also has augmented `StreamExecution` to use this watermark for several purposes:
        - To know when a given time window aggregation is finalized and thus results can be emitted when using output modes that do not allow updates (e.g. `Append` mode).
        - To minimize the amount of state that we need to keep for on-going aggregations, by evicting state for groups that are no longer expected to change.  Although, we do still maintain all state if the query requires (i.e. if the event time is not present in the `groupBy` or when running in `Complete` mode).
      
      An example that emits windowed counts of records, waiting up to 5 minutes for late data to arrive.
      ```scala
      df.withWatermark("eventTime", "5 minutes")
        .groupBy(window($"eventTime", "1 minute") as 'window)
        .count()
        .writeStream
        .format("console")
        .mode("append") // In append mode, we only output finalized aggregations.
        .start()
      ```
      
      ### Calculating the watermark.
      The current event time is computed by looking at the `MAX(eventTime)` seen this epoch across all of the partitions in the query minus some user defined _delayThreshold_.  An additional constraint is that the watermark must increase monotonically.
      
      Note that since we must coordinate this value across partitions occasionally, the actual watermark used is only guaranteed to be at least `delay` behind the actual event time.  In some cases we may still process records that arrive more than delay late.
      
      This mechanism was chosen for the initial implementation over processing time for two reasons:
        - it is robust to downtime that could affect processing delay
        - it does not require syncing of time or timezones between the producer and the processing engine.
      
      ### Other notable implementation details
       - A new trigger metric `eventTimeWatermark` outputs the current value of the watermark.
       - We mark the event time column in the `Attribute` metadata using the key `spark.watermarkDelay`.  This allows downstream operations to know which column holds the event time.  Operations like `window` propagate this metadata.
       - `explain()` marks the watermark with a suffix of `-T${delayMs}` to ease debugging of how this information is propagated.
       - Currently, we don't filter out late records, but instead rely on the state store to avoid emitting records that are both added and filtered in the same epoch.
      
      ### Remaining in this PR
       - [ ] The test for recovery is currently failing as we don't record the watermark used in the offset log.  We will need to do so to ensure determinism, but this is deferred until #15626 is merged.
      
      ### Other follow-ups
      There are some natural additional features that we should consider for future work:
       - Ability to write records that arrive too late to some external store in case any out-of-band remediation is required.
       - `Update` mode so you can get partial results before a group is evicted.
       - Other mechanisms for calculating the watermark.  In particular a watermark based on quantiles would be more robust to outliers.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15702 from marmbrus/watermarks.
      
      (cherry picked from commit c0718782)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      27999b36
    • Nattavut Sutyanyong's avatar
      [SPARK-17348][SQL] Incorrect results from subquery transformation · ae66799f
      Nattavut Sutyanyong authored
      
      ## What changes were proposed in this pull request?
      
      Return an Analysis exception when there is a correlated non-equality predicate in a subquery and the correlated column from the outer reference is not from the immediate parent operator of the subquery. This PR prevents incorrect results from subquery transformation in such case.
      
      Test cases, both positive and negative tests, are added.
      
      ## How was this patch tested?
      
      sql/test, catalyst/test, hive/test, and scenarios that will produce incorrect results without this PR and product correct results when subquery transformation does happen.
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #15763 from nsyca/spark-17348.
      
      (cherry picked from commit bd85603b)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      ae66799f
    • Zheng RuiFeng's avatar
      [SPARK-11496][GRAPHX][FOLLOWUP] Add param checking for runParallelPersonalizedPageRank · cff7a70b
      Zheng RuiFeng authored
      
      ## What changes were proposed in this pull request?
      add the param checking to keep in line with other algos
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15876 from zhengruifeng/param_check_runParallelPersonalizedPageRank.
      
      (cherry picked from commit 75934457)
      Signed-off-by: default avatarDB Tsai <dbtsai@dbtsai.com>
      Unverified
      cff7a70b
    • cody koeninger's avatar
      [SPARK-17510][STREAMING][KAFKA] config max rate on a per-partition basis · db691f05
      cody koeninger authored
      
      ## What changes were proposed in this pull request?
      
      Allow configuration of max rate on a per-topicpartition basis.
      ## How was this patch tested?
      
      Unit tests.
      
      The reporter (Jeff Nadler) said he could test on his workload, so let's wait on that report.
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15132 from koeninger/SPARK-17510.
      
      (cherry picked from commit 89d1fa58)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      db691f05
    • Tathagata Das's avatar
      [SPARK-18416][STRUCTURED STREAMING] Fixed temp file leak in state store · 3c623d22
      Tathagata Das authored
      
      ## What changes were proposed in this pull request?
      
      StateStore.get() causes temporary files to be created immediately, even if the store is not used to make updates for new version. The temp file is not closed as store.commit() is not called in those cases, thus keeping the output stream to temp file open forever.
      
      This PR fixes it by opening the temp file only when there are updates being made.
      
      ## How was this patch tested?
      
      New unit test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15859 from tdas/SPARK-18416.
      
      (cherry picked from commit bdfe60ac)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      3c623d22
    • Noritaka Sekiyama's avatar
      [SPARK-18432][DOC] Changed HDFS default block size from 64MB to 128MB · c07fe1c5
      Noritaka Sekiyama authored
      Changed HDFS default block size from 64MB to 128MB.
      https://issues.apache.org/jira/browse/SPARK-18432
      
      
      
      Author: Noritaka Sekiyama <moomindani@gmail.com>
      
      Closes #15879 from moomindani/SPARK-18432.
      
      (cherry picked from commit 9d07ceee)
      Signed-off-by: default avatarKousuke Saruta <sarutak@oss.nttdata.co.jp>
      c07fe1c5
    • WangTaoTheTonic's avatar
      [SPARK-18396][HISTORYSERVER] Duration" column makes search result confused,... · 518dc1e1
      WangTaoTheTonic authored
      [SPARK-18396][HISTORYSERVER] Duration" column makes search result confused, maybe we should make it unsearchable
      
      ## What changes were proposed in this pull request?
      
      When we search data in History Server, it will check if any columns contains the search string. Duration is represented as long value in table, so if we search simple string like "003", "111", the duration containing "003", ‘111“ will be showed, which make not much sense to users.
      We cannot simply transfer the long value to meaning format like "1 h", "3.2 min" because they are also used for sorting. Better way to handle it is ban "Duration" columns from searching.
      
      ## How was this patch tested
      
      manually tests.
      
      Before("local-1478225166651" pass the filter because its duration in long value, which is "257244245" contains search string "244"):
      ![before](https://cloud.githubusercontent.com/assets/5276001/20203166/f851ffc6-a7ff-11e6-8fe6-91a90ca92b23.jpg)
      
      After:
      ![after](https://cloud.githubusercontent.com/assets/5276001/20178646/2129fbb0-a78d-11e6-9edb-39f885ce3ed0.jpg
      
      )
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #15838 from WangTaoTheTonic/duration.
      
      (cherry picked from commit 637a0bb8)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      518dc1e1
    • actuaryzhang's avatar
      [SPARK-18166][MLLIB] Fix Poisson GLM bug due to wrong requirement of response values · d554c02f
      actuaryzhang authored
      
      ## What changes were proposed in this pull request?
      
      The current implementation of Poisson GLM seems to allow only positive values. This is incorrect since the support of Poisson includes the origin. The bug is easily fixed by changing the test of the Poisson variable from  'require(y **>** 0.0' to  'require(y **>=** 0.0'.
      
      mengxr  srowen
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      Author: actuaryzhang <actuaryzhang@uber.com>
      
      Closes #15683 from actuaryzhang/master.
      
      (cherry picked from commit ae6cddb7)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      d554c02f
    • Sean Owen's avatar
      [SPARK-18382][WEBUI] "run at null:-1" in UI when no file/line info in call site info · 12bde11c
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Avoid reporting null/-1 file / line number in call sites if encountering StackTraceElement without this info
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15862 from srowen/SPARK-18382.
      
      (cherry picked from commit f95b124c)
      Signed-off-by: default avatarKousuke Saruta <sarutak@oss.nttdata.co.jp>
      12bde11c
  4. Nov 13, 2016
  5. Nov 12, 2016
  6. Nov 11, 2016
    • sethah's avatar
      [SPARK-18060][ML] Avoid unnecessary computation for MLOR · 56859c02
      sethah authored
      
      ## What changes were proposed in this pull request?
      
      Before this patch, the gradient updates for multinomial logistic regression were computed by an outer loop over the number of classes and an inner loop over the number of features. Inside the inner loop, we standardized the feature value (`value / featuresStd(index)`), which means we performed the computation `numFeatures * numClasses` times. We only need to perform that computation `numFeatures` times, however. If we re-order the inner and outer loop, we can avoid this, but then we lose sequential memory access. In this patch, we instead lay out the coefficients in column major order while we train, so that we can avoid the extra computation and retain sequential memory access. We convert back to row-major order when we create the model.
      
      ## How was this patch tested?
      
      This is an implementation detail only, so the original behavior should be maintained. All tests pass. I ran some performance tests to verify speedups. The results are below, and show significant speedups.
      ## Performance Tests
      
      **Setup**
      
      3 node bare-metal cluster
      120 cores total
      384 gb RAM total
      
      **Results**
      
      NOTE: The `currentMasterTime` and `thisPatchTime` are times in seconds for a single iteration of L-BFGS or OWL-QN.
      
      |    |   numPoints |   numFeatures |   numClasses |   regParam |   elasticNetParam |   currentMasterTime (sec) |   thisPatchTime (sec) |   pctSpeedup |
      |----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------|
      |  0 |       1e+07 |           100 |          500 |       0.5  |                 0 |                        90 |                    18 |           80 |
      |  1 |       1e+08 |           100 |           50 |       0.5  |                 0 |                        90 |                    19 |           78 |
      |  2 |       1e+08 |           100 |           50 |       0.05 |                 1 |                        72 |                    19 |           73 |
      |  3 |       1e+06 |           100 |         5000 |       0.5  |                 0 |                        93 |                    53 |           43 |
      |  4 |       1e+07 |           100 |         5000 |       0.5  |                 0 |                       900 |                   390 |           56 |
      |  5 |       1e+08 |           100 |          500 |       0.5  |                 0 |                       840 |                   174 |           79 |
      |  6 |       1e+08 |           100 |          200 |       0.5  |                 0 |                       360 |                    72 |           80 |
      |  7 |       1e+08 |          1000 |            5 |       0.5  |                 0 |                         9 |                     3 |           66 |
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15593 from sethah/MLOR_PERF_COL_MAJOR_COEF.
      
      (cherry picked from commit 46b2550b)
      Signed-off-by: default avatarDB Tsai <dbtsai@dbtsai.com>
      Unverified
      56859c02
    • Felix Cheung's avatar
      [SPARK-18264][SPARKR] build vignettes with package, update vignettes for CRAN... · c2ebda44
      Felix Cheung authored
      [SPARK-18264][SPARKR] build vignettes with package, update vignettes for CRAN release build and add info on release
      
      ## What changes were proposed in this pull request?
      
      Changes to DESCRIPTION to build vignettes.
      Changes the metadata for vignettes to generate the recommended format (which is about <10% of size before). Unfortunately it does not look as nice
      (before - left, after - right)
      
      ![image](https://cloud.githubusercontent.com/assets/8969467/20040492/b75883e6-a40d-11e6-9534-25cdd5d59a8b.png)
      
      ![image](https://cloud.githubusercontent.com/assets/8969467/20040490/a40f4d42-a40d-11e6-8c91-af00ddcbdad9.png
      
      )
      
      Also add information on how to run build/release to CRAN later.
      
      ## How was this patch tested?
      
      manually, unit tests
      
      shivaram
      
      We need this for branch-2.1
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15790 from felixcheung/rpkgvignettes.
      
      (cherry picked from commit ba23f768)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      c2ebda44
    • Ryan Blue's avatar
      [SPARK-18387][SQL] Add serialization to checkEvaluation. · 87820da7
      Ryan Blue authored
      
      ## What changes were proposed in this pull request?
      
      This removes the serialization test from RegexpExpressionsSuite and
      replaces it by serializing all expressions in checkEvaluation.
      
      This also fixes math constant expressions by making LeafMathExpression
      Serializable and fixes NumberFormat values that are null or invalid
      after serialization.
      
      ## How was this patch tested?
      
      This patch is to tests.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #15847 from rdblue/SPARK-18387-fix-serializable-expressions.
      
      (cherry picked from commit 6e95325f)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      87820da7
    • Dongjoon Hyun's avatar
      [SPARK-17982][SQL] SQLBuilder should wrap the generated SQL with parenthesis for LIMIT · 465e4b40
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      Currently, `SQLBuilder` handles `LIMIT` by always adding `LIMIT` at the end of the generated subSQL. It makes `RuntimeException`s like the following. This PR adds a parenthesis always except `SubqueryAlias` is used together with `LIMIT`.
      
      **Before**
      
      ``` scala
      scala> sql("CREATE TABLE tbl(id INT)")
      scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
      java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
      ```
      
      **After**
      
      ``` scala
      scala> sql("CREATE TABLE tbl(id INT)")
      scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
      scala> sql("SELECT id2 FROM v1")
      res4: org.apache.spark.sql.DataFrame = [id2: int]
      ```
      
      **Fixed cases in this PR**
      
      The following two cases are the detail query plans having problematic SQL generations.
      
      1. `SELECT * FROM (SELECT id FROM tbl LIMIT 2)`
      
          Please note that **FROM SELECT** part of the generated SQL in the below. When we don't use '()' for limit, this fails.
      
      ```scala
      # Original logical plan:
      Project [id#1]
      +- GlobalLimit 2
         +- LocalLimit 2
            +- Project [id#1]
               +- MetastoreRelation default, tbl
      
      # Canonicalized logical plan:
      Project [gen_attr_0#1 AS id#4]
      +- SubqueryAlias tbl
         +- Project [gen_attr_0#1]
            +- GlobalLimit 2
               +- LocalLimit 2
                  +- Project [gen_attr_0#1]
                     +- SubqueryAlias gen_subquery_0
                        +- Project [id#1 AS gen_attr_0#1]
                           +- SQLTable default, tbl, [id#1]
      
      # Generated SQL:
      SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM SELECT `gen_attr_0` FROM (SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2) AS tbl
      ```
      
      2. `SELECT * FROM (SELECT id FROM tbl TABLESAMPLE (2 ROWS))`
      
          Please note that **((~~~) AS gen_subquery_0 LIMIT 2)** in the below. When we use '()' for limit on `SubqueryAlias`, this fails.
      
      ```scala
      # Original logical plan:
      Project [id#1]
      +- Project [id#1]
         +- GlobalLimit 2
            +- LocalLimit 2
               +- MetastoreRelation default, tbl
      
      # Canonicalized logical plan:
      Project [gen_attr_0#1 AS id#4]
      +- SubqueryAlias tbl
         +- Project [gen_attr_0#1]
            +- GlobalLimit 2
               +- LocalLimit 2
                  +- SubqueryAlias gen_subquery_0
                     +- Project [id#1 AS gen_attr_0#1]
                        +- SQLTable default, tbl, [id#1]
      
      # Generated SQL:
      SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM ((SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2)) AS tbl
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a newly added test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15546 from dongjoon-hyun/SPARK-17982.
      
      (cherry picked from commit d42bb7cc)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      465e4b40
    • Vinayak's avatar
      [SPARK-17843][WEB UI] Indicate event logs pending for processing on history server UI · 00c9c7d9
      Vinayak authored
      ## What changes were proposed in this pull request?
      
      History Server UI's application listing to display information on currently under process event logs so a user knows that pending this processing an application may not list on the UI.
      
      When there are no event logs under process, the application list page has a "Last Updated" date-time at the top indicating the date-time of the last _completed_ scan of the event logs. The value is displayed to the user in his/her local time zone.
      ## How was this patch tested?
      
      All unit tests pass. Particularly all the suites under org.apache.spark.deploy.history.\* were run to test changes.
      - Very first startup - Pending logs - no logs processed yet:
      
      <img width="1280" alt="screen shot 2016-10-24 at 3 07 04 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640981/b8d2a96a-99fc-11e6-9b1f-2d736fe90e48.png">
      - Very first startup - Pending logs - some logs processed:
      
      <img width="1280" alt="screen shot 2016-10-24 at 3 18 42 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641087/3f8e3bae-99fd-11e6-9ef1-e0e70d71d8ef.png">
      - Last updated - No currently pending logs:
      
      <img width="1280" alt="screen shot 2016-10-17 at 8 34 37 pm" src="https://cloud.githubusercontent.com/assets/12079825/19443100/4d13946c-94a9-11e6-8ee2-c442729bb206.png">
      - Last updated - With some currently pending logs:
      
      <img width="1280" alt="screen shot 2016-10-24 at 3 09 31 pm" src="https://cloud.githubusercontent.com/assets/12079825/19640903/7323ba3a-99fc-11e6-8359-6a45753dbb28.png">
      - No applications found and No currently pending logs:
      
      <img width="1280" alt="screen shot 2016-10-24 at 3 24 26 pm" src="https://cloud.githubusercontent.com/assets/12079825/19641364/03a2cb04-99fe-11e6-87d6-d09587fc6201.png
      
      ">
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      
      Closes #15410 from vijoshi/SAAS-608_master.
      
      (cherry picked from commit a531fe1a)
      Signed-off-by: default avatarTom Graves <tgraves@yahoo-inc.com>
      00c9c7d9
  7. Nov 10, 2016
Loading