Skip to content
Snippets Groups Projects
  1. Jul 19, 2015
    • Carl Anders Düvel's avatar
      [SPARK-9094] [PARENT] Increased io.dropwizard.metrics from 3.1.0 to 3.1.2 · 344d1567
      Carl Anders Düvel authored
      We are running Spark 1.4.0 in production and ran into problems because after a network hiccup (which happens often in our current environment) no more metrics were reported to graphite leaving us blindfolded about the current state of our spark applications. [This problem](https://github.com/dropwizard/metrics/commit/70559816f1fc3a0a0122b5263d5478ff07396991) was fixed in the current version of the metrics library. We run spark with this change  in production now and have seen no problems. We also had a look at the commit history since 3.1.0 and did not detect any potentially  incompatible changes but many fixes which could potentially help other users as well.
      
      Author: Carl Anders Düvel <c.a.duevel@gmail.com>
      
      Closes #7493 from hackbert/bump-metrics-lib-version and squashes the following commits:
      
      6677565 [Carl Anders Düvel] [SPARK-9094] [PARENT] Increased io.dropwizard.metrics from 3.1.0 to 3.1.2 in order to get this fix https://github.com/dropwizard/metrics/commit/70559816f1fc3a0a0122b5263d5478ff07396991
      344d1567
    • Liang-Chi Hsieh's avatar
      [SPARK-9166][SQL][PYSPARK] Capture and hide IllegalArgumentException in Python API · 9b644c41
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9166
      
      Simply capture and hide `IllegalArgumentException` in Python API.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7497 from viirya/hide_illegalargument and squashes the following commits:
      
      8324dce [Liang-Chi Hsieh] Fix python style.
      9ace67d [Liang-Chi Hsieh] Also check exception message.
      8b2ce5c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into hide_illegalargument
      7be016a [Liang-Chi Hsieh] Capture and hide IllegalArgumentException in Python.
      9b644c41
    • Reynold Xin's avatar
      89d13585
    • Herman van Hovell's avatar
      [SPARK-8638] [SQL] Window Function Performance Improvements · a9a0d0ce
      Herman van Hovell authored
      ## Description
      Performance improvements for Spark Window functions. This PR will also serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA tickets SPARK-8638 and SPARK-7712 for more information.
      
      ## Improvements
      * Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse.
      * Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement.
      * A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. This will be enabled in a follow-up PR.
      * A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code.
      
      ## Benchmarking
      I have done a small benchmark using [on time performance](http://www.transtats.bts.gov) data of the month april. I have used the origin as a partioning key, as a result there is quite some variation in window sizes. The code for the benchmark can be found in the JIRA ticket. These are the results per Frame type:
      
      Frame | Master | SPARK-8638
      ----- | ------ | ----------
      Entire Frame | 2 s | 1 s
      Sliding | 18 s | 1 s
      Growing | 14 s | 0.9 s
      Shrinking | 13 s | 1 s
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #7057 from hvanhovell/SPARK-8638 and squashes the following commits:
      
      3bfdc49 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
      2eb3b33 [Herman van Hovell] Corrected reverse range frame processing.
      2cd2d5b [Herman van Hovell] Corrected reverse range frame processing.
      b0654d7 [Herman van Hovell] Tests for exotic frame specifications.
      e75b76e [Herman van Hovell] More docs, added support for reverse sliding range frames, and some reorganization of code.
      1fdb558 [Herman van Hovell] Changed Data In HiveDataFrameWindowSuite.
      ac2f682 [Herman van Hovell] Added a few more comments.
      1938312 [Herman van Hovell] Added Documentation to the createBoundOrdering methods.
      bb020e6 [Herman van Hovell] Major overhaul of Window operator.
      a9a0d0ce
    • Reynold Xin's avatar
      Fixed test cases. · 04c1b49f
      Reynold Xin authored
      04c1b49f
    • Tarek Auel's avatar
      [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK... · 83b682be
      Tarek Auel authored
      [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK-8179][SPARK-8177][SPARK-8178][SPARK-9115][SQL] date functions
      
      Jira:
      https://issues.apache.org/jira/browse/SPARK-8199
      https://issues.apache.org/jira/browse/SPARK-8184
      https://issues.apache.org/jira/browse/SPARK-8183
      https://issues.apache.org/jira/browse/SPARK-8182
      https://issues.apache.org/jira/browse/SPARK-8181
      https://issues.apache.org/jira/browse/SPARK-8180
      https://issues.apache.org/jira/browse/SPARK-8179
      https://issues.apache.org/jira/browse/SPARK-8177
      https://issues.apache.org/jira/browse/SPARK-8179
      https://issues.apache.org/jira/browse/SPARK-9115
      
      Regarding `day`and `dayofmonth` are both necessary?
      
      ~~I am going to add `Quarter` to this PR as well.~~ Done.
      
      ~~As soon as the Scala coding is reviewed and discussed, I'll add the python api.~~ Done
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      Author: Tarek Auel <tarek.auel@gmail.com>
      
      Closes #6981 from tarekauel/SPARK-8199 and squashes the following commits:
      
      f7b4c8c [Tarek Auel] [SPARK-8199] fixed bug in tests
      bb567b6 [Tarek Auel] [SPARK-8199] fixed test
      3e095ba [Tarek Auel] [SPARK-8199] style and timezone fix
      256c357 [Tarek Auel] [SPARK-8199] code cleanup
      5983dcc [Tarek Auel] [SPARK-8199] whitespace fix
      6e0c78f [Tarek Auel] [SPARK-8199] removed setTimeZone in tests, according to cloud-fans comment in #7488
      4afc09c [Tarek Auel] [SPARK-8199] concise leap year handling
      ea6c110 [Tarek Auel] [SPARK-8199] fix after merging master
      70238e0 [Tarek Auel] Merge branch 'master' into SPARK-8199
      3c6ae2e [Tarek Auel] [SPARK-8199] removed binary search
      fb98ba0 [Tarek Auel] [SPARK-8199] python docstring fix
      cdfae27 [Tarek Auel] [SPARK-8199] cleanup & python docstring fix
      746b80a [Tarek Auel] [SPARK-8199] build fix
      0ad6db8 [Tarek Auel] [SPARK-8199] minor fix
      523542d [Tarek Auel] [SPARK-8199] address comments
      2259299 [Tarek Auel] [SPARK-8199] day_of_month alias
      d01b977 [Tarek Auel] [SPARK-8199] python underscore
      56c4a92 [Tarek Auel] [SPARK-8199] update python docu
      e223bc0 [Tarek Auel] [SPARK-8199] refactoring
      d6aa14e [Tarek Auel] [SPARK-8199] fixed Hive compatibility
      b382267 [Tarek Auel] [SPARK-8199] fixed bug in day calculation; removed set TimeZone in HiveCompatibilitySuite for test purposes; removed Hive tests for second and minute, because we can cast '2015-03-18' to a timestamp and extract a minute/second from it
      1b2e540 [Tarek Auel] [SPARK-8119] style fix
      0852655 [Tarek Auel] [SPARK-8119] changed from ExpectsInputTypes to implicit casts
      ec87c69 [Tarek Auel] [SPARK-8119] bug fixing and refactoring
      1358cdc [Tarek Auel] Merge remote-tracking branch 'origin/master' into SPARK-8199
      740af0e [Tarek Auel] implement date function using a calculation based on days
      4fb66da [Tarek Auel] WIP: date functions on calculation only
      1a436c9 [Tarek Auel] wip
      f775f39 [Tarek Auel] fixed return type
      ad17e96 [Tarek Auel] improved implementation
      c42b444 [Tarek Auel] Removed merge conflict file
      ccb723c [Tarek Auel] [SPARK-8199] style and fixed merge issues
      10e4ad1 [Tarek Auel] Merge branch 'master' into date-functions-fast
      7d9f0eb [Tarek Auel] [SPARK-8199] git renaming issue
      f3e7a9f [Tarek Auel] [SPARK-8199] revert change in DataFrameFunctionsSuite
      6f5d95c [Tarek Auel] [SPARK-8199] fixed year interval
      d9f8ac3 [Tarek Auel] [SPARK-8199] implement fast track
      7bc9d93 [Tarek Auel] Merge branch 'master' into SPARK-8199
      5a105d9 [Tarek Auel] [SPARK-8199] rebase after #6985 got merged
      eb6760d [Tarek Auel] Merge branch 'master' into SPARK-8199
      f120415 [Tarek Auel] improved runtime
      a8edebd [Tarek Auel] use Calendar instead of SimpleDateFormat
      5fe74e1 [Tarek Auel] fixed python style
      3bfac90 [Tarek Auel] fixed style
      356df78 [Tarek Auel] rely on cast mechanism of Spark. Simplified implementation
      02efc5d [Tarek Auel] removed doubled code
      a5ea120 [Tarek Auel] added python api; changed test to be more meaningful
      b680db6 [Tarek Auel] added codegeneration to all functions
      c739788 [Tarek Auel] added support for quarter SPARK-8178
      849fb41 [Tarek Auel] fixed stupid test
      638596f [Tarek Auel] improved codegen
      4d8049b [Tarek Auel] fixed tests and added type check
      5ebb235 [Tarek Auel] resolved naming conflict
      d0e2f99 [Tarek Auel] date functions
      83b682be
  2. Jul 18, 2015
    • Forest Fang's avatar
      [SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits · 6cb6096c
      Forest Fang authored
      By grouping projection calls into multiple apply function, we are able to push the number of projections codegen can handle from ~1k to ~60k. I have set the unit test to test against 5k as 60k took 15s for the unit test to complete.
      
      Author: Forest Fang <forest.fang@outlook.com>
      
      Closes #7076 from saurfang/codegen_size_limit and squashes the following commits:
      
      b7a7635 [Forest Fang] [SPARK-8443][SQL] Execute and verify split projections in test
      adef95a [Forest Fang] [SPARK-8443][SQL] Use safer factor and rewrite splitting code
      1b5aa7e [Forest Fang] [SPARK-8443][SQL] inline execution if one block only
      9405680 [Forest Fang] [SPARK-8443][SQL] split projection code by size limit
      6cb6096c
    • Reynold Xin's avatar
      [SPARK-8278] Remove non-streaming JSON reader. · 45d798c3
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7501 from rxin/jsonrdd and squashes the following commits:
      
      767ec55 [Reynold Xin] More Mima
      51f456e [Reynold Xin] Mima exclude.
      789cb80 [Reynold Xin] Fixed compilation error.
      b4cf50d [Reynold Xin] [SPARK-8278] Remove non-streaming JSON reader.
      45d798c3
    • Reynold Xin's avatar
      [SPARK-9150][SQL] Create CodegenFallback and Unevaluable trait · 9914b1b2
      Reynold Xin authored
      It is very hard to track which expressions have code gen implemented or not. This patch removes the default fallback gencode implementation from Expression, and moves that into a new trait called CodegenFallback. Each concrete expression needs to either implement code generation, or mix in CodegenFallback. This makes it very easy to track which expressions have code generation implemented already.
      
      Additionally, this patch creates an Unevaluable trait that can be used to track expressions that don't support evaluation (e.g. Star).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7487 from rxin/codegenfallback and squashes the following commits:
      
      14ebf38 [Reynold Xin] Fixed Conv
      6c1c882 [Reynold Xin] Fixed Alias.
      b42611b [Reynold Xin] [SPARK-9150][SQL] Create a trait to track code generation for expressions.
      cb5c066 [Reynold Xin] Removed extra import.
      39cbe40 [Reynold Xin] [SPARK-8240][SQL] string function: concat
      9914b1b2
    • Reynold Xin's avatar
      [SPARK-9174][SQL] Add documentation for all public SQLConfs. · e16a19a3
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7500 from rxin/sqlconf and squashes the following commits:
      
      a5726c8 [Reynold Xin] [SPARK-9174][SQL] Add documentation for all public SQLConfs.
      e16a19a3
    • Reynold Xin's avatar
      [SPARK-8240][SQL] string function: concat · 6e1e2eba
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7486 from rxin/concat and squashes the following commits:
      
      5217d6e [Reynold Xin] Removed Hive's concat test.
      f5cb7a3 [Reynold Xin] Concat is never nullable.
      ae4e61f [Reynold Xin] Removed extra import.
      fddcbbd [Reynold Xin] Fixed NPE.
      22e831c [Reynold Xin] Added missing file.
      57a2352 [Reynold Xin] [SPARK-8240][SQL] string function: concat
      6e1e2eba
    • Yijie Shen's avatar
      [SPARK-9055][SQL] WidenTypes should also support Intersect and Except · 3d2134fc
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9055
      
      cc rxin
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7491 from yijieshen/widen and squashes the following commits:
      
      079fa52 [Yijie Shen] widenType support for intersect and expect
      3d2134fc
    • Reynold Xin's avatar
      Closes #6122 · cdc36eef
      Reynold Xin authored
      cdc36eef
    • Liang-Chi Hsieh's avatar
      [SPARK-9151][SQL] Implement code generation for Abs · 225de8da
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9151
      
      Add codegen support for `Abs`.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7498 from viirya/abs_codegen and squashes the following commits:
      
      0c8410f [Liang-Chi Hsieh] Implement code generation for Abs.
      225de8da
    • Wenchen Fan's avatar
      [SPARK-9171][SQL] add and improve tests for nondeterministic expressions · 86c50bf7
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7496 from cloud-fan/tests and squashes the following commits:
      
      0958f90 [Wenchen Fan] improve test for nondeterministic expressions
      86c50bf7
    • Wenchen Fan's avatar
      [SPARK-9167][SQL] use UTC Calendar in `stringToDate` · 692378c0
      Wenchen Fan authored
      fix 2 bugs introduced in https://github.com/apache/spark/pull/7353
      
      1. we should use UTC Calendar when cast string to date . Before #7353 , we use `DateTimeUtils.fromJavaDate(Date.valueOf(s.toString))` to cast string to date, and `fromJavaDate` will call `millisToDays` to avoid the time zone issue. Now we use `DateTimeUtils.stringToDate(s)`, we should create a Calendar with UTC in the begging.
      2. we should not change the default time zone in test cases. The `threadLocalLocalTimeZone` and `threadLocalTimestampFormat` in `DateTimeUtils` will only be evaluated once for each thread, so we can't set the default time zone back anymore.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7488 from cloud-fan/datetime and squashes the following commits:
      
      9cd6005 [Wenchen Fan] address comments
      21ef293 [Wenchen Fan] fix 2 bugs in datetime
      692378c0
    • Wenchen Fan's avatar
      [SPARK-9142][SQL] remove more self type in catalyst · 1b4ff055
      Wenchen Fan authored
      a follow up of https://github.com/apache/spark/pull/7479.
      The `TreeNode` is the root case of the requirement of `self: Product =>` stuff, so why not make `TreeNode` extend `Product`?
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7495 from cloud-fan/self-type and squashes the following commits:
      
      8676af7 [Wenchen Fan] remove more self type
      1b4ff055
    • Josh Rosen's avatar
      [SPARK-9143] [SQL] Add planner rule for automatically inserting Unsafe <->... · b8aec6cd
      Josh Rosen authored
      [SPARK-9143] [SQL] Add planner rule for automatically inserting Unsafe <-> Safe row format converters
      
      Now that we have two different internal row formats, UnsafeRow and the old Java-object-based row format, we end up having to perform conversions between these two formats. These conversions should not be performed by the operators themselves; instead, the planner should be responsible for inserting appropriate format conversions when they are needed.
      
      This patch makes the following changes:
      
      - Add two new physical operators for performing row format conversions, `ConvertToUnsafe` and `ConvertFromUnsafe`.
      - Add new methods to `SparkPlan` to allow operators to express whether they output UnsafeRows and whether they can handle safe or unsafe rows as inputs.
      - Implement an `EnsureRowFormats` rule to automatically insert converter operators where necessary.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7482 from JoshRosen/unsafe-converter-planning and squashes the following commits:
      
      7450fa5 [Josh Rosen] Resolve conflicts in favor of choosing UnsafeRow
      5220cce [Josh Rosen] Add roundtrip converter test
      2bb8da8 [Josh Rosen] Add Union unsafe support + tests to bump up test coverage
      6f79449 [Josh Rosen] Add even more assertions to execute()
      08ce199 [Josh Rosen] Rename ConvertFromUnsafe -> ConvertToSafe
      0e2d548 [Josh Rosen] Add assertion if operators' input rows are in different formats
      cabb703 [Josh Rosen] Add tests for Filter
      3b11ce3 [Josh Rosen] Add missing test file.
      ae2195a [Josh Rosen] Fixes
      0fef0f8 [Josh Rosen] Rename file.
      d5f9005 [Josh Rosen] Finish writing EnsureRowFormats planner rule
      b5df19b [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-converter-planning
      9ba3038 [Josh Rosen] WIP
      b8aec6cd
    • Reynold Xin's avatar
      [SPARK-9169][SQL] Improve unit test coverage for null expressions. · fba3f5ba
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7490 from rxin/unit-test-null-funcs and squashes the following commits:
      
      7b276f0 [Reynold Xin] Move isNaN.
      8307287 [Reynold Xin] [SPARK-9169][SQL] Improve unit test coverage for null expressions.
      fba3f5ba
    • Paweł Kozikowski's avatar
      [MLLIB] [DOC] Seed fix in mllib naive bayes example · b9ef7ac9
      Paweł Kozikowski authored
      Previous seed resulted in empty test data set.
      
      Author: Paweł Kozikowski <mupakoz@gmail.com>
      
      Closes #7477 from mupakoz/patch-1 and squashes the following commits:
      
      f5d41ee [Paweł Kozikowski] Mllib Naive Bayes example data set enlarged
      b9ef7ac9
  3. Jul 17, 2015
    • Rekha Joshi's avatar
      [SPARK-9118] [ML] Implement IntArrayParam in mllib · 10179082
      Rekha Joshi authored
      Implement IntArrayParam in mllib
      
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      Author: Joshi <rekhajoshm@gmail.com>
      
      Closes #7481 from rekhajoshm/SPARK-9118 and squashes the following commits:
      
      d3b1766 [Joshi] Implement IntArrayParam
      0be142d [Rekha Joshi] Merge pull request #3 from apache/master
      106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
      e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
      10179082
    • Yu ISHIKAWA's avatar
      [SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines · 34a889db
      Yu ISHIKAWA authored
      I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.
      
      [SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits:
      
      be752de [Yu ISHIKAWA] Add assertions
      a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst
      4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python
      fb2417c [Yu ISHIKAWA] Use getInt, instead of get
      f397be4 [Yu ISHIKAWA] Switch the comparisons.
      ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter.
      effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam
      c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test
      19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests
      1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst
      f8338bc [Yu ISHIKAWA] Add the placeholders in Python
      4a03003 [Yu ISHIKAWA] Test for contains in Python
      6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply`
      288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names
      5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception
      97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy`
      e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class
      978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans
      2ec80bc [Yu ISHIKAWA] Fit on 1 line
      e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones
      b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python
      f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python
      3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation
      4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon
      2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam`
      19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF
      4d2ad1e [Yu ISHIKAWA] Modify the indentations
      0ae422f [Yu ISHIKAWA] Add a test for `setParams`
      4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala
      11ffdf1 [Yu ISHIKAWA] Use `===` and the variable
      220a176 [Yu ISHIKAWA] Set a random seed in the unit testing
      92c3efc [Yu ISHIKAWA] Make the points for a test be fewer
      c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python
      6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods
      687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala
      a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations
      5bedc51 [Yu ISHIKAWA] Remve an extra new line
      444c289 [Yu ISHIKAWA] Add the validation for `runs`
      e41989c [Yu ISHIKAWA] Modify how to validate `initStep`
      7ea133a [Yu ISHIKAWA] Change how to validate `initMode`
      7991e15 [Yu ISHIKAWA] Add a validation for `k`
      c2df35d [Yu ISHIKAWA] Make `predict` private
      93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform`
      d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs
      e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private
      8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans
      6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps`
      99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode`
      79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs
      6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      11c2a12 [Yu ISHIKAWA] Limit the imports
      badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel}
      f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods
      85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol`
      aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x
      c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline
      598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python
      63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
      34a889db
    • Yijie Shen's avatar
      [SPARK-8280][SPARK-8281][SQL]Handle NaN, null and Infinity in math · 529a2c2d
      Yijie Shen authored
      JIRA:
      https://issues.apache.org/jira/browse/SPARK-8280
      https://issues.apache.org/jira/browse/SPARK-8281
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7451 from yijieshen/nan_null2 and squashes the following commits:
      
      47a529d [Yijie Shen] style fix
      63dee44 [Yijie Shen] handle log expressions similar to Hive
      188be51 [Yijie Shen] null to nan in Math Expression
      529a2c2d
    • Daoyuan Wang's avatar
      [SPARK-7026] [SQL] fix left semi join with equi key and non-equi condition · 17072386
      Daoyuan Wang authored
      When the `condition` extracted by `ExtractEquiJoinKeys` contain join Predicate for left semi join, we can not plan it as semiJoin. Such as
      
          SELECT * FROM testData2 x
          LEFT SEMI JOIN testData2 y
          ON x.b = y.b
          AND x.a >= y.a + 2
      Condition `x.a >= y.a + 2` can not evaluate on table `x`, so it throw errors
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #5643 from adrian-wang/spark7026 and squashes the following commits:
      
      cc09809 [Daoyuan Wang] refactor semijoin and add plan test
      575a7c8 [Daoyuan Wang] fix notserializable
      27841de [Daoyuan Wang] fix rebase
      10bf124 [Daoyuan Wang] fix style
      72baa02 [Daoyuan Wang] fix style
      8e0afca [Daoyuan Wang] merge commits for rebase
      17072386
    • Tathagata Das's avatar
      [SPARK-9030] [STREAMING] Add Kinesis.createStream unit tests that actual sends data · b13ef772
      Tathagata Das authored
      Current Kinesis unit tests do not test createStream by sending data. This PR is to add such unit test. Note that this unit will not run by default. It will only run when the relevant environment variables are set.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #7413 from tdas/kinesis-tests and squashes the following commits:
      
      0e16db5 [Tathagata Das] Added more comments regarding testOrIgnore
      1ea5ce0 [Tathagata Das] Added more comments
      c7caef7 [Tathagata Das] Address comments
      a297b59 [Tathagata Das] Reverted unnecessary change in KafkaStreamSuite
      90c9bde [Tathagata Das] Removed scalatest.FunSuite
      deb7f4f [Tathagata Das] Removed scalatest.FunSuite
      18c2208 [Tathagata Das] Changed how SparkFunSuite is inherited
      dbb33a5 [Tathagata Das] Added license
      88f6dab [Tathagata Das] Added scala docs
      c6be0d7 [Tathagata Das] minor changes
      24a992b [Tathagata Das] Moved KinesisTestUtils to src instead of test for future python usage
      465b55d [Tathagata Das] Made unit tests optional in a nice way
      4d70703 [Tathagata Das] Added license
      129d436 [Tathagata Das] Minor updates
      cc36510 [Tathagata Das] Added KinesisStreamSuite
      b13ef772
    • Wenchen Fan's avatar
      [SPARK-9117] [SQL] fix BooleanSimplification in case-insensitive · bd903ee8
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7452 from cloud-fan/boolean-simplify and squashes the following commits:
      
      2a6e692 [Wenchen Fan] fix style
      d3cfd26 [Wenchen Fan] fix BooleanSimplification in case-insensitive
      bd903ee8
    • Wenchen Fan's avatar
      [SPARK-9113] [SQL] enable analysis check code for self join · fd6b3101
      Wenchen Fan authored
      The check was unreachable before, as `case operator: LogicalPlan` catches everything already.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7449 from cloud-fan/tmp and squashes the following commits:
      
      2bb6637 [Wenchen Fan] add test
      5493aea [Wenchen Fan] add the check back
      27221a7 [Wenchen Fan] remove unnecessary analysis check code for self join
      fd6b3101
    • Yijie Shen's avatar
      [SPARK-9080][SQL] add isNaN predicate expression · 15fc2ffe
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9080
      
      cc rxin
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7464 from yijieshen/isNaN and squashes the following commits:
      
      11ae039 [Yijie Shen] add isNaN in functions
      666718e [Yijie Shen] add isNaN predicate expression
      15fc2ffe
    • Reynold Xin's avatar
      [SPARK-9142] [SQL] Removing unnecessary self types in Catalyst. · b2aa490b
      Reynold Xin authored
      Just a small change to add Product type to the base expression/plan abstract classes, based on suggestions on #7434 and offline discussions.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7479 from rxin/remove-self-types and squashes the following commits:
      
      e407ffd [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in Catalyst.
      b2aa490b
    • Joshi's avatar
      [SPARK-8593] [CORE] Sort app attempts by start time. · 42d8a012
      Joshi authored
      This makes sure attempts are listed in the order they were executed, and that the
      app's state matches the state of the most current attempt.
      
      Author: Joshi <rekhajoshm@gmail.com>
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      
      Closes #7253 from rekhajoshm/SPARK-8593 and squashes the following commits:
      
      874dd80 [Joshi] History Server: updated order for multiple attempts(logcleaner)
      716e0b1 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime)
      548c753 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime)
      83306a8 [Joshi] History Server: updated order for multiple attempts(descending start time)
      b0fc922 [Joshi] History Server: updated order for multiple attempts(updated comment)
      cc0fda7 [Joshi] History Server: updated order for multiple attempts(updated test)
      304cb0b [Joshi] History Server: updated order for multiple attempts(reverted HistoryPage)
      85024e8 [Joshi] History Server: updated order for multiple attempts
      a41ac4b [Joshi] History Server: updated order for multiple attempts
      ab65fa1 [Joshi] History Server: some attempt completed to work with showIncomplete
      0be142d [Rekha Joshi] Merge pull request #3 from apache/master
      106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
      e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
      42d8a012
    • Bryan Cutler's avatar
      [SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles · 8b8be1f5
      Bryan Cutler authored
      Broadcast of ensemble models in transformImpl before call to predict
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #6300 from BryanCutler/bcast-ensemble-models-7127 and squashes the following commits:
      
      86e73de [Bryan Cutler] [SPARK-7127] Replaced deprecated callUDF with udf
      40a139d [Bryan Cutler] Merge branch 'master' into bcast-ensemble-models-7127
      9afad56 [Bryan Cutler] [SPARK-7127] Simplified calls by overriding transformImpl and using broadcasted model in callUDF to make prediction
      1f34be4 [Bryan Cutler] [SPARK-7127] Removed accidental newline
      171a6ce [Bryan Cutler] [SPARK-7127] Used modelAccessor parameter in predictImpl to access broadcasted model
      6fd153c [Bryan Cutler] [SPARK-7127] Applied broadcasting to remaining ensemble models
      aaad77b [Bryan Cutler] [SPARK-7127] Removed abstract class for broadcasting model, instead passing a prediction function as param to transform
      83904bb [Bryan Cutler] [SPARK-7127] Adding broadcast of model before prediction in RandomForestClassifier
      8b8be1f5
    • Yanbo Liang's avatar
      [SPARK-8792] [ML] Add Python API for PCA transformer · 830666f6
      Yanbo Liang authored
      Add Python API for PCA transformer
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7190 from yanboliang/spark-8792 and squashes the following commits:
      
      8f4ac31 [Yanbo Liang] address comments
      8a79cc0 [Yanbo Liang] Add Python API for PCA transformer
      830666f6
    • Feynman Liang's avatar
      [SPARK-9090] [ML] Fix definition of residual in LinearRegressionSummary,... · 6da10696
      Feynman Liang authored
      [SPARK-9090] [ML] Fix definition of residual in LinearRegressionSummary, EnsembleTestHelper, and SquaredError
      
      Make the definition of residuals in Spark consistent with literature. We have been using `prediction - label` for residuals, but literature usually defines `residual = label - prediction`.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7435 from feynmanliang/SPARK-9090-Fix-LinearRegressionSummary-Residuals and squashes the following commits:
      
      f4b39d8 [Feynman Liang] Fix doc
      bc12a92 [Feynman Liang] Tweak EnsembleTestHelper and SquaredError residuals
      63f0d60 [Feynman Liang] Fix definition of residual
      6da10696
    • zsxwing's avatar
      [SPARK-5681] [STREAMING] Move 'stopReceivers' to the event loop to resolve the race condition · ad0954f6
      zsxwing authored
      This is an alternative way to fix `SPARK-5681`. It minimizes the changes.
      
      Closes #4467
      
      Author: zsxwing <zsxwing@gmail.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6294 from zsxwing/pr4467 and squashes the following commits:
      
      709ac1f [zsxwing] Fix the comment
      e103e8a [zsxwing] Move ReceiverTracker.stop into ReceiverTracker.stop
      f637142 [zsxwing] Address minor code style comments
      a178d37 [zsxwing] Move 'stopReceivers' to the event looop to resolve the race condition
      51fb07e [zsxwing] Fix the code style
      3cb19a3 [zsxwing] Merge branch 'master' into pr4467
      b4c29e7 [zsxwing] Stop receiver only if we start it
      c41ee94 [zsxwing] Make stopReceivers private
      7c73c1f [zsxwing] Use trackerStateLock to protect trackerState
      a8120c0 [zsxwing] Merge branch 'master' into pr4467
      7b1d9af [zsxwing] "case Throwable" => "case NonFatal"
      15ed4a1 [zsxwing] Register before starting the receiver
      fff63f9 [zsxwing] Use a lock to eliminate the race condition when stopping receivers and registering receivers happen at the same time.
      e0ef72a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout
      19b76d9 [Liang-Chi Hsieh] Remove timeout.
      34c18dc [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout
      c419677 [Liang-Chi Hsieh] Fix style.
      9e1a760 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout
      355f9ce [Liang-Chi Hsieh] Separate register and start events for receivers.
      3d568e8 [Liang-Chi Hsieh] Let receivers get registered first before going started.
      ae0d9fd [Liang-Chi Hsieh] Merge branch 'master' into tracker_status_timeout
      77983f3 [Liang-Chi Hsieh] Add tracker status and stop to receive messages when stopping tracker.
      ad0954f6
    • Wenchen Fan's avatar
      [SPARK-9136] [SQL] fix several bugs in DateTimeUtils.stringToTimestamp · 074085d6
      Wenchen Fan authored
      a follow up of https://github.com/apache/spark/pull/7353
      
      1. we should use `Calendar.HOUR_OF_DAY` instead of `Calendar.HOUR`(this is for AM, PM).
      2. we should call `c.set(Calendar.MILLISECOND, 0)` after `Calendar.getInstance`
      
      I'm not sure why the tests didn't fail in jenkins, but I ran latest spark master branch locally and `DateTimeUtilsSuite` failed.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7473 from cloud-fan/datetime and squashes the following commits:
      
      66cdaf2 [Wenchen Fan] fix several bugs in DateTimeUtils.stringToTimestamp
      074085d6
    • Yanbo Liang's avatar
      [SPARK-8600] [ML] Naive Bayes API for spark.ml Pipelines · 99746428
      Yanbo Liang authored
      Naive Bayes API for spark.ml Pipelines
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7284 from yanboliang/spark-8600 and squashes the following commits:
      
      bc890f7 [Yanbo Liang] remove labels valid check
      c3de687 [Yanbo Liang] remove labels from ml.NaiveBayesModel
      a2b3088 [Yanbo Liang] address comments
      3220b82 [Yanbo Liang] trigger jenkins
      3018a41 [Yanbo Liang] address comments
      208e166 [Yanbo Liang] Naive Bayes API for spark.ml Pipelines
      99746428
    • Yuhao Yang's avatar
      [SPARK-9062] [ML] Change output type of Tokenizer to Array(String, true) · 806c579f
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-9062
      
      Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default.
      
      I'm not sure what's the recommended way for Tokenizer to handle the null value in the input. Any suggestion will be welcome.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #7414 from hhbyyh/tokenizer and squashes the following commits:
      
      c01bd7a [Yuhao Yang] change output type of tokenizer
      806c579f
    • Davies Liu's avatar
      [SPARK-9138] [MLLIB] fix Vectors.dense · f9a82a88
      Davies Liu authored
      Vectors.dense() should accept numbers directly, like the one in Scala. We already use it in doctests, it worked by luck.
      
      cc mengxr jkbradley
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7476 from davies/fix_vectors_dense and squashes the following commits:
      
      e0fd292 [Davies Liu] fix Vectors.dense
      f9a82a88
    • tien-dungle's avatar
      [SPARK-9109] [GRAPHX] Keep the cached edge in the graph · 587c315b
      tien-dungle authored
      The change here is to keep the cached RDDs in the graph object so that when the graph.unpersist() is called these RDDs are correctly unpersisted.
      
      ```java
      import org.apache.spark.graphx._
      import org.apache.spark.rdd.RDD
      import org.slf4j.LoggerFactory
      import org.apache.spark.graphx.util.GraphGenerators
      
      // Create an RDD for the vertices
      val users: RDD[(VertexId, (String, String))] =
        sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                             (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
      // Create an RDD for edges
      val relationships: RDD[Edge[String]] =
        sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
                             Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
      // Define a default user in case there are relationship with missing user
      val defaultUser = ("John Doe", "Missing")
      // Build the initial Graph
      val graph = Graph(users, relationships, defaultUser)
      graph.cache().numEdges
      
      graph.unpersist()
      
      sc.getPersistentRDDs.foreach( r => println( r._2.toString))
      ```
      
      Author: tien-dungle <tien-dung.le@realimpactanalytics.com>
      
      Closes #7469 from tien-dungle/SPARK-9109_Graphx-unpersist and squashes the following commits:
      
      8d87997 [tien-dungle] Keep the cached edge in the graph
      587c315b
    • Liang-Chi Hsieh's avatar
      [SPARK-8945][SQL] Add add and subtract expressions for IntervalType · eba6a1af
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8945
      
      Add add and subtract expressions for IntervalType.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Reynold Xin <rxin@databricks.com>
      
      Closes #7398 from viirya/interval_add_subtract and squashes the following commits:
      
      acd1f1e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract
      5abae28 [Liang-Chi Hsieh] For comments.
      6f5b72e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract
      dbe3906 [Liang-Chi Hsieh] For comments.
      13a2fc5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract
      83ec129 [Liang-Chi Hsieh] Remove intervalMethod.
      acfe1ab [Liang-Chi Hsieh] Fix scala style.
      d3e9d0e [Liang-Chi Hsieh] Add add and subtract expressions for IntervalType.
      eba6a1af
Loading