Skip to content
Snippets Groups Projects
  1. Feb 01, 2015
    • Jacky Li's avatar
      [SPARK-4001][MLlib] adding parallel FP-Growth algorithm for frequent pattern mining in MLlib · 859f7249
      Jacky Li authored
      Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark. This PR add an implementation for it.
      There is a point I am not sure wether it is most efficient. In order to filter out the eligible frequent item set, currently I am using a cartesian operation on two RDDs to calculate the degree of support of each item set, not sure wether it is better to use broadcast variable to achieve the same.
      
      I will add an example to use this algorithm if requires
      
      Author: Jacky Li <jacky.likun@huawei.com>
      Author: Jacky Li <jackylk@users.noreply.github.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2847 from jackylk/apriori and squashes the following commits:
      
      bee3093 [Jacky Li] Merge pull request #1 from mengxr/SPARK-4001
      7e69725 [Xiangrui Meng] simplify FPTree and update FPGrowth
      ec21f7d [Jacky Li] fix scalastyle
      93f3280 [Jacky Li] create FPTree class
      d110ab2 [Jacky Li] change test case to use MLlibTestSparkContext
      a6c5081 [Jacky Li] Add Parallel FPGrowth algorithm
      eb3e4ca [Jacky Li] add FPGrowth
      03df2b6 [Jacky Li] refactory according to comments
      7b77ad7 [Jacky Li] fix scalastyle check
      f68a0bd [Jacky Li] add 2 apriori implemenation and fp-growth implementation
      889b33f [Jacky Li] modify per scalastyle check
      da2cba7 [Jacky Li] adding apriori algorithm for frequent item set mining in Spark
      859f7249
    • Yuhao Yang's avatar
      [Spark-5406][MLlib] LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound · d85cd4eb
      Yuhao Yang authored
      JIRA link: https://issues.apache.org/jira/browse/SPARK-5406
      
      The code in breeze svd  imposes the upper bound for LocalLAPACK in RowMatrix.computeSVD
      code from breeze svd (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala)
           val workSize = ( 3
              * scala.math.min(m, n)
              * scala.math.min(m, n)
              + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
                * scala.math.min(m, n) + 4 * scala.math.min(m, n))
            )
            val work = new Array[Double](workSize)
      
      As a result, 7 * n * n + 4 * n < Int.MaxValue at least (depends on JVM)
      
      In some worse cases, like n = 25000, work size will become positive again (80032704) and bring wired behavior.
      
      The PR is only the beginning, to support Genbase ( an important biological benchmark that would help promote Spark to genetic applications, http://www.paradigm4.com/wp-content/uploads/2014/06/Genomics-Benchmark-Technical-Report.pdf),
      which needs to compute svd for matrix up to 60K * 70K. I found many potential issues and would like to know if there's any plan undergoing that would expand the range of matrix computation based on Spark.
      Thanks.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #4200 from hhbyyh/rowMatrix and squashes the following commits:
      
      f7864d0 [Yuhao Yang] update auto logic for rowMatrix svd
      23860e4 [Yuhao Yang] fix comment style
      e48a6e4 [Yuhao Yang] make latent svd computation constraint clear
      d85cd4eb
    • Cheng Lian's avatar
      [SPARK-5465] [SQL] Fixes filter push-down for Parquet data source · ec100321
      Cheng Lian authored
      Not all Catalyst filter expressions can be converted to Parquet filter predicates. We should try to convert each individual predicate and then collect those convertible ones.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4255)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4255 from liancheng/spark-5465 and squashes the following commits:
      
      14ccd37 [Cheng Lian] Fixes filter push-down for Parquet data source
      ec100321
    • Daoyuan Wang's avatar
      [SPARK-5262] [SPARK-5244] [SQL] add coalesce in SQLParser and widen types for... · 8cf4a1f0
      Daoyuan Wang authored
      [SPARK-5262] [SPARK-5244] [SQL] add coalesce in SQLParser and widen types for parameters of coalesce
      
      I'll add test case in #4040
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4057 from adrian-wang/coal and squashes the following commits:
      
      4d0111a [Daoyuan Wang] address Yin's comments
      c393e18 [Daoyuan Wang] fix rebase conflicts
      e47c03a [Daoyuan Wang] add coalesce in parser
      c74828d [Daoyuan Wang] cast types for coalesce
      8cf4a1f0
    • OopsOutOfMemory's avatar
      [SPARK-5196][SQL] Support `comment` in Create Table Field DDL · 1b56f1d6
      OopsOutOfMemory authored
      Support `comment` in create a table field.
      __CREATE TEMPORARY TABLE people(name string `comment` "the name of a person")__
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #3999 from OopsOutOfMemory/meta_comment and squashes the following commits:
      
      39150d4 [OopsOutOfMemory] add comment and refine test suite
      1b56f1d6
    • Masayoshi TSUZUKI's avatar
      [SPARK-1825] Make Windows Spark client work fine with Linux YARN cluster · 7712ed5b
      Masayoshi TSUZUKI authored
      Modified environment strings and path separators to platform-independent style if possible.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #3943 from tsudukim/feature/SPARK-1825 and squashes the following commits:
      
      ec4b865 [Masayoshi TSUZUKI] Rebased and modified as comments.
      f8a1d5a [Masayoshi TSUZUKI] Merge branch 'master' of github.com:tsudukim/spark into feature/SPARK-1825
      3d03d35 [Masayoshi TSUZUKI] [SPARK-1825] Make Windows Spark client work fine with Linux YARN cluster
      7712ed5b
    • Tom Panning's avatar
      [SPARK-5176] The thrift server does not support cluster mode · 1ca0a101
      Tom Panning authored
      Output an error message if the thrift server is started in cluster mode.
      
      Author: Tom Panning <tom.panning@nextcentury.com>
      
      Closes #4137 from tpanningnextcen/spark-5176-thrift-cluster-mode-error and squashes the following commits:
      
      f5c0509 [Tom Panning] [SPARK-5176] The thrift server does not support cluster mode
      1ca0a101
    • Kousuke Saruta's avatar
      [SPARK-5155] Build fails with spark-ganglia-lgpl profile · c80194b3
      Kousuke Saruta authored
      Build fails with spark-ganglia-lgpl profile at the moment. This is because pom.xml for spark-ganglia-lgpl is not updated.
      
      This PR is related to #4218, #4209 and #3812.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #4303 from sarutak/fix-ganglia-pom-for-metric and squashes the following commits:
      
      5cf455f [Kousuke Saruta] Fixed pom.xml for ganglia in order to use io.dropwizard.metrics
      c80194b3
    • Liang-Chi Hsieh's avatar
      [Minor][SQL] Little refactor DataFrame related codes · ef89b82d
      Liang-Chi Hsieh authored
      Simplify some codes related to DataFrame.
      
      *  Calling `toAttributes` instead of a `map`.
      *  Original `createDataFrame` creates the `StructType` and its attributes in a redundant way. Refactored it to create `StructType` and call `toAttributes` on it directly.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4298 from viirya/refactor_df and squashes the following commits:
      
      1d61c64 [Liang-Chi Hsieh] Revert it.
      f36efb5 [Liang-Chi Hsieh] Relax the constraint of toDataFrame.
      2c9f370 [Liang-Chi Hsieh] Just refactor DataFrame codes.
      ef89b82d
    • zsxwing's avatar
      [SPARK-4859][Core][Streaming] Refactor LiveListenerBus and StreamingListenerBus · 883bc88d
      zsxwing authored
      This PR refactors LiveListenerBus and StreamingListenerBus and extracts the common codes to a parent class `ListenerBus`.
      
      It also includes bug fixes in #3710:
      1. Fix the race condition of queueFullErrorMessageLogged in LiveListenerBus and StreamingListenerBus to avoid outputing `queue-full-error` logs multiple times.
      2. Make sure the SHUTDOWN message will be delivered to listenerThread, so that we can make sure listenerThread will always be able to exit.
      3. Log the error from listener rather than crashing listenerThread in StreamingListenerBus.
      
      During fixing the above bugs, we find it's better to make LiveListenerBus and StreamingListenerBus have the same bahaviors. Then there will be many duplicated codes in LiveListenerBus and StreamingListenerBus.
      
      Therefore, I extracted their common codes to `ListenerBus` as a parent class: LiveListenerBus and StreamingListenerBus only need to extend `ListenerBus` and implement `onPostEvent` (how to process an event) and `onDropEvent` (do something when droppping an event).
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #4006 from zsxwing/SPARK-4859-refactor and squashes the following commits:
      
      c8dade2 [zsxwing] Fix the code style after renaming
      5715061 [zsxwing] Rename ListenerHelper to ListenerBus and the original ListenerBus to AsynchronousListenerBus
      f0ef647 [zsxwing] Fix the code style
      4e85ffc [zsxwing] Merge branch 'master' into SPARK-4859-refactor
      d2ef990 [zsxwing] Add private[spark]
      4539f91 [zsxwing] Remove final to pass MiMa tests
      a9dccd3 [zsxwing] Remove SparkListenerShutdown
      7cc04c3 [zsxwing] Refactor LiveListenerBus and StreamingListenerBus and make them share same code base
      883bc88d
    • Xiangrui Meng's avatar
      [SPARK-5424][MLLIB] make the new ALS impl take generic ID types · 4a171225
      Xiangrui Meng authored
      This PR makes the ALS implementation take generic ID types, e.g., Long and String, and expose it as a developer API.
      
      TODO:
      - [x] make sure that specialization works (validated in profiler)
      
      srowen You may like this change:) I hit a Scala compiler bug with specialization. It compiles now but users and items must have the same type. I'm going to check whether specialization really works.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4281 from mengxr/generic-als and squashes the following commits:
      
      96072c3 [Xiangrui Meng] merge master
      135f741 [Xiangrui Meng] minor update
      c2db5e5 [Xiangrui Meng] make test pass
      86588e1 [Xiangrui Meng] use a single ID type for both users and items
      74f1f73 [Xiangrui Meng] compile but runtime error at test
      e36469a [Xiangrui Meng] add classtags and make it compile
      7a5aeb3 [Xiangrui Meng] UserType -> User, ItemType -> Item
      c8ee0bc [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into generic-als
      72b5006 [Xiangrui Meng] remove generic from pipeline interface
      8bbaea0 [Xiangrui Meng] make ALS take generic IDs
      4a171225
    • Octavian Geagla's avatar
      [SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use · bdb0680d
      Octavian Geagla authored
      This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.
      
      Author: Octavian Geagla <ogeagla@gmail.com>
      
      Closes #4140 from ogeagla/SPARK-5207 and squashes the following commits:
      
      fa64dfa [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel to take stddev instead of variance
      9078fe0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] Incorporate code review feedback: change arg ordering, add dev api annotations, do better null checking, add another test and some doc for this.
      997d2e0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] make withMean and withStd public, add constructor which uses defaults, un-refactor test class
      64408a4 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel contructor to not be private to mllib, added tests for newly-exposed functionality
      bdb0680d
    • Ryan Williams's avatar
      [SPARK-5422] Add support for sending Graphite metrics via UDP · 80bd715a
      Ryan Williams authored
      Depends on [SPARK-5413](https://issues.apache.org/jira/browse/SPARK-5413) / #4209, included here, will rebase once the latter's merged.
      
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #4218 from ryan-williams/udp and squashes the following commits:
      
      ebae393 [Ryan Williams] Add support for sending Graphite metrics via UDP
      cb58262 [Ryan Williams] bump metrics dependency to v3.1.0
      80bd715a
  2. Jan 31, 2015
    • Sean Owen's avatar
      SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` doesn't work with Java 8 · c84d5a10
      Sean Owen authored
      These are more `javadoc` 8-related changes I spotted while investigating. These should be helpful in any event, but this does not nearly resolve SPARK-3359, which may never be feasible while using `unidoc` and `javadoc` 8.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4193 from srowen/SPARK-3359 and squashes the following commits:
      
      5b33f66 [Sean Owen] Additional scaladoc fixes for javadoc 8; still not going to be javadoc 8 compatible
      c84d5a10
    • Burak Yavuz's avatar
      [SPARK-3975] Added support for BlockMatrix addition and multiplication · ef8974b1
      Burak Yavuz authored
      Support for multiplying and adding large distributed matrices!
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <brkyvz@dn51t42l.sunet>
      Author: Burak Yavuz <brkyvz@dn51t4rd.sunet>
      Author: Burak Yavuz <brkyvz@dn0a221430.sunet>
      Author: Burak Yavuz <brkyvz@dn0a22b17d.sunet>
      
      Closes #4274 from brkyvz/SPARK-3975PR2 and squashes the following commits:
      
      17abd59 [Burak Yavuz] added indices to error message
      ac25783 [Burak Yavuz] merged masyer
      b66fd8b [Burak Yavuz] merged masyer
      e39baff [Burak Yavuz] addressed code review v1
      2dba642 [Burak Yavuz] [SPARK-3975] Added support for BlockMatrix addition and multiplication
      fb7624b [Burak Yavuz] merged master
      98c58ea [Burak Yavuz] added tests
      cdeb5df [Burak Yavuz] before adding tests
      c9bf247 [Burak Yavuz] fixed merge conflicts
      1cb0d06 [Burak Yavuz] [SPARK-3976] Added doc
      f92a916 [Burak Yavuz] merge upstream
      1a63b20 [Burak Yavuz] [SPARK-3974] Remove setPartition method. Isn't required
      1e8bb2a [Burak Yavuz] [SPARK-3974] Change return type of cache and persist
      e3d24c3 [Burak Yavuz] [SPARK-3976] Pulled upstream changes
      fa3774f [Burak Yavuz] [SPARK-3976] updated matrix multiplication and addition implementation
      239ab4b [Burak Yavuz] [SPARK-3974] Addressed @jkbradley's comments
      add7b05 [Burak Yavuz] [SPARK-3976] Updated code according to upstream changes
      e29acfd [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-3976
      3127233 [Burak Yavuz] fixed merge conflicts with upstream
      ba414d2 [Burak Yavuz] [SPARK-3974] fixed frobenius norm
      ab6cde0 [Burak Yavuz] [SPARK-3974] Modifications cleaning code up, making size calculation more robust
      9ae85aa [Burak Yavuz] [SPARK-3974] Made partitioner a variable inside BlockMatrix instead of a constructor variable
      d033861 [Burak Yavuz] [SPARK-3974] Removed SubMatrixInfo and added constructor without partitioner
      8e954ab [Burak Yavuz] save changes
      bbeae8c [Burak Yavuz] merged master
      987ea53 [Burak Yavuz] merged master
      49b9586 [Burak Yavuz] [SPARK-3974] Updated testing utils from master
      645afbe [Burak Yavuz] [SPARK-3974] Pull latest master
      beb1edd [Burak Yavuz] merge conflicts fixed
      f41d8db [Burak Yavuz] update tests
      b05aabb [Burak Yavuz] [SPARK-3974] Updated tests to reflect changes
      56b0546 [Burak Yavuz] updates from 3974 PR
      b7b8a8f [Burak Yavuz] pull updates from master
      b2dec63 [Burak Yavuz] Pull changes from 3974
      19c17e8 [Burak Yavuz] [SPARK-3974] Changed blockIdRow and blockIdCol
      5f062e6 [Burak Yavuz] updates with 3974
      6729fbd [Burak Yavuz] Updated with respect to SPARK-3974 PR
      589fbb6 [Burak Yavuz] [SPARK-3974] Code review feedback addressed
      63a4858 [Burak Yavuz] added grid multiplication
      aa8f086 [Burak Yavuz] [SPARK-3974] Additional comments added
      7381b99 [Burak Yavuz] merge with PR1
      f378e16 [Burak Yavuz] [SPARK-3974] Block Matrix Abstractions ready
      b693209 [Burak Yavuz] Ready for Pull request
      ef8974b1
    • martinzapletal's avatar
      [MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm · 34250a61
      martinzapletal authored
      This PR introduces an API for Isotonic regression and one algorithm implementing it, Pool adjacent violators.
      
      The Isotonic regression problem is sufficiently described in [Floudas, Pardalos, Encyclopedia of Optimization](http://books.google.co.uk/books?id=gtoTkL7heS0C&pg=RA2-PA87&lpg=RA2-PA87&dq=pooled+adjacent+violators+code&source=bl&ots=ZzQbZXVJnn&sig=reH_hBV6yIb9BeZNTF9092vD8PY&hl=en&sa=X&ei=WmF2VLiOIZLO7Qa-t4Bo&ved=0CD8Q6AEwBA#v=onepage&q&f=false), [Wikipedia](http://en.wikipedia.org/wiki/Isotonic_regression) or [Stat Wiki](http://stat.wikia.com/wiki/Isotonic_regression).
      
      Pool adjacent violators was introduced by  M. Ayer et al. in 1955.  A history and development of isotonic regression algorithms is in [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper) and list of available algorithms including their complexity is listed in [Stout, Fastest Isotonic Regression Algorithms](http://web.eecs.umich.edu/~qstout/IsoRegAlg_140812.pdf).
      
      An approach to parallelize the computation of PAV was presented in [Kearsley, Tapia, Trosset, An Approach to Parallelizing Isotonic Regression](http://softlib.rice.edu/pub/CRPC-TRs/reports/CRPC-TR96640.pdf).
      
      The implemented Pool adjacent violators algorithm is based on  [Floudas, Pardalos, Encyclopedia of Optimization](http://books.google.co.uk/books?id=gtoTkL7heS0C&pg=RA2-PA87&lpg=RA2-PA87&dq=pooled+adjacent+violators+code&source=bl&ots=ZzQbZXVJnn&sig=reH_hBV6yIb9BeZNTF9092vD8PY&hl=en&sa=X&ei=WmF2VLiOIZLO7Qa-t4Bo&ved=0CD8Q6AEwBA#v=onepage&q&f=false) (Chapter Isotonic regression problems, p. 86) and  [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper), also nicely formulated in [Tibshirani,  Hoefling, Tibshirani, Nearly-Isotonic Regression](http://www.stat.cmu.edu/~ryantibs/papers/neariso.pdf). Implementation itself inspired by R implementations [Klaus, Strimmer, 2008, fdrtool: Estimation of (Local) False Discovery Rates and Higher Criticism](http://cran.r-project.org/web/packages/fdrtool/index.html) and [R Development Core Team, stats, 2009](https://github.com/lgautier/R-3-0-branch-alt/blob/master/src/library/stats/R/isoreg.R). I ran tests with both these libraries and confirmed they yield the same results. More R implementations referenced in aforementioned [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators
      Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper). The implementation is also inspired and cross checked with other implementations: [Ted Harding, 2007](https://stat.ethz.ch/pipermail/r-help/2007-March/127981.html), [scikit-learn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/_isotonic.pyx), [Andrew Tulloch, 2014, Julia](https://github.com/ajtulloch/Isotonic.jl/blob/master/src/pooled_pava.jl), [Andrew Tulloch, 2014, c++](https://gist.github.com/ajtulloch/9499872), described in [Andrew Tulloch, Speeding up isotonic regression in scikit-learn by 5,000x](http://tullo.ch/articles/speeding-up-isotonic-regression/), [Fabian Pedregosa, 2012](https://gist.github.com/fabianp/3081831), [Sreangsu Acharyya. libpav](https://bitbucket.org/sreangsu/libpav/src/f744bc1b0fea257f0cacaead1c922eab201ba91b/src/pav.h?at=default) and [Gustav Larsson](https://gist.github.com/gustavla/9499068).
      
      Author: martinzapletal <zapletal-martin@email.cz>
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Martin Zapletal <zapletal-martin@email.cz>
      
      Closes #3519 from zapletal-martin/SPARK-3278 and squashes the following commits:
      
      5a54ea4 [Martin Zapletal] Merge pull request #2 from mengxr/isotonic-fix-java
      37ba24e [Xiangrui Meng] fix java tests
      e3c0e44 [martinzapletal] Merge remote-tracking branch 'origin/SPARK-3278' into SPARK-3278
      d8feb82 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      ded071c [Martin Zapletal] Merge pull request #1 from mengxr/SPARK-3278
      4dfe136 [Xiangrui Meng] add cache back
      0b35c15 [Xiangrui Meng] compress pools and update tests
      35d044e [Xiangrui Meng] update paraPAVA
      077606b [Xiangrui Meng] minor
      05422a8 [Xiangrui Meng] add unit test for model construction
      5925113 [Xiangrui Meng] Merge remote-tracking branch 'zapletal-martin/SPARK-3278' into SPARK-3278
      80c6681 [Xiangrui Meng] update IRModel
      3da56e5 [martinzapletal] SPARK-3278 fixed indentation error
      75eac55 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      88eb4e2 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Isotonic parameter removed from algorithm, defined behaviour for multiple data points with the same feature value, added tests to verify it
      e60a34f [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Styling and comment fixes.
      d93c8f9 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Change to IsotonicRegression api. Isotonic parameter now follows api of other mllib algorithms
      1fff77d [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Java api changes, test refactoring, comments and citations, isotonic regression model validations, linear interpolation for predictions
      12151e6 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      7aca4cc [martinzapletal] SPARK-3278 comment spelling
      9ae9d53 [martinzapletal] SPARK-3278 changes after PR feedback https://github.com/apache/spark/pull/3519. Binary search used for isotonic regression model predictions
      fad4bf9 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519
      ce0e30c [martinzapletal] SPARK-3278 readability refactoring
      f90c8c7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      0d14bd3 [martinzapletal] SPARK-3278 changed Java api to match Scala api's (Double, Double, Double)
      3c2954b [martinzapletal] SPARK-3278 Isotonic regression java api
      45aa7e8 [martinzapletal] SPARK-3278 Isotonic regression java api
      e9b3323 [martinzapletal] Merge branch 'SPARK-3278-weightedLabeledPoint' into SPARK-3278
      823d803 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      941fd1f [martinzapletal] SPARK-3278 Isotonic regression java api
      a24e29f [martinzapletal] SPARK-3278 refactored weightedlabeledpoint to (double, double, double) and updated api
      deb0f17 [martinzapletal] SPARK-3278 refactored weightedlabeledpoint to (double, double, double) and updated api
      8cefd18 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278-weightedLabeledPoint
      cab5a46 [martinzapletal] SPARK-3278 PR 3519 refactoring WeightedLabeledPoint to tuple as per comments
      b8b1620 [martinzapletal] Removed WeightedLabeledPoint. Replaced by tuple of doubles
      34760d5 [martinzapletal] Removed WeightedLabeledPoint. Replaced by tuple of doubles
      089bf86 [martinzapletal] Removed MonotonicityConstraint, Isotonic and Antitonic constraints. Replced by simple boolean
      c06f88c [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      6046550 [martinzapletal] SPARK-3278 scalastyle errors resolved
      8f5daf9 [martinzapletal] SPARK-3278 added comments and cleaned up api to consistently handle weights
      629a1ce [martinzapletal] SPARK-3278 added isotonic regression for weighted data. Added tests for Java api
      05d9048 [martinzapletal] SPARK-3278 isotonic regression refactoring and api changes
      961aa05 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      3de71d0 [martinzapletal] SPARK-3278 added initial version of Isotonic regression algorithm including proposed API
      34250a61
    • Reynold Xin's avatar
      [SPARK-5307] Add a config option for SerializationDebugger. · 63640831
      Reynold Xin authored
      Just in case there is a bug in the SerializationDebugger that makes error reporting worse than it was.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4297 from rxin/ser-config and squashes the following commits:
      
      f1d4629 [Reynold Xin] [SPARK-5307] Add a config option for SerializationDebugger.
      63640831
    • kai's avatar
      [SQL] remove redundant field "childOutput" from execution.Aggregate, use child.output instead · f54c9f60
      kai authored
      Author: kai <kaizeng@eecs.berkeley.edu>
      
      Closes #4291 from kai-zeng/aggregate-fix and squashes the following commits:
      
      78658ef [kai] remove redundant field "childOutput"
      f54c9f60
    • Reynold Xin's avatar
      [SPARK-5307] SerializationDebugger · 740a5686
      Reynold Xin authored
      This patch adds a SerializationDebugger that is used to add serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.
      
      The patch uses the internals of JVM serialization (in particular, heavy usage of ObjectStreamClass). Compared with an earlier attempt, this one provides extra information including field names, array offsets, writeExternal calls, etc.
      
      An example serialization stack:
      ```
      Serialization stack:
        - object not serializable (class: org.apache.spark.serializer.NotSerializable, value: org.apache.spark.serializer.NotSerializable2c43caa4)
        - element of array (index: 0)
        - array (class [Ljava.lang.Object;, size 1)
        - field (class: org.apache.spark.serializer.SerializableArray, name: arrayField, type: class [Ljava.lang.Object;)
        - object (class org.apache.spark.serializer.SerializableArray, org.apache.spark.serializer.SerializableArray193c5908)
        - writeExternal data
        - externalizable object (class org.apache.spark.serializer.ExternalizableClass, org.apache.spark.serializer.ExternalizableClass320bdadc)
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4098 from rxin/SerializationDebugger and squashes the following commits:
      
      553b3ff [Reynold Xin] Update SerializationDebuggerSuite.scala
      572d0cb [Reynold Xin] Disable automatically when reflection fails.
      b349b77 [Reynold Xin] [SPARK-5307] SerializationDebugger to help debug NotSerializableException - take 2
      740a5686
  3. Jan 30, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5504] [sql] convertToCatalyst should support nested arrays · e643de42
      Joseph K. Bradley authored
      After the recent refactoring, convertToCatalyst in ScalaReflection does not recurse on Arrays. It should.
      
      The test suite modification made the test fail before the fix in ScalaReflection.  The fix makes the test suite succeed.
      
      CC: marmbrus
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4295 from jkbradley/SPARK-5504 and squashes the following commits:
      
      6b7276d [Joseph K. Bradley] Fixed issue in ScalaReflection.convertToCatalyst with Arrays with non-primitive types. Modified test suite so it failed before the fix and works after the fix.
      e643de42
    • Travis Galoppo's avatar
      SPARK-5400 [MLlib] Changed name of GaussianMixtureEM to GaussianMixture · 98697734
      Travis Galoppo authored
      Decoupling the model and the algorithm
      
      Author: Travis Galoppo <tjg2107@columbia.edu>
      
      Closes #4290 from tgaloppo/spark-5400 and squashes the following commits:
      
      9c1534c [Travis Galoppo] Fixed invokation instructions in comments
      d848076 [Travis Galoppo] SPARK-5400 Changed name of GaussianMixtureEM to GaussianMixture to separate model from algorithm
      98697734
    • sboeschhuawei's avatar
      [SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function · f377431a
      sboeschhuawei authored
      Add single pseudo-eigenvector PIC
      Including documentations and updated pom.xml with the following codes:
      mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
      mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala
      
      Author: sboeschhuawei <stephen.boesch@huawei.com>
      Author: Fan Jiang <fanjiang.sc@huawei.com>
      Author: Jiang Fan <fjiang6@gmail.com>
      Author: Stephen Boesch <stephen.boesch@huawei.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4254 from fjiang6/PIC and squashes the following commits:
      
      4550850 [sboeschhuawei] Removed pic test data
      f292f31 [Stephen Boesch] Merge pull request #44 from mengxr/SPARK-4259
      4b78aaf [Xiangrui Meng] refactor PIC
      24fbf52 [sboeschhuawei] Updated API to be similar to KMeans plus other changes requested by Xiangrui on the PR
      c12dfc8 [sboeschhuawei] Removed examples files and added pic_data.txt. Revamped testcases yet to come
      92d4752 [sboeschhuawei] Move the Guassian/ Affinity matrix calcs out of PIC. Presently in the test suite
      7ebd149 [sboeschhuawei] Incorporate Xiangrui's first set of PR comments except restructure PIC.run to take Graph but do not remove Gaussian
      121e4d5 [sboeschhuawei] Remove unused testing data files
      1c3a62e [sboeschhuawei] removed matplot.py and reordered all private methods to bottom of PIC
      218a49d [sboeschhuawei] Applied Xiangrui's comments - especially removing RDD/PICLinalg classes and making noncritical methods private
      43ab10b [sboeschhuawei] Change last two println's to log4j logger
      88aacc8 [sboeschhuawei] Add assert to testcase on cluster sizes
      24f438e [sboeschhuawei] fixed incorrect markdown in clustering doc
      060e6bf [sboeschhuawei] Added link to PIC doc from the main clustering md doc
      be659e3 [sboeschhuawei] Added mllib specific log4j
      90e7fa4 [sboeschhuawei] Converted from custom Linalg routines to Breeze: added JavaDoc comments; added Markdown documentation
      bea48ea [sboeschhuawei] Converted custom Linear Algebra datatypes/routines to use Breeze.
      b29c0db [Fan Jiang] Update PIClustering.scala
      ace9749 [Fan Jiang] Update PIClustering.scala
      a112f38 [sboeschhuawei] Added graphx main and test jars as dependencies to mllib/pom.xml
      f656c34 [sboeschhuawei] Added iris dataset
      b7dbcbe [sboeschhuawei] Added axes and combined into single plot for matplotlib
      a2b1e57 [sboeschhuawei] Revert inadvertent update to KMeans
      9294263 [sboeschhuawei] Added visualization/plotting of input/output data
      e5df2b8 [sboeschhuawei] First end to end working PIC
      0700335 [sboeschhuawei] First end to end working version: but has bad performance issue
      32a90dc [sboeschhuawei] Update circles test data values
      0ef163f [sboeschhuawei] Added ConcentricCircles data generation and KMeans clustering
      3fd5bc8 [sboeschhuawei] PIClustering is running in new branch (up to the pseudo-eigenvector convergence step)
      d5aae20 [Jiang Fan] Adding Power Iteration Clustering and Suite test
      a3c5fbe [Jiang Fan] Adding Power Iteration Clustering
      f377431a
    • Burak Yavuz's avatar
      [SPARK-5486] Added validate method to BlockMatrix · 6ee8338b
      Burak Yavuz authored
      The `validate` method will allow users to debug their `BlockMatrix`, if operations like `add` or `multiply` return unexpected results. It checks the following properties in a `BlockMatrix`:
      - Are the dimensions of the `BlockMatrix` consistent with what the user entered: (`nRows`, `nCols`)
      - Are the dimensions of each `MatrixBlock` consistent with what the user entered: (`rowsPerBlock`, `colsPerBlock`)
      - Are there blocks with duplicate indices
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #4279 from brkyvz/SPARK-5486 and squashes the following commits:
      
      c152a73 [Burak Yavuz] addressed code review v2
      598c583 [Burak Yavuz] merged master
      b55ac5c [Burak Yavuz] addressed code review v1
      25f083b [Burak Yavuz] simplify implementation
      0aa519a [Burak Yavuz] [SPARK-5486] Added validate method to BlockMatrix
      6ee8338b
    • Xiangrui Meng's avatar
      [SPARK-5496][MLLIB] Allow both classification and Classification in Algo for trees. · 0a95085f
      Xiangrui Meng authored
      to be backward compatible.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4287 from mengxr/SPARK-5496 and squashes the following commits:
      
      a025c53 [Xiangrui Meng] Allow both classification and Classification in Algo for trees.
      0a95085f
    • Joseph J.C. Tang's avatar
      [MLLIB] SPARK-4846: throw a RuntimeException and give users hints to increase the minCount · 54d95758
      Joseph J.C. Tang authored
      When the vocabSize\*vectorSize is larger than Int.MaxValue/8, we try to throw a RuntimeException. Because under this circumstance it would definitely throw an OOM when allocating memory to serialize the arrays syn0Global&syn1Global.   syn0Global&syn1Global are float arrays. Serializing them should need a byte array of more than 8 times of syn0Global's size.
      Also if we catch an OOM even if vocabSize\*vectorSize is less than Int.MaxValue/8, we should give users hints to increase the minCount or decrease the vectorSize.
      
      Author: Joseph J.C. Tang <jinntrance@gmail.com>
      
      Closes #4247 from jinntrance/w2v-fix and squashes the following commits:
      
      b5eb71f [Joseph J.C. Tang] throw a RuntimeException and give users hints regarding the vectorSize&minCount
      54d95758
    • Sandy Ryza's avatar
      SPARK-5393. Flood of util.RackResolver log messages after SPARK-1714 · 254eaa4d
      Sandy Ryza authored
      Previously I had tried to solve this with by adding a line in Spark's log4j-defaults.properties.
      
      The issue with the message in log4j-defaults.properties was that the log4j.properties packaged inside Hadoop was getting picked up instead. While it would be ideal to fix that as well, we still want to quiet this in situations where a user supplies their own custom log4j properties.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #4192 from sryza/sandy-spark-5393 and squashes the following commits:
      
      4d5dedc [Sandy Ryza] Only set log level if unset
      46e07c5 [Sandy Ryza] SPARK-5393. Flood of util.RackResolver log messages after SPARK-1714
      254eaa4d
    • Takuya UESHIN's avatar
      [SPARK-5457][SQL] Add missing DSL for ApproxCountDistinct. · 6f21dce5
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #4250 from ueshin/issues/SPARK-5457 and squashes the following commits:
      
      3c05e59 [Takuya UESHIN] Remove parameter to use default value of ApproxCountDistinct.
      faea19d [Takuya UESHIN] Use overload instead of default value for Java support.
      d1cca38 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-5457
      663d43d [Takuya UESHIN] Add missing DSL for ApproxCountDistinct.
      6f21dce5
    • Kazuki Taniguchi's avatar
      [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees · bc1fc9b6
      Kazuki Taniguchi authored
      This PR is implementing the Gradient Boosted Trees for Python API.
      
      Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>
      
      Closes #3951 from kazk1018/gbt_for_py and squashes the following commits:
      
      620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
      bc1fc9b6
  4. Jan 29, 2015
    • Burak Yavuz's avatar
      [SPARK-5322] Added transpose functionality to BlockMatrix · dd4d84cf
      Burak Yavuz authored
      BlockMatrices can now be transposed!
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #4275 from brkyvz/SPARK-5322 and squashes the following commits:
      
      33806ed [Burak Yavuz] added lazy comment
      33e9219 [Burak Yavuz] made transpose lazy
      5a274cd [Burak Yavuz] added cached tests
      5dcf85c [Burak Yavuz] [SPARK-5322] Added transpose functionality to BlockMatrix
      dd4d84cf
    • Reynold Xin's avatar
      [SQL] Support df("*") to select all columns in a data frame. · 80def9de
      Reynold Xin authored
      This PR makes Star a trait, and provides two implementations: UnresolvedStar (used for *, tblName.*) and ResolvedStar (used for df("*")).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4283 from rxin/df-star and squashes the following commits:
      
      c9cba3e [Reynold Xin] Removed mapFunction in UnresolvedStar.
      1a3a1d7 [Reynold Xin] [SQL] Support df("*") to select all columns in a data frame.
      80def9de
    • Josh Rosen's avatar
      [SPARK-5462] [SQL] Use analyzed query plan in DataFrame.apply() · 22271f96
      Josh Rosen authored
      This patch changes DataFrame's `apply()` method to use an analyzed query plan when resolving column names.  This fixes a bug where `apply` would throw "invalid call to qualifiers on unresolved object" errors when called on DataFrames constructed via `SQLContext.sql()`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4282 from JoshRosen/SPARK-5462 and squashes the following commits:
      
      b9e6da2 [Josh Rosen] [SPARK-5462] Use analyzed query plan in DataFrame.apply().
      22271f96
    • Davies Liu's avatar
      [SPARK-5395] [PySpark] fix python process leak while coalesce() · 5c746eed
      Davies Liu authored
      Currently, the Python process is released into pool only after the task had finished, it cause many process forked if coalesce() is called.
      
      This PR will change it to release the process as soon as read all the data from it (finish the partition), then a process could be reused to process multiple partitions in a single task.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4238 from davies/py_leak and squashes the following commits:
      
      ec80a43 [Davies Liu] add @volatile
      6da437a [Davies Liu] address comments
      24ed322 [Davies Liu] fix python process leak while coalesce()
      5c746eed
    • Reynold Xin's avatar
      [SQL] DataFrame API improvements · ce9c43ba
      Reynold Xin authored
      1. Added Dsl.column in case Dsl.col is shadowed.
      2. Allow using String to specify the target data type in cast.
      3. Support sorting on multiple columns using column names.
      4. Added Java API test file.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4280 from rxin/dsl1 and squashes the following commits:
      
      33ecb7a [Reynold Xin] Add the Java test.
      d06540a [Reynold Xin] [SQL] DataFrame API improvements.
      ce9c43ba
    • Patrick Wendell's avatar
      Revert "[WIP] [SPARK-3996]: Shade Jetty in Spark deliverables" · d2071e8f
      Patrick Wendell authored
      This reverts commit f240fe39.
      d2071e8f
    • Yoshihiro Shimizu's avatar
      remove 'return' · 5338772f
      Yoshihiro Shimizu authored
      looks unnecessary :grinning:
      
      Author: Yoshihiro Shimizu <shimizu@amoad.com>
      
      Closes #4268 from y-shimizu/remove-return and squashes the following commits:
      
      12be0e9 [Yoshihiro Shimizu] remove 'return'
      5338772f
    • Patrick Wendell's avatar
      [WIP] [SPARK-3996]: Shade Jetty in Spark deliverables · f240fe39
      Patrick Wendell authored
      This patch piggy-back's on vanzin's work to simplify the Guava shading,
      and adds Jetty as a shaded library in Spark. Other than adding Jetty,
      it consilidates the \<artifactSet\>'s into the root pom. I found it was
      a bit easier to follow that way, since you don't need to look into
      child pom's to find out specific artifact sets included in shading.
      
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #4252 from pwendell/jetty and squashes the following commits:
      
      19f0710 [Patrick Wendell] More code review feedback
      961452d [Patrick Wendell] Responding to feedback from Marcello
      6df25ca [Patrick Wendell] [WIP] [SPARK-3996]: Shade Jetty in Spark deliverables
      f240fe39
    • Josh Rosen's avatar
      [SPARK-5464] Fix help() for Python DataFrame instances · 0bb15f22
      Josh Rosen authored
      This fixes an exception that prevented users from calling `help()` on Python DataFrame instances.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4278 from JoshRosen/SPARK-5464-python-dataframe-help-command and squashes the following commits:
      
      08f95f7 [Josh Rosen] Fix exception when calling help() on Python DataFrame instances
      0bb15f22
    • Yin Huai's avatar
      [SPARK-4296][SQL] Trims aliases when resolving and checking aggregate expressions · c00d517d
      Yin Huai authored
      I believe that SPARK-4296 has been fixed by 3684fd21. I am adding tests based #3910 (change the udf to HiveUDF instead).
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4010 from yhuai/SPARK-4296-yin and squashes the following commits:
      
      6343800 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-4296-yin
      6cfadd2 [Yin Huai] Actually, this issue has been fixed by 3684fd21.
      d42b707 [Yin Huai] Update comment.
      8b3a274 [Yin Huai] Since expressions in grouping expressions can have aliases, which can be used by the outer query block,     revert this change.
      443538d [Cheng Lian] Trims aliases when resolving and checking aggregate expressions
      c00d517d
    • wangfei's avatar
      [SPARK-5373][SQL] Literal in agg grouping expressions leads to incorrect result · c1b3eebf
      wangfei authored
      `select key, count( * ) from src group by key, 1`  will get the wrong answer.
      
      e.g. for this table
      ```
        val testData2 =
          TestSQLContext.sparkContext.parallelize(
            TestData2(1, 1) ::
            TestData2(1, 2) ::
            TestData2(2, 1) ::
            TestData2(2, 2) ::
            TestData2(3, 1) ::
            TestData2(3, 2) :: Nil, 2).toSchemaRDD
        testData2.registerTempTable("testData2")
      ```
      result of `SELECT a, count(1) FROM testData2 GROUP BY a, 1`  is
      
      ```
                           [1,1]
                           [2,2]
                           [3,1]
      ```
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #4169 from scwf/agg-bug and squashes the following commits:
      
      05751db [wangfei] fix bugs when literal in agg grouping expressioons
      c1b3eebf
    • wangfei's avatar
      [SPARK-5367][SQL] Support star expression in udf · fbaf9e08
      wangfei authored
      now spark sql does not support star expression in udf, run the following sql by spark-sql will get error
      ```
      select concat(*) from src
      ```
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #4163 from scwf/udf-star and squashes the following commits:
      
      9db7b39 [wangfei] addressed comments
      da1da09 [scwf] minor fix
      f87b5f9 [scwf] added test case
      587bf7e [wangfei] compile fix
      eb93c16 [wangfei] fix star resolve issue in udf
      fbaf9e08
Loading