Skip to content
Snippets Groups Projects
  1. Feb 01, 2015
    • Kousuke Saruta's avatar
      [SPARK-5155] Build fails with spark-ganglia-lgpl profile · c80194b3
      Kousuke Saruta authored
      Build fails with spark-ganglia-lgpl profile at the moment. This is because pom.xml for spark-ganglia-lgpl is not updated.
      
      This PR is related to #4218, #4209 and #3812.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #4303 from sarutak/fix-ganglia-pom-for-metric and squashes the following commits:
      
      5cf455f [Kousuke Saruta] Fixed pom.xml for ganglia in order to use io.dropwizard.metrics
      c80194b3
    • Liang-Chi Hsieh's avatar
      [Minor][SQL] Little refactor DataFrame related codes · ef89b82d
      Liang-Chi Hsieh authored
      Simplify some codes related to DataFrame.
      
      *  Calling `toAttributes` instead of a `map`.
      *  Original `createDataFrame` creates the `StructType` and its attributes in a redundant way. Refactored it to create `StructType` and call `toAttributes` on it directly.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4298 from viirya/refactor_df and squashes the following commits:
      
      1d61c64 [Liang-Chi Hsieh] Revert it.
      f36efb5 [Liang-Chi Hsieh] Relax the constraint of toDataFrame.
      2c9f370 [Liang-Chi Hsieh] Just refactor DataFrame codes.
      ef89b82d
    • zsxwing's avatar
      [SPARK-4859][Core][Streaming] Refactor LiveListenerBus and StreamingListenerBus · 883bc88d
      zsxwing authored
      This PR refactors LiveListenerBus and StreamingListenerBus and extracts the common codes to a parent class `ListenerBus`.
      
      It also includes bug fixes in #3710:
      1. Fix the race condition of queueFullErrorMessageLogged in LiveListenerBus and StreamingListenerBus to avoid outputing `queue-full-error` logs multiple times.
      2. Make sure the SHUTDOWN message will be delivered to listenerThread, so that we can make sure listenerThread will always be able to exit.
      3. Log the error from listener rather than crashing listenerThread in StreamingListenerBus.
      
      During fixing the above bugs, we find it's better to make LiveListenerBus and StreamingListenerBus have the same bahaviors. Then there will be many duplicated codes in LiveListenerBus and StreamingListenerBus.
      
      Therefore, I extracted their common codes to `ListenerBus` as a parent class: LiveListenerBus and StreamingListenerBus only need to extend `ListenerBus` and implement `onPostEvent` (how to process an event) and `onDropEvent` (do something when droppping an event).
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #4006 from zsxwing/SPARK-4859-refactor and squashes the following commits:
      
      c8dade2 [zsxwing] Fix the code style after renaming
      5715061 [zsxwing] Rename ListenerHelper to ListenerBus and the original ListenerBus to AsynchronousListenerBus
      f0ef647 [zsxwing] Fix the code style
      4e85ffc [zsxwing] Merge branch 'master' into SPARK-4859-refactor
      d2ef990 [zsxwing] Add private[spark]
      4539f91 [zsxwing] Remove final to pass MiMa tests
      a9dccd3 [zsxwing] Remove SparkListenerShutdown
      7cc04c3 [zsxwing] Refactor LiveListenerBus and StreamingListenerBus and make them share same code base
      883bc88d
    • Xiangrui Meng's avatar
      [SPARK-5424][MLLIB] make the new ALS impl take generic ID types · 4a171225
      Xiangrui Meng authored
      This PR makes the ALS implementation take generic ID types, e.g., Long and String, and expose it as a developer API.
      
      TODO:
      - [x] make sure that specialization works (validated in profiler)
      
      srowen You may like this change:) I hit a Scala compiler bug with specialization. It compiles now but users and items must have the same type. I'm going to check whether specialization really works.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4281 from mengxr/generic-als and squashes the following commits:
      
      96072c3 [Xiangrui Meng] merge master
      135f741 [Xiangrui Meng] minor update
      c2db5e5 [Xiangrui Meng] make test pass
      86588e1 [Xiangrui Meng] use a single ID type for both users and items
      74f1f73 [Xiangrui Meng] compile but runtime error at test
      e36469a [Xiangrui Meng] add classtags and make it compile
      7a5aeb3 [Xiangrui Meng] UserType -> User, ItemType -> Item
      c8ee0bc [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into generic-als
      72b5006 [Xiangrui Meng] remove generic from pipeline interface
      8bbaea0 [Xiangrui Meng] make ALS take generic IDs
      4a171225
    • Octavian Geagla's avatar
      [SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use · bdb0680d
      Octavian Geagla authored
      This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.
      
      Author: Octavian Geagla <ogeagla@gmail.com>
      
      Closes #4140 from ogeagla/SPARK-5207 and squashes the following commits:
      
      fa64dfa [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel to take stddev instead of variance
      9078fe0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] Incorporate code review feedback: change arg ordering, add dev api annotations, do better null checking, add another test and some doc for this.
      997d2e0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] make withMean and withStd public, add constructor which uses defaults, un-refactor test class
      64408a4 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel contructor to not be private to mllib, added tests for newly-exposed functionality
      bdb0680d
    • Ryan Williams's avatar
      [SPARK-5422] Add support for sending Graphite metrics via UDP · 80bd715a
      Ryan Williams authored
      Depends on [SPARK-5413](https://issues.apache.org/jira/browse/SPARK-5413) / #4209, included here, will rebase once the latter's merged.
      
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #4218 from ryan-williams/udp and squashes the following commits:
      
      ebae393 [Ryan Williams] Add support for sending Graphite metrics via UDP
      cb58262 [Ryan Williams] bump metrics dependency to v3.1.0
      80bd715a
  2. Jan 31, 2015
    • Sean Owen's avatar
      SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` doesn't work with Java 8 · c84d5a10
      Sean Owen authored
      These are more `javadoc` 8-related changes I spotted while investigating. These should be helpful in any event, but this does not nearly resolve SPARK-3359, which may never be feasible while using `unidoc` and `javadoc` 8.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4193 from srowen/SPARK-3359 and squashes the following commits:
      
      5b33f66 [Sean Owen] Additional scaladoc fixes for javadoc 8; still not going to be javadoc 8 compatible
      c84d5a10
    • Burak Yavuz's avatar
      [SPARK-3975] Added support for BlockMatrix addition and multiplication · ef8974b1
      Burak Yavuz authored
      Support for multiplying and adding large distributed matrices!
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <brkyvz@dn51t42l.sunet>
      Author: Burak Yavuz <brkyvz@dn51t4rd.sunet>
      Author: Burak Yavuz <brkyvz@dn0a221430.sunet>
      Author: Burak Yavuz <brkyvz@dn0a22b17d.sunet>
      
      Closes #4274 from brkyvz/SPARK-3975PR2 and squashes the following commits:
      
      17abd59 [Burak Yavuz] added indices to error message
      ac25783 [Burak Yavuz] merged masyer
      b66fd8b [Burak Yavuz] merged masyer
      e39baff [Burak Yavuz] addressed code review v1
      2dba642 [Burak Yavuz] [SPARK-3975] Added support for BlockMatrix addition and multiplication
      fb7624b [Burak Yavuz] merged master
      98c58ea [Burak Yavuz] added tests
      cdeb5df [Burak Yavuz] before adding tests
      c9bf247 [Burak Yavuz] fixed merge conflicts
      1cb0d06 [Burak Yavuz] [SPARK-3976] Added doc
      f92a916 [Burak Yavuz] merge upstream
      1a63b20 [Burak Yavuz] [SPARK-3974] Remove setPartition method. Isn't required
      1e8bb2a [Burak Yavuz] [SPARK-3974] Change return type of cache and persist
      e3d24c3 [Burak Yavuz] [SPARK-3976] Pulled upstream changes
      fa3774f [Burak Yavuz] [SPARK-3976] updated matrix multiplication and addition implementation
      239ab4b [Burak Yavuz] [SPARK-3974] Addressed @jkbradley's comments
      add7b05 [Burak Yavuz] [SPARK-3976] Updated code according to upstream changes
      e29acfd [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-3976
      3127233 [Burak Yavuz] fixed merge conflicts with upstream
      ba414d2 [Burak Yavuz] [SPARK-3974] fixed frobenius norm
      ab6cde0 [Burak Yavuz] [SPARK-3974] Modifications cleaning code up, making size calculation more robust
      9ae85aa [Burak Yavuz] [SPARK-3974] Made partitioner a variable inside BlockMatrix instead of a constructor variable
      d033861 [Burak Yavuz] [SPARK-3974] Removed SubMatrixInfo and added constructor without partitioner
      8e954ab [Burak Yavuz] save changes
      bbeae8c [Burak Yavuz] merged master
      987ea53 [Burak Yavuz] merged master
      49b9586 [Burak Yavuz] [SPARK-3974] Updated testing utils from master
      645afbe [Burak Yavuz] [SPARK-3974] Pull latest master
      beb1edd [Burak Yavuz] merge conflicts fixed
      f41d8db [Burak Yavuz] update tests
      b05aabb [Burak Yavuz] [SPARK-3974] Updated tests to reflect changes
      56b0546 [Burak Yavuz] updates from 3974 PR
      b7b8a8f [Burak Yavuz] pull updates from master
      b2dec63 [Burak Yavuz] Pull changes from 3974
      19c17e8 [Burak Yavuz] [SPARK-3974] Changed blockIdRow and blockIdCol
      5f062e6 [Burak Yavuz] updates with 3974
      6729fbd [Burak Yavuz] Updated with respect to SPARK-3974 PR
      589fbb6 [Burak Yavuz] [SPARK-3974] Code review feedback addressed
      63a4858 [Burak Yavuz] added grid multiplication
      aa8f086 [Burak Yavuz] [SPARK-3974] Additional comments added
      7381b99 [Burak Yavuz] merge with PR1
      f378e16 [Burak Yavuz] [SPARK-3974] Block Matrix Abstractions ready
      b693209 [Burak Yavuz] Ready for Pull request
      ef8974b1
    • martinzapletal's avatar
      [MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm · 34250a61
      martinzapletal authored
      This PR introduces an API for Isotonic regression and one algorithm implementing it, Pool adjacent violators.
      
      The Isotonic regression problem is sufficiently described in [Floudas, Pardalos, Encyclopedia of Optimization](http://books.google.co.uk/books?id=gtoTkL7heS0C&pg=RA2-PA87&lpg=RA2-PA87&dq=pooled+adjacent+violators+code&source=bl&ots=ZzQbZXVJnn&sig=reH_hBV6yIb9BeZNTF9092vD8PY&hl=en&sa=X&ei=WmF2VLiOIZLO7Qa-t4Bo&ved=0CD8Q6AEwBA#v=onepage&q&f=false), [Wikipedia](http://en.wikipedia.org/wiki/Isotonic_regression) or [Stat Wiki](http://stat.wikia.com/wiki/Isotonic_regression).
      
      Pool adjacent violators was introduced by  M. Ayer et al. in 1955.  A history and development of isotonic regression algorithms is in [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper) and list of available algorithms including their complexity is listed in [Stout, Fastest Isotonic Regression Algorithms](http://web.eecs.umich.edu/~qstout/IsoRegAlg_140812.pdf).
      
      An approach to parallelize the computation of PAV was presented in [Kearsley, Tapia, Trosset, An Approach to Parallelizing Isotonic Regression](http://softlib.rice.edu/pub/CRPC-TRs/reports/CRPC-TR96640.pdf).
      
      The implemented Pool adjacent violators algorithm is based on  [Floudas, Pardalos, Encyclopedia of Optimization](http://books.google.co.uk/books?id=gtoTkL7heS0C&pg=RA2-PA87&lpg=RA2-PA87&dq=pooled+adjacent+violators+code&source=bl&ots=ZzQbZXVJnn&sig=reH_hBV6yIb9BeZNTF9092vD8PY&hl=en&sa=X&ei=WmF2VLiOIZLO7Qa-t4Bo&ved=0CD8Q6AEwBA#v=onepage&q&f=false) (Chapter Isotonic regression problems, p. 86) and  [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper), also nicely formulated in [Tibshirani,  Hoefling, Tibshirani, Nearly-Isotonic Regression](http://www.stat.cmu.edu/~ryantibs/papers/neariso.pdf). Implementation itself inspired by R implementations [Klaus, Strimmer, 2008, fdrtool: Estimation of (Local) False Discovery Rates and Higher Criticism](http://cran.r-project.org/web/packages/fdrtool/index.html) and [R Development Core Team, stats, 2009](https://github.com/lgautier/R-3-0-branch-alt/blob/master/src/library/stats/R/isoreg.R). I ran tests with both these libraries and confirmed they yield the same results. More R implementations referenced in aforementioned [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators
      Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper). The implementation is also inspired and cross checked with other implementations: [Ted Harding, 2007](https://stat.ethz.ch/pipermail/r-help/2007-March/127981.html), [scikit-learn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/_isotonic.pyx), [Andrew Tulloch, 2014, Julia](https://github.com/ajtulloch/Isotonic.jl/blob/master/src/pooled_pava.jl), [Andrew Tulloch, 2014, c++](https://gist.github.com/ajtulloch/9499872), described in [Andrew Tulloch, Speeding up isotonic regression in scikit-learn by 5,000x](http://tullo.ch/articles/speeding-up-isotonic-regression/), [Fabian Pedregosa, 2012](https://gist.github.com/fabianp/3081831), [Sreangsu Acharyya. libpav](https://bitbucket.org/sreangsu/libpav/src/f744bc1b0fea257f0cacaead1c922eab201ba91b/src/pav.h?at=default) and [Gustav Larsson](https://gist.github.com/gustavla/9499068).
      
      Author: martinzapletal <zapletal-martin@email.cz>
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Martin Zapletal <zapletal-martin@email.cz>
      
      Closes #3519 from zapletal-martin/SPARK-3278 and squashes the following commits:
      
      5a54ea4 [Martin Zapletal] Merge pull request #2 from mengxr/isotonic-fix-java
      37ba24e [Xiangrui Meng] fix java tests
      e3c0e44 [martinzapletal] Merge remote-tracking branch 'origin/SPARK-3278' into SPARK-3278
      d8feb82 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      ded071c [Martin Zapletal] Merge pull request #1 from mengxr/SPARK-3278
      4dfe136 [Xiangrui Meng] add cache back
      0b35c15 [Xiangrui Meng] compress pools and update tests
      35d044e [Xiangrui Meng] update paraPAVA
      077606b [Xiangrui Meng] minor
      05422a8 [Xiangrui Meng] add unit test for model construction
      5925113 [Xiangrui Meng] Merge remote-tracking branch 'zapletal-martin/SPARK-3278' into SPARK-3278
      80c6681 [Xiangrui Meng] update IRModel
      3da56e5 [martinzapletal] SPARK-3278 fixed indentation error
      75eac55 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      88eb4e2 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Isotonic parameter removed from algorithm, defined behaviour for multiple data points with the same feature value, added tests to verify it
      e60a34f [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Styling and comment fixes.
      d93c8f9 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Change to IsotonicRegression api. Isotonic parameter now follows api of other mllib algorithms
      1fff77d [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Java api changes, test refactoring, comments and citations, isotonic regression model validations, linear interpolation for predictions
      12151e6 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      7aca4cc [martinzapletal] SPARK-3278 comment spelling
      9ae9d53 [martinzapletal] SPARK-3278 changes after PR feedback https://github.com/apache/spark/pull/3519. Binary search used for isotonic regression model predictions
      fad4bf9 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519
      ce0e30c [martinzapletal] SPARK-3278 readability refactoring
      f90c8c7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      0d14bd3 [martinzapletal] SPARK-3278 changed Java api to match Scala api's (Double, Double, Double)
      3c2954b [martinzapletal] SPARK-3278 Isotonic regression java api
      45aa7e8 [martinzapletal] SPARK-3278 Isotonic regression java api
      e9b3323 [martinzapletal] Merge branch 'SPARK-3278-weightedLabeledPoint' into SPARK-3278
      823d803 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      941fd1f [martinzapletal] SPARK-3278 Isotonic regression java api
      a24e29f [martinzapletal] SPARK-3278 refactored weightedlabeledpoint to (double, double, double) and updated api
      deb0f17 [martinzapletal] SPARK-3278 refactored weightedlabeledpoint to (double, double, double) and updated api
      8cefd18 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278-weightedLabeledPoint
      cab5a46 [martinzapletal] SPARK-3278 PR 3519 refactoring WeightedLabeledPoint to tuple as per comments
      b8b1620 [martinzapletal] Removed WeightedLabeledPoint. Replaced by tuple of doubles
      34760d5 [martinzapletal] Removed WeightedLabeledPoint. Replaced by tuple of doubles
      089bf86 [martinzapletal] Removed MonotonicityConstraint, Isotonic and Antitonic constraints. Replced by simple boolean
      c06f88c [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      6046550 [martinzapletal] SPARK-3278 scalastyle errors resolved
      8f5daf9 [martinzapletal] SPARK-3278 added comments and cleaned up api to consistently handle weights
      629a1ce [martinzapletal] SPARK-3278 added isotonic regression for weighted data. Added tests for Java api
      05d9048 [martinzapletal] SPARK-3278 isotonic regression refactoring and api changes
      961aa05 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      3de71d0 [martinzapletal] SPARK-3278 added initial version of Isotonic regression algorithm including proposed API
      34250a61
    • Reynold Xin's avatar
      [SPARK-5307] Add a config option for SerializationDebugger. · 63640831
      Reynold Xin authored
      Just in case there is a bug in the SerializationDebugger that makes error reporting worse than it was.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4297 from rxin/ser-config and squashes the following commits:
      
      f1d4629 [Reynold Xin] [SPARK-5307] Add a config option for SerializationDebugger.
      63640831
    • kai's avatar
      [SQL] remove redundant field "childOutput" from execution.Aggregate, use child.output instead · f54c9f60
      kai authored
      Author: kai <kaizeng@eecs.berkeley.edu>
      
      Closes #4291 from kai-zeng/aggregate-fix and squashes the following commits:
      
      78658ef [kai] remove redundant field "childOutput"
      f54c9f60
    • Reynold Xin's avatar
      [SPARK-5307] SerializationDebugger · 740a5686
      Reynold Xin authored
      This patch adds a SerializationDebugger that is used to add serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.
      
      The patch uses the internals of JVM serialization (in particular, heavy usage of ObjectStreamClass). Compared with an earlier attempt, this one provides extra information including field names, array offsets, writeExternal calls, etc.
      
      An example serialization stack:
      ```
      Serialization stack:
        - object not serializable (class: org.apache.spark.serializer.NotSerializable, value: org.apache.spark.serializer.NotSerializable2c43caa4)
        - element of array (index: 0)
        - array (class [Ljava.lang.Object;, size 1)
        - field (class: org.apache.spark.serializer.SerializableArray, name: arrayField, type: class [Ljava.lang.Object;)
        - object (class org.apache.spark.serializer.SerializableArray, org.apache.spark.serializer.SerializableArray193c5908)
        - writeExternal data
        - externalizable object (class org.apache.spark.serializer.ExternalizableClass, org.apache.spark.serializer.ExternalizableClass320bdadc)
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4098 from rxin/SerializationDebugger and squashes the following commits:
      
      553b3ff [Reynold Xin] Update SerializationDebuggerSuite.scala
      572d0cb [Reynold Xin] Disable automatically when reflection fails.
      b349b77 [Reynold Xin] [SPARK-5307] SerializationDebugger to help debug NotSerializableException - take 2
      740a5686
  3. Jan 30, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5504] [sql] convertToCatalyst should support nested arrays · e643de42
      Joseph K. Bradley authored
      After the recent refactoring, convertToCatalyst in ScalaReflection does not recurse on Arrays. It should.
      
      The test suite modification made the test fail before the fix in ScalaReflection.  The fix makes the test suite succeed.
      
      CC: marmbrus
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4295 from jkbradley/SPARK-5504 and squashes the following commits:
      
      6b7276d [Joseph K. Bradley] Fixed issue in ScalaReflection.convertToCatalyst with Arrays with non-primitive types. Modified test suite so it failed before the fix and works after the fix.
      e643de42
    • Travis Galoppo's avatar
      SPARK-5400 [MLlib] Changed name of GaussianMixtureEM to GaussianMixture · 98697734
      Travis Galoppo authored
      Decoupling the model and the algorithm
      
      Author: Travis Galoppo <tjg2107@columbia.edu>
      
      Closes #4290 from tgaloppo/spark-5400 and squashes the following commits:
      
      9c1534c [Travis Galoppo] Fixed invokation instructions in comments
      d848076 [Travis Galoppo] SPARK-5400 Changed name of GaussianMixtureEM to GaussianMixture to separate model from algorithm
      98697734
    • sboeschhuawei's avatar
      [SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function · f377431a
      sboeschhuawei authored
      Add single pseudo-eigenvector PIC
      Including documentations and updated pom.xml with the following codes:
      mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
      mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala
      
      Author: sboeschhuawei <stephen.boesch@huawei.com>
      Author: Fan Jiang <fanjiang.sc@huawei.com>
      Author: Jiang Fan <fjiang6@gmail.com>
      Author: Stephen Boesch <stephen.boesch@huawei.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4254 from fjiang6/PIC and squashes the following commits:
      
      4550850 [sboeschhuawei] Removed pic test data
      f292f31 [Stephen Boesch] Merge pull request #44 from mengxr/SPARK-4259
      4b78aaf [Xiangrui Meng] refactor PIC
      24fbf52 [sboeschhuawei] Updated API to be similar to KMeans plus other changes requested by Xiangrui on the PR
      c12dfc8 [sboeschhuawei] Removed examples files and added pic_data.txt. Revamped testcases yet to come
      92d4752 [sboeschhuawei] Move the Guassian/ Affinity matrix calcs out of PIC. Presently in the test suite
      7ebd149 [sboeschhuawei] Incorporate Xiangrui's first set of PR comments except restructure PIC.run to take Graph but do not remove Gaussian
      121e4d5 [sboeschhuawei] Remove unused testing data files
      1c3a62e [sboeschhuawei] removed matplot.py and reordered all private methods to bottom of PIC
      218a49d [sboeschhuawei] Applied Xiangrui's comments - especially removing RDD/PICLinalg classes and making noncritical methods private
      43ab10b [sboeschhuawei] Change last two println's to log4j logger
      88aacc8 [sboeschhuawei] Add assert to testcase on cluster sizes
      24f438e [sboeschhuawei] fixed incorrect markdown in clustering doc
      060e6bf [sboeschhuawei] Added link to PIC doc from the main clustering md doc
      be659e3 [sboeschhuawei] Added mllib specific log4j
      90e7fa4 [sboeschhuawei] Converted from custom Linalg routines to Breeze: added JavaDoc comments; added Markdown documentation
      bea48ea [sboeschhuawei] Converted custom Linear Algebra datatypes/routines to use Breeze.
      b29c0db [Fan Jiang] Update PIClustering.scala
      ace9749 [Fan Jiang] Update PIClustering.scala
      a112f38 [sboeschhuawei] Added graphx main and test jars as dependencies to mllib/pom.xml
      f656c34 [sboeschhuawei] Added iris dataset
      b7dbcbe [sboeschhuawei] Added axes and combined into single plot for matplotlib
      a2b1e57 [sboeschhuawei] Revert inadvertent update to KMeans
      9294263 [sboeschhuawei] Added visualization/plotting of input/output data
      e5df2b8 [sboeschhuawei] First end to end working PIC
      0700335 [sboeschhuawei] First end to end working version: but has bad performance issue
      32a90dc [sboeschhuawei] Update circles test data values
      0ef163f [sboeschhuawei] Added ConcentricCircles data generation and KMeans clustering
      3fd5bc8 [sboeschhuawei] PIClustering is running in new branch (up to the pseudo-eigenvector convergence step)
      d5aae20 [Jiang Fan] Adding Power Iteration Clustering and Suite test
      a3c5fbe [Jiang Fan] Adding Power Iteration Clustering
      f377431a
    • Burak Yavuz's avatar
      [SPARK-5486] Added validate method to BlockMatrix · 6ee8338b
      Burak Yavuz authored
      The `validate` method will allow users to debug their `BlockMatrix`, if operations like `add` or `multiply` return unexpected results. It checks the following properties in a `BlockMatrix`:
      - Are the dimensions of the `BlockMatrix` consistent with what the user entered: (`nRows`, `nCols`)
      - Are the dimensions of each `MatrixBlock` consistent with what the user entered: (`rowsPerBlock`, `colsPerBlock`)
      - Are there blocks with duplicate indices
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #4279 from brkyvz/SPARK-5486 and squashes the following commits:
      
      c152a73 [Burak Yavuz] addressed code review v2
      598c583 [Burak Yavuz] merged master
      b55ac5c [Burak Yavuz] addressed code review v1
      25f083b [Burak Yavuz] simplify implementation
      0aa519a [Burak Yavuz] [SPARK-5486] Added validate method to BlockMatrix
      6ee8338b
    • Xiangrui Meng's avatar
      [SPARK-5496][MLLIB] Allow both classification and Classification in Algo for trees. · 0a95085f
      Xiangrui Meng authored
      to be backward compatible.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4287 from mengxr/SPARK-5496 and squashes the following commits:
      
      a025c53 [Xiangrui Meng] Allow both classification and Classification in Algo for trees.
      0a95085f
    • Joseph J.C. Tang's avatar
      [MLLIB] SPARK-4846: throw a RuntimeException and give users hints to increase the minCount · 54d95758
      Joseph J.C. Tang authored
      When the vocabSize\*vectorSize is larger than Int.MaxValue/8, we try to throw a RuntimeException. Because under this circumstance it would definitely throw an OOM when allocating memory to serialize the arrays syn0Global&syn1Global.   syn0Global&syn1Global are float arrays. Serializing them should need a byte array of more than 8 times of syn0Global's size.
      Also if we catch an OOM even if vocabSize\*vectorSize is less than Int.MaxValue/8, we should give users hints to increase the minCount or decrease the vectorSize.
      
      Author: Joseph J.C. Tang <jinntrance@gmail.com>
      
      Closes #4247 from jinntrance/w2v-fix and squashes the following commits:
      
      b5eb71f [Joseph J.C. Tang] throw a RuntimeException and give users hints regarding the vectorSize&minCount
      54d95758
    • Sandy Ryza's avatar
      SPARK-5393. Flood of util.RackResolver log messages after SPARK-1714 · 254eaa4d
      Sandy Ryza authored
      Previously I had tried to solve this with by adding a line in Spark's log4j-defaults.properties.
      
      The issue with the message in log4j-defaults.properties was that the log4j.properties packaged inside Hadoop was getting picked up instead. While it would be ideal to fix that as well, we still want to quiet this in situations where a user supplies their own custom log4j properties.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #4192 from sryza/sandy-spark-5393 and squashes the following commits:
      
      4d5dedc [Sandy Ryza] Only set log level if unset
      46e07c5 [Sandy Ryza] SPARK-5393. Flood of util.RackResolver log messages after SPARK-1714
      254eaa4d
    • Takuya UESHIN's avatar
      [SPARK-5457][SQL] Add missing DSL for ApproxCountDistinct. · 6f21dce5
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #4250 from ueshin/issues/SPARK-5457 and squashes the following commits:
      
      3c05e59 [Takuya UESHIN] Remove parameter to use default value of ApproxCountDistinct.
      faea19d [Takuya UESHIN] Use overload instead of default value for Java support.
      d1cca38 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-5457
      663d43d [Takuya UESHIN] Add missing DSL for ApproxCountDistinct.
      6f21dce5
    • Kazuki Taniguchi's avatar
      [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees · bc1fc9b6
      Kazuki Taniguchi authored
      This PR is implementing the Gradient Boosted Trees for Python API.
      
      Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>
      
      Closes #3951 from kazk1018/gbt_for_py and squashes the following commits:
      
      620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
      bc1fc9b6
  4. Jan 29, 2015
    • Burak Yavuz's avatar
      [SPARK-5322] Added transpose functionality to BlockMatrix · dd4d84cf
      Burak Yavuz authored
      BlockMatrices can now be transposed!
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #4275 from brkyvz/SPARK-5322 and squashes the following commits:
      
      33806ed [Burak Yavuz] added lazy comment
      33e9219 [Burak Yavuz] made transpose lazy
      5a274cd [Burak Yavuz] added cached tests
      5dcf85c [Burak Yavuz] [SPARK-5322] Added transpose functionality to BlockMatrix
      dd4d84cf
    • Reynold Xin's avatar
      [SQL] Support df("*") to select all columns in a data frame. · 80def9de
      Reynold Xin authored
      This PR makes Star a trait, and provides two implementations: UnresolvedStar (used for *, tblName.*) and ResolvedStar (used for df("*")).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4283 from rxin/df-star and squashes the following commits:
      
      c9cba3e [Reynold Xin] Removed mapFunction in UnresolvedStar.
      1a3a1d7 [Reynold Xin] [SQL] Support df("*") to select all columns in a data frame.
      80def9de
    • Josh Rosen's avatar
      [SPARK-5462] [SQL] Use analyzed query plan in DataFrame.apply() · 22271f96
      Josh Rosen authored
      This patch changes DataFrame's `apply()` method to use an analyzed query plan when resolving column names.  This fixes a bug where `apply` would throw "invalid call to qualifiers on unresolved object" errors when called on DataFrames constructed via `SQLContext.sql()`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4282 from JoshRosen/SPARK-5462 and squashes the following commits:
      
      b9e6da2 [Josh Rosen] [SPARK-5462] Use analyzed query plan in DataFrame.apply().
      22271f96
    • Davies Liu's avatar
      [SPARK-5395] [PySpark] fix python process leak while coalesce() · 5c746eed
      Davies Liu authored
      Currently, the Python process is released into pool only after the task had finished, it cause many process forked if coalesce() is called.
      
      This PR will change it to release the process as soon as read all the data from it (finish the partition), then a process could be reused to process multiple partitions in a single task.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4238 from davies/py_leak and squashes the following commits:
      
      ec80a43 [Davies Liu] add @volatile
      6da437a [Davies Liu] address comments
      24ed322 [Davies Liu] fix python process leak while coalesce()
      5c746eed
    • Reynold Xin's avatar
      [SQL] DataFrame API improvements · ce9c43ba
      Reynold Xin authored
      1. Added Dsl.column in case Dsl.col is shadowed.
      2. Allow using String to specify the target data type in cast.
      3. Support sorting on multiple columns using column names.
      4. Added Java API test file.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4280 from rxin/dsl1 and squashes the following commits:
      
      33ecb7a [Reynold Xin] Add the Java test.
      d06540a [Reynold Xin] [SQL] DataFrame API improvements.
      ce9c43ba
    • Patrick Wendell's avatar
      Revert "[WIP] [SPARK-3996]: Shade Jetty in Spark deliverables" · d2071e8f
      Patrick Wendell authored
      This reverts commit f240fe39.
      d2071e8f
    • Yoshihiro Shimizu's avatar
      remove 'return' · 5338772f
      Yoshihiro Shimizu authored
      looks unnecessary :grinning:
      
      Author: Yoshihiro Shimizu <shimizu@amoad.com>
      
      Closes #4268 from y-shimizu/remove-return and squashes the following commits:
      
      12be0e9 [Yoshihiro Shimizu] remove 'return'
      5338772f
    • Patrick Wendell's avatar
      [WIP] [SPARK-3996]: Shade Jetty in Spark deliverables · f240fe39
      Patrick Wendell authored
      This patch piggy-back's on vanzin's work to simplify the Guava shading,
      and adds Jetty as a shaded library in Spark. Other than adding Jetty,
      it consilidates the \<artifactSet\>'s into the root pom. I found it was
      a bit easier to follow that way, since you don't need to look into
      child pom's to find out specific artifact sets included in shading.
      
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #4252 from pwendell/jetty and squashes the following commits:
      
      19f0710 [Patrick Wendell] More code review feedback
      961452d [Patrick Wendell] Responding to feedback from Marcello
      6df25ca [Patrick Wendell] [WIP] [SPARK-3996]: Shade Jetty in Spark deliverables
      f240fe39
    • Josh Rosen's avatar
      [SPARK-5464] Fix help() for Python DataFrame instances · 0bb15f22
      Josh Rosen authored
      This fixes an exception that prevented users from calling `help()` on Python DataFrame instances.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4278 from JoshRosen/SPARK-5464-python-dataframe-help-command and squashes the following commits:
      
      08f95f7 [Josh Rosen] Fix exception when calling help() on Python DataFrame instances
      0bb15f22
    • Yin Huai's avatar
      [SPARK-4296][SQL] Trims aliases when resolving and checking aggregate expressions · c00d517d
      Yin Huai authored
      I believe that SPARK-4296 has been fixed by 3684fd21. I am adding tests based #3910 (change the udf to HiveUDF instead).
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4010 from yhuai/SPARK-4296-yin and squashes the following commits:
      
      6343800 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-4296-yin
      6cfadd2 [Yin Huai] Actually, this issue has been fixed by 3684fd21.
      d42b707 [Yin Huai] Update comment.
      8b3a274 [Yin Huai] Since expressions in grouping expressions can have aliases, which can be used by the outer query block,     revert this change.
      443538d [Cheng Lian] Trims aliases when resolving and checking aggregate expressions
      c00d517d
    • wangfei's avatar
      [SPARK-5373][SQL] Literal in agg grouping expressions leads to incorrect result · c1b3eebf
      wangfei authored
      `select key, count( * ) from src group by key, 1`  will get the wrong answer.
      
      e.g. for this table
      ```
        val testData2 =
          TestSQLContext.sparkContext.parallelize(
            TestData2(1, 1) ::
            TestData2(1, 2) ::
            TestData2(2, 1) ::
            TestData2(2, 2) ::
            TestData2(3, 1) ::
            TestData2(3, 2) :: Nil, 2).toSchemaRDD
        testData2.registerTempTable("testData2")
      ```
      result of `SELECT a, count(1) FROM testData2 GROUP BY a, 1`  is
      
      ```
                           [1,1]
                           [2,2]
                           [3,1]
      ```
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #4169 from scwf/agg-bug and squashes the following commits:
      
      05751db [wangfei] fix bugs when literal in agg grouping expressioons
      c1b3eebf
    • wangfei's avatar
      [SPARK-5367][SQL] Support star expression in udf · fbaf9e08
      wangfei authored
      now spark sql does not support star expression in udf, run the following sql by spark-sql will get error
      ```
      select concat(*) from src
      ```
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #4163 from scwf/udf-star and squashes the following commits:
      
      9db7b39 [wangfei] addressed comments
      da1da09 [scwf] minor fix
      f87b5f9 [scwf] added test case
      587bf7e [wangfei] compile fix
      eb93c16 [wangfei] fix star resolve issue in udf
      fbaf9e08
    • Yash Datta's avatar
      [SPARK-4786][SQL]: Parquet filter pushdown for castable types · de221ea0
      Yash Datta authored
      Enable parquet filter pushdown of castable types like short, byte that can be cast to integer
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #4156 from saucam/filter_short and squashes the following commits:
      
      a403979 [Yash Datta] SPARK-4786: Fix styling issues
      d029866 [Yash Datta] SPARK-4786: Add test case
      cb2e0d9 [Yash Datta] SPARK-4786: Parquet filter pushdown for castable types
      de221ea0
    • Michael Davies's avatar
      [SPARK-5309][SQL] Add support for dictionaries in PrimitiveConverter for Strin... · 940f3756
      Michael Davies authored
      ...gs.
      
      Parquet Converters allow developers to take advantage of dictionary encoding of column data to reduce Column Binary decoding.
      
      The Spark PrimitiveConverter was not using that API and consequently for String columns that used dictionary compression repeated Binary to String conversions for the same String.
      
      In measurements this could account for over 25% of entire query time.
      For example a 500M row table split across 16 blocks was aggregated and summed in a litte under 30s before this change and a little under 20s after the change.
      
      Author: Michael Davies <Michael.BellDavies@gmail.com>
      
      Closes #4187 from MickDavies/SPARK-5309-2 and squashes the following commits:
      
      327287e [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings.
      33c002c [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings.
      940f3756
    • Liang-Chi Hsieh's avatar
      [SPARK-5429][SQL] Use javaXML plan serialization for Hive golden answers on Hive 0.13.1 · bce0ba1f
      Liang-Chi Hsieh authored
      I found that running `HiveComparisonTest.createQueryTest` to generate Hive golden answer files on Hive 0.13.1 would throw KryoException. I am not sure if this can be reproduced by others. Since Hive 0.13.0, Kryo plan serialization is introduced to replace javaXML as default plan serialization format. This is a quick fix to set hive configuration to use javaXML serialization.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4223 from viirya/fix_hivetest and squashes the following commits:
      
      97a8760 [Liang-Chi Hsieh] Use javaXML plan serialization.
      bce0ba1f
    • Reynold Xin's avatar
      [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods. · 71563223
      Reynold Xin authored
      Turns out Scala does generate static methods for ones defined in a companion object. Finally no need to separate api.java.dsl and api.scala.dsl.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4276 from rxin/dsl and squashes the following commits:
      
      30aa611 [Reynold Xin] Add all files.
      1a9d215 [Reynold Xin] [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
      71563223
    • Marcelo Vanzin's avatar
      [SPARK-5466] Add explicit guava dependencies where needed. · f9e56945
      Marcelo Vanzin authored
      One side-effect of shading guava is that it disappears as a transitive
      dependency. For Hadoop 2.x, this was masked by the fact that Hadoop
      itself depends on guava. But certain versions of Hadoop 1.x also
      shade guava, leaving either no guava or some random version pulled
      by another dependency on the classpath.
      
      So be explicit about the dependency in modules that use guava directly,
      which is the right thing to do anyway.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4272 from vanzin/SPARK-5466 and squashes the following commits:
      
      e3f30e5 [Marcelo Vanzin] Dependency for catalyst is not needed.
      d3b2c84 [Marcelo Vanzin] [SPARK-5466] Add explicit guava dependencies where needed.
      f9e56945
    • Xiangrui Meng's avatar
      [SPARK-5477] refactor stat.py · a3dc6184
      Xiangrui Meng authored
      There is only a single `stat.py` file for the `mllib.stat` package. We recently added `MultivariateGaussian` under `mllib.stat.distribution` in Scala/Java. It would be nice to refactor `stat.py` and make it easy to expand. Note that `ChiSqTestResult` is moved from `mllib.stat` to `mllib.stat.test`. The latter is used in Scala/Java. It is only used in the return value of `Statistics.chiSqTest`, so this should be an okay change.
      
      davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4266 from mengxr/py-stat-refactor and squashes the following commits:
      
      1a5e1db [Xiangrui Meng] refactor stat.py
      a3dc6184
    • Reynold Xin's avatar
      [SQL] Various DataFrame DSL update. · 5ad78f62
      Reynold Xin authored
      1. Added foreach, foreachPartition, flatMap to DataFrame.
      2. Added col() in dsl.
      3. Support renaming columns in toDataFrame.
      4. Support type inference on arrays (in addition to Seq).
      5. Updated mllib to use the new DSL.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4260 from rxin/sql-dsl-update and squashes the following commits:
      
      73466c1 [Reynold Xin] Fixed LogisticRegression. Also added better error message for resolve.
      fab3ccc [Reynold Xin] Bug fix.
      d31fcd2 [Reynold Xin] Style fix.
      62608c4 [Reynold Xin] [SQL] Various DataFrame DSL update.
      5ad78f62
Loading