Skip to content
Snippets Groups Projects
  1. Feb 02, 2015
    • Jacek Lewandowski's avatar
      SPARK-5425: Use synchronised methods in system properties to create SparkConf · 5a552616
      Jacek Lewandowski authored
      SPARK-5425: Fixed usages of system properties
      
      This patch fixes few problems caused by the fact that the Scala wrapper over system properties is not thread-safe and is basically invalid because it doesn't take into account the default values which could have been set in the properties object. The problem is fixed by modifying `Utils.getSystemProperties` method so that it uses `stringPropertyNames` method of the `Properties` class, which is thread-safe (internally it creates a defensive copy in a synchronized method) and returns keys of the properties which were set explicitly and which are defined as defaults.
      The other related problem, which is fixed here. was in `ResetSystemProperties` mix-in. It created a copy of the system properties in the wrong way.
      
      This patch also introduces a test case for thread-safeness of SparkConf creation.
      
      Refer to the discussion in https://github.com/apache/spark/pull/4220 for more details.
      
      Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
      
      Closes #4222 from jacek-lewandowski/SPARK-5425-1.3 and squashes the following commits:
      
      03da61b [Jacek Lewandowski] SPARK-5425: Modified Utils.getSystemProperties to return a map of all system properties - explicit + defaults
      8faf2ea [Jacek Lewandowski] SPARK-5425: Use SerializationUtils to save properties in ResetSystemProperties trait
      71aa572 [Jacek Lewandowski] SPARK-5425: Use synchronised methods in system properties to create SparkConf
      5a552616
    • Martin Weindel's avatar
      Disabling Utils.chmod700 for Windows · bff65b5c
      Martin Weindel authored
      This patch makes Spark 1.2.1rc2 work again on Windows.
      
      Without it you get following log output on creating a Spark context:
      INFO  org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster
      ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in .... Ignoring this directory.
      ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any local dir.
      
      Author: Martin Weindel <martin.weindel@gmail.com>
      Author: mweindel <m.weindel@usu-software.de>
      
      Closes #4299 from MartinWeindel/branch-1.2 and squashes the following commits:
      
      535cb7f [Martin Weindel] fixed last commit
      f17072e [Martin Weindel] moved condition to caller to avoid confusion on chmod700() return value
      4de5e91 [Martin Weindel] reverted to unix line ends
      fe2740b [mweindel] moved comment
      ac4749c [mweindel] fixed chmod700 for Windows
      bff65b5c
    • Marcelo Vanzin's avatar
      Make sure only owner can read / write to directories created for the job. · 52f5754f
      Marcelo Vanzin authored
      
      Whenever a directory is created by the utility method, immediately restrict
      its permissions so that only the owner has access to its contents.
      
      Signed-off-by: default avatarJosh Rosen <joshrosen@databricks.com>
      52f5754f
    • Patrick Wendell's avatar
    • Iulian Dragos's avatar
      [SPARK-4631][streaming][FIX] Wait for a receiver to start before publishing test data. · e908322c
      Iulian Dragos authored
      This fixes two sources of non-deterministic failures in this test:
      
      - wait for a receiver to be up before pushing data through MQTT
      - gracefully handle the case where the MQTT client is overloaded. There’s
      a hard-coded limit of 10 in-flight messages, and this test may hit it.
      Instead of crashing, we retry sending the message.
      
      Both of these are needed to make the test pass reliably on my machine.
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #4270 from dragos/issue/fix-flaky-test-SPARK-4631 and squashes the following commits:
      
      f66c482 [Iulian Dragos] [SPARK-4631][streaming] Wait for a receiver to start before publishing test data.
      d408a8e [Iulian Dragos] Install callback before connecting to MQTT broker.
      e908322c
    • Liang-Chi Hsieh's avatar
      [SPARK-5212][SQL] Add support of schema-less, custom field delimiter and SerDe for HiveQL transform · 683e9382
      Liang-Chi Hsieh authored
      This pr adds the support of schema-less syntax, custom field delimiter and SerDe for HiveQL's transform.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4014 from viirya/schema_less_trans and squashes the following commits:
      
      ac2d1fe [Liang-Chi Hsieh] Refactor codes for comments.
      a137933 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into schema_less_trans
      aa10fbd [Liang-Chi Hsieh] Add Hive golden answer files again.
      575f695 [Liang-Chi Hsieh] Add Hive golden answer files for new unit tests.
      a422562 [Liang-Chi Hsieh] Use createQueryTest for unit tests and remove unnecessary imports.
      ccb71e3 [Liang-Chi Hsieh] Refactor codes for comments.
      37bd391 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into schema_less_trans
      6000889 [Liang-Chi Hsieh] Wrap input and output schema into ScriptInputOutputSchema.
      21727f7 [Liang-Chi Hsieh] Move schema-less output to proper place. Use multilines instead of a long line SQL.
      9a6dc04 [Liang-Chi Hsieh] setRecordReaderID is introduced in 0.13.1, use reflection API to call it.
      7a14f31 [Liang-Chi Hsieh] Fix bug.
      799b5e1 [Liang-Chi Hsieh] Call getSerializedClass instead of using Text.
      be2c3fc [Liang-Chi Hsieh] Fix style.
      32d3046 [Liang-Chi Hsieh] Add SerDe support.
      ab22f7b [Liang-Chi Hsieh] Fix style.
      7a48e42 [Liang-Chi Hsieh] Add support of custom field delimiter.
      b1729d9 [Liang-Chi Hsieh] Fix style.
      ccee49e [Liang-Chi Hsieh] Add unit test.
      f561c37 [Liang-Chi Hsieh] Add support of schema-less script transformation.
      683e9382
    • Xutingjun's avatar
      [SPARK-5530] Add executor container to executorIdToContainer · 62a93a16
      Xutingjun authored
      when call killExecutor method, it will only go to the else branch, because  the variable executorIdToContainer never be put any value.
      
      Author: Xutingjun <1039320815@qq.com>
      
      Closes #4309 from XuTingjun/dynamicAllocator and squashes the following commits:
      
      c823418 [Xutingjun] fix bugwq
      62a93a16
    • Nicholas Chammas's avatar
      [Docs] Fix Building Spark link text · 3f941b68
      Nicholas Chammas authored
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4312 from nchammas/patch-2 and squashes the following commits:
      
      9d943aa [Nicholas Chammas] [Docs] Fix Building Spark link text
      3f941b68
    • lianhuiwang's avatar
      [SPARK-5173]support python application running on yarn cluster mode · f5e63751
      lianhuiwang authored
      now when we run python application on yarn cluster mode through spark-submit, spark-submit does not support python application on yarn cluster mode. so i modify code of submit and yarn's AM in order to support it.
      through specifying .py file or primaryResource file via spark-submit, we can make pyspark run in yarn-cluster mode.
      example:spark-submit --master yarn-master --num-executors 1 --driver-memory 1g --executor-memory 1g  xx.py --primaryResource yy.conf
      this config is same as pyspark on yarn-client mode.
      firstly,we put local path of .py or primaryResource to yarn's dist.files.that can be distributed on slave nodes.and then in spark-submit we transfer --py-files and --primaryResource to yarn.Client and use "org.apache.spark.deploy.PythonRunner" to user class that can run .py files on ApplicationMaster.
      in yarn.Client we transfer --py-files and --primaryResource to  ApplicationMaster.
      in ApplicationMaster, user's class is org.apache.spark.deploy.PythonRunner, and user's args is primaryResource and -py-files. so that can make pyspark run on ApplicationMaster.
      JoshRosen tgravescs sryza
      
      Author: lianhuiwang <lianhuiwang09@gmail.com>
      Author: Wang Lianhui <lianhuiwang09@gmail.com>
      
      Closes #3976 from lianhuiwang/SPARK-5173 and squashes the following commits:
      
      28a8a58 [lianhuiwang] fix variable name
      67f8cee [lianhuiwang] update with andrewor's comments
      0319ae3 [lianhuiwang] address with sryza's comments
      2385ef6 [lianhuiwang] address with sryza's comments
      03640ab [lianhuiwang] add sparkHome to env
      47d2fc3 [lianhuiwang] fix test
      2adc8f5 [lianhuiwang] add spark.test.home
      d60bc60 [lianhuiwang] fix test
      5b30064 [lianhuiwang] add test
      097a5ec [lianhuiwang] fix line length exceeds 100
      905a106 [lianhuiwang] update with sryza and andrewor 's comments
      f1f55b6 [lianhuiwang] when yarn-cluster, all python files can be non-local
      172eec1 [Wang Lianhui] fix a min submit's bug
      9c941bc [lianhuiwang] support python application running on yarn cluster mode
      f5e63751
    • Sandy Ryza's avatar
      SPARK-4585. Spark dynamic executor allocation should use minExecutors as... · b2047b55
      Sandy Ryza authored
      ... initial number
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #4051 from sryza/sandy-spark-4585 and squashes the following commits:
      
      d1dd039 [Sandy Ryza] Add spark.dynamicAllocation.initialNumExecutors and make min and max not required
      b7c59dc [Sandy Ryza] SPARK-4585. Spark dynamic executor allocation should use minExecutors as initial number
      b2047b55
    • Alexander Ulanov's avatar
      [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square feature selection · c081b21b
      Alexander Ulanov authored
      The following is implemented:
      1) generic traits for feature selection and filtering
      2) trait for feature selection of LabeledPoint with discrete data
      3) traits for calculation of contingency table and chi squared
      4) class for chi-squared feature selection
      5) tests for the above
      
      Needs some optimization in matrix operations.
      
      This request is a try to implement feature selection for MLLIB, the previous work by the issue author izendejas was not finished (https://issues.apache.org/jira/browse/SPARK-1473). This request is also related to data discretization issues: https://issues.apache.org/jira/browse/SPARK-1303 and https://issues.apache.org/jira/browse/SPARK-1216 that weren't merged.
      
      Author: Alexander Ulanov <nashb@yandex.ru>
      
      Closes #1484 from avulanov/featureselection and squashes the following commits:
      
      755d358 [Alexander Ulanov] Addressing reviewers comments @mengxr
      a6ad82a [Alexander Ulanov] Addressing reviewers comments @mengxr
      714b878 [Alexander Ulanov] Addressing reviewers comments @mengxr
      010acff [Alexander Ulanov] Rebase
      427ca4e [Alexander Ulanov] Addressing reviewers comments: implement VectorTransformer interface, use Statistics.chiSqTest
      f9b070a [Alexander Ulanov] Adding Apache header in tests...
      80363ca [Alexander Ulanov] Tests, comments, apache headers and scala style
      150a3e0 [Alexander Ulanov] Scala style fix
      f356365 [Alexander Ulanov] Chi Squared by contingency table. Refactoring
      2bacdc7 [Alexander Ulanov] Combinations and chi-squared values test
      66e0333 [Alexander Ulanov] Feature selector, fix of lazyness
      aab9b73 [Alexander Ulanov] Feature selection redesign with vigdorchik
      e24eee4 [Alexander Ulanov] Traits for FeatureSelection, CombinationsCalculator and FeatureFilter
      ca49e80 [Alexander Ulanov] Feature selection filter
      2ade254 [Alexander Ulanov] Code style
      0bd8434 [Alexander Ulanov] Chi Squared feature selection: initial version
      c081b21b
    • Sandy Ryza's avatar
      SPARK-5492. Thread statistics can break with older Hadoop versions · 6f341310
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #4305 from sryza/sandy-spark-5492 and squashes the following commits:
      
      b7d4497 [Sandy Ryza] SPARK-5492. Thread statistics can break with older Hadoop versions
      6f341310
    • jerryshao's avatar
      [SPARK-5478][UI][Minor] Add missing right parentheses · 63dfe21d
      jerryshao authored
      ![UI](https://dl.dropboxusercontent.com/u/19230832/Capture.PNG)
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #4267 from jerryshao/SPARK-5478 and squashes the following commits:
      
      9fe51cc [jerryshao] Add missing right parentheses
      63dfe21d
  2. Feb 01, 2015
    • Tobias Schlatter's avatar
      [SPARK-5353] Log failures in REPL class loading · 9f0a6e18
      Tobias Schlatter authored
      Author: Tobias Schlatter <tobias@meisch.ch>
      
      Closes #4130 from gzm0/log-repl-loading and squashes the following commits:
      
      4fa0582 [Tobias Schlatter] Log failures in REPL class loading
      9f0a6e18
    • Patrick Wendell's avatar
      [SPARK-3996]: Shade Jetty in Spark deliverables · a15f6e31
      Patrick Wendell authored
      (v2 of this patch with a fix that was only relevant for the maven build).
      
      This patch piggy-back's on vanzin's work to simplify the Guava shading,
      and adds Jetty as a shaded library in Spark. Other than adding Jetty,
      it consilidates the <artifactSet>'s into the root pom. I found it was
      a bit easier to follow that way, since you don't need to look into
      child pom's to find out specific artifact sets included in shading.
      
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #4285 from pwendell/jetty and squashes the following commits:
      
      d3e7f4e [Patrick Wendell] Fix for shaded deps causing compile errors
      19f0710 [Patrick Wendell] More code review feedback
      961452d [Patrick Wendell] Responding to feedback from Marcello
      6df25ca [Patrick Wendell] [WIP] [SPARK-3996]: Shade Jetty in Spark deliverables
      a15f6e31
    • Jacky Li's avatar
      [SPARK-4001][MLlib] adding parallel FP-Growth algorithm for frequent pattern mining in MLlib · 859f7249
      Jacky Li authored
      Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark. This PR add an implementation for it.
      There is a point I am not sure wether it is most efficient. In order to filter out the eligible frequent item set, currently I am using a cartesian operation on two RDDs to calculate the degree of support of each item set, not sure wether it is better to use broadcast variable to achieve the same.
      
      I will add an example to use this algorithm if requires
      
      Author: Jacky Li <jacky.likun@huawei.com>
      Author: Jacky Li <jackylk@users.noreply.github.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2847 from jackylk/apriori and squashes the following commits:
      
      bee3093 [Jacky Li] Merge pull request #1 from mengxr/SPARK-4001
      7e69725 [Xiangrui Meng] simplify FPTree and update FPGrowth
      ec21f7d [Jacky Li] fix scalastyle
      93f3280 [Jacky Li] create FPTree class
      d110ab2 [Jacky Li] change test case to use MLlibTestSparkContext
      a6c5081 [Jacky Li] Add Parallel FPGrowth algorithm
      eb3e4ca [Jacky Li] add FPGrowth
      03df2b6 [Jacky Li] refactory according to comments
      7b77ad7 [Jacky Li] fix scalastyle check
      f68a0bd [Jacky Li] add 2 apriori implemenation and fp-growth implementation
      889b33f [Jacky Li] modify per scalastyle check
      da2cba7 [Jacky Li] adding apriori algorithm for frequent item set mining in Spark
      859f7249
    • Yuhao Yang's avatar
      [Spark-5406][MLlib] LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound · d85cd4eb
      Yuhao Yang authored
      JIRA link: https://issues.apache.org/jira/browse/SPARK-5406
      
      The code in breeze svd  imposes the upper bound for LocalLAPACK in RowMatrix.computeSVD
      code from breeze svd (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala)
           val workSize = ( 3
              * scala.math.min(m, n)
              * scala.math.min(m, n)
              + scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
                * scala.math.min(m, n) + 4 * scala.math.min(m, n))
            )
            val work = new Array[Double](workSize)
      
      As a result, 7 * n * n + 4 * n < Int.MaxValue at least (depends on JVM)
      
      In some worse cases, like n = 25000, work size will become positive again (80032704) and bring wired behavior.
      
      The PR is only the beginning, to support Genbase ( an important biological benchmark that would help promote Spark to genetic applications, http://www.paradigm4.com/wp-content/uploads/2014/06/Genomics-Benchmark-Technical-Report.pdf),
      which needs to compute svd for matrix up to 60K * 70K. I found many potential issues and would like to know if there's any plan undergoing that would expand the range of matrix computation based on Spark.
      Thanks.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #4200 from hhbyyh/rowMatrix and squashes the following commits:
      
      f7864d0 [Yuhao Yang] update auto logic for rowMatrix svd
      23860e4 [Yuhao Yang] fix comment style
      e48a6e4 [Yuhao Yang] make latent svd computation constraint clear
      d85cd4eb
    • Cheng Lian's avatar
      [SPARK-5465] [SQL] Fixes filter push-down for Parquet data source · ec100321
      Cheng Lian authored
      Not all Catalyst filter expressions can be converted to Parquet filter predicates. We should try to convert each individual predicate and then collect those convertible ones.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4255)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4255 from liancheng/spark-5465 and squashes the following commits:
      
      14ccd37 [Cheng Lian] Fixes filter push-down for Parquet data source
      ec100321
    • Daoyuan Wang's avatar
      [SPARK-5262] [SPARK-5244] [SQL] add coalesce in SQLParser and widen types for... · 8cf4a1f0
      Daoyuan Wang authored
      [SPARK-5262] [SPARK-5244] [SQL] add coalesce in SQLParser and widen types for parameters of coalesce
      
      I'll add test case in #4040
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4057 from adrian-wang/coal and squashes the following commits:
      
      4d0111a [Daoyuan Wang] address Yin's comments
      c393e18 [Daoyuan Wang] fix rebase conflicts
      e47c03a [Daoyuan Wang] add coalesce in parser
      c74828d [Daoyuan Wang] cast types for coalesce
      8cf4a1f0
    • OopsOutOfMemory's avatar
      [SPARK-5196][SQL] Support `comment` in Create Table Field DDL · 1b56f1d6
      OopsOutOfMemory authored
      Support `comment` in create a table field.
      __CREATE TEMPORARY TABLE people(name string `comment` "the name of a person")__
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #3999 from OopsOutOfMemory/meta_comment and squashes the following commits:
      
      39150d4 [OopsOutOfMemory] add comment and refine test suite
      1b56f1d6
    • Masayoshi TSUZUKI's avatar
      [SPARK-1825] Make Windows Spark client work fine with Linux YARN cluster · 7712ed5b
      Masayoshi TSUZUKI authored
      Modified environment strings and path separators to platform-independent style if possible.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #3943 from tsudukim/feature/SPARK-1825 and squashes the following commits:
      
      ec4b865 [Masayoshi TSUZUKI] Rebased and modified as comments.
      f8a1d5a [Masayoshi TSUZUKI] Merge branch 'master' of github.com:tsudukim/spark into feature/SPARK-1825
      3d03d35 [Masayoshi TSUZUKI] [SPARK-1825] Make Windows Spark client work fine with Linux YARN cluster
      7712ed5b
    • Tom Panning's avatar
      [SPARK-5176] The thrift server does not support cluster mode · 1ca0a101
      Tom Panning authored
      Output an error message if the thrift server is started in cluster mode.
      
      Author: Tom Panning <tom.panning@nextcentury.com>
      
      Closes #4137 from tpanningnextcen/spark-5176-thrift-cluster-mode-error and squashes the following commits:
      
      f5c0509 [Tom Panning] [SPARK-5176] The thrift server does not support cluster mode
      1ca0a101
    • Kousuke Saruta's avatar
      [SPARK-5155] Build fails with spark-ganglia-lgpl profile · c80194b3
      Kousuke Saruta authored
      Build fails with spark-ganglia-lgpl profile at the moment. This is because pom.xml for spark-ganglia-lgpl is not updated.
      
      This PR is related to #4218, #4209 and #3812.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #4303 from sarutak/fix-ganglia-pom-for-metric and squashes the following commits:
      
      5cf455f [Kousuke Saruta] Fixed pom.xml for ganglia in order to use io.dropwizard.metrics
      c80194b3
    • Liang-Chi Hsieh's avatar
      [Minor][SQL] Little refactor DataFrame related codes · ef89b82d
      Liang-Chi Hsieh authored
      Simplify some codes related to DataFrame.
      
      *  Calling `toAttributes` instead of a `map`.
      *  Original `createDataFrame` creates the `StructType` and its attributes in a redundant way. Refactored it to create `StructType` and call `toAttributes` on it directly.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4298 from viirya/refactor_df and squashes the following commits:
      
      1d61c64 [Liang-Chi Hsieh] Revert it.
      f36efb5 [Liang-Chi Hsieh] Relax the constraint of toDataFrame.
      2c9f370 [Liang-Chi Hsieh] Just refactor DataFrame codes.
      ef89b82d
    • zsxwing's avatar
      [SPARK-4859][Core][Streaming] Refactor LiveListenerBus and StreamingListenerBus · 883bc88d
      zsxwing authored
      This PR refactors LiveListenerBus and StreamingListenerBus and extracts the common codes to a parent class `ListenerBus`.
      
      It also includes bug fixes in #3710:
      1. Fix the race condition of queueFullErrorMessageLogged in LiveListenerBus and StreamingListenerBus to avoid outputing `queue-full-error` logs multiple times.
      2. Make sure the SHUTDOWN message will be delivered to listenerThread, so that we can make sure listenerThread will always be able to exit.
      3. Log the error from listener rather than crashing listenerThread in StreamingListenerBus.
      
      During fixing the above bugs, we find it's better to make LiveListenerBus and StreamingListenerBus have the same bahaviors. Then there will be many duplicated codes in LiveListenerBus and StreamingListenerBus.
      
      Therefore, I extracted their common codes to `ListenerBus` as a parent class: LiveListenerBus and StreamingListenerBus only need to extend `ListenerBus` and implement `onPostEvent` (how to process an event) and `onDropEvent` (do something when droppping an event).
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #4006 from zsxwing/SPARK-4859-refactor and squashes the following commits:
      
      c8dade2 [zsxwing] Fix the code style after renaming
      5715061 [zsxwing] Rename ListenerHelper to ListenerBus and the original ListenerBus to AsynchronousListenerBus
      f0ef647 [zsxwing] Fix the code style
      4e85ffc [zsxwing] Merge branch 'master' into SPARK-4859-refactor
      d2ef990 [zsxwing] Add private[spark]
      4539f91 [zsxwing] Remove final to pass MiMa tests
      a9dccd3 [zsxwing] Remove SparkListenerShutdown
      7cc04c3 [zsxwing] Refactor LiveListenerBus and StreamingListenerBus and make them share same code base
      883bc88d
    • Xiangrui Meng's avatar
      [SPARK-5424][MLLIB] make the new ALS impl take generic ID types · 4a171225
      Xiangrui Meng authored
      This PR makes the ALS implementation take generic ID types, e.g., Long and String, and expose it as a developer API.
      
      TODO:
      - [x] make sure that specialization works (validated in profiler)
      
      srowen You may like this change:) I hit a Scala compiler bug with specialization. It compiles now but users and items must have the same type. I'm going to check whether specialization really works.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4281 from mengxr/generic-als and squashes the following commits:
      
      96072c3 [Xiangrui Meng] merge master
      135f741 [Xiangrui Meng] minor update
      c2db5e5 [Xiangrui Meng] make test pass
      86588e1 [Xiangrui Meng] use a single ID type for both users and items
      74f1f73 [Xiangrui Meng] compile but runtime error at test
      e36469a [Xiangrui Meng] add classtags and make it compile
      7a5aeb3 [Xiangrui Meng] UserType -> User, ItemType -> Item
      c8ee0bc [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into generic-als
      72b5006 [Xiangrui Meng] remove generic from pipeline interface
      8bbaea0 [Xiangrui Meng] make ALS take generic IDs
      4a171225
    • Octavian Geagla's avatar
      [SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use · bdb0680d
      Octavian Geagla authored
      This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.
      
      Author: Octavian Geagla <ogeagla@gmail.com>
      
      Closes #4140 from ogeagla/SPARK-5207 and squashes the following commits:
      
      fa64dfa [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel to take stddev instead of variance
      9078fe0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] Incorporate code review feedback: change arg ordering, add dev api annotations, do better null checking, add another test and some doc for this.
      997d2e0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] make withMean and withStd public, add constructor which uses defaults, un-refactor test class
      64408a4 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel contructor to not be private to mllib, added tests for newly-exposed functionality
      bdb0680d
    • Ryan Williams's avatar
      [SPARK-5422] Add support for sending Graphite metrics via UDP · 80bd715a
      Ryan Williams authored
      Depends on [SPARK-5413](https://issues.apache.org/jira/browse/SPARK-5413) / #4209, included here, will rebase once the latter's merged.
      
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #4218 from ryan-williams/udp and squashes the following commits:
      
      ebae393 [Ryan Williams] Add support for sending Graphite metrics via UDP
      cb58262 [Ryan Williams] bump metrics dependency to v3.1.0
      80bd715a
  3. Jan 31, 2015
    • Sean Owen's avatar
      SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` doesn't work with Java 8 · c84d5a10
      Sean Owen authored
      These are more `javadoc` 8-related changes I spotted while investigating. These should be helpful in any event, but this does not nearly resolve SPARK-3359, which may never be feasible while using `unidoc` and `javadoc` 8.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4193 from srowen/SPARK-3359 and squashes the following commits:
      
      5b33f66 [Sean Owen] Additional scaladoc fixes for javadoc 8; still not going to be javadoc 8 compatible
      c84d5a10
    • Burak Yavuz's avatar
      [SPARK-3975] Added support for BlockMatrix addition and multiplication · ef8974b1
      Burak Yavuz authored
      Support for multiplying and adding large distributed matrices!
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <brkyvz@dn51t42l.sunet>
      Author: Burak Yavuz <brkyvz@dn51t4rd.sunet>
      Author: Burak Yavuz <brkyvz@dn0a221430.sunet>
      Author: Burak Yavuz <brkyvz@dn0a22b17d.sunet>
      
      Closes #4274 from brkyvz/SPARK-3975PR2 and squashes the following commits:
      
      17abd59 [Burak Yavuz] added indices to error message
      ac25783 [Burak Yavuz] merged masyer
      b66fd8b [Burak Yavuz] merged masyer
      e39baff [Burak Yavuz] addressed code review v1
      2dba642 [Burak Yavuz] [SPARK-3975] Added support for BlockMatrix addition and multiplication
      fb7624b [Burak Yavuz] merged master
      98c58ea [Burak Yavuz] added tests
      cdeb5df [Burak Yavuz] before adding tests
      c9bf247 [Burak Yavuz] fixed merge conflicts
      1cb0d06 [Burak Yavuz] [SPARK-3976] Added doc
      f92a916 [Burak Yavuz] merge upstream
      1a63b20 [Burak Yavuz] [SPARK-3974] Remove setPartition method. Isn't required
      1e8bb2a [Burak Yavuz] [SPARK-3974] Change return type of cache and persist
      e3d24c3 [Burak Yavuz] [SPARK-3976] Pulled upstream changes
      fa3774f [Burak Yavuz] [SPARK-3976] updated matrix multiplication and addition implementation
      239ab4b [Burak Yavuz] [SPARK-3974] Addressed @jkbradley's comments
      add7b05 [Burak Yavuz] [SPARK-3976] Updated code according to upstream changes
      e29acfd [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-3976
      3127233 [Burak Yavuz] fixed merge conflicts with upstream
      ba414d2 [Burak Yavuz] [SPARK-3974] fixed frobenius norm
      ab6cde0 [Burak Yavuz] [SPARK-3974] Modifications cleaning code up, making size calculation more robust
      9ae85aa [Burak Yavuz] [SPARK-3974] Made partitioner a variable inside BlockMatrix instead of a constructor variable
      d033861 [Burak Yavuz] [SPARK-3974] Removed SubMatrixInfo and added constructor without partitioner
      8e954ab [Burak Yavuz] save changes
      bbeae8c [Burak Yavuz] merged master
      987ea53 [Burak Yavuz] merged master
      49b9586 [Burak Yavuz] [SPARK-3974] Updated testing utils from master
      645afbe [Burak Yavuz] [SPARK-3974] Pull latest master
      beb1edd [Burak Yavuz] merge conflicts fixed
      f41d8db [Burak Yavuz] update tests
      b05aabb [Burak Yavuz] [SPARK-3974] Updated tests to reflect changes
      56b0546 [Burak Yavuz] updates from 3974 PR
      b7b8a8f [Burak Yavuz] pull updates from master
      b2dec63 [Burak Yavuz] Pull changes from 3974
      19c17e8 [Burak Yavuz] [SPARK-3974] Changed blockIdRow and blockIdCol
      5f062e6 [Burak Yavuz] updates with 3974
      6729fbd [Burak Yavuz] Updated with respect to SPARK-3974 PR
      589fbb6 [Burak Yavuz] [SPARK-3974] Code review feedback addressed
      63a4858 [Burak Yavuz] added grid multiplication
      aa8f086 [Burak Yavuz] [SPARK-3974] Additional comments added
      7381b99 [Burak Yavuz] merge with PR1
      f378e16 [Burak Yavuz] [SPARK-3974] Block Matrix Abstractions ready
      b693209 [Burak Yavuz] Ready for Pull request
      ef8974b1
    • martinzapletal's avatar
      [MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm · 34250a61
      martinzapletal authored
      This PR introduces an API for Isotonic regression and one algorithm implementing it, Pool adjacent violators.
      
      The Isotonic regression problem is sufficiently described in [Floudas, Pardalos, Encyclopedia of Optimization](http://books.google.co.uk/books?id=gtoTkL7heS0C&pg=RA2-PA87&lpg=RA2-PA87&dq=pooled+adjacent+violators+code&source=bl&ots=ZzQbZXVJnn&sig=reH_hBV6yIb9BeZNTF9092vD8PY&hl=en&sa=X&ei=WmF2VLiOIZLO7Qa-t4Bo&ved=0CD8Q6AEwBA#v=onepage&q&f=false), [Wikipedia](http://en.wikipedia.org/wiki/Isotonic_regression) or [Stat Wiki](http://stat.wikia.com/wiki/Isotonic_regression).
      
      Pool adjacent violators was introduced by  M. Ayer et al. in 1955.  A history and development of isotonic regression algorithms is in [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper) and list of available algorithms including their complexity is listed in [Stout, Fastest Isotonic Regression Algorithms](http://web.eecs.umich.edu/~qstout/IsoRegAlg_140812.pdf).
      
      An approach to parallelize the computation of PAV was presented in [Kearsley, Tapia, Trosset, An Approach to Parallelizing Isotonic Regression](http://softlib.rice.edu/pub/CRPC-TRs/reports/CRPC-TR96640.pdf).
      
      The implemented Pool adjacent violators algorithm is based on  [Floudas, Pardalos, Encyclopedia of Optimization](http://books.google.co.uk/books?id=gtoTkL7heS0C&pg=RA2-PA87&lpg=RA2-PA87&dq=pooled+adjacent+violators+code&source=bl&ots=ZzQbZXVJnn&sig=reH_hBV6yIb9BeZNTF9092vD8PY&hl=en&sa=X&ei=WmF2VLiOIZLO7Qa-t4Bo&ved=0CD8Q6AEwBA#v=onepage&q&f=false) (Chapter Isotonic regression problems, p. 86) and  [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper), also nicely formulated in [Tibshirani,  Hoefling, Tibshirani, Nearly-Isotonic Regression](http://www.stat.cmu.edu/~ryantibs/papers/neariso.pdf). Implementation itself inspired by R implementations [Klaus, Strimmer, 2008, fdrtool: Estimation of (Local) False Discovery Rates and Higher Criticism](http://cran.r-project.org/web/packages/fdrtool/index.html) and [R Development Core Team, stats, 2009](https://github.com/lgautier/R-3-0-branch-alt/blob/master/src/library/stats/R/isoreg.R). I ran tests with both these libraries and confirmed they yield the same results. More R implementations referenced in aforementioned [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators
      Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper). The implementation is also inspired and cross checked with other implementations: [Ted Harding, 2007](https://stat.ethz.ch/pipermail/r-help/2007-March/127981.html), [scikit-learn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/_isotonic.pyx), [Andrew Tulloch, 2014, Julia](https://github.com/ajtulloch/Isotonic.jl/blob/master/src/pooled_pava.jl), [Andrew Tulloch, 2014, c++](https://gist.github.com/ajtulloch/9499872), described in [Andrew Tulloch, Speeding up isotonic regression in scikit-learn by 5,000x](http://tullo.ch/articles/speeding-up-isotonic-regression/), [Fabian Pedregosa, 2012](https://gist.github.com/fabianp/3081831), [Sreangsu Acharyya. libpav](https://bitbucket.org/sreangsu/libpav/src/f744bc1b0fea257f0cacaead1c922eab201ba91b/src/pav.h?at=default) and [Gustav Larsson](https://gist.github.com/gustavla/9499068).
      
      Author: martinzapletal <zapletal-martin@email.cz>
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Martin Zapletal <zapletal-martin@email.cz>
      
      Closes #3519 from zapletal-martin/SPARK-3278 and squashes the following commits:
      
      5a54ea4 [Martin Zapletal] Merge pull request #2 from mengxr/isotonic-fix-java
      37ba24e [Xiangrui Meng] fix java tests
      e3c0e44 [martinzapletal] Merge remote-tracking branch 'origin/SPARK-3278' into SPARK-3278
      d8feb82 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      ded071c [Martin Zapletal] Merge pull request #1 from mengxr/SPARK-3278
      4dfe136 [Xiangrui Meng] add cache back
      0b35c15 [Xiangrui Meng] compress pools and update tests
      35d044e [Xiangrui Meng] update paraPAVA
      077606b [Xiangrui Meng] minor
      05422a8 [Xiangrui Meng] add unit test for model construction
      5925113 [Xiangrui Meng] Merge remote-tracking branch 'zapletal-martin/SPARK-3278' into SPARK-3278
      80c6681 [Xiangrui Meng] update IRModel
      3da56e5 [martinzapletal] SPARK-3278 fixed indentation error
      75eac55 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      88eb4e2 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Isotonic parameter removed from algorithm, defined behaviour for multiple data points with the same feature value, added tests to verify it
      e60a34f [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Styling and comment fixes.
      d93c8f9 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Change to IsotonicRegression api. Isotonic parameter now follows api of other mllib algorithms
      1fff77d [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Java api changes, test refactoring, comments and citations, isotonic regression model validations, linear interpolation for predictions
      12151e6 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      7aca4cc [martinzapletal] SPARK-3278 comment spelling
      9ae9d53 [martinzapletal] SPARK-3278 changes after PR feedback https://github.com/apache/spark/pull/3519. Binary search used for isotonic regression model predictions
      fad4bf9 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519
      ce0e30c [martinzapletal] SPARK-3278 readability refactoring
      f90c8c7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      0d14bd3 [martinzapletal] SPARK-3278 changed Java api to match Scala api's (Double, Double, Double)
      3c2954b [martinzapletal] SPARK-3278 Isotonic regression java api
      45aa7e8 [martinzapletal] SPARK-3278 Isotonic regression java api
      e9b3323 [martinzapletal] Merge branch 'SPARK-3278-weightedLabeledPoint' into SPARK-3278
      823d803 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      941fd1f [martinzapletal] SPARK-3278 Isotonic regression java api
      a24e29f [martinzapletal] SPARK-3278 refactored weightedlabeledpoint to (double, double, double) and updated api
      deb0f17 [martinzapletal] SPARK-3278 refactored weightedlabeledpoint to (double, double, double) and updated api
      8cefd18 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278-weightedLabeledPoint
      cab5a46 [martinzapletal] SPARK-3278 PR 3519 refactoring WeightedLabeledPoint to tuple as per comments
      b8b1620 [martinzapletal] Removed WeightedLabeledPoint. Replaced by tuple of doubles
      34760d5 [martinzapletal] Removed WeightedLabeledPoint. Replaced by tuple of doubles
      089bf86 [martinzapletal] Removed MonotonicityConstraint, Isotonic and Antitonic constraints. Replced by simple boolean
      c06f88c [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      6046550 [martinzapletal] SPARK-3278 scalastyle errors resolved
      8f5daf9 [martinzapletal] SPARK-3278 added comments and cleaned up api to consistently handle weights
      629a1ce [martinzapletal] SPARK-3278 added isotonic regression for weighted data. Added tests for Java api
      05d9048 [martinzapletal] SPARK-3278 isotonic regression refactoring and api changes
      961aa05 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
      3de71d0 [martinzapletal] SPARK-3278 added initial version of Isotonic regression algorithm including proposed API
      34250a61
    • Reynold Xin's avatar
      [SPARK-5307] Add a config option for SerializationDebugger. · 63640831
      Reynold Xin authored
      Just in case there is a bug in the SerializationDebugger that makes error reporting worse than it was.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4297 from rxin/ser-config and squashes the following commits:
      
      f1d4629 [Reynold Xin] [SPARK-5307] Add a config option for SerializationDebugger.
      63640831
    • kai's avatar
      [SQL] remove redundant field "childOutput" from execution.Aggregate, use child.output instead · f54c9f60
      kai authored
      Author: kai <kaizeng@eecs.berkeley.edu>
      
      Closes #4291 from kai-zeng/aggregate-fix and squashes the following commits:
      
      78658ef [kai] remove redundant field "childOutput"
      f54c9f60
    • Reynold Xin's avatar
      [SPARK-5307] SerializationDebugger · 740a5686
      Reynold Xin authored
      This patch adds a SerializationDebugger that is used to add serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.
      
      The patch uses the internals of JVM serialization (in particular, heavy usage of ObjectStreamClass). Compared with an earlier attempt, this one provides extra information including field names, array offsets, writeExternal calls, etc.
      
      An example serialization stack:
      ```
      Serialization stack:
        - object not serializable (class: org.apache.spark.serializer.NotSerializable, value: org.apache.spark.serializer.NotSerializable2c43caa4)
        - element of array (index: 0)
        - array (class [Ljava.lang.Object;, size 1)
        - field (class: org.apache.spark.serializer.SerializableArray, name: arrayField, type: class [Ljava.lang.Object;)
        - object (class org.apache.spark.serializer.SerializableArray, org.apache.spark.serializer.SerializableArray193c5908)
        - writeExternal data
        - externalizable object (class org.apache.spark.serializer.ExternalizableClass, org.apache.spark.serializer.ExternalizableClass320bdadc)
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4098 from rxin/SerializationDebugger and squashes the following commits:
      
      553b3ff [Reynold Xin] Update SerializationDebuggerSuite.scala
      572d0cb [Reynold Xin] Disable automatically when reflection fails.
      b349b77 [Reynold Xin] [SPARK-5307] SerializationDebugger to help debug NotSerializableException - take 2
      740a5686
  4. Jan 30, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5504] [sql] convertToCatalyst should support nested arrays · e643de42
      Joseph K. Bradley authored
      After the recent refactoring, convertToCatalyst in ScalaReflection does not recurse on Arrays. It should.
      
      The test suite modification made the test fail before the fix in ScalaReflection.  The fix makes the test suite succeed.
      
      CC: marmbrus
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4295 from jkbradley/SPARK-5504 and squashes the following commits:
      
      6b7276d [Joseph K. Bradley] Fixed issue in ScalaReflection.convertToCatalyst with Arrays with non-primitive types. Modified test suite so it failed before the fix and works after the fix.
      e643de42
    • Travis Galoppo's avatar
      SPARK-5400 [MLlib] Changed name of GaussianMixtureEM to GaussianMixture · 98697734
      Travis Galoppo authored
      Decoupling the model and the algorithm
      
      Author: Travis Galoppo <tjg2107@columbia.edu>
      
      Closes #4290 from tgaloppo/spark-5400 and squashes the following commits:
      
      9c1534c [Travis Galoppo] Fixed invokation instructions in comments
      d848076 [Travis Galoppo] SPARK-5400 Changed name of GaussianMixtureEM to GaussianMixture to separate model from algorithm
      98697734
    • sboeschhuawei's avatar
      [SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function · f377431a
      sboeschhuawei authored
      Add single pseudo-eigenvector PIC
      Including documentations and updated pom.xml with the following codes:
      mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
      mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala
      
      Author: sboeschhuawei <stephen.boesch@huawei.com>
      Author: Fan Jiang <fanjiang.sc@huawei.com>
      Author: Jiang Fan <fjiang6@gmail.com>
      Author: Stephen Boesch <stephen.boesch@huawei.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4254 from fjiang6/PIC and squashes the following commits:
      
      4550850 [sboeschhuawei] Removed pic test data
      f292f31 [Stephen Boesch] Merge pull request #44 from mengxr/SPARK-4259
      4b78aaf [Xiangrui Meng] refactor PIC
      24fbf52 [sboeschhuawei] Updated API to be similar to KMeans plus other changes requested by Xiangrui on the PR
      c12dfc8 [sboeschhuawei] Removed examples files and added pic_data.txt. Revamped testcases yet to come
      92d4752 [sboeschhuawei] Move the Guassian/ Affinity matrix calcs out of PIC. Presently in the test suite
      7ebd149 [sboeschhuawei] Incorporate Xiangrui's first set of PR comments except restructure PIC.run to take Graph but do not remove Gaussian
      121e4d5 [sboeschhuawei] Remove unused testing data files
      1c3a62e [sboeschhuawei] removed matplot.py and reordered all private methods to bottom of PIC
      218a49d [sboeschhuawei] Applied Xiangrui's comments - especially removing RDD/PICLinalg classes and making noncritical methods private
      43ab10b [sboeschhuawei] Change last two println's to log4j logger
      88aacc8 [sboeschhuawei] Add assert to testcase on cluster sizes
      24f438e [sboeschhuawei] fixed incorrect markdown in clustering doc
      060e6bf [sboeschhuawei] Added link to PIC doc from the main clustering md doc
      be659e3 [sboeschhuawei] Added mllib specific log4j
      90e7fa4 [sboeschhuawei] Converted from custom Linalg routines to Breeze: added JavaDoc comments; added Markdown documentation
      bea48ea [sboeschhuawei] Converted custom Linear Algebra datatypes/routines to use Breeze.
      b29c0db [Fan Jiang] Update PIClustering.scala
      ace9749 [Fan Jiang] Update PIClustering.scala
      a112f38 [sboeschhuawei] Added graphx main and test jars as dependencies to mllib/pom.xml
      f656c34 [sboeschhuawei] Added iris dataset
      b7dbcbe [sboeschhuawei] Added axes and combined into single plot for matplotlib
      a2b1e57 [sboeschhuawei] Revert inadvertent update to KMeans
      9294263 [sboeschhuawei] Added visualization/plotting of input/output data
      e5df2b8 [sboeschhuawei] First end to end working PIC
      0700335 [sboeschhuawei] First end to end working version: but has bad performance issue
      32a90dc [sboeschhuawei] Update circles test data values
      0ef163f [sboeschhuawei] Added ConcentricCircles data generation and KMeans clustering
      3fd5bc8 [sboeschhuawei] PIClustering is running in new branch (up to the pseudo-eigenvector convergence step)
      d5aae20 [Jiang Fan] Adding Power Iteration Clustering and Suite test
      a3c5fbe [Jiang Fan] Adding Power Iteration Clustering
      f377431a
    • Burak Yavuz's avatar
      [SPARK-5486] Added validate method to BlockMatrix · 6ee8338b
      Burak Yavuz authored
      The `validate` method will allow users to debug their `BlockMatrix`, if operations like `add` or `multiply` return unexpected results. It checks the following properties in a `BlockMatrix`:
      - Are the dimensions of the `BlockMatrix` consistent with what the user entered: (`nRows`, `nCols`)
      - Are the dimensions of each `MatrixBlock` consistent with what the user entered: (`rowsPerBlock`, `colsPerBlock`)
      - Are there blocks with duplicate indices
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #4279 from brkyvz/SPARK-5486 and squashes the following commits:
      
      c152a73 [Burak Yavuz] addressed code review v2
      598c583 [Burak Yavuz] merged master
      b55ac5c [Burak Yavuz] addressed code review v1
      25f083b [Burak Yavuz] simplify implementation
      0aa519a [Burak Yavuz] [SPARK-5486] Added validate method to BlockMatrix
      6ee8338b
    • Xiangrui Meng's avatar
      [SPARK-5496][MLLIB] Allow both classification and Classification in Algo for trees. · 0a95085f
      Xiangrui Meng authored
      to be backward compatible.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4287 from mengxr/SPARK-5496 and squashes the following commits:
      
      a025c53 [Xiangrui Meng] Allow both classification and Classification in Algo for trees.
      0a95085f
    • Joseph J.C. Tang's avatar
      [MLLIB] SPARK-4846: throw a RuntimeException and give users hints to increase the minCount · 54d95758
      Joseph J.C. Tang authored
      When the vocabSize\*vectorSize is larger than Int.MaxValue/8, we try to throw a RuntimeException. Because under this circumstance it would definitely throw an OOM when allocating memory to serialize the arrays syn0Global&syn1Global.   syn0Global&syn1Global are float arrays. Serializing them should need a byte array of more than 8 times of syn0Global's size.
      Also if we catch an OOM even if vocabSize\*vectorSize is less than Int.MaxValue/8, we should give users hints to increase the minCount or decrease the vectorSize.
      
      Author: Joseph J.C. Tang <jinntrance@gmail.com>
      
      Closes #4247 from jinntrance/w2v-fix and squashes the following commits:
      
      b5eb71f [Joseph J.C. Tang] throw a RuntimeException and give users hints regarding the vectorSize&minCount
      54d95758
Loading