Skip to content
Snippets Groups Projects
  1. Apr 21, 2015
    • Joseph K. Bradley's avatar
      [SPARK-6113] [ML] Small cleanups after original tree API PR · 607eff0e
      Joseph K. Bradley authored
      This does a few clean-ups.  With this PR, all spark.ml tree components have ```private[ml]``` constructors.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5567 from jkbradley/dt-api-dt2 and squashes the following commits:
      
      2263b5b [Joseph K. Bradley] Added note about tree example issue.
      bb9f610 [Joseph K. Bradley] Small cleanups after original tree API PR
      607eff0e
    • Patrick Wendell's avatar
      [MINOR] Comment improvements in ExternalSorter. · 70f9f8ff
      Patrick Wendell authored
      1. Clearly specifies the contract/interactions for users of this class.
      2. Minor fix in one doc to avoid ambiguity.
      
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #5620 from pwendell/cleanup and squashes the following commits:
      
      8d8f44f [Patrick Wendell] [Minor] Comment improvements in ExternalSorter.
      70f9f8ff
    • zsxwing's avatar
      [SPARK-6490][Docs] Add docs for rpc configurations · 3a3f7100
      zsxwing authored
      Added docs for rpc configurations and also fixed two places that should have been fixed in #5595.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5607 from zsxwing/SPARK-6490-docs and squashes the following commits:
      
      25a6736 [zsxwing] Increase the default timeout to 120s
      6e37c30 [zsxwing] Update docs
      5577540 [zsxwing] Use spark.network.timeout as the default timeout if it presents
      4f07174 [zsxwing] Fix unit tests
      1c2cf26 [zsxwing] Add docs for rpc configurations
      3a3f7100
    • texasmichelle's avatar
      [SPARK-1684] [PROJECT INFRA] Merge script should standardize SPARK-XXX prefix · a0761ec7
      texasmichelle authored
      Cleans up the pull request title in the merge script to follow conventions outlined in the wiki under Contributing Code.
      https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingCode
      
      [MODULE] SPARK-XXXX: Description
      
      Author: texasmichelle <texasmichelle@gmail.com>
      
      Closes #5149 from texasmichelle/master and squashes the following commits:
      
      9b6b0a7 [texasmichelle] resolved variable scope issue
      7d5fa20 [texasmichelle] only prompt if title has been modified
      8c195bb [texasmichelle] removed erroneous line
      4f1ed46 [texasmichelle] Deque removal, logic simplifications, & prompt user to pick a title (orig or modified)
      df73f6a [texasmichelle] reworked regex's to enforce brackets around JIRA ref
      43b5aed [texasmichelle] Merge remote-tracking branch 'apache/master'
      25229c6 [texasmichelle] Merge remote-tracking branch 'apache/master'
      aa20a6e [texasmichelle] Move code into main() and add doctest for new text parsing method
      48520ba [texasmichelle] SPARK-1684: Corrected import statement
      042099d [texasmichelle] SPARK-1684 Merge script should standardize SPARK-XXX prefix
      8f4a7d1 [texasmichelle] SPARK-1684 Merge script should standardize SPARK-XXX prefix
      a0761ec7
    • Reynold Xin's avatar
      Closes #5427 · 41ef78a9
      Reynold Xin authored
      41ef78a9
    • Reynold Xin's avatar
      [SPARK-6953] [PySpark] speed up python tests · 3134c3fe
      Reynold Xin authored
      This PR try to speed up some python tests:
      
      ```
      tests.py                       144s -> 103s      -41s
      mllib/classification.py         24s -> 17s        -7s
      mllib/regression.py             27s -> 15s       -12s
      mllib/tree.py                   27s -> 13s       -14s
      mllib/tests.py                  64s -> 31s       -33s
      streaming/tests.py             185s -> 84s      -101s
      ```
      Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice).
      
      During testing, it will show used time for each test file:
      ```
      Run core tests ...
      Running test: pyspark/rdd.py ... ok (22s)
      Running test: pyspark/context.py ... ok (16s)
      Running test: pyspark/conf.py ... ok (4s)
      Running test: pyspark/broadcast.py ... ok (4s)
      Running test: pyspark/accumulators.py ... ok (4s)
      Running test: pyspark/serializers.py ... ok (6s)
      Running test: pyspark/profiler.py ... ok (5s)
      Running test: pyspark/shuffle.py ... ok (1s)
      Running test: pyspark/tests.py ... ok (103s)   144s
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5605 from rxin/python-tests-speed and squashes the following commits:
      
      d08542d [Reynold Xin] Merge pull request #14 from mengxr/SPARK-6953
      89321ee [Xiangrui Meng] fix seed in tests
      3ad2387 [Reynold Xin] Merge pull request #5427 from davies/python_tests
      3134c3fe
    • Marcelo Vanzin's avatar
      [SPARK-6014] [core] Revamp Spark shutdown hooks, fix shutdown races. · e72c16e3
      Marcelo Vanzin authored
      This change adds some new utility code to handle shutdown hooks in
      Spark. The main goal is to take advantage of Hadoop 2.x's API for
      shutdown hooks, which allows Spark to register a hook that will
      run before the one that cleans up HDFS clients, and thus avoids
      some races that would cause exceptions to show up and other issues
      such as failure to properly close event logs.
      
      Unfortunately, Hadoop 1.x does not have such APIs, so in that case
      correctness is still left to chance.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5560 from vanzin/SPARK-6014 and squashes the following commits:
      
      edfafb1 [Marcelo Vanzin] Better scaladoc.
      fcaeedd [Marcelo Vanzin] Merge branch 'master' into SPARK-6014
      e7039dc [Marcelo Vanzin] [SPARK-6014] [core] Revamp Spark shutdown hooks, fix shutdown races.
      e72c16e3
    • mweindel's avatar
      Avoid warning message about invalid refuse_seconds value in Mesos >=0.21... · b063a61b
      mweindel authored
      Starting with version 0.21.0, Apache Mesos is very noisy if the filter parameter refuse_seconds is set to an invalid value like `-1`.
      I have seen systems with millions of log lines like
      ```
      W0420 18:00:48.773059 32352 hierarchical_allocator_process.hpp:589] Using the default value of 'refuse_seconds' to create the refused resources filter because the input value is negative
      ```
      in the Mesos master INFO and WARNING log files.
      Therefore the CoarseMesosSchedulerBackend should set the default value for refuse seconds (i.e. 5 seconds) directly.
      This is no problem for the fine-grained MesosSchedulerBackend, as it uses the value 1 second for this parameter.
      
      Author: mweindel <m.weindel@usu-software.de>
      
      Closes #5597 from MartinWeindel/master and squashes the following commits:
      
      2f99ffd [mweindel] Avoid warning message about invalid refuse_seconds value in Mesos >=0.21.
      b063a61b
    • Alain's avatar
      [Minor][MLLIB] Fix a minor formatting bug in toString method in Node.scala · ae036d08
      Alain authored
      
      add missing comma and space
      
      Author: Alain <aihe@usc.edu>
      
      Closes #5621 from AiHe/tree-node-issue and squashes the following commits:
      
      159a7bb [Alain] [Minor][MLLIB] Fix a minor formatting bug in toString methods in Node.scala
      
      (cherry picked from commit 4508f01890a723f80d631424ff8eda166a13a727)
      Signed-off-by: default avatarXiangrui Meng <meng@databricks.com>
      ae036d08
    • Xiangrui Meng's avatar
      [SPARK-7036][MLLIB] ALS.train should support DataFrames in PySpark · 686dd742
      Xiangrui Meng authored
      SchemaRDD works with ALS.train in 1.2, so we should continue support DataFrames for compatibility. coderxiang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5619 from mengxr/SPARK-7036 and squashes the following commits:
      
      dfcaf5a [Xiangrui Meng] ALS.train should support DataFrames in PySpark
      686dd742
    • MechCoder's avatar
      [SPARK-6065] [MLlib] Optimize word2vec.findSynonyms using blas calls · 7fe6142c
      MechCoder authored
      1. Use blas calls to find the dot product between two vectors.
      2. Prevent re-computing the L2 norm of the given vector for each word in model.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5467 from MechCoder/spark-6065 and squashes the following commits:
      
      dd0b0b2 [MechCoder] Preallocate wordVectors
      ffc9240 [MechCoder] Minor
      6b74c81 [MechCoder] Switch back to native blas calls
      da1642d [MechCoder] Explicit types and indexing
      64575b0 [MechCoder] Save indexedmap and a wordvecmat instead of matrix
      fbe0108 [MechCoder] Made the following changes 1. Calculate norms during initialization. 2. Use Blas calls from linalg.blas
      1350cf3 [MechCoder] [SPARK-6065] Optimize word2vec.findSynonynms using blas calls
      7fe6142c
    • Marcelo Vanzin's avatar
      [minor] [build] Set java options when generating mima ignores. · a70e849c
      Marcelo Vanzin authored
      The default java options make the call to GenerateMIMAIgnore take
      forever to run since it's gc'ing all the time. Improve things by
      setting the perm gen size / max heap size to larger values.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5615 from vanzin/gen-mima-fix and squashes the following commits:
      
      f44e921 [Marcelo Vanzin] [minor] [build] Set java options when generating mima ignores.
      a70e849c
    • Josh Rosen's avatar
      [SPARK-3386] Share and reuse SerializerInstances in shuffle paths · f83c0f11
      Josh Rosen authored
      This patch modifies several shuffle-related code paths to share and re-use SerializerInstances instead of creating new ones.  Some serializers, such as KryoSerializer or SqlSerializer, can be fairly expensive to create or may consume moderate amounts of memory, so it's probably best to avoid unnecessary serializer creation in hot code paths.
      
      The key change in this patch is modifying `getDiskWriter()` / `DiskBlockObjectWriter` to accept `SerializerInstance`s instead of `Serializer`s (which are factories for instances).  This allows the disk writer's creator to decide whether the serializer instance can be shared or re-used.
      
      The rest of the patch modifies several write and read paths to use shared serializers.  One big win is in `ShuffleBlockFetcherIterator`, where we used to create a new serializer per received block.  Similarly, the shuffle write path used to create a new serializer per file even though in many cases only a single thread would be writing to a file at a time.
      
      I made a small serializer reuse optimization in CoarseGrainedExecutorBackend as well, since it seemed like a small and obvious improvement.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5606 from JoshRosen/SPARK-3386 and squashes the following commits:
      
      f661ce7 [Josh Rosen] Remove thread local; add comment instead
      64f8398 [Josh Rosen] Use ThreadLocal for serializer instance in CoarseGrainedExecutorBackend
      aeb680e [Josh Rosen] [SPARK-3386] Reuse SerializerInstance in shuffle code paths
      f83c0f11
    • Cheng Hao's avatar
      [SPARK-5817] [SQL] Fix bug of udtf with column names · 7662ec23
      Cheng Hao authored
      It's a bug while do query like:
      ```sql
      select d from (select explode(array(1,1)) d from src limit 1) t
      ```
      And it will throws exception like:
      ```
      org.apache.spark.sql.AnalysisException: cannot resolve 'd' given input columns _c0; line 1 pos 7
      at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
      at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
      at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
      at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
      at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
      at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
      at scala.collection.AbstractTraversable.map(Traversable.scala:105)
      at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116)
      at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      ```
      
      To solve the bug, it requires code refactoring for UDTF
      The major changes are about:
      * Simplifying the UDTF development, UDTF will manage the output attribute names any more, instead, the `logical.Generate` will handle that properly.
      * UDTF will be asked for the output schema (data types) during the logical plan analyzing.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #4602 from chenghao-intel/explode_bug and squashes the following commits:
      
      c2a5132 [Cheng Hao] add back resolved for Alias
      556e982 [Cheng Hao] revert the unncessary change
      002c361 [Cheng Hao] change the rule of resolved for Generate
      04ae500 [Cheng Hao] add qualifier only for generator output
      5ee5d2c [Cheng Hao] prepend the new qualifier
      d2e8b43 [Cheng Hao] Update the code as feedback
      ca5e7f4 [Cheng Hao] shrink the commits
      7662ec23
    • Punya Biswal's avatar
      [SPARK-6996][SQL] Support map types in java beans · 2a24bf92
      Punya Biswal authored
      liancheng mengxr this is similar to #5146.
      
      Author: Punya Biswal <pbiswal@palantir.com>
      
      Closes #5578 from punya/feature/SPARK-6996 and squashes the following commits:
      
      d56c3e0 [Punya Biswal] Fix imports
      c7e308b [Punya Biswal] Support java iterable types in POJOs
      5e00685 [Punya Biswal] Support map types in java beans
      2a24bf92
    • Yin Huai's avatar
      [SPARK-6969][SQL] Refresh the cached table when REFRESH TABLE is used · 6265cba0
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-6969
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5583 from yhuai/refreshTableRefreshDataCache and squashes the following commits:
      
      1e5142b [Yin Huai] Add todo.
      92b2498 [Yin Huai] Minor updates.
      367df92 [Yin Huai] Recache data in the command of REFRESH TABLE.
      6265cba0
    • Wenchen Fan's avatar
      [SQL][minor] make it more clear that we only need to re-throw GetField... · 03fd9216
      Wenchen Fan authored
      [SQL][minor] make it more clear that we only need to re-throw GetField exception for UnresolvedAttribute
      
      For `GetField` outside `UnresolvedAttribute`, we will throw exception in `Analyzer`.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #5588 from cloud-fan/tmp and squashes the following commits:
      
      7ac74d2 [Wenchen Fan] small refactor
      03fd9216
    • vidmantas zemleris's avatar
      [SPARK-6994] Allow to fetch field values by name in sql.Row · 2e8c6ca4
      vidmantas zemleris authored
      It looked weird that up to now there was no way in Spark's Scala API to access fields of `DataFrame/sql.Row` by name, only by their index.
      
      This tries to solve this issue.
      
      Author: vidmantas zemleris <vidmantas@vinted.com>
      
      Closes #5573 from vidma/features/row-with-named-fields and squashes the following commits:
      
      6145ae3 [vidmantas zemleris] [SPARK-6994][SQL] Allow to fetch field values by name on Row
      9564ebb [vidmantas zemleris] [SPARK-6994][SQL] Add fieldIndex to schema (StructType)
      2e8c6ca4
    • Prashant Sharma's avatar
      [SPARK-7011] Build(compilation) fails with scala 2.11 option, because a... · 04bf34e3
      Prashant Sharma authored
      [SPARK-7011] Build(compilation) fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.
      
      [This](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L58) is where it is used and fails compilations at.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #5593 from ScrapCodes/SPARK-7011/build-fix and squashes the following commits:
      
      e6d57a3 [Prashant Sharma] [SPARK-7011] Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.
      04bf34e3
    • MechCoder's avatar
      [SPARK-6845] [MLlib] [PySpark] Add isTranposed flag to DenseMatrix · 45c47fa4
      MechCoder authored
      Since sparse matrices now support a isTransposed flag for row major data, DenseMatrices should do the same.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5455 from MechCoder/spark-6845 and squashes the following commits:
      
      525c370 [MechCoder] minor
      004a37f [MechCoder] Cast boolean to int
      151f3b6 [MechCoder] [WIP] Add isTransposed to pickle DenseMatrix
      cc0b90a [MechCoder] [SPARK-6845] Add isTranposed flag to DenseMatrix
      45c47fa4
    • emres's avatar
      SPARK-3276 Added a new configuration spark.streaming.minRememberDuration · c25ca7c5
      emres authored
      SPARK-3276 Added a new configuration parameter ``spark.streaming.minRememberDuration``, with a default value of 1 minute.
      
      So that when a Spark Streaming application is started, an arbitrary number of minutes can be taken as threshold for remembering.
      
      Author: emres <emre.sevinc@gmail.com>
      
      Closes #5438 from emres/SPARK-3276 and squashes the following commits:
      
      766f938 [emres] SPARK-3276 Switched to using newly added getTimeAsSeconds method.
      affee1d [emres] SPARK-3276 Changed the property name and variable name for minRememberDuration
      c9d58ca [emres] SPARK-3276 Minor code re-formatting.
      1c53ba9 [emres] SPARK-3276 Started to use ssc.conf rather than ssc.sparkContext.getConf,  and also getLong method directly.
      bfe0acb [emres] SPARK-3276 Moved the minRememberDurationMin to the class
      daccc82 [emres] SPARK-3276 Changed the property name to reflect the unit of value and reduced number of fields.
      43cc1ce [emres] SPARK-3276 Added a new configuration parameter spark.streaming.minRemember duration, with a default value of 1 minute.
      c25ca7c5
    • Kay Ousterhout's avatar
      [SPARK-5360] [SPARK-6606] Eliminate duplicate objects in serialized CoGroupedRDD · c035c0f2
      Kay Ousterhout authored
      CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object.
      
      This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran into): if any state is stored in the RDD or ShuffleHandle on the worker side, subtle bugs can appear due to the fact that the references to the RDD / ShuffleHandle in the RDD and in the partition point to separate objects. I'm not sure if this is enough of a potential future problem to fix this old and central part of the code, so hoping to get input from others here.
      
      I did some simple experiments to see how much this effects closure size. For this example:
      $ val a = sc.parallelize(1 to 10).map((_, 1))
      $ val b = sc.parallelize(1 to 2).map(x => (x, 2*x))
      $ a.cogroup(b).collect()
      the closure was 1902 bytes with current Spark, and 1129 bytes after my change. The difference comes from eliminating duplicate serialization of the shuffle handle.
      
      For this example:
      $ val sortedA = a.sortByKey()
      $ val sortedB = b.sortByKey()
      $ sortedA.cogroup(sortedB).collect()
      the closure was 3491 bytes with current Spark, and 1333 bytes after my change. Here, the difference comes from eliminating duplicate serialization of the two RDDs for the narrow dependencies.
      
      The ShuffleHandle includes the ShuffleDependency, so this difference will get larger if a ShuffleDependency includes a serializer, a key ordering, or an aggregator (all set to None by default). It would also get bigger for a big RDD -- although I can't think of any examples where the RDD object gets large.  The difference is not affected by the size of the function the user specifies, which (based on my understanding) is typically the source of large task closures.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #4145 from kayousterhout/SPARK-5360 and squashes the following commits:
      
      85156c3 [Kay Ousterhout] Better comment the narrowDeps parameter
      cff0209 [Kay Ousterhout] Fixed spelling issue
      658e1af [Kay Ousterhout] [SPARK-5360] Eliminate duplicate objects in serialized CoGroupedRDD
      c035c0f2
    • David McGuire's avatar
      [SPARK-6985][streaming] Receiver maxRate over 1000 causes a StackOverflowError · 5fea3e5c
      David McGuire authored
      A simple truncation in integer division (on rates over 1000 messages / second) causes the existing implementation to sleep for 0 milliseconds, then call itself recursively; this causes what is essentially an infinite recursion, since the base case of the calculated amount of time having elapsed can't be reached before available stack space is exhausted. A fix to this truncation error is included in this patch.
      
      However, even with the defect patched, the accuracy of the existing implementation is abysmal (the error bounds of the original test were effectively [-30%, +10%], although this fact was obscured by hard-coded error margins); as such, when the error bounds were tightened down to [-5%, +5%], the existing implementation failed to meet the new, tightened, requirements. Therefore, an industry-vetted solution (from Guava) was used to get the adapted tests to pass.
      
      Author: David McGuire <david.mcguire2@nike.com>
      
      Closes #5559 from dmcguire81/master and squashes the following commits:
      
      d29d2e0 [David McGuire] Back out to +/-5% error margins, for flexibility in timing
      8be6934 [David McGuire] Fix spacing per code review
      90e98b9 [David McGuire] Address scalastyle errors
      29011bd [David McGuire] Further ratchet down the error margins
      b33b796 [David McGuire] Eliminate dependency on even distribution by BlockGenerator
      8f2934b [David McGuire] Remove arbitrary thread timing / cooperation code
      70ee310 [David McGuire] Use Thread.yield(), since Thread.sleep(0) is system-dependent
      82ee46d [David McGuire] Replace guard clause with nested conditional
      2794717 [David McGuire] Replace the RateLimiter with the Guava implementation
      38f3ca8 [David McGuire] Ratchet down the error rate to +/- 5%; tests fail
      24b1bc0 [David McGuire] Fix truncation in integer division causing infinite recursion
      d6e1079 [David McGuire] Stack overflow error in RateLimiter on rates over 1000/s
      5fea3e5c
    • Yanbo Liang's avatar
      [SPARK-5990] [MLLIB] Model import/export for IsotonicRegression · 1f2f723b
      Yanbo Liang authored
      Model import/export for IsotonicRegression
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5270 from yanboliang/spark-5990 and squashes the following commits:
      
      872028d [Yanbo Liang] fix code style
      f80ec1b [Yanbo Liang] address comments
      49600cc [Yanbo Liang] address comments
      429ff7d [Yanbo Liang] store each interval as a record
      2b2f5a1 [Yanbo Liang] Model import/export for IsotonicRegression
      1f2f723b
    • Davies Liu's avatar
      [SPARK-6949] [SQL] [PySpark] Support Date/Timestamp in Column expression · ab9128fb
      Davies Liu authored
      This PR enable auto_convert in JavaGateway, then we could register a converter for a given types, for example, date and datetime.
      
      There are two bugs related to auto_convert, see [1] and [2], we workaround it in this PR.
      
      [1]  https://github.com/bartdag/py4j/issues/160
      [2] https://github.com/bartdag/py4j/issues/161
      
      cc rxin JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5570 from davies/py4j_date and squashes the following commits:
      
      eb4fa53 [Davies Liu] fix tests in python 3
      d17d634 [Davies Liu] rollback changes in mllib
      2e7566d [Davies Liu] convert tuple into ArrayList
      ceb3779 [Davies Liu] Update rdd.py
      3c373f3 [Davies Liu] support date and datetime by auto_convert
      cb094ff [Davies Liu] enable auto convert
      ab9128fb
    • zsxwing's avatar
      [SPARK-6490][Core] Add spark.rpc.* and deprecate spark.akka.* · 8136810d
      zsxwing authored
      Deprecated `spark.akka.num.retries`, `spark.akka.retry.wait`, `spark.akka.askTimeout`,  `spark.akka.lookupTimeout`, and added `spark.rpc.num.retries`, `spark.rpc.retry.wait`, `spark.rpc.askTimeout`, `spark.rpc.lookupTimeout`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5595 from zsxwing/SPARK-6490 and squashes the following commits:
      
      e0d80a9 [zsxwing] Use getTimeAsMs and getTimeAsSeconds and other minor fixes
      31dbe69 [zsxwing] Add spark.rpc.* and deprecate spark.akka.*
      8136810d
  2. Apr 20, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-6635][SQL] DataFrame.withColumn should replace columns with identical column names · c736220d
      Liang-Chi Hsieh authored
      JIRA https://issues.apache.org/jira/browse/SPARK-6635
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5541 from viirya/replace_with_column and squashes the following commits:
      
      b539c7b [Liang-Chi Hsieh] For comment.
      72f35b1 [Liang-Chi Hsieh] DataFrame.withColumn can replace original column with identical column name.
      c736220d
    • Yin Huai's avatar
      [SPARK-6368][SQL] Build a specialized serializer for Exchange operator. · ce7ddabb
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6368
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5497 from yhuai/serializer2 and squashes the following commits:
      
      da562c5 [Yin Huai] Merge remote-tracking branch 'upstream/master' into serializer2
      50e0c3d [Yin Huai] When no filed is emitted to shuffle, use SparkSqlSerializer for now.
      9f1ed92 [Yin Huai] Merge remote-tracking branch 'upstream/master' into serializer2
      6d07678 [Yin Huai] Address comments.
      4273b8c [Yin Huai] Enabled SparkSqlSerializer2.
      09e587a [Yin Huai] Remove TODO.
      791b96a [Yin Huai] Use UTF8String.
      60a1487 [Yin Huai] Merge remote-tracking branch 'upstream/master' into serializer2
      3e09655 [Yin Huai] Use getAs for Date column.
      43b9fb4 [Yin Huai] Test.
      8297732 [Yin Huai] Fix test.
      c9373c8 [Yin Huai] Support DecimalType.
      2379eeb [Yin Huai] ASF header.
      39704ab [Yin Huai] Specialized serializer for Exchange.
      ce7ddabb
    • BenFradet's avatar
      [doc][streaming] Fixed broken link in mllib section · 517bdf36
      BenFradet authored
      The commit message is pretty self-explanatory.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #5600 from BenFradet/master and squashes the following commits:
      
      108492d [BenFradet] [doc][streaming] Fixed broken link in mllib section
      517bdf36
    • Eric Chiang's avatar
      fixed doc · 97fda73d
      Eric Chiang authored
      The contribution is my original work. I license the work to the project under the project's open source license.
      
      Small typo in the programming guide.
      
      Author: Eric Chiang <eric.chiang.m@gmail.com>
      
      Closes #5599 from ericchiang/docs-typo and squashes the following commits:
      
      1177942 [Eric Chiang] fixed doc
      97fda73d
    • Liang-Chi Hsieh's avatar
      [Minor][MLlib] Incorrect path to test data is used in DecisionTreeExample · 1ebceaa5
      Liang-Chi Hsieh authored
      It should load from `testInput` instead of `input` for test data.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5594 from viirya/use_testinput and squashes the following commits:
      
      5e8b174 [Liang-Chi Hsieh] Fix style.
      b60b475 [Liang-Chi Hsieh] Use testInput.
      1ebceaa5
    • Elisey Zanko's avatar
      [SPARK-6661] Python type errors should print type, not object · 77176619
      Elisey Zanko authored
      Author: Elisey Zanko <elisey.zanko@gmail.com>
      
      Closes #5361 from 31z4/spark-6661 and squashes the following commits:
      
      73c5d79 [Elisey Zanko] Python type errors should print type, not object
      77176619
    • Aaron Davidson's avatar
      [SPARK-7003] Improve reliability of connection failure detection between Netty... · 968ad972
      Aaron Davidson authored
      [SPARK-7003] Improve reliability of connection failure detection between Netty block transfer service endpoints
      
      Currently we rely on the assumption that an exception will be raised and the channel closed if two endpoints cannot communicate over a Netty TCP channel. However, this guarantee does not hold in all network environments, and [SPARK-6962](https://issues.apache.org/jira/browse/SPARK-6962) seems to point to a case where only the server side of the connection detected a fault.
      
      This patch improves robustness of fetch/rpc requests by having an explicit timeout in the transport layer which closes the connection if there is a period of inactivity while there are outstanding requests.
      
      NB: This patch is actually only around 50 lines added if you exclude the testing-related code.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #5584 from aarondav/timeout and squashes the following commits:
      
      8699680 [Aaron Davidson] Address Reynold's comments
      37ce656 [Aaron Davidson] [SPARK-7003] Improve reliability of connection failure detection between Netty block transfer service endpoints
      968ad972
    • jrabary's avatar
      [SPARK-5924] Add the ability to specify withMean or withStd parameters with StandarScaler · 1be20707
      jrabary authored
      The current implementation call the default constructor of mllib.feature.StandarScaler without the possibility to specify withMean or withStd options.
      
      Author: jrabary <Jaonary@gmail.com>
      
      Closes #4704 from jrabary/master and squashes the following commits:
      
      fae8568 [jrabary] style fix
      8896b0e [jrabary] Comments fix
      ef96d73 [jrabary] style fix
      8e52607 [jrabary] style fix
      edd9d48 [jrabary] Fix default param initialization
      17e1a76 [jrabary] Fix default param initialization
      298f405 [jrabary] Typo fix
      45ed914 [jrabary] Add withMean and withStd params to StandarScaler
      1be20707
  3. Apr 19, 2015
  4. Apr 18, 2015
    • Olivier Girardot's avatar
      SPARK-6993 : Add default min, max methods for JavaDoubleRDD · 8fbd45c7
      Olivier Girardot authored
      The default method will use Guava's Ordering instead of
      java.util.Comparator.naturalOrder() because it's not available
      in Java 7, only in Java 8.
      
      Author: Olivier Girardot <o.girardot@lateral-thoughts.com>
      
      Closes #5571 from ogirardot/master and squashes the following commits:
      
      7fe2e9e [Olivier Girardot] SPARK-6993 : Add default min, max methods for JavaDoubleRDD
      8fbd45c7
Loading