Skip to content
Snippets Groups Projects
  1. Aug 02, 2014
    • Chris Fregly's avatar
      [SPARK-1981] Add AWS Kinesis streaming support · 91f9504e
      Chris Fregly authored
      Author: Chris Fregly <chris@fregly.com>
      
      Closes #1434 from cfregly/master and squashes the following commits:
      
      4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
      0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
      691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
      0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
      e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
      d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
      912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
      db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
      338997e [Chris Fregly] improve build docs for kinesis
      828f8ae [Chris Fregly] more cleanup
      e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      cd68c0d [Chris Fregly] fixed typos and backward compatibility
      d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
      91f9504e
    • Yin Huai's avatar
      [SQL] Set outputPartitioning of BroadcastHashJoin correctly. · 67bd8e3c
      Yin Huai authored
      I think we will not generate the plan triggering this bug at this moment. But, let me explain it...
      
      Right now, we are using `left.outputPartitioning` as the `outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical plan for cases like...
      ```sql
      SELECT l.key, count(*)
      FROM (SELECT key, count(*) as cnt
            FROM src
            GROUP BY key) l // This is buildPlan
      JOIN r // This is the streamedPlan
      ON (l.cnt = r.value)
      GROUP BY l.key
      ```
      Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the `BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using `l.key` as the key, we will not introduce an `Exchange` for this aggregation. However, `r`'s outputPartitioning may not match the required distribution of the last `GROUP BY` and we fail to group data correctly.
      
      JIRA is being reindexed. I will create a JIRA ticket once it is back online.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1735 from yhuai/BroadcastHashJoin and squashes the following commits:
      
      96d9cb3 [Yin Huai] Set outputPartitioning correctly.
      67bd8e3c
    • Joseph K. Bradley's avatar
      [SPARK-2478] [mllib] DecisionTree Python API · 3f67382e
      Joseph K. Bradley authored
      Added experimental Python API for Decision Trees.
      
      API:
      * class DecisionTreeModel
      ** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
      ** numNodes()
      ** depth()
      ** __str__()
      * class DecisionTree
      ** trainClassifier()
      ** trainRegressor()
      ** train()
      
      Examples and testing:
      * Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
      * Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors
      
      Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.
      
      CC mengxr manishamde
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:
      
      3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
      6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
      67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
      aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
      fa10ea7 [Joseph K. Bradley] Small style update
      7968692 [Joseph K. Bradley] small braces typo fix
      e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
      db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
      6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      93953f1 [Joseph K. Bradley] Likely done with Python API.
      6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
      188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
      2b20c61 [Joseph K. Bradley] Small doc and style updates
      1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
      8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
      376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
      e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
      52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
      8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
      cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
      8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      2283df8 [Joseph K. Bradley] 2 bug fixes.
      73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
      f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
      3f67382e
    • Andrew Or's avatar
      [HOTFIX] Do not throw NPE if spark.test.home is not set · e09e18b3
      Andrew Or authored
      `spark.test.home` was introduced in #1734. This is fine for SBT but is failing maven tests. Either way it shouldn't throw an NPE.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1739 from andrewor14/fix-spark-test-home and squashes the following commits:
      
      ce2624c [Andrew Or] Do not throw NPE if spark.test.home is not set
      e09e18b3
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 87738bfa
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #706 (close requested by 'pwendell')
      Closes #453 (close requested by 'pwendell')
      Closes #557 (close requested by 'tdas')
      Closes #495 (close requested by 'tdas')
      Closes #1232 (close requested by 'pwendell')
      Closes #82 (close requested by 'pwendell')
      Closes #600 (close requested by 'pwendell')
      Closes #473 (close requested by 'pwendell')
      Closes #351 (close requested by 'pwendell')
      87738bfa
    • Patrick Wendell's avatar
      HOTFIX: Fix concurrency issue in FlumePollingStreamSuite. · 44460ba5
      Patrick Wendell authored
      This has been failing on master. One possible cause is that the port
      gets contended if multiple test runs happen concurrently and they
      hit this test at the same time. Since this test takes a long time
      (60 seconds) that's very plausible. This patch randomizes the port
      used in this test to avoid contention.
      44460ba5
    • Patrick Wendell's avatar
      HOTFIX: Fixing test error in maven for flume-sink. · 25cad6ad
      Patrick Wendell authored
      We needed to add an explicit dependency on scalatest since this
      module will not get it from spark core like others do.
      25cad6ad
    • Anand Avati's avatar
      [SPARK-1812] sql/catalyst - Provide explicit type information · 08c095b6
      Anand Avati authored
      For Scala 2.11 compatibility.
      
      Without the explicit type specification, withNullability
      return type is inferred to be Attribute, and thus calling
      at() on the returned object fails in these tests:
      
      [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:370: value at is not a
      [ERROR]     val c4_notNull = 'a.boolean.notNull.at(3)
      [ERROR]                                         ^
      [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:371: value at is not a
      [ERROR]     val c5_notNull = 'a.boolean.notNull.at(4)
      [ERROR]                                         ^
      [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:372: value at is not a
      [ERROR]     val c6_notNull = 'a.boolean.notNull.at(5)
      [ERROR]                                         ^
      [ERROR] /Users/avati/work/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvaluationSuite.scala:558: value at is not a
      [ERROR]     val s_notNull = 'a.string.notNull.at(0)
      
      Signed-off-by: Anand Avati <avatiredhat.com>
      
      Author: Anand Avati <avati@redhat.com>
      
      Closes #1709 from avati/SPARK-1812-notnull and squashes the following commits:
      
      0470eb3 [Anand Avati] SPARK-1812: sql/catalyst - Provide explicit type information
      08c095b6
    • Andrew Or's avatar
      [SPARK-2454] Do not ship spark home to Workers · 148af608
      Andrew Or authored
      When standalone Workers launch executors, they inherit the Spark home set by the driver. This means if the worker machines do not share the same directory structure as the driver node, the Workers will attempt to run scripts (e.g. bin/compute-classpath.sh) that do not exist locally and fail. This is a common scenario if the driver is launched from outside of the cluster.
      
      The solution is to simply not pass the driver's Spark home to the Workers. This PR further makes an attempt to avoid overloading the usages of `spark.home`, which is now only used for setting executor Spark home on Mesos and in python.
      
      This is based on top of #1392 and originally reported by YanTangZhai. Tested on standalone cluster.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1734 from andrewor14/spark-home-reprise and squashes the following commits:
      
      f71f391 [Andrew Or] Revert changes in python
      1c2532c [Andrew Or] Merge branch 'master' of github.com:apache/spark into spark-home-reprise
      188fc5d [Andrew Or] Avoid using spark.home where possible
      09272b7 [Andrew Or] Always use Worker's working directory as spark home
      148af608
    • Andrew Or's avatar
      [SPARK-2316] Avoid O(blocks) operations in listeners · d934801d
      Andrew Or authored
      The existing code in `StorageUtils` is not the most efficient. Every time we want to update an `RDDInfo` we end up iterating through all blocks on all block managers just to discard most of them. The symptoms manifest themselves in the bountiful UI bugs observed in the wild. Many of these bugs are caused by the slow consumption of events in `LiveListenerBus`, which frequently leads to the event queue overflowing and `SparkListenerEvent`s being dropped on the floor. The changes made in this PR avoid this by first filtering out only the blocks relevant to us before computing storage information from them.
      
      It's worth a mention that this corner of the Spark code is also not very well-tested at all. The bulk of the changes in this PR (more than 60%) is actually test cases for the various logic in `StorageUtils.scala` as well as `StorageTab.scala`. These will eventually be extended to cover the various listeners that constitute the `SparkUI`.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1679 from andrewor14/fix-drop-events and squashes the following commits:
      
      f80c1fa [Andrew Or] Rewrite fold and reduceOption as sum
      e132d69 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
      14fa1c3 [Andrew Or] Simplify some code + update a few comments
      a91be46 [Andrew Or] Make ExecutorsPage blazingly fast
      bf6f09b [Andrew Or] Minor changes
      8981de1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
      af19bc0 [Andrew Or] *UsedByRDD -> *UsedByRdd (minor)
      6970bc8 [Andrew Or] Add extensive tests for StorageListener and the new code in StorageUtils
      e080b9e [Andrew Or] Reduce run time of StorageUtils.updateRddInfo to near constant
      2c3ef6a [Andrew Or] Actually filter out only the relevant RDDs
      6fef86a [Andrew Or] Add extensive tests for new code in StorageStatus
      b66b6b0 [Andrew Or] Use more efficient underlying data structures for blocks
      6a7b7c0 [Andrew Or] Avoid chained operations on TraversableLike
      a9ec384 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
      b12fcd7 [Andrew Or] Fix tests + simplify sc.getRDDStorageInfo
      da8e322 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
      8e91921 [Andrew Or] Iterate through a filtered set of blocks when updating RDDInfo
      7b2c4aa [Andrew Or] Rewrite blockLocationsFromStorageStatus + clean up method signatures
      41fa50d [Andrew Or] Add a legacy constructor for StorageStatus
      53af15d [Andrew Or] Refactor StorageStatus + add a bunch of tests
      d934801d
    • Patrick Wendell's avatar
    • GuoQiang Li's avatar
      [SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api · adc83032
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1369 from witgo/SPARK-1470_new and squashes the following commits:
      
      66a1641 [GuoQiang Li] IncompatibleResultTypeProblem
      73a89ba [GuoQiang Li] Use the scala-logging wrapper instead of the directly sfl4j api.
      adc83032
    • Jeremy Freeman's avatar
      StatCounter on NumPy arrays [PYSPARK][SPARK-2012] · 4bc3bb29
      Jeremy Freeman authored
      These changes allow StatCounters to work properly on NumPy arrays, to fix the issue reported here  (https://issues.apache.org/jira/browse/SPARK-2012).
      
      If NumPy is installed, the NumPy functions ``maximum``, ``minimum``, and ``sqrt``, which work on arrays, are used to merge statistics. If not, we fall back on scalar operators, so it will work on arrays with NumPy, but will also work without NumPy.
      
      New unit tests added, along with a check for NumPy in the tests.
      
      Author: Jeremy Freeman <the.freeman.lab@gmail.com>
      
      Closes #1725 from freeman-lab/numpy-max-statcounter and squashes the following commits:
      
      fe973b1 [Jeremy Freeman] Avoid duplicate array import in tests
      7f0e397 [Jeremy Freeman] Refactored check for numpy
      8e764dd [Jeremy Freeman] Explicit numpy imports
      875414c [Jeremy Freeman] Fixed indents
      1c8a832 [Jeremy Freeman] Unit tests for StatCounter with NumPy arrays
      176a127 [Jeremy Freeman] Use numpy arrays in StatCounter
      4bc3bb29
    • Burak's avatar
      [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator.... · fda47598
      Burak authored
      [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator. RandomRDD is now of generic type
      
      The RandomRDDGenerators used to only output RDD[Double].
      Now RandomRDDGenerators.randomRDD can be used to generate a random RDD[T] via a class that extends RandomDataGenerator, by supplying a type T and overriding the nextValue() function as they wish.
      
      Author: Burak <brkyvz@gmail.com>
      
      Closes #1732 from brkyvz/SPARK-2801 and squashes the following commits:
      
      c94a694 [Burak] [SPARK-2801][MLlib] Missing ClassTags added
      22d96fe [Burak] [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator, generic types added for RandomRDD instead of Double
      fda47598
  2. Aug 01, 2014
    • Tor Myklebust's avatar
      [SPARK-1580][MLLIB] Estimate ALS communication and computation costs. · e25ec061
      Tor Myklebust authored
      Continue the work from #493.
      
      Closes #493 and Closes #593
      
      Author: Tor Myklebust <tmyklebu@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1731 from mengxr/tmyklebu-alscost and squashes the following commits:
      
      9b56a8b [Xiangrui Meng] updated API and added a simple test
      68a3229 [Xiangrui Meng] merge master
      217bd1d [Tor Myklebust] Documentation and choleskies -> subproblems.
      8cbb718 [Tor Myklebust] Braces get spaces.
      0455cd4 [Tor Myklebust] Parens for collectAsMap.
      2b2febe [Tor Myklebust] Use `makeLinkRDDs` when estimating costs.
      2ab7a5d [Tor Myklebust] Reindent estimateCost's declaration and make it return Seqs.
      8b21e6d [Tor Myklebust] Fix overlong lines.
      8cbebf1 [Tor Myklebust] Rename and clean up the return format of cost estimator.
      6615ed5 [Tor Myklebust] It's more useful to give per-partition estimates.  Do that.
      5530678 [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark into alscost
      6c31324 [Tor Myklebust] Make it actually build...
      a1184d1 [Tor Myklebust] Mark ALS.evaluatePartitioner DeveloperApi.
      657a71b [Tor Myklebust] Simple-minded estimates of computation and communication costs in ALS.
      dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
      23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
      495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
      674933a [Tor Myklebust] Fix style.
      40edc23 [Tor Myklebust] Fix missing space.
      f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
      5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
      36a0f43 [Tor Myklebust] Make the partitioner private.
      d872b09 [Tor Myklebust] Add negative id ALS test.
      df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
      c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
      c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
      e25ec061
    • Michael Giannakopoulos's avatar
      [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods. · c2811892
      Michael Giannakopoulos authored
      Related to issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC).
      
      Author: Michael Giannakopoulos <miccagiann@gmail.com>
      
      Closes #1624 from miccagiann/new-branch and squashes the following commits:
      
      c02e5f5 [Michael Giannakopoulos] Merge cleanly with upstream/master.
      8dcb888 [Michael Giannakopoulos] Putting the if/else if statements in brackets.
      fed8eaa [Michael Giannakopoulos] Adding a space in the message related to the IllegalArgumentException.
      44e6ff0 [Michael Giannakopoulos] Adding a blank line before python class LinearRegressionWithSGD.
      8eba9c5 [Michael Giannakopoulos] Change function signatures. Exception is thrown from the scala component and not from the python one.
      638be47 [Michael Giannakopoulos] Modified code to comply with code standards.
      ec50ee9 [Michael Giannakopoulos] Shorten the if-elif-else statement in regression.py file
      b962744 [Michael Giannakopoulos] Replaced the enum classes, with strings-keywords for defining the values of 'regType' parameter.
      78853ec [Michael Giannakopoulos] Providing intercept and regualizer functionallity for linear methods in only one function.
      3ac8874 [Michael Giannakopoulos] Added support for regularizer and intercection parameters for linear regression method.
      c2811892
    • Jeremy Freeman's avatar
      Streaming mllib [SPARK-2438][MLLIB] · f6a18993
      Jeremy Freeman authored
      This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with tdas and mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries.
      
      __Summary of additions:__
      
      _StreamingLinearAlgorithm_
      - An abstract class for fitting generalized linear models online to streaming data, including training on (and updating) a model, and making predictions.
      
      _StreamingLinearRegressionWithSGD_
      - Class and companion object for running streaming linear regression
      
      _StreamingLinearRegressionTestSuite_
      - Unit tests
      
      _StreamingLinearRegression_
      - Example use case: fitting a model online to data from one stream, and making predictions on other data
      
      __Notes__
      - If this looks good, I can use the StreamingLinearAlgorithm class to easily implement other analyses that follow the same logic (Ridge, Lasso, Logistic, SVM).
      
      Author: Jeremy Freeman <the.freeman.lab@gmail.com>
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #1361 from freeman-lab/streaming-mllib and squashes the following commits:
      
      775ea29 [Jeremy Freeman] Throw error if user doesn't initialize weights
      4086fee [Jeremy Freeman] Fixed current weight formatting
      8b95b27 [Jeremy Freeman] Restored broadcasting
      29f27ec [Jeremy Freeman] Formatting
      8711c41 [Jeremy Freeman] Used return to avoid indentation
      777b596 [Jeremy Freeman] Restored treeAggregate
      74cf440 [Jeremy Freeman] Removed static methods
      d28cf9a [Jeremy Freeman] Added usage notes
      c3326e7 [Jeremy Freeman] Improved documentation
      9541a41 [Jeremy Freeman] Merge remote-tracking branch 'upstream/master' into streaming-mllib
      66eba5e [Jeremy Freeman] Fixed line lengths
      2fe0720 [Jeremy Freeman] Minor cleanup
      7d51378 [Jeremy Freeman] Moved streaming loader to MLUtils
      b9b69f6 [Jeremy Freeman] Added setter methods
      c3f8b5a [Jeremy Freeman] Modified logging
      00aafdc [Jeremy Freeman] Add modifiers
      14b801e [Jeremy Freeman] Name changes
      c7d38a3 [Jeremy Freeman] Move check for empty data to GradientDescent
      4b0a5d3 [Jeremy Freeman] Cleaned up tests
      74188d6 [Jeremy Freeman] Eliminate dependency on commons
      50dd237 [Jeremy Freeman] Removed experimental tag
      6bfe1e6 [Jeremy Freeman] Fixed imports
      a2a63ad [freeman] Makes convergence test more robust
      86220bc [freeman] Streaming linear regression unit tests
      fb4683a [freeman] Minor changes for scalastyle consistency
      fd31e03 [freeman] Changed logging behavior
      453974e [freeman] Fixed indentation
      c4b1143 [freeman] Streaming linear regression
      604f4d7 [freeman] Expanded private class to include mllib
      d99aa85 [freeman] Helper methods for streaming MLlib apps
      0898add [freeman] Added dependency on streaming
      f6a18993
    • Josh Rosen's avatar
      [SPARK-2764] Simplify daemon.py process structure · e8e0fd69
      Josh Rosen authored
      Curently, daemon.py forks a pool of numProcessors subprocesses, and those processes fork themselves again to create the actual Python worker processes that handle data.
      
      I think that this extra layer of indirection is unnecessary and adds a lot of complexity.  This commit attempts to remove this middle layer of subprocesses by launching the workers directly from daemon.py.
      
      See https://github.com/mesos/spark/pull/563 for the original PR that added daemon.py, where I raise some issues with the current design.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1680 from JoshRosen/pyspark-daemon and squashes the following commits:
      
      5abbcb9 [Josh Rosen] Replace magic number: 4 -> EINTR
      5495dff [Josh Rosen] Throw IllegalStateException if worker launch fails.
      b79254d [Josh Rosen] Detect failed fork() calls; improve error logging.
      282c2c4 [Josh Rosen] Remove daemon.py exit logging, since it caused problems:
      8554536 [Josh Rosen] Fix daemon’s shutdown(); log shutdown reason.
      4e0fab8 [Josh Rosen] Remove shared-memory exit_flag; don't die on worker death.
      e9892b4 [Josh Rosen] [WIP] [SPARK-2764] Simplify daemon.py process structure.
      e8e0fd69
    • GuoQiang Li's avatar
      [SPARK-2800]: Exclude scalastyle-output.xml Apache RAT checks · a38d3c9e
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1729 from witgo/SPARK-2800 and squashes the following commits:
      
      13ca966 [GuoQiang Li] Add scalastyle-output.xml  to .rat-excludes file
      a38d3c9e
    • Albert Chu's avatar
      [SPARK-2116] Load spark-defaults.conf from SPARK_CONF_DIR if set · 0da07da5
      Albert Chu authored
      If SPARK_CONF_DIR environment variable is set, search it for spark-defaults.conf.
      
      Author: Albert Chu <chu11@llnl.gov>
      
      Closes #1059 from chu11/SPARK-2116 and squashes the following commits:
      
      9f3ac94 [Albert Chu] SPARK-2116: If SPARK_CONF_DIR environment variable is set, search it for spark-defaults.conf.
      0da07da5
    • Yin Huai's avatar
      [SPARK-2212][SQL] Hash Outer Join (follow-up bug fix). · 3822f33f
      Yin Huai authored
      We need to carefully set the ouputPartitioning of the HashOuterJoin Operator. Otherwise, we may not correctly handle nulls.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1721 from yhuai/SPARK-2212-BugFix and squashes the following commits:
      
      ed5eef7 [Yin Huai] Correctly choosing outputPartitioning for the HashOuterJoin operator.
      3822f33f
    • Davies Liu's avatar
      [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD · 880eabec
      Davies Liu authored
      Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes.
      
      This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance.
      
      root
       |-- field1: integer (nullable = true)
       |-- field2: string (nullable = true)
       |-- field3: struct (nullable = true)
       |    |-- field4: integer (nullable = true)
       |    |-- field5: array (nullable = true)
       |    |    |-- element: integer (containsNull = false)
       |-- field6: array (nullable = true)
       |    |-- element: struct (containsNull = false)
       |    |    |-- field7: string (nullable = true)
      
      Then we can access them by row.field3.field5[0]  or row.field6[5].field7
      
      It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType.
      
      You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as:
      
      ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1]))
      
      Or you could use Row to create a class just like namedtuple, for example:
      
      Person = Row("name", "age")
      ctx.inferSchema(rdd.map(lambda x: Person(*x)))
      
      Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The `schema` should be StructType, see the API docs for details.
      
      schema = StructType([StructField("name, StringType, True),
                                          StructType("age", IntegerType, True)])
      ctx.applySchema(rdd, schema)
      
      PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1598 from davies/nested and squashes the following commits:
      
      f1d15b6 [Davies Liu] verify schema with the first few rows
      8852aaf [Davies Liu] check type of schema
      abe9e6e [Davies Liu] address comments
      61b2292 [Davies Liu] add @deprecated to pythonToJavaMap
      1e5b801 [Davies Liu] improve cache of classes
      51aa135 [Davies Liu] use Row to infer schema
      e9c0d5c [Davies Liu] remove string typed schema
      353a3f2 [Davies Liu] fix code style
      63de8f8 [Davies Liu] fix typo
      c79ca67 [Davies Liu] fix serialization of nested data
      6b258b5 [Davies Liu] fix pep8
      9d8447c [Davies Liu] apply schema provided by string of names
      f5df97f [Davies Liu] refactor, address comments
      9d9af55 [Davies Liu] use arrry to applySchema and infer schema in Python
      84679b3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into nested
      0eaaf56 [Davies Liu] fix doc tests
      b3559b4 [Davies Liu] use generated Row instead of namedtuple
      c4ddc30 [Davies Liu] fix conflict between name of fields and variables
      7f6f251 [Davies Liu] address all comments
      d69d397 [Davies Liu] refactor
      2cc2d45 [Davies Liu] refactor
      182fb46 [Davies Liu] refactor
      bc6e9e1 [Davies Liu] switch to new Schema API
      547bf3e [Davies Liu] Merge branch 'master' into nested
      a435b5a [Davies Liu] add docs and code refactor
      2c8debc [Davies Liu] Merge branch 'master' into nested
      644665a [Davies Liu] use tuple and namedtuple for schemardd
      880eabec
    • Joseph K. Bradley's avatar
      [SPARK-2796] [mllib] DecisionTree bug fix: ordered categorical features · 7058a539
      Joseph K. Bradley authored
      Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      
      Added new test to DecisionTreeSuite to catch this: "regression stump with categorical variables of arity 2"
      
      Bug fix: Modified upper bound discussed above.
      
      Also: Small improvements to coding style in DecisionTree.
      
      CC mengxr manishamde
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1720 from jkbradley/decisiontree-bugfix2 and squashes the following commits:
      
      225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      7058a539
    • Doris Xin's avatar
      [SPARK-2786][mllib] Python correlations · d88e6956
      Doris Xin authored
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1713 from dorx/pythonCorrelation and squashes the following commits:
      
      5f1e60c [Doris Xin] reviewer comments.
      46ff6eb [Doris Xin] reviewer comments.
      ad44085 [Doris Xin] style fix
      e69d446 [Doris Xin] fixed missed conflicts.
      eb5bf56 [Doris Xin] merge master
      cc9f725 [Doris Xin] units passed.
      9141a63 [Doris Xin] WIP2
      d199f1f [Doris Xin] Moved correlation names into a public object
      cd163d6 [Doris Xin] WIP
      d88e6956
    • Aaron Davidson's avatar
      SPARK-2791: Fix committing, reverting and state tracking in shuffle file consolidation · 78f2af58
      Aaron Davidson authored
      All changes from this PR are by mridulm and are drawn from his work in #1609. This patch is intended to fix all major issues related to shuffle file consolidation that mridulm found, while minimizing changes to the code, with the hope that it may be more easily merged into 1.1.
      
      This patch is **not** intended as a replacement for #1609, which provides many additional benefits, including fixes to ExternalAppendOnlyMap, improvements to DiskBlockObjectWriter's API, and several new unit tests.
      
      If it is feasible to merge #1609 for the 1.1 deadline, that is a preferable option.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1678 from aarondav/consol and squashes the following commits:
      
      53b3f6d [Aaron Davidson] Correct behavior when writing unopened file
      701d045 [Aaron Davidson] Rebase with sort-based shuffle
      9160149 [Aaron Davidson] SPARK-2532: Minimal shuffle consolidation fixes
      78f2af58
    • joyyoj's avatar
      [SPARK-2379] Fix the bug that streaming's receiver may fall into a dead loop · b270309d
      joyyoj authored
      Author: joyyoj <sunshch@gmail.com>
      
      Closes #1694 from joyyoj/SPARK-2379 and squashes the following commits:
      
      d73790d [joyyoj] SPARK-2379 Fix the bug that streaming's receiver may fall into a dead loop
      22e7821 [joyyoj] Merge remote-tracking branch 'apache/master'
      3f4a602 [joyyoj] Merge remote-tracking branch 'remotes/apache/master'
      f4660c5 [joyyoj] [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not read properly
      b270309d
    • zsxwing's avatar
      SPARK-1612: Fix potential resource leaks · f5d9bea2
      zsxwing authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-1612
      
      Move the "close" statements into a "finally" block.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #535 from zsxwing/SPARK-1612 and squashes the following commits:
      
      ae52f50 [zsxwing] Update to follow the code style
      549ba13 [zsxwing] SPARK-1612: Fix potential resource leaks
      f5d9bea2
    • Liang-Chi Hsieh's avatar
      [SPARK-2490] Change recursive visiting on RDD dependencies to iterative approach · baf9ce1a
      Liang-Chi Hsieh authored
      When performing some transformations on RDDs after many iterations, the dependencies of RDDs could be very long. It can easily cause StackOverflowError when recursively visiting these dependencies in Spark core. For example:
      
          var rdd = sc.makeRDD(Array(1))
          for (i <- 1 to 1000) {
            rdd = rdd.coalesce(1).cache()
            rdd.collect()
          }
      
      This PR changes recursive visiting on rdd's dependencies to iterative approach to avoid StackOverflowError.
      
      In addition to the recursive visiting, since the Java serializer has a known [bug](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4152790) that causes StackOverflowError too when serializing/deserializing a large graph of objects. So applying this PR only solves part of the problem. Using KryoSerializer to replace Java serializer might be helpful. However, since KryoSerializer is not supported for `spark.closure.serializer` now, I can not test if KryoSerializer can solve Java serializer's problem completely.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #1418 from viirya/remove_recursive_visit and squashes the following commits:
      
      6b2c615 [Liang-Chi Hsieh] change function name; comply with code style.
      5f072a7 [Liang-Chi Hsieh] add comments to explain Stack usage.
      8742dbb [Liang-Chi Hsieh] comply with code style.
      900538b [Liang-Chi Hsieh] change recursive visiting on rdd's dependencies to iterative approach to avoid stackoverflowerror.
      baf9ce1a
    • Aaron Staple's avatar
      [SPARK-695] In DAGScheduler's getPreferredLocs, track set of visited partitions. · eb5bdcaf
      Aaron Staple authored
      getPreferredLocs traverses a dependency graph of partitions using depth first search.  Given a complex dependency graph, the old implementation may explore a set of paths in the graph that is exponential in the number of nodes.  By maintaining a set of visited nodes the new implementation avoids revisiting nodes, preventing exponential blowup.
      
      Some comment and whitespace cleanups are also included.
      
      Author: Aaron Staple <aaron.staple@gmail.com>
      
      Closes #1362 from staple/SPARK-695 and squashes the following commits:
      
      ecea0f3 [Aaron Staple] address review comments
      751c661 [Aaron Staple] [SPARK-695] Add a unit test.
      5adf326 [Aaron Staple] Replace getPreferredLocsInternal's HashMap argument with a simpler HashSet.
      58e37d0 [Aaron Staple] Replace comment documenting NarrowDependency.
      6751ced [Aaron Staple] Revert "Remove unused variable."
      04c7097 [Aaron Staple] Fix indentation.
      0030884 [Aaron Staple] Remove unused variable.
      33f67c6 [Aaron Staple] Clarify comment.
      4e42b46 [Aaron Staple] Remove apparently incorrect comment describing NarrowDependency.
      65c2d3d [Aaron Staple] [SPARK-695] In DAGScheduler's getPreferredLocs, track set of visited partitions.
      eb5bdcaf
    • CrazyJvm's avatar
      [SQL] Documentation: Explain cacheTable command · c82fe478
      CrazyJvm authored
      add the `cacheTable` specification
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #1681 from CrazyJvm/sql-programming-guide-cache and squashes the following commits:
      
      0a231e0 [CrazyJvm] grammar fixes
      a04020e [CrazyJvm] modify title to Cached tables
      18b6594 [CrazyJvm] fix format
      2cbbf58 [CrazyJvm] add cacheTable guide
      c82fe478
    • Cheng Hao's avatar
      [SPARK-2767] [SQL] SparkSQL CLI doens't output error message if query failed. · c0b47bad
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1686 from chenghao-intel/spark_sql_cli and squashes the following commits:
      
      eb664cc [Cheng Hao] Output detailed failure message in console
      93b0382 [Cheng Hao] Fix Bug of no output in cli if exception thrown internally
      c0b47bad
    • chutium's avatar
      [SPARK-2729] [SQL] Forgot to match Timestamp type in ColumnBuilder · 580c7011
      chutium authored
      just a match forgot, found after SPARK-2710 , TimestampType can be used by a SchemaRDD generated from JDBC ResultSet
      
      Author: chutium <teng.qiu@gmail.com>
      
      Closes #1636 from chutium/SPARK-2729 and squashes the following commits:
      
      71af77a [chutium] [SPARK-2729] [SQL] added Timestamp in NullableColumnAccessorSuite
      39cf9f8 [chutium] [SPARK-2729] add Timestamp Type into ColumnBuilder TestSuite, ref. #1636
      ab6ff97 [chutium] [SPARK-2729] Forgot to match Timestamp type in ColumnBuilder
      580c7011
    • Cheng Hao's avatar
      [SQL][SPARK-2212]Hash Outer Join · 4415722e
      Cheng Hao authored
      This patch is to support the hash based outer join. Currently, outer join for big relations are resort to `BoradcastNestedLoopJoin`, which is super slow. This PR will create 2 hash tables for both relations in the same partition, which greatly reduce the table scans.
      
      Here is the testing code that I used:
      ```
      package org.apache.spark.sql.hive
      
      import org.apache.spark.SparkContext
      import org.apache.spark.SparkConf
      import org.apache.spark.sql._
      
      case class Record(key: String, value: String)
      
      object JoinTablePrepare extends App {
        import TestHive2._
      
        val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"${i % 828193}", s"val_$i")))
      
        runSqlHive("SHOW TABLES")
        runSqlHive("DROP TABLE if exists a")
        runSqlHive("DROP TABLE if exists b")
        runSqlHive("DROP TABLE if exists result")
        rdd.registerAsTable("records")
      
        runSqlHive("""CREATE TABLE a (key STRING, value STRING)
                       | ROW FORMAT SERDE
                       | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                       | STORED AS RCFILE
                     """.stripMargin)
        runSqlHive("""CREATE TABLE b (key STRING, value STRING)
                       | ROW FORMAT SERDE
                       | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                       | STORED AS RCFILE
                     """.stripMargin)
        runSqlHive("""CREATE TABLE result (key STRING, value STRING)
                       | ROW FORMAT SERDE
                       | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                       | STORED AS RCFILE
                     """.stripMargin)
      
        hql(s"""from records
                   | insert into table a
                   | select key, value
                 """.stripMargin)
        hql(s"""from records
                   | insert into table b select key + 100000, value
                 """.stripMargin)
      }
      
      object JoinTablePerformanceTest extends App {
        import TestHive2._
      
        hql("SHOW TABLES")
        hql("set spark.sql.shuffle.partitions=20")
      
        val leftOuterJoin = "insert overwrite table result select a.key, b.value from a left outer join b on a.key=b.key"
        val rightOuterJoin = "insert overwrite table result select a.key, b.value from a right outer join b on a.key=b.key"
        val fullOuterJoin = "insert overwrite table result select a.key, b.value from a full outer join b on a.key=b.key"
      
        val results = ("LeftOuterJoin", benchmark(leftOuterJoin)) :: ("LeftOuterJoin", benchmark(leftOuterJoin)) ::
                      ("RightOuterJoin", benchmark(rightOuterJoin)) :: ("RightOuterJoin", benchmark(rightOuterJoin)) ::
                      ("FullOuterJoin", benchmark(fullOuterJoin)) :: ("FullOuterJoin", benchmark(fullOuterJoin)) :: Nil
        val explains = hql(s"explain $leftOuterJoin").collect ++ hql(s"explain $rightOuterJoin").collect ++ hql(s"explain $fullOuterJoin").collect
        println(explains.mkString(",\n"))
        results.foreach { case (prompt, result) => {
            println(s"$prompt: took ${result._1} ms (${result._2} records)")
          }
        }
      
        def benchmark(cmd: String) = {
          val begin = System.currentTimeMillis()
          val result = hql(cmd)
          val end = System.currentTimeMillis()
          val count = hql("select count(1) from result").collect.mkString("")
          ((end - begin), count)
        }
      }
      ```
      And the result as shown below:
      ```
      [Physical execution plan:],
      [InsertIntoHiveTable (MetastoreRelation default, result, None), Map(), true],
      [ Project [key#95,value#98]],
      [  HashOuterJoin [key#95], [key#97], LeftOuter, None],
      [   Exchange (HashPartitioning [key#95], 20)],
      [    HiveTableScan [key#95], (MetastoreRelation default, a, None), None],
      [   Exchange (HashPartitioning [key#97], 20)],
      [    HiveTableScan [key#97,value#98], (MetastoreRelation default, b, None), None],
      [Physical execution plan:],
      [InsertIntoHiveTable (MetastoreRelation default, result, None), Map(), true],
      [ Project [key#102,value#105]],
      [  HashOuterJoin [key#102], [key#104], RightOuter, None],
      [   Exchange (HashPartitioning [key#102], 20)],
      [    HiveTableScan [key#102], (MetastoreRelation default, a, None), None],
      [   Exchange (HashPartitioning [key#104], 20)],
      [    HiveTableScan [key#104,value#105], (MetastoreRelation default, b, None), None],
      [Physical execution plan:],
      [InsertIntoHiveTable (MetastoreRelation default, result, None), Map(), true],
      [ Project [key#109,value#112]],
      [  HashOuterJoin [key#109], [key#111], FullOuter, None],
      [   Exchange (HashPartitioning [key#109], 20)],
      [    HiveTableScan [key#109], (MetastoreRelation default, a, None), None],
      [   Exchange (HashPartitioning [key#111], 20)],
      [    HiveTableScan [key#111,value#112], (MetastoreRelation default, b, None), None]
      LeftOuterJoin: took 16072 ms ([3000000] records)
      LeftOuterJoin: took 14394 ms ([3000000] records)
      RightOuterJoin: took 14802 ms ([3000000] records)
      RightOuterJoin: took 14747 ms ([3000000] records)
      FullOuterJoin: took 17715 ms ([6000000] records)
      FullOuterJoin: took 17629 ms ([6000000] records)
      ```
      
      Without this PR, the benchmark will run seems never end.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1147 from chenghao-intel/hash_based_outer_join and squashes the following commits:
      
      65c599e [Cheng Hao] Fix issues with the community comments
      72b1394 [Cheng Hao] Fix bug of stale value in joinedRow
      55baef7 [Cheng Hao] Add HashOuterJoin
      4415722e
    • Yin Huai's avatar
      [SPARK-2179][SQL] A minor refactoring Java data type APIs (2179 follow-up). · c41fdf04
      Yin Huai authored
      It is a follow-up PR of SPARK-2179 (https://issues.apache.org/jira/browse/SPARK-2179). It makes package names of data type APIs more consistent across languages (Scala: `org.apache.spark.sql`, Java: `org.apache.spark.sql.api.java`, Python: `pyspark.sql`).
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1712 from yhuai/javaDataType and squashes the following commits:
      
      62eb705 [Yin Huai] Move package-info.
      add4bcb [Yin Huai] Make the package names of data type classes consistent across languages by moving all Java data type classes to package sql.api.java.
      c41fdf04
    • Sandy Ryza's avatar
      SPARK-2099. Report progress while task is running. · 8d338f64
      Sandy Ryza authored
      This is a sketch of a patch that allows the UI to show metrics for tasks that have not yet completed.  It adds a heartbeat every 2 seconds from the executors to the driver, reporting metrics for all of the executor's tasks.
      
      It still needs unit tests, polish, and cluster testing, but I wanted to put it up to get feedback on the approach.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1056 from sryza/sandy-spark-2099 and squashes the following commits:
      
      93b9fdb [Sandy Ryza] Up heartbeat interval to 10 seconds and other tidying
      132aec7 [Sandy Ryza] Heartbeat and HeartbeatResponse are already Serializable as case classes
      38dffde [Sandy Ryza] Additional review feedback and restore test that was removed in BlockManagerSuite
      51fa396 [Sandy Ryza] Remove hostname race, add better comments about threading, and some stylistic improvements
      3084f10 [Sandy Ryza] Make TaskUIData a case class again
      3bda974 [Sandy Ryza] Stylistic fixes
      0dae734 [Sandy Ryza] SPARK-2099. Report progress while task is running.
      8d338f64
    • Xiangrui Meng's avatar
      [HOTFIX] downgrade breeze version to 0.7 · 5328c0aa
      Xiangrui Meng authored
      breeze-0.8.1 causes dependency issues, as discussed in #940 .
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1718 from mengxr/revert-breeze and squashes the following commits:
      
      99c4681 [Xiangrui Meng] downgrade breeze version to 0.7
      5328c0aa
    • witgo's avatar
      [SPARK-1997] update breeze to version 0.8.1 · 0dacb1ad
      witgo authored
      `breeze 0.8.1`  dependent on  `scala-logging-slf4j 2.1.1` The relevant code on #1369
      
      Author: witgo <witgo@qq.com>
      
      Closes #940 from witgo/breeze-8.0.1 and squashes the following commits:
      
      65cc65e [witgo] update breeze  to version 0.8.1
      0dacb1ad
    • Sean Owen's avatar
      SPARK-2768 [MLLIB] Add product, user recommend method to MatrixFactorizationModel · 82d209d4
      Sean Owen authored
      Right now, `MatrixFactorizationModel` can only predict a score for one or more `(user,product)` tuples. As a comment in the file notes, it would be more useful to expose a recommend method, that computes top N scoring products for a user (or vice versa – users for a product).
      
      (This also corrects some long lines in the Java ALS test suite.)
      
      As you can see, it's a little messy to access the class from Java. Should there be a Java-friendly wrapper for it? with a pointer about where that should go, I could add that.
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1687 from srowen/SPARK-2768 and squashes the following commits:
      
      b349675 [Sean Owen] Additional review changes
      c9edb04 [Sean Owen] Updates from code review
      7bc35f9 [Sean Owen] Add recommend methods to MatrixFactorizationModel
      82d209d4
    • jerryshao's avatar
      [SPARK-2103][Streaming] Change to ClassTag for KafkaInputDStream and fix reflection issue · a32f0fb7
      jerryshao authored
      This PR updates previous Manifest for KafkaInputDStream's Decoder to ClassTag, also fix the problem addressed in [SPARK-2103](https://issues.apache.org/jira/browse/SPARK-2103).
      
      Previous Java interface cannot actually get the type of Decoder, so when using this Manifest to reconstruct the decode object will meet reflection exception.
      
      Also for other two Java interfaces, ClassTag[String] is useless because calling Scala API will get the right implicit ClassTag.
      
      Current Kafka unit test cannot actually verify the interface. I've tested these interfaces in my local and distribute settings.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #1508 from jerryshao/SPARK-2103 and squashes the following commits:
      
      e90c37b [jerryshao] Add Mima excludes
      7529810 [jerryshao] Change Manifest to ClassTag for KafkaInputDStream's Decoder and fix Decoder construct issue when using Java API
      a32f0fb7
    • Ye Xianjin's avatar
      [Spark 2557] fix LOCAL_N_REGEX in createTaskScheduler and make local-n and... · 284771ef
      Ye Xianjin authored
      [Spark 2557] fix LOCAL_N_REGEX in createTaskScheduler and make local-n and local-n-failures consistent
      
      [SPARK-2557](https://issues.apache.org/jira/browse/SPARK-2557)
      
      Author: Ye Xianjin <advancedxy@gmail.com>
      
      Closes #1464 from advancedxy/SPARK-2557 and squashes the following commits:
      
      d844d67 [Ye Xianjin] add local-*-n-failures, bad-local-n, bad-local-n-failures test case
      3bbc668 [Ye Xianjin] fix LOCAL_N_REGEX regular expression and make local_n_failures accept * as all cores on the computer
      284771ef
Loading