Skip to content
Snippets Groups Projects
  1. May 05, 2015
    • zsxwing's avatar
      [SPARK-7350] [STREAMING] [WEBUI] Attach the Streaming tab when calling ssc.start() · c6d1efba
      zsxwing authored
      It's meaningless to display the Streaming tab before `ssc.start()`. So we should attach it in the `ssc.start` method.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5898 from zsxwing/SPARK-7350 and squashes the following commits:
      
      e676487 [zsxwing] Attach the Streaming tab when calling ssc.start()
      c6d1efba
    • zsxwing's avatar
      [SPARK-5074] [CORE] [TESTS] Fix the flakey test 'run shuffle with map stage... · 5ffc73e6
      zsxwing authored
      [SPARK-5074] [CORE] [TESTS] Fix the flakey test 'run shuffle with map stage failure' in DAGSchedulerSuite
      
      Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/2240/testReport/junit/org.apache.spark.scheduler/DAGSchedulerSuite/run_shuffle_with_map_stage_failure/
      
      This is because many tests share the same `JobListener`. Because after each test, `scheduler` isn't stopped. So actually it's still running. When running the test `run shuffle with map stage failure`, some previous test may trigger [ResubmitFailedStages](https://github.com/apache/spark/blob/ebc25a4ddfe07a67668217cec59893bc3b8cf730/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1120) logic, and report `jobFailed` and override the global `failure` variable.
      
      This PR uses `after` to call `scheduler.stop()` for each test.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5903 from zsxwing/SPARK-5074 and squashes the following commits:
      
      1e6f13e [zsxwing] Fix the flakey test 'run shuffle with map stage failure' in DAGSchedulerSuite
      5ffc73e6
    • Liang-Chi Hsieh's avatar
      [MINOR] Minor update for document · b83091ae
      Liang-Chi Hsieh authored
      Two minor doc errors in `BytesToBytesMap` and `UnsafeRow`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5906 from viirya/minor_doc and squashes the following commits:
      
      27f9089 [Liang-Chi Hsieh] Minor update for doc.
      b83091ae
    • Imran Rashid's avatar
      [SPARK-3454] separate json endpoints for data in the UI · d4973580
      Imran Rashid authored
      Exposes data available in the UI as json over http.  Key points:
      
      * new endpoints, handled independently of existing XyzPage classes.  Root entrypoint is `JsonRootResource`
      * Uses jersey + jackson for routing & converting POJOs into json
      * tests against known results in `HistoryServerSuite`
      * also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #4435 from squito/SPARK-3454 and squashes the following commits:
      
      da1e35f [Imran Rashid] typos etc.
      5e78b4f [Imran Rashid] fix rendering problems
      5ae02ad [Imran Rashid] Merge branch 'master' into SPARK-3454
      f016182 [Imran Rashid] change all constructors json-pojo class constructors to be private[spark] to protect us from mima-false-positives if we add fields
      3347b72 [Imran Rashid] mark EnumUtil as @Private
      ec140a2 [Imran Rashid] create @Private
      cc1febf [Imran Rashid] add docs on the metrics-as-json api
      cbaf287 [Imran Rashid] Merge branch 'master' into SPARK-3454
      56db31e [Imran Rashid] update tests for mulit-attempt
      7f3bc4e [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier to start & stop http servers in sbt"
      67008b4 [Imran Rashid] rats
      9e51400 [Imran Rashid] style
      c9bae1c [Imran Rashid] handle multiple attempts per app
      b87cd63 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt
      188762c [Imran Rashid] multi-attempt
      2af11e5 [Imran Rashid] Merge branch 'master' into SPARK-3454
      befff0c [Imran Rashid] review feedback
      14ac3ed [Imran Rashid] jersey-core needs to be explicit; move version & scope to parent pom.xml
      f90680e [Imran Rashid] Merge branch 'master' into SPARK-3454
      dc8a7fe [Imran Rashid] style, fix errant comments
      acb7ef6 [Imran Rashid] fix indentation
      7bf1811 [Imran Rashid] move MetricHelper so mima doesnt think its exposed; comments
      9d889d6 [Imran Rashid] undo some unnecessary changes
      f48a7b0 [Imran Rashid] docs
      52bbae8 [Imran Rashid] StorageListener & StorageStatusListener needs to synchronize internally to be thread-safe
      31c79ce [Imran Rashid] asm no longer needed for SPARK_PREPEND_CLASSES
      b2f8b91 [Imran Rashid] @DeveloperApi
      2e19be2 [Imran Rashid] lazily convert ApplicationInfo to avoid memory overhead
      ba3d9d2 [Imran Rashid] upper case enums
      39ac29c [Imran Rashid] move EnumUtil
      d2bde77 [Imran Rashid] update error handling & scoping
      4a234d3 [Imran Rashid] avoid jersey-media-json-jackson b/c of potential version conflicts
      a157a2f [Imran Rashid] style
      7bd4d15 [Imran Rashid] delete security test, since it doesnt do anything
      a325563 [Imran Rashid] style
      a9c5cf1 [Imran Rashid] undo changes superceeded by master
      0c6f968 [Imran Rashid] update deps
      1ed0d07 [Imran Rashid] Merge branch 'master' into SPARK-3454
      4c92af6 [Imran Rashid] style
      f2e63ad [Imran Rashid] Merge branch 'master' into SPARK-3454
      c22b11f [Imran Rashid] fix compile error
      9ea682c [Imran Rashid] go back to good ol' java enums
      cf86175 [Imran Rashid] style
      d493b38 [Imran Rashid] Merge branch 'master' into SPARK-3454
      f05ae89 [Imran Rashid] add in ExecutorSummaryInfo for MiMa :(
      101a698 [Imran Rashid] style
      d2ef58d [Imran Rashid] revert changes that had HistoryServer refresh the application listing more often
      b136e39b [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier to start & stop http servers in sbt"
      e031719 [Imran Rashid] fixes from review
      1f53a66 [Imran Rashid] style
      b4a7863 [Imran Rashid] fix compile error
      2c8b7ee [Imran Rashid] rats
      1578a4a [Imran Rashid] doc
      674f8dc [Imran Rashid] more explicit about total numbers of jobs & stages vs. number retained
      9922be0 [Imran Rashid] Merge branch 'master' into stage_distributions
      f5a5196 [Imran Rashid] undo removal of renderJson from MasterPage, since there is no substitute yet
      db61211 [Imran Rashid] get JobProgressListener directly from UI
      fdfc181 [Imran Rashid] stage/taskList
      63eb4a6 [Imran Rashid] tests for taskSummary
      ad27de8 [Imran Rashid] error handling on quantile values
      b2efcaf [Imran Rashid] cleanup, combine stage-related paths into one resource
      aaba896 [Imran Rashid] wire up task summary
      a4b1397 [Imran Rashid] stage metric distributions
      e48ba32 [Imran Rashid] rename
      eaf3bbb [Imran Rashid] style
      25cd894 [Imran Rashid] if only given day, assume GMT
      51eaedb [Imran Rashid] more visibility fixes
      9f28b7e [Imran Rashid] ack, more cleanup
      99764e1 [Imran Rashid] Merge branch 'SPARK-3454_w_jersey' into SPARK-3454
      a61a43c [Imran Rashid] oops, remove accidental checkin
      a066055 [Imran Rashid] set visibility on a lot of classes
      1f361c8 [Imran Rashid] update rat-excludes
      0be5120 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
      2382bef [Imran Rashid] switch to using new "enum"
      fef6605 [Imran Rashid] some utils for working w/ new "enum" format
      dbfc7bf [Imran Rashid] style
      b86bcb0 [Imran Rashid] update test to look at one stage attempt
      5f9df24 [Imran Rashid] style
      7fd156a [Imran Rashid] refactor jsonDiff to avoid code duplication
      73f1378 [Imran Rashid] test json; also add test cases for cleaned stages & jobs
      97d411f [Imran Rashid] json endpoint for one job
      0c96147 [Imran Rashid] better error msgs for bad stageId vs bad attemptId
      dddbd29 [Imran Rashid] stages have attempt; jobs are sorted; resource for all attempts for one stage
      190c17a [Imran Rashid] StagePage should distinguish no task data, from unknown stage
      84cd497 [Imran Rashid] AllJobsPage should still report correct completed & failed job count, even if some have been cleaned, to make it consistent w/ AllStagesPage
      36e4062 [Imran Rashid] SparkUI needs to know about startTime, so it can list its own applicationInfo
      b4c75ed [Imran Rashid] fix merge conflicts; need to widen visibility in a few cases
      e91750a [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
      56d2fc7 [Imran Rashid] jersey needs asm for SPARK_PREPEND_CLASSES to work
      f7df095 [Imran Rashid] add test for accumulables, and discover that I need update after all
      9c0c125 [Imran Rashid] add accumulableInfo
      00e9cc5 [Imran Rashid] more style
      3377e61 [Imran Rashid] scaladoc
      d05f7a9 [Imran Rashid] dont use case classes for status api POJOs, since they have binary compatibility issues
      654cecf [Imran Rashid] move all the status api POJOs to one file
      b86e2b0 [Imran Rashid] style
      18a8c45 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
      5598f19 [Imran Rashid] delete some unnecessary code, more to go
      56edce0 [Imran Rashid] style
      017c755 [Imran Rashid] add in metrics now available
      1b78cb7 [Imran Rashid] fix some import ordering
      0dc3ea7 [Imran Rashid] if app isnt found, reload apps from FS before giving up
      c7d884f [Imran Rashid] fix merge conflicts
      0c12b50 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
      b6a96a8 [Imran Rashid] compare json by AST, not string
      cd37845 [Imran Rashid] switch to using java.util.Dates for times
      a4ab5aa [Imran Rashid] add in explicit dependency on jersey 1.9 -- maven wasn't happy before this
      4fdc39f [Imran Rashid] refactor case insensitive enum parsing
      cba1ef6 [Imran Rashid] add security (maybe?) for metrics json
      f0264a7 [Imran Rashid] switch to using jersey for metrics json
      bceb3a9 [Imran Rashid] set http response code on error, some testing
      e0356b6 [Imran Rashid] put new test expectation files in rat excludes (is this OK?)
      b252e7a [Imran Rashid] small cleanup of accidental changes
      d1a8c92 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt
      4b398d0 [Imran Rashid] expose UI data as json in new endpoints
      d4973580
    • Jihong MA's avatar
      [SPARK-7357] Improving HBaseTest example · 51f46200
      Jihong MA authored
      Author: Jihong MA <linlin200605@gmail.com>
      
      Closes #5904 from JihongMA/SPARK-7357 and squashes the following commits:
      
      7d6153a [Jihong MA] SPARK-7357 Improving HBaseTest example
      51f46200
    • Sandy Ryza's avatar
      [SPARK-5112] Expose SizeEstimator as a developer api · 4222da68
      Sandy Ryza authored
      "The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
      -the Tuning Spark page
      
      This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3913 from sryza/sandy-spark-5112 and squashes the following commits:
      
      8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark
      2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util"
      93f4cd0 [Sandy Ryza] Move SizeEstimator out of util
      e21c1f4 [Sandy Ryza] Remove unused import
      798ab88 [Sandy Ryza] Update documentation and add to SparkContext
      34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
      4222da68
    • shekhar.bansal's avatar
      [SPARK-6653] [YARN] New config to specify port for sparkYarnAM actor system · fc8feaa8
      shekhar.bansal authored
      Author: shekhar.bansal <shekhar.bansal@guavus.com>
      
      Closes #5719 from zuxqoj/master and squashes the following commits:
      
      5574ff7 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system
      5117258 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system
      9de5330 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system
      456a592 [shekhar.bansal] [SPARK-6653][yarn] New configuration property to specify port for sparkYarnAM actor system
      803e93e [shekhar.bansal] [SPARK-6653][yarn] New configuration property to specify port for sparkYarnAM actor system
      fc8feaa8
    • zsxwing's avatar
      [SPARK-7341] [STREAMING] [TESTS] Fix the flaky test: org.apache.spark.stre... · 4d29867e
      zsxwing authored
      ...aming.InputStreamsSuite.socket input stream
      
      Remove non-deterministic "Thread.sleep" and use deterministic strategies to fix the flaky failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2127/testReport/junit/org.apache.spark.streaming/InputStreamsSuite/socket_input_stream/
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5891 from zsxwing/SPARK-7341 and squashes the following commits:
      
      611157a [zsxwing] Add wait methods to BatchCounter and use BatchCounter in InputStreamsSuite
      014b58f [zsxwing] Use withXXX to clean up the resources
      c9bf746 [zsxwing] Move 'waitForStart' into the 'start' method and fix the code style
      9d0de6d [zsxwing] [SPARK-7341][Streaming][Tests] Fix the flaky test: org.apache.spark.streaming.InputStreamsSuite.socket input stream
      4d29867e
    • jerryshao's avatar
      [SPARK-7113] [STREAMING] Support input information reporting for Direct Kafka stream · 8436f7e9
      jerryshao authored
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #5879 from jerryshao/SPARK-7113 and squashes the following commits:
      
      b0b506c [jerryshao] Address the comments
      0babe66 [jerryshao] Support input information reporting for Direct Kafka stream
      8436f7e9
    • Tathagata Das's avatar
      [HOTFIX] [TEST] Ignoring flaky tests · 8776fe0b
      Tathagata Das authored
      org.apache.spark.DriverSuite.driver should exit after finishing without cleanup (SPARK-530)
      https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2267/
      
      org.apache.spark.deploy.SparkSubmitSuite.includes jars passed in through --jars
      https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2271/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/
      
      org.apache.spark.streaming.flume.FlumePollingStreamSuite.flume polling test
      https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2269/
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5901 from tdas/ignore-flaky-tests and squashes the following commits:
      
      9cd8667 [Tathagata Das] Ignoring tests.
      8776fe0b
    • Tathagata Das's avatar
      [SPARK-7139] [STREAMING] Allow received block metadata to be saved to WAL and... · 1854ac32
      Tathagata Das authored
      [SPARK-7139] [STREAMING] Allow received block metadata to be saved to WAL and recovered on driver failure
      
      - Enabled ReceivedBlockTracker WAL by default
      - Stored block metadata in the WAL
      - Optimized WALBackedBlockRDD by skipping block fetch when the block is known to not exist in Spark
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5732 from tdas/SPARK-7139 and squashes the following commits:
      
      575476e [Tathagata Das] Added more tests to get 100% coverage of the WALBackedBlockRDD
      19668ba [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7139
      685fab3 [Tathagata Das] Addressed comments in PR
      637bc9c [Tathagata Das] Changed segment to handle
      466212c [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7139
      5f67a59 [Tathagata Das] Fixed HdfsUtils to handle append in local file system
      1bc5bc3 [Tathagata Das] Fixed bug on unexpected recovery
      d06fa21 [Tathagata Das] Enabled ReceivedBlockTracker by default, stored block metadata and optimized block fetching in WALBackedBlockRDD
      1854ac32
    • Marcelo Vanzin's avatar
      [MINOR] [BUILD] Declare ivy dependency in root pom. · c5790a2f
      Marcelo Vanzin authored
      Without this, any dependency that pulls ivy transitively may override
      the version and potentially cause issue. In my machine, the hive tests
      were pulling an old version of ivy, and subsequently failing with a
      "NoSuchMethodError".
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5893 from vanzin/ivy-dep-fix and squashes the following commits:
      
      ea2112d [Marcelo Vanzin] [minor] [build] Declare ivy dependency in root pom.
      c5790a2f
    • Niccolo Becchi's avatar
      [MINOR] Renamed variables in SparkKMeans.scala, LocalKMeans.scala and... · da738cff
      Niccolo Becchi authored
      [MINOR] Renamed variables in SparkKMeans.scala, LocalKMeans.scala and kmeans.py to simplify readability
      
      With the previous syntax it could look like that the reduceByKey sums separately abscissas and ordinates of some 2D points. Perhaps in this way should be easier to understand the example, especially for who is starting the functional programming like me now.
      
      Author: Niccolo Becchi <niccolo.becchi@gmail.com>
      Author: pippobaudos <niccolo.becchi@gmail.com>
      
      Closes #5875 from pippobaudos/patch-1 and squashes the following commits:
      
      3bb3a47 [pippobaudos] renamed variables in LocalKMeans.scala and kmeans.py to simplify readability
      2c2a7a2 [Niccolo Becchi] Update SparkKMeans.scala
      da738cff
    • Xiangrui Meng's avatar
      [SPARK-7314] [SPARK-3524] [PYSPARK] upgrade Pyrolite to 4.4 · e9b16e67
      Xiangrui Meng authored
      This PR upgrades Pyrolite to 4.4, which contains the bug fix for SPARK-3524 and some other performance improvements (e.g., SPARK-6288). The artifact is still under `org.spark-project` on Maven Central since there is no official release published there.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5850 from mengxr/SPARK-7314 and squashes the following commits:
      
      2ed4a95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7314
      da3c2dd [Xiangrui Meng] remove my repo
      fe7e29b [Xiangrui Meng] switch to maven central
      6ddac0e [Xiangrui Meng] reverse the machine code for float/double
      d2d5b5b [Xiangrui Meng] change back to 4.4
      7824a9c [Xiangrui Meng] use Pyrolite 3.1
      cc3903a [Xiangrui Meng] upgrade Pyrolite to 4.4-0 for testing
      e9b16e67
  2. May 04, 2015
    • Bryan Cutler's avatar
      [SPARK-7236] [CORE] Fix to prevent AkkaUtils askWithReply from sleeping on final attempt · 8aa5aea7
      Bryan Cutler authored
      Added a check so that if `AkkaUtils.askWithReply` is on the final attempt, it will not sleep for the `retryInterval`.  This should also prevent the thread from sleeping for `Int.Max` when using `askWithReply` with default values for `maxAttempts` and `retryInterval`.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #5896 from BryanCutler/askWithReply-sleep-7236 and squashes the following commits:
      
      653a07b [Bryan Cutler] [SPARK-7236] Fix to prevent AkkaUtils askWithReply from sleeping on final attempt
      8aa5aea7
    • Reynold Xin's avatar
      [SPARK-7266] Add ExpectsInputTypes to expressions when possible. · 678c4da0
      Reynold Xin authored
      This should gives us better analysis time error messages (rather than runtime) and automatic type casting.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5796 from rxin/expected-input-types and squashes the following commits:
      
      c900760 [Reynold Xin] [SPARK-7266] Add ExpectsInputTypes to expressions when possible.
      678c4da0
    • Burak Yavuz's avatar
      [SPARK-7243][SQL] Contingency Tables for DataFrames · 80554111
      Burak Yavuz authored
      Computes a pair-wise frequency table of the given columns. Also known as cross-tabulation.
      cc mengxr rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5842 from brkyvz/df-cont and squashes the following commits:
      
      a07c01e [Burak Yavuz] addressed comments v4.1
      ae9e01d [Burak Yavuz] fix test
      9106585 [Burak Yavuz] addressed comments v4.0
      bced829 [Burak Yavuz] fix merge conflicts
      a63ad00 [Burak Yavuz] addressed comments v3.0
      a0cad97 [Burak Yavuz] addressed comments v3.0
      6805df8 [Burak Yavuz] addressed comments and fixed test
      939b7c4 [Burak Yavuz] lint python
      7f098bc [Burak Yavuz] add crosstab pyTest
      fd53b00 [Burak Yavuz] added python support for crosstab
      27a5a81 [Burak Yavuz] implemented crosstab
      80554111
    • Andrew Or's avatar
      [SPARK-6943] [SPARK-6944] DAG visualization on SparkUI · fc8b5819
      Andrew Or authored
      This patch adds the functionality to display the RDD DAG on the SparkUI.
      
      This DAG describes the relationships between
      - an RDD and its dependencies,
      - an RDD and its operation scopes, and
      - an RDD's operation scopes and the stage / job hierarchy
      
      An operation scope here refers to the existing public APIs that created the RDDs (e.g. `textFile`, `treeAggregate`). In the future, we can expand this to include higher level operations like SQL queries.
      
      *Note: This blatantly stole a few lines of HTML and JavaScript from #5547 (thanks shroffpradyumn!)*
      
      Here's what the job page looks like:
      <img src="https://issues.apache.org/jira/secure/attachment/12730286/job-page.png" width="700px"/>
      and the stage page:
      <img src="https://issues.apache.org/jira/secure/attachment/12730287/stage-page.png" width="300px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5729 from andrewor14/viz2 and squashes the following commits:
      
      666c03b [Andrew Or] Round corners of RDD boxes on stage page (minor)
      01ba336 [Andrew Or] Change RDD cache color to red (minor)
      6f9574a [Andrew Or] Add tests for RDDOperationScope
      1c310e4 [Andrew Or] Wrap a few more RDD functions in an operation scope
      3ffe566 [Andrew Or] Restore "null" as default for RDD name
      5fdd89d [Andrew Or] children -> child (minor)
      0d07a84 [Andrew Or] Fix python style
      afb98e2 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
      0d7aa32 [Andrew Or] Fix python tests
      3459ab2 [Andrew Or] Fix tests
      832443c [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
      429e9e1 [Andrew Or] Display cached RDDs on the viz
      b1f0fd1 [Andrew Or] Rename OperatorScope -> RDDOperationScope
      31aae06 [Andrew Or] Extract visualization logic from listener
      83f9c58 [Andrew Or] Implement a programmatic representation of operator scopes
      5a7faf4 [Andrew Or] Rename references to viz scopes to viz clusters
      ee33d52 [Andrew Or] Separate HTML generating code from listener
      f9830a2 [Andrew Or] Refactor + clean up + document JS visualization code
      b80cc52 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
      0706992 [Andrew Or] Add link from jobs to stages
      deb48a0 [Andrew Or] Translate stage boxes taking into account the width
      5c7ce16 [Andrew Or] Connect RDDs across stages + update style
      ab91416 [Andrew Or] Introduce visualization to the Job Page
      5f07e9c [Andrew Or] Remove more return statements from scopes
      5e388ea [Andrew Or] Fix line too long
      43de96e [Andrew Or] Add parent IDs to StageInfo
      6e2cfea [Andrew Or] Remove all return statements in `withScope`
      d19c4da [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
      7ef957c [Andrew Or] Fix scala style
      4310271 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
      aa868a9 [Andrew Or] Ensure that HadoopRDD is actually serializable
      c3bfcae [Andrew Or] Re-implement scopes using closures instead of annotations
      52187fc [Andrew Or] Rat excludes
      09d361e [Andrew Or] Add ID to node label (minor)
      71281fa [Andrew Or] Embed the viz in the UI in a toggleable manner
      8dd5af2 [Andrew Or] Fill in documentation + miscellaneous minor changes
      fe7816f [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz
      205f838 [Andrew Or] Reimplement rendering with dagre-d3 instead of viz.js
      5e22946 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz
      6a7cdca [Andrew Or] Move RDD scope util methods and logic to its own file
      494d5c2 [Andrew Or] Revert a few unintended style changes
      9fac6f3 [Andrew Or] Re-implement scopes through annotations instead
      f22f337 [Andrew Or] First working implementation of visualization with vis.js
      2184348 [Andrew Or] Translate RDD information to dot file
      5143523 [Andrew Or] Expose the necessary information in RDDInfo
      a9ed4f9 [Andrew Or] Add a few missing scopes to certain RDD methods
      6b3403b [Andrew Or] Scope all RDD methods
      fc8b5819
    • 云峤's avatar
      [SPARK-7319][SQL] Improve the output from DataFrame.show() · f32e69ec
      云峤 authored
      Author: 云峤 <chensong.cs@alibaba-inc.com>
      
      Closes #5865 from kaka1992/df.show and squashes the following commits:
      
      c79204b [云峤] Update
      a1338f6 [云峤] Update python dataFrame show test and add empty df unit test.
      734369c [云峤] Update python dataFrame show test and add empty df unit test.
      84aec3e [云峤] Update python dataFrame show test and add empty df unit test.
      159b3d5 [云峤] update
      03ef434 [云峤] update
      7394fd5 [云峤] update test show
      ced487a [云峤] update pep8
      b6e690b [云峤] Merge remote-tracking branch 'upstream/master' into df.show
      30ac311 [云峤] [SPARK-7294] ADD BETWEEN
      7d62368 [云峤] [SPARK-7294] ADD BETWEEN
      baf839b [云峤] [SPARK-7294] ADD BETWEEN
      d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
      f32e69ec
    • Xiangrui Meng's avatar
      [SPARK-5956] [MLLIB] Pipeline components should be copyable. · e0833c59
      Xiangrui Meng authored
      This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:
      
      ~~~scala
      def fit(dataset: DataFrame, extra: ParamMap): Model = {
        copy(extra).fit(dataset)
      }
      ~~~
      
      Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:
      
      ~~~scala
      val effectiveRegParam = $(regParam) / yStd
      val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
      val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
      ~~~
      
      Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).
      
      Other changes:
      * `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
      * `fittingParamMap` was removed because the `parent` carries this information.
      * `validate` was renamed to `validateParams` to be more precise.
      
      TODOs:
      * [x] add tests for newly added methods
      * [ ] update documentation
      
      jkbradley dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5820 from mengxr/SPARK-5956 and squashes the following commits:
      
      7bef88d [Xiangrui Meng] address comments
      05229c3 [Xiangrui Meng] assert -> assertEquals
      b2927b1 [Xiangrui Meng] organize imports
      f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
      93e7924 [Xiangrui Meng] add tests for hasParam & copy
      463ecae [Xiangrui Meng] merge master
      2b954c3 [Xiangrui Meng] update Binarizer
      465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
      282a1a8 [Xiangrui Meng] fix test
      819dd2d [Xiangrui Meng] merge master
      b642872 [Xiangrui Meng] example code runs
      5a67779 [Xiangrui Meng] examples compile
      c76b4d1 [Xiangrui Meng] fix all unit tests
      0f4fd64 [Xiangrui Meng] fix some tests
      9286a22 [Xiangrui Meng] copyValues to trained models
      53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
      9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
      d882afc [Xiangrui Meng] test compile
      f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
      e0833c59
    • Andrew Or's avatar
      [MINOR] Fix python test typo? · 5a1a1075
      Andrew Or authored
      I suspect haven't been using anaconda in tests in a while. I wonder if this change actually does anything but this line as it stands looks strictly less correct.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5883 from andrewor14/fix-run-tests-typo and squashes the following commits:
      
      a3ad720 [Andrew Or] Fix typo?
      5a1a1075
    • tianyi's avatar
      [SPARK-5100] [SQL] add webui for thriftserver · 343d3bfa
      tianyi authored
      This PR is a rebased version of #3946 , and mainly focused on creating an independent tab for the thrift server in spark web UI.
      
      Features:
      
      1. Session related statistics ( username and IP are only supported in hive-0.13.1 )
      2. List all the SQL executing or executed on this server
      3. Provide links to the job generated by SQL
      4. Provide link to show all SQL executing or executed in a specified session
      
      Prototype snapshots:
      
      This is the main page for thrift server
      
      ![image](https://cloud.githubusercontent.com/assets/1411869/7361379/df7dcc64-ed89-11e4-9964-4df0b32f475e.png)
      
      Author: tianyi <tianyi.asiainfo@gmail.com>
      
      Closes #5730 from tianyi/SPARK-5100 and squashes the following commits:
      
      cfd14c7 [tianyi] style fix
      0efe3d5 [tianyi] revert part of pom change
      c0f2fa0 [tianyi] extends HiveThriftJdbcTest to start/stop thriftserver for UI test
      aa20408 [tianyi] fix style problem
      c9df6f9 [tianyi] add testsuite for thriftserver ui and fix some style issue
      9830199 [tianyi] add webui for thriftserver
      343d3bfa
    • Yuhao Yang's avatar
      [SPARK-5563] [MLLIB] LDA with online variational inference · 3539cb7d
      Yuhao Yang authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-5563
      The PR contains the implementation for [Online LDA] (https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on the research of  Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira.
      
      Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus.
      
       Correctness test.
      I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4419 from hhbyyh/ldaonline and squashes the following commits:
      
      1045eec [Yuhao Yang] Merge pull request #2 from jkbradley/hhbyyh-ldaonline2
      cf376ff [Joseph K. Bradley] For private vars needed for testing, I made them private and added accessors.  Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter.
      6149ca6 [Yuhao Yang] fix for setOptimizer
      cf0007d [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
      54cf8da [Yuhao Yang] some style change
      68c2318 [Yuhao Yang] add a java ut
      4041723 [Yuhao Yang] add ut
      138bfed [Yuhao Yang] Merge pull request #1 from jkbradley/hhbyyh-ldaonline-update
      9e910d9 [Joseph K. Bradley] small fix
      61d60df [Joseph K. Bradley] Minor cleanups: * Update *Concentration parameter documentation * EM Optimizer: createVertices() does not need to be a function * OnlineLDAOptimizer: typos in doc * Clean up the core code for online LDA (Scala style)
      a996a82 [Yuhao Yang] respond to comments
      b1178cf [Yuhao Yang] fit into the optimizer framework
      dbe3cff [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
      15be071 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
      b29193b [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
      d19ef55 [Yuhao Yang] change OnlineLDA to class
      97b9e1a [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
      e7bf3b0 [Yuhao Yang] move to seperate file
      f367cc9 [Yuhao Yang] change to optimization
      8cb16a6 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
      62405cc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
      02d0373 [Yuhao Yang] fix style in comment
      f6d47ca [Yuhao Yang] Merge branch 'ldaonline' of https://github.com/hhbyyh/spark into ldaonline
      d86cdec [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
      a570c9a [Yuhao Yang] use sample to pick up batch
      4a3f27e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
      e271eb1 [Yuhao Yang] remove non ascii
      581c623 [Yuhao Yang] seperate API and adjust batch split
      37af91a [Yuhao Yang] iMerge remote-tracking branch 'upstream/master' into ldaonline
      20328d1 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline i
      aa365d1 [Yuhao Yang] merge upstream master
      3a06526 [Yuhao Yang] merge with new example
      0dd3947 [Yuhao Yang] kMerge remote-tracking branch 'upstream/master' into ldaonline
      0d0f3ee [Yuhao Yang] replace random split with sliding
      fa408a8 [Yuhao Yang] ssMerge remote-tracking branch 'upstream/master' into ldaonline
      45884ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s
      f41c5ca [Yuhao Yang] style fix
      26dca1b [Yuhao Yang] style fix and make class private
      043e786 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s Conflicts: 	mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
      d640d9c [Yuhao Yang] online lda initial checkin
      3539cb7d
  3. May 03, 2015
    • Burak Yavuz's avatar
      [SPARK-7241] Pearson correlation for DataFrames · 9646018b
      Burak Yavuz authored
      submitting this PR from a phone, excuse the brevity.
      adds Pearson correlation to Dataframes, reusing the covariance calculation code
      
      cc mengxr rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5858 from brkyvz/df-corr and squashes the following commits:
      
      285b838 [Burak Yavuz] addressed comments v2.0
      d10babb [Burak Yavuz] addressed comments v0.2
      4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr
      4fe693b [Burak Yavuz] addressed comments v0.1
      a682d06 [Burak Yavuz] ready for PR
      9646018b
    • Xiangrui Meng's avatar
      [SPARK-7329] [MLLIB] simplify ParamGridBuilder impl · 1ffa8cb9
      Xiangrui Meng authored
      as suggested by justinuang on #5601.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5873 from mengxr/SPARK-7329 and squashes the following commits:
      
      d08f9cf [Xiangrui Meng] simplify tests
      b7a7b9b [Xiangrui Meng] simplify grid build
      1ffa8cb9
    • Sean Owen's avatar
      [SPARK-7302] [DOCS] SPARK building documentation still mentions building for yarn 0.23 · 9e25b09f
      Sean Owen authored
      Remove references to Hadoop 0.23
      
      CC tgravescs Is this what you had in mind? basically all refs to 0.23?
      We don't support YARN 0.23, but also don't support Hadoop 0.23 anymore AFAICT. There are no builds or releases for it.
      
      In fact, on a related note, refs to CDH3 (Hadoop 0.20.2) should be removed as this certainly isn't supported either.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5863 from srowen/SPARK-7302 and squashes the following commits:
      
      42f5d1e [Sean Owen] Remove CDH3 (Hadoop 0.20.2) refs too
      dad02e3 [Sean Owen] Remove references to Hadoop 0.23
      9e25b09f
    • Michael Armbrust's avatar
      [SPARK-6907] [SQL] Isolated client for HiveMetastore · daa70bf1
      Michael Armbrust authored
      This PR adds initial support for loading multiple versions of Hive in a single JVM and provides a common interface for extracting metadata from the `HiveMetastoreClient` for a given version.  This is accomplished by creating an isolated `ClassLoader` that operates according to the following rules:
      
       - __Shared Classes__: Java, Scala, logging, and Spark classes are delegated to `baseClassLoader`
        allowing the results of calls to the `ClientInterface` to be visible externally.
       - __Hive Classes__: new instances are loaded from `execJars`.  These classes are not
        accessible externally due to their custom loading.
       - __Barrier Classes__: Classes such as `ClientWrapper` are defined in Spark but must link to a specific version of Hive.  As a result, the bytecode is acquired from the Spark `ClassLoader` but a new copy is created for each instance of `IsolatedClientLoader`.
        This new instance is able to see a specific version of hive without using reflection where ever hive is consistent across versions. Since
        this is a unique instance, it is not visible externally other than as a generic
        `ClientInterface`, unless `isolationOn` is set to `false`.
      
      In addition to the unit tests, I have also tested this locally against mysql instances of the Hive Metastore.  I've also successfully ported Spark SQL to run with this client, but due to the size of the changes, that will come in a follow-up PR.
      
      By default, Hive jars are currently downloaded from Maven automatically for a given version to ease packaging and testing.  However, there is also support for specifying their location manually for deployments without internet.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5851 from marmbrus/isolatedClient and squashes the following commits:
      
      c72f6ac [Michael Armbrust] rxins comments
      1e271fa [Michael Armbrust] [SPARK-6907][SQL] Isolated client for HiveMetastore
      daa70bf1
    • Omede Firouz's avatar
      [SPARK-7022] [PYSPARK] [ML] Add ML.Tuning.ParamGridBuilder to PySpark · f4af9255
      Omede Firouz authored
      Author: Omede Firouz <ofirouz@palantir.com>
      Author: Omede <omedefirouz@gmail.com>
      
      Closes #5601 from oefirouz/paramgrid and squashes the following commits:
      
      c9e2481 [Omede Firouz] Make test a doctest
      9a8ce22 [Omede] Fix linter issues
      8b8a6d2 [Omede Firouz] [SPARK-7022][PySpark][ML] Add ML.Tuning.ParamGridBuilder to PySpark
      f4af9255
  4. May 02, 2015
    • WangTaoTheTonic's avatar
      [SPARK-7031] [THRIFTSERVER] let thrift server take SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS · 49549d5a
      WangTaoTheTonic authored
      We should let Thrift Server take these two parameters as it is a daemon. And it is better to read driver-related configs as an app submited by spark-submit.
      
      https://issues.apache.org/jira/browse/SPARK-7031
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #5609 from WangTaoTheTonic/SPARK-7031 and squashes the following commits:
      
      8d3fc16 [WangTaoTheTonic] indent
      035069b [WangTaoTheTonic] better code style
      d3ddfb6 [WangTaoTheTonic] revert the unnecessary changes in suite
      624e652 [WangTaoTheTonic] fix break tests
      0565831 [WangTaoTheTonic] fix failed tests
      4fb25ed [WangTaoTheTonic] let thrift server take SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS
      49549d5a
    • BenFradet's avatar
      [SPARK-7255] [STREAMING] [DOCUMENTATION] Added documentation for spark.streaming.kafka.maxRetries · ea841efc
      BenFradet authored
      Added documentation for spark.streaming.kafka.maxRetries
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #5808 from BenFradet/master and squashes the following commits:
      
      cc72e7a [BenFradet] updated doc for spark.streaming.kafka.maxRetries to explain the default value
      18f823e [BenFradet] Added "consecutive" to the spark.streaming.kafka.maxRetries doc
      597fdeb [BenFradet] Mention that spark.streaming.kafka.maxRetries only applies to the direct kafka api
      0efad39 [BenFradet] Added documentation for spark.streaming.kafka.maxRetries
      ea841efc
    • Cheng Hao's avatar
      [SPARK-5213] [SQL] Pluggable SQL Parser Support · 5d6b90d9
      Cheng Hao authored
      based on #4015, we should not delete `sqlParser` from sqlcontext, that leads to mima failed. Users implement dialect to give a fallback for `sqlParser`  and we should construct `sqlParser` in sqlcontext according to the dialect
      `protected[sql] val sqlParser = new SparkSQLParser(getSQLDialect().parse(_))`
      
      Author: Cheng Hao <hao.cheng@intel.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #5827 from scwf/sqlparser1 and squashes the following commits:
      
      81b9737 [scwf] comment fix
      0878bd1 [scwf] remove comments
      c19780b [scwf] fix mima tests
      c2895cf [scwf] Merge branch 'master' of https://github.com/apache/spark into sqlparser1
      493775c [Cheng Hao] update the code as feedback
      81a731f [Cheng Hao] remove the unecessary comment
      aab0b0b [Cheng Hao] polish the code a little bit
      49b9d81 [Cheng Hao] shrink the comment for rebasing
      5d6b90d9
    • Marcelo Vanzin's avatar
      [MINOR] [HIVE] Fix QueryPartitionSuite. · 82c8c37c
      Marcelo Vanzin authored
      At least in the version of Hive I tested on, the test was deleting
      a temp directory generated by Hive instead of one containing partition
      data. So fix the filter to only consider partition directories when
      deciding what to delete.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5854 from vanzin/hive-test-fix and squashes the following commits:
      
      7594ae9 [Marcelo Vanzin] Fix typo.
      729fa80 [Marcelo Vanzin] [minor] [hive] Fix QueryPartitionSuite.
      82c8c37c
    • Ye Xianjin's avatar
      [SPARK-6030] [CORE] Using simulated field layout method to compute class shellSize · bfcd528d
      Ye Xianjin authored
      SizeEstimator gives wrong result for Integer on 64bit JVM with UseCompressedOops on, this pr fixes that. For more details, please refer [SPARK-6030](https://issues.apache.org/jira/browse/SPARK-6030)
      sryza, I noticed there is a pr to expose SizeEstimator, maybe that should be waited by this pr get merged if we confirm this problem.
      And shivaram would you mind to review this pr since you contribute related code. Also cc to srowen and mateiz
      
      Author: Ye Xianjin <advancedxy@gmail.com>
      
      Closes #4783 from advancedxy/SPARK-6030 and squashes the following commits:
      
      c4dcb41 [Ye Xianjin] Add super.beforeEach in the beforeEach method to make the trait stackable.. Remove useless leading whitespace.
      3f80640 [Ye Xianjin] The size of Integer class changes from 24 to 16 on a 64-bit JVM with -UseCompressedOops flag on after the fix. I don't how 100000 was originally calculated, It looks like 100000 is the magic number which makes sure spilling. Because of the size change, It fails because there is no spilling at all. Change the number to a slightly larger number fixes that.
      e849d2d [Ye Xianjin] Merge two shellSize assignments into one. Add some explanation to alignSizeUp method.
      85a0b51 [Ye Xianjin] Fix typos and update wording in comments. Using alignSizeUp to compute alignSize.
      d27eb77 [Ye Xianjin] Add some detailed comments in the code. Add some test cases. It's very difficult to design test cases as the final object alignment will hide a lot of filed layout details if we just considering the whole size.
      842aed1 [Ye Xianjin] primitiveSize(cls) can just return Int. Use a simplified class field layout method to calculate class instance size. Will add more documents and test cases. Add a new alignSizeUp function which uses bitwise operators to speedup.
      62e8ab4 [Ye Xianjin] Don't alignSize for objects' shellSize, alignSize when added to state.size. Add some primitive wrapper objects size tests.
      bfcd528d
    • Mridul Muralidharan's avatar
      [SPARK-7323] [SPARK CORE] Use insertAll instead of insert while merging combiners in reducer · da303526
      Mridul Muralidharan authored
      Author: Mridul Muralidharan <mridulm@yahoo-inc.com>
      
      Closes #5862 from mridulm/optimize_aggregator and squashes the following commits:
      
      61cf43a [Mridul Muralidharan] Use insertAll instead of insert - much more expensive to do it per tuple
      da303526
    • Dean Chen's avatar
      [SPARK-3444] Fix typo in Dataframes.py introduced in [] · 856a571e
      Dean Chen authored
      Author: Dean Chen <deanchen5@gmail.com>
      
      Closes #5866 from deanchen/patch-1 and squashes the following commits:
      
      0028bc4 [Dean Chen] Fix typo in Dataframes.py introduced in [SPARK-3444]
      856a571e
    • Tathagata Das's avatar
      [SPARK-7315] [STREAMING] [TEST] Fix flaky WALBackedBlockRDDSuite · ecc6eb50
      Tathagata Das authored
      `FileUtils.getTempDirectoryPath()` path may or may not exist. We want to make sure that it does not exist.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5853 from tdas/SPARK-7315 and squashes the following commits:
      
      141afd5 [Tathagata Das] Removed use of FileUtils
      b08d4f1 [Tathagata Das] Fix flaky WALBackedBlockRDDSuite
      ecc6eb50
    • Andrew Or's avatar
      [SPARK-7120] [SPARK-7121] Closure cleaner nesting + documentation + tests · 7394e7ad
      Andrew Or authored
      Note: ~600 lines of this is test code, and ~100 lines documentation.
      
      **[SPARK-7121]** ClosureCleaner does not handle nested closures properly. For instance, in SparkContext, I tried to do the following:
      ```
      def scope[T](body: => T): T = body // no-op
      def myCoolMethod(path: String): RDD[String] = scope {
        parallelize(1 to 10).map { _ => path }
      }
      ```
      and I got an exception complaining that SparkContext is not serializable. The issue here is that the inner closure is getting its path from the outer closure (the scope), but the outer closure references the SparkContext object itself to get the `parallelize` method.
      
      Note, however, that the inner closure doesn't actually need the SparkContext; it just needs a field from the outer closure. If we modify ClosureCleaner to clean the outer closure recursively using only the fields accessed by the inner closure, then we can serialize the inner closure.
      
      **[SPARK-7120]** Also, the other thing is that this file is one of the least understood, partly because it is very low level and is written a long time ago. This patch attempts to change that by adding the missing documentation.
      
      This is blocking my effort on a separate task #5729.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5685 from andrewor14/closure-cleaner and squashes the following commits:
      
      cd46230 [Andrew Or] Revert a small change that affected streaming
      0bbe77f [Andrew Or] Fix style
      ea874bc [Andrew Or] Fix tests
      26c5072 [Andrew Or] Address comments
      16fbcfd [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner
      26c7aba [Andrew Or] Revert "In sc.runJob, actually clean the inner closure"
      6f75784 [Andrew Or] Revert "Guard against NPE if CC is used outside of an application"
      e909a42 [Andrew Or] Guard against NPE if CC is used outside of an application
      3998168 [Andrew Or] In sc.runJob, actually clean the inner closure
      9187066 [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner
      d889950 [Andrew Or] Revert "Bypass SerializationDebugger for now (SPARK-7180)"
      9419efe [Andrew Or] Bypass SerializationDebugger for now (SPARK-7180)
      6d4d3f1 [Andrew Or] Fix scala style?
      4aab379 [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner
      e45e904 [Andrew Or] More minor updates (wording, renaming etc.)
      8b71cdb [Andrew Or] Update a few comments
      eb127e5 [Andrew Or] Use private method tester for a few things
      a3aa465 [Andrew Or] Add more tests for individual closure cleaner operations
      e672170 [Andrew Or] Guard against potential infinite cycles in method visitor
      6d36f38 [Andrew Or] Fix closure cleaner visibility
      2106f12 [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner
      263593d [Andrew Or] Finalize tests
      06fd668 [Andrew Or] Make closure cleaning idempotent
      a4866e3 [Andrew Or] Add tests (still WIP)
      438c68f [Andrew Or] Minor changes
      2390a60 [Andrew Or] Feature flag this new behavior
      86f7823 [Andrew Or] Implement transitive cleaning + add missing documentation
      7394e7ad
    • Burak Yavuz's avatar
      [SPARK-7242] added python api for freqItems in DataFrames · 2e0f3579
      Burak Yavuz authored
      The python api for DataFrame's plus addressed your comments from previous PR.
      rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5859 from brkyvz/df-freq-py2 and squashes the following commits:
      
      f9aa9ce [Burak Yavuz] addressed comments v0.1
      4b25056 [Burak Yavuz] added python api for freqItems
      2e0f3579
  5. May 01, 2015
    • Mridul Muralidharan's avatar
      [SPARK-7317] [Shuffle] Expose shuffle handle · b79aeb95
      Mridul Muralidharan authored
      Details in JIRA, in a nut-shell, all machinary for custom RDD's to leverage spark shuffle directly (without exposing impl details of shuffle) exists - except for this small piece.
      
      Exposing this will allow for custom dependencies to get a handle to ShuffleHandle - which they can then leverage on reduce side.
      
      Author: Mridul Muralidharan <mridulm@yahoo-inc.com>
      
      Closes #5857 from mridulm/expose_shuffle_handle and squashes the following commits:
      
      d8b6bd4 [Mridul Muralidharan] Expose ShuffleHandle
      b79aeb95
    • Marcelo Vanzin's avatar
      [SPARK-6229] Add SASL encryption to network library. · 38d4e9e4
      Marcelo Vanzin authored
      There are two main parts of this change:
      
      - Extending the bootstrap mechanism in the network library to add a server-side
        bootstrap (which works a little bit differently than the client-side bootstrap), and
        to allow the  bootstraps to modify the underlying channel.
      
      - Use SASL to encrypt data going through the RPC channel.
      
      The second item requires some non-optimal code to be able to work around the
      fact that the outbound path in netty is not thread-safe, and ordering is very important
      when encryption is in the picture.
      
      A lot of the changes outside the network/common library are just to adjust to the
      changed API for initializing the RPC server.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5377 from vanzin/SPARK-6229 and squashes the following commits:
      
      ff01966 [Marcelo Vanzin] Use fancy new size config style.
      be53f32 [Marcelo Vanzin] Merge branch 'master' into SPARK-6229
      47d4aff [Marcelo Vanzin] Merge branch 'master' into SPARK-6229
      7a2a805 [Marcelo Vanzin] Clean up some unneeded changes.
      2f92237 [Marcelo Vanzin] Add comment.
      67bb0c6 [Marcelo Vanzin] Revert "Avoid exposing ByteArrayWritableChannel outside of test code."
      065f684 [Marcelo Vanzin] Add test to verify chunking.
      3d1695d [Marcelo Vanzin] Minor cleanups.
      73cff0e [Marcelo Vanzin] Skip bytes in decode path too.
      318ad23 [Marcelo Vanzin] Avoid exposing ByteArrayWritableChannel outside of test code.
      346f829 [Marcelo Vanzin] Avoid trip through channel selector by not reporting 0 bytes written.
      a4a5938 [Marcelo Vanzin] Review feedback.
      4797519 [Marcelo Vanzin] Remove unused import.
      9908ada [Marcelo Vanzin] Fix test, SASL backend disposal.
      7fe1489 [Marcelo Vanzin] Add a test that makes sure encryption is actually enabled.
      adb6f9d [Marcelo Vanzin] Review feedback.
      cf2a605 [Marcelo Vanzin] Clean up some code.
      8584323 [Marcelo Vanzin] Fix a comment.
      e98bc55 [Marcelo Vanzin] Add option to only allow encrypted connections to the server.
      dad42fc [Marcelo Vanzin] Make encryption thread-safe, less memory-intensive.
      b00999a [Marcelo Vanzin] Consolidate ByteArrayWritableChannel, fix SASL code to match master changes.
      b923cae [Marcelo Vanzin] Make SASL encryption handler thread-safe, handle FileRegion messages.
      39539a7 [Marcelo Vanzin] Add config option to enable SASL encryption.
      351a86f [Marcelo Vanzin] Add SASL encryption to network library.
      fbe6ccb [Marcelo Vanzin] Add TransportServerBootstrap, make SASL code use it.
      38d4e9e4
Loading