- May 05, 2015
-
-
zsxwing authored
It's meaningless to display the Streaming tab before `ssc.start()`. So we should attach it in the `ssc.start` method. Author: zsxwing <zsxwing@gmail.com> Closes #5898 from zsxwing/SPARK-7350 and squashes the following commits: e676487 [zsxwing] Attach the Streaming tab when calling ssc.start()
-
zsxwing authored
[SPARK-5074] [CORE] [TESTS] Fix the flakey test 'run shuffle with map stage failure' in DAGSchedulerSuite Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/2240/testReport/junit/org.apache.spark.scheduler/DAGSchedulerSuite/run_shuffle_with_map_stage_failure/ This is because many tests share the same `JobListener`. Because after each test, `scheduler` isn't stopped. So actually it's still running. When running the test `run shuffle with map stage failure`, some previous test may trigger [ResubmitFailedStages](https://github.com/apache/spark/blob/ebc25a4ddfe07a67668217cec59893bc3b8cf730/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1120) logic, and report `jobFailed` and override the global `failure` variable. This PR uses `after` to call `scheduler.stop()` for each test. Author: zsxwing <zsxwing@gmail.com> Closes #5903 from zsxwing/SPARK-5074 and squashes the following commits: 1e6f13e [zsxwing] Fix the flakey test 'run shuffle with map stage failure' in DAGSchedulerSuite
-
Liang-Chi Hsieh authored
Two minor doc errors in `BytesToBytesMap` and `UnsafeRow`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5906 from viirya/minor_doc and squashes the following commits: 27f9089 [Liang-Chi Hsieh] Minor update for doc.
-
Imran Rashid authored
Exposes data available in the UI as json over http. Key points: * new endpoints, handled independently of existing XyzPage classes. Root entrypoint is `JsonRootResource` * Uses jersey + jackson for routing & converting POJOs into json * tests against known results in `HistoryServerSuite` * also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages. Author: Imran Rashid <irashid@cloudera.com> Closes #4435 from squito/SPARK-3454 and squashes the following commits: da1e35f [Imran Rashid] typos etc. 5e78b4f [Imran Rashid] fix rendering problems 5ae02ad [Imran Rashid] Merge branch 'master' into SPARK-3454 f016182 [Imran Rashid] change all constructors json-pojo class constructors to be private[spark] to protect us from mima-false-positives if we add fields 3347b72 [Imran Rashid] mark EnumUtil as @Private ec140a2 [Imran Rashid] create @Private cc1febf [Imran Rashid] add docs on the metrics-as-json api cbaf287 [Imran Rashid] Merge branch 'master' into SPARK-3454 56db31e [Imran Rashid] update tests for mulit-attempt 7f3bc4e [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier to start & stop http servers in sbt" 67008b4 [Imran Rashid] rats 9e51400 [Imran Rashid] style c9bae1c [Imran Rashid] handle multiple attempts per app b87cd63 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt 188762c [Imran Rashid] multi-attempt 2af11e5 [Imran Rashid] Merge branch 'master' into SPARK-3454 befff0c [Imran Rashid] review feedback 14ac3ed [Imran Rashid] jersey-core needs to be explicit; move version & scope to parent pom.xml f90680e [Imran Rashid] Merge branch 'master' into SPARK-3454 dc8a7fe [Imran Rashid] style, fix errant comments acb7ef6 [Imran Rashid] fix indentation 7bf1811 [Imran Rashid] move MetricHelper so mima doesnt think its exposed; comments 9d889d6 [Imran Rashid] undo some unnecessary changes f48a7b0 [Imran Rashid] docs 52bbae8 [Imran Rashid] StorageListener & StorageStatusListener needs to synchronize internally to be thread-safe 31c79ce [Imran Rashid] asm no longer needed for SPARK_PREPEND_CLASSES b2f8b91 [Imran Rashid] @DeveloperApi 2e19be2 [Imran Rashid] lazily convert ApplicationInfo to avoid memory overhead ba3d9d2 [Imran Rashid] upper case enums 39ac29c [Imran Rashid] move EnumUtil d2bde77 [Imran Rashid] update error handling & scoping 4a234d3 [Imran Rashid] avoid jersey-media-json-jackson b/c of potential version conflicts a157a2f [Imran Rashid] style 7bd4d15 [Imran Rashid] delete security test, since it doesnt do anything a325563 [Imran Rashid] style a9c5cf1 [Imran Rashid] undo changes superceeded by master 0c6f968 [Imran Rashid] update deps 1ed0d07 [Imran Rashid] Merge branch 'master' into SPARK-3454 4c92af6 [Imran Rashid] style f2e63ad [Imran Rashid] Merge branch 'master' into SPARK-3454 c22b11f [Imran Rashid] fix compile error 9ea682c [Imran Rashid] go back to good ol' java enums cf86175 [Imran Rashid] style d493b38 [Imran Rashid] Merge branch 'master' into SPARK-3454 f05ae89 [Imran Rashid] add in ExecutorSummaryInfo for MiMa :( 101a698 [Imran Rashid] style d2ef58d [Imran Rashid] revert changes that had HistoryServer refresh the application listing more often b136e39b [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier to start & stop http servers in sbt" e031719 [Imran Rashid] fixes from review 1f53a66 [Imran Rashid] style b4a7863 [Imran Rashid] fix compile error 2c8b7ee [Imran Rashid] rats 1578a4a [Imran Rashid] doc 674f8dc [Imran Rashid] more explicit about total numbers of jobs & stages vs. number retained 9922be0 [Imran Rashid] Merge branch 'master' into stage_distributions f5a5196 [Imran Rashid] undo removal of renderJson from MasterPage, since there is no substitute yet db61211 [Imran Rashid] get JobProgressListener directly from UI fdfc181 [Imran Rashid] stage/taskList 63eb4a6 [Imran Rashid] tests for taskSummary ad27de8 [Imran Rashid] error handling on quantile values b2efcaf [Imran Rashid] cleanup, combine stage-related paths into one resource aaba896 [Imran Rashid] wire up task summary a4b1397 [Imran Rashid] stage metric distributions e48ba32 [Imran Rashid] rename eaf3bbb [Imran Rashid] style 25cd894 [Imran Rashid] if only given day, assume GMT 51eaedb [Imran Rashid] more visibility fixes 9f28b7e [Imran Rashid] ack, more cleanup 99764e1 [Imran Rashid] Merge branch 'SPARK-3454_w_jersey' into SPARK-3454 a61a43c [Imran Rashid] oops, remove accidental checkin a066055 [Imran Rashid] set visibility on a lot of classes 1f361c8 [Imran Rashid] update rat-excludes 0be5120 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey 2382bef [Imran Rashid] switch to using new "enum" fef6605 [Imran Rashid] some utils for working w/ new "enum" format dbfc7bf [Imran Rashid] style b86bcb0 [Imran Rashid] update test to look at one stage attempt 5f9df24 [Imran Rashid] style 7fd156a [Imran Rashid] refactor jsonDiff to avoid code duplication 73f1378 [Imran Rashid] test json; also add test cases for cleaned stages & jobs 97d411f [Imran Rashid] json endpoint for one job 0c96147 [Imran Rashid] better error msgs for bad stageId vs bad attemptId dddbd29 [Imran Rashid] stages have attempt; jobs are sorted; resource for all attempts for one stage 190c17a [Imran Rashid] StagePage should distinguish no task data, from unknown stage 84cd497 [Imran Rashid] AllJobsPage should still report correct completed & failed job count, even if some have been cleaned, to make it consistent w/ AllStagesPage 36e4062 [Imran Rashid] SparkUI needs to know about startTime, so it can list its own applicationInfo b4c75ed [Imran Rashid] fix merge conflicts; need to widen visibility in a few cases e91750a [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey 56d2fc7 [Imran Rashid] jersey needs asm for SPARK_PREPEND_CLASSES to work f7df095 [Imran Rashid] add test for accumulables, and discover that I need update after all 9c0c125 [Imran Rashid] add accumulableInfo 00e9cc5 [Imran Rashid] more style 3377e61 [Imran Rashid] scaladoc d05f7a9 [Imran Rashid] dont use case classes for status api POJOs, since they have binary compatibility issues 654cecf [Imran Rashid] move all the status api POJOs to one file b86e2b0 [Imran Rashid] style 18a8c45 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey 5598f19 [Imran Rashid] delete some unnecessary code, more to go 56edce0 [Imran Rashid] style 017c755 [Imran Rashid] add in metrics now available 1b78cb7 [Imran Rashid] fix some import ordering 0dc3ea7 [Imran Rashid] if app isnt found, reload apps from FS before giving up c7d884f [Imran Rashid] fix merge conflicts 0c12b50 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey b6a96a8 [Imran Rashid] compare json by AST, not string cd37845 [Imran Rashid] switch to using java.util.Dates for times a4ab5aa [Imran Rashid] add in explicit dependency on jersey 1.9 -- maven wasn't happy before this 4fdc39f [Imran Rashid] refactor case insensitive enum parsing cba1ef6 [Imran Rashid] add security (maybe?) for metrics json f0264a7 [Imran Rashid] switch to using jersey for metrics json bceb3a9 [Imran Rashid] set http response code on error, some testing e0356b6 [Imran Rashid] put new test expectation files in rat excludes (is this OK?) b252e7a [Imran Rashid] small cleanup of accidental changes d1a8c92 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt 4b398d0 [Imran Rashid] expose UI data as json in new endpoints
-
Jihong MA authored
Author: Jihong MA <linlin200605@gmail.com> Closes #5904 from JihongMA/SPARK-7357 and squashes the following commits: 7d6153a [Jihong MA] SPARK-7357 Improving HBaseTest example
-
Sandy Ryza authored
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD." -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. Author: Sandy Ryza <sandy@cloudera.com> Closes #3913 from sryza/sandy-spark-5112 and squashes the following commits: 8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark 2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util" 93f4cd0 [Sandy Ryza] Move SizeEstimator out of util e21c1f4 [Sandy Ryza] Remove unused import 798ab88 [Sandy Ryza] Update documentation and add to SparkContext 34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
-
shekhar.bansal authored
Author: shekhar.bansal <shekhar.bansal@guavus.com> Closes #5719 from zuxqoj/master and squashes the following commits: 5574ff7 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system 5117258 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system 9de5330 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system 456a592 [shekhar.bansal] [SPARK-6653][yarn] New configuration property to specify port for sparkYarnAM actor system 803e93e [shekhar.bansal] [SPARK-6653][yarn] New configuration property to specify port for sparkYarnAM actor system
-
zsxwing authored
...aming.InputStreamsSuite.socket input stream Remove non-deterministic "Thread.sleep" and use deterministic strategies to fix the flaky failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2127/testReport/junit/org.apache.spark.streaming/InputStreamsSuite/socket_input_stream/ Author: zsxwing <zsxwing@gmail.com> Closes #5891 from zsxwing/SPARK-7341 and squashes the following commits: 611157a [zsxwing] Add wait methods to BatchCounter and use BatchCounter in InputStreamsSuite 014b58f [zsxwing] Use withXXX to clean up the resources c9bf746 [zsxwing] Move 'waitForStart' into the 'start' method and fix the code style 9d0de6d [zsxwing] [SPARK-7341][Streaming][Tests] Fix the flaky test: org.apache.spark.streaming.InputStreamsSuite.socket input stream
-
jerryshao authored
Author: jerryshao <saisai.shao@intel.com> Closes #5879 from jerryshao/SPARK-7113 and squashes the following commits: b0b506c [jerryshao] Address the comments 0babe66 [jerryshao] Support input information reporting for Direct Kafka stream
-
Tathagata Das authored
org.apache.spark.DriverSuite.driver should exit after finishing without cleanup (SPARK-530) https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2267/ org.apache.spark.deploy.SparkSubmitSuite.includes jars passed in through --jars https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2271/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/ org.apache.spark.streaming.flume.FlumePollingStreamSuite.flume polling test https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2269/ Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #5901 from tdas/ignore-flaky-tests and squashes the following commits: 9cd8667 [Tathagata Das] Ignoring tests.
-
Tathagata Das authored
[SPARK-7139] [STREAMING] Allow received block metadata to be saved to WAL and recovered on driver failure - Enabled ReceivedBlockTracker WAL by default - Stored block metadata in the WAL - Optimized WALBackedBlockRDD by skipping block fetch when the block is known to not exist in Spark Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #5732 from tdas/SPARK-7139 and squashes the following commits: 575476e [Tathagata Das] Added more tests to get 100% coverage of the WALBackedBlockRDD 19668ba [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7139 685fab3 [Tathagata Das] Addressed comments in PR 637bc9c [Tathagata Das] Changed segment to handle 466212c [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7139 5f67a59 [Tathagata Das] Fixed HdfsUtils to handle append in local file system 1bc5bc3 [Tathagata Das] Fixed bug on unexpected recovery d06fa21 [Tathagata Das] Enabled ReceivedBlockTracker by default, stored block metadata and optimized block fetching in WALBackedBlockRDD
-
Marcelo Vanzin authored
Without this, any dependency that pulls ivy transitively may override the version and potentially cause issue. In my machine, the hive tests were pulling an old version of ivy, and subsequently failing with a "NoSuchMethodError". Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5893 from vanzin/ivy-dep-fix and squashes the following commits: ea2112d [Marcelo Vanzin] [minor] [build] Declare ivy dependency in root pom.
-
Niccolo Becchi authored
[MINOR] Renamed variables in SparkKMeans.scala, LocalKMeans.scala and kmeans.py to simplify readability With the previous syntax it could look like that the reduceByKey sums separately abscissas and ordinates of some 2D points. Perhaps in this way should be easier to understand the example, especially for who is starting the functional programming like me now. Author: Niccolo Becchi <niccolo.becchi@gmail.com> Author: pippobaudos <niccolo.becchi@gmail.com> Closes #5875 from pippobaudos/patch-1 and squashes the following commits: 3bb3a47 [pippobaudos] renamed variables in LocalKMeans.scala and kmeans.py to simplify readability 2c2a7a2 [Niccolo Becchi] Update SparkKMeans.scala
-
Xiangrui Meng authored
This PR upgrades Pyrolite to 4.4, which contains the bug fix for SPARK-3524 and some other performance improvements (e.g., SPARK-6288). The artifact is still under `org.spark-project` on Maven Central since there is no official release published there. Author: Xiangrui Meng <meng@databricks.com> Closes #5850 from mengxr/SPARK-7314 and squashes the following commits: 2ed4a95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7314 da3c2dd [Xiangrui Meng] remove my repo fe7e29b [Xiangrui Meng] switch to maven central 6ddac0e [Xiangrui Meng] reverse the machine code for float/double d2d5b5b [Xiangrui Meng] change back to 4.4 7824a9c [Xiangrui Meng] use Pyrolite 3.1 cc3903a [Xiangrui Meng] upgrade Pyrolite to 4.4-0 for testing
-
- May 04, 2015
-
-
Bryan Cutler authored
Added a check so that if `AkkaUtils.askWithReply` is on the final attempt, it will not sleep for the `retryInterval`. This should also prevent the thread from sleeping for `Int.Max` when using `askWithReply` with default values for `maxAttempts` and `retryInterval`. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #5896 from BryanCutler/askWithReply-sleep-7236 and squashes the following commits: 653a07b [Bryan Cutler] [SPARK-7236] Fix to prevent AkkaUtils askWithReply from sleeping on final attempt
-
Reynold Xin authored
This should gives us better analysis time error messages (rather than runtime) and automatic type casting. Author: Reynold Xin <rxin@databricks.com> Closes #5796 from rxin/expected-input-types and squashes the following commits: c900760 [Reynold Xin] [SPARK-7266] Add ExpectsInputTypes to expressions when possible.
-
Burak Yavuz authored
Computes a pair-wise frequency table of the given columns. Also known as cross-tabulation. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5842 from brkyvz/df-cont and squashes the following commits: a07c01e [Burak Yavuz] addressed comments v4.1 ae9e01d [Burak Yavuz] fix test 9106585 [Burak Yavuz] addressed comments v4.0 bced829 [Burak Yavuz] fix merge conflicts a63ad00 [Burak Yavuz] addressed comments v3.0 a0cad97 [Burak Yavuz] addressed comments v3.0 6805df8 [Burak Yavuz] addressed comments and fixed test 939b7c4 [Burak Yavuz] lint python 7f098bc [Burak Yavuz] add crosstab pyTest fd53b00 [Burak Yavuz] added python support for crosstab 27a5a81 [Burak Yavuz] implemented crosstab
-
Andrew Or authored
This patch adds the functionality to display the RDD DAG on the SparkUI. This DAG describes the relationships between - an RDD and its dependencies, - an RDD and its operation scopes, and - an RDD's operation scopes and the stage / job hierarchy An operation scope here refers to the existing public APIs that created the RDDs (e.g. `textFile`, `treeAggregate`). In the future, we can expand this to include higher level operations like SQL queries. *Note: This blatantly stole a few lines of HTML and JavaScript from #5547 (thanks shroffpradyumn!)* Here's what the job page looks like: <img src="https://issues.apache.org/jira/secure/attachment/12730286/job-page.png" width="700px"/> and the stage page: <img src="https://issues.apache.org/jira/secure/attachment/12730287/stage-page.png" width="300px"/> Author: Andrew Or <andrew@databricks.com> Closes #5729 from andrewor14/viz2 and squashes the following commits: 666c03b [Andrew Or] Round corners of RDD boxes on stage page (minor) 01ba336 [Andrew Or] Change RDD cache color to red (minor) 6f9574a [Andrew Or] Add tests for RDDOperationScope 1c310e4 [Andrew Or] Wrap a few more RDD functions in an operation scope 3ffe566 [Andrew Or] Restore "null" as default for RDD name 5fdd89d [Andrew Or] children -> child (minor) 0d07a84 [Andrew Or] Fix python style afb98e2 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2 0d7aa32 [Andrew Or] Fix python tests 3459ab2 [Andrew Or] Fix tests 832443c [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2 429e9e1 [Andrew Or] Display cached RDDs on the viz b1f0fd1 [Andrew Or] Rename OperatorScope -> RDDOperationScope 31aae06 [Andrew Or] Extract visualization logic from listener 83f9c58 [Andrew Or] Implement a programmatic representation of operator scopes 5a7faf4 [Andrew Or] Rename references to viz scopes to viz clusters ee33d52 [Andrew Or] Separate HTML generating code from listener f9830a2 [Andrew Or] Refactor + clean up + document JS visualization code b80cc52 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2 0706992 [Andrew Or] Add link from jobs to stages deb48a0 [Andrew Or] Translate stage boxes taking into account the width 5c7ce16 [Andrew Or] Connect RDDs across stages + update style ab91416 [Andrew Or] Introduce visualization to the Job Page 5f07e9c [Andrew Or] Remove more return statements from scopes 5e388ea [Andrew Or] Fix line too long 43de96e [Andrew Or] Add parent IDs to StageInfo 6e2cfea [Andrew Or] Remove all return statements in `withScope` d19c4da [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2 7ef957c [Andrew Or] Fix scala style 4310271 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2 aa868a9 [Andrew Or] Ensure that HadoopRDD is actually serializable c3bfcae [Andrew Or] Re-implement scopes using closures instead of annotations 52187fc [Andrew Or] Rat excludes 09d361e [Andrew Or] Add ID to node label (minor) 71281fa [Andrew Or] Embed the viz in the UI in a toggleable manner 8dd5af2 [Andrew Or] Fill in documentation + miscellaneous minor changes fe7816f [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz 205f838 [Andrew Or] Reimplement rendering with dagre-d3 instead of viz.js 5e22946 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz 6a7cdca [Andrew Or] Move RDD scope util methods and logic to its own file 494d5c2 [Andrew Or] Revert a few unintended style changes 9fac6f3 [Andrew Or] Re-implement scopes through annotations instead f22f337 [Andrew Or] First working implementation of visualization with vis.js 2184348 [Andrew Or] Translate RDD information to dot file 5143523 [Andrew Or] Expose the necessary information in RDDInfo a9ed4f9 [Andrew Or] Add a few missing scopes to certain RDD methods 6b3403b [Andrew Or] Scope all RDD methods
-
云峤 authored
Author: 云峤 <chensong.cs@alibaba-inc.com> Closes #5865 from kaka1992/df.show and squashes the following commits: c79204b [云峤] Update a1338f6 [云峤] Update python dataFrame show test and add empty df unit test. 734369c [云峤] Update python dataFrame show test and add empty df unit test. 84aec3e [云峤] Update python dataFrame show test and add empty df unit test. 159b3d5 [云峤] update 03ef434 [云峤] update 7394fd5 [云峤] update test show ced487a [云峤] update pep8 b6e690b [云峤] Merge remote-tracking branch 'upstream/master' into df.show 30ac311 [云峤] [SPARK-7294] ADD BETWEEN 7d62368 [云峤] [SPARK-7294] ADD BETWEEN baf839b [云峤] [SPARK-7294] ADD BETWEEN d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
-
Xiangrui Meng authored
This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`: ~~~scala def fit(dataset: DataFrame, extra: ParamMap): Model = { copy(extra).fit(dataset) } ~~~ Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have: ~~~scala val effectiveRegParam = $(regParam) / yStd val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam ~~~ Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model). Other changes: * `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance. * `fittingParamMap` was removed because the `parent` carries this information. * `validate` was renamed to `validateParams` to be more precise. TODOs: * [x] add tests for newly added methods * [ ] update documentation jkbradley dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #5820 from mengxr/SPARK-5956 and squashes the following commits: 7bef88d [Xiangrui Meng] address comments 05229c3 [Xiangrui Meng] assert -> assertEquals b2927b1 [Xiangrui Meng] organize imports f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956 93e7924 [Xiangrui Meng] add tests for hasParam & copy 463ecae [Xiangrui Meng] merge master 2b954c3 [Xiangrui Meng] update Binarizer 465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956 282a1a8 [Xiangrui Meng] fix test 819dd2d [Xiangrui Meng] merge master b642872 [Xiangrui Meng] example code runs 5a67779 [Xiangrui Meng] examples compile c76b4d1 [Xiangrui Meng] fix all unit tests 0f4fd64 [Xiangrui Meng] fix some tests 9286a22 [Xiangrui Meng] copyValues to trained models 53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues 9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams d882afc [Xiangrui Meng] test compile f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
-
Andrew Or authored
I suspect haven't been using anaconda in tests in a while. I wonder if this change actually does anything but this line as it stands looks strictly less correct. Author: Andrew Or <andrew@databricks.com> Closes #5883 from andrewor14/fix-run-tests-typo and squashes the following commits: a3ad720 [Andrew Or] Fix typo?
-
tianyi authored
This PR is a rebased version of #3946 , and mainly focused on creating an independent tab for the thrift server in spark web UI. Features: 1. Session related statistics ( username and IP are only supported in hive-0.13.1 ) 2. List all the SQL executing or executed on this server 3. Provide links to the job generated by SQL 4. Provide link to show all SQL executing or executed in a specified session Prototype snapshots: This is the main page for thrift server  Author: tianyi <tianyi.asiainfo@gmail.com> Closes #5730 from tianyi/SPARK-5100 and squashes the following commits: cfd14c7 [tianyi] style fix 0efe3d5 [tianyi] revert part of pom change c0f2fa0 [tianyi] extends HiveThriftJdbcTest to start/stop thriftserver for UI test aa20408 [tianyi] fix style problem c9df6f9 [tianyi] add testsuite for thriftserver ui and fix some style issue 9830199 [tianyi] add webui for thriftserver
-
Yuhao Yang authored
JIRA: https://issues.apache.org/jira/browse/SPARK-5563 The PR contains the implementation for [Online LDA] (https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on the research of Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira. Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus. Correctness test. I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #4419 from hhbyyh/ldaonline and squashes the following commits: 1045eec [Yuhao Yang] Merge pull request #2 from jkbradley/hhbyyh-ldaonline2 cf376ff [Joseph K. Bradley] For private vars needed for testing, I made them private and added accessors. Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter. 6149ca6 [Yuhao Yang] fix for setOptimizer cf0007d [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline 54cf8da [Yuhao Yang] some style change 68c2318 [Yuhao Yang] add a java ut 4041723 [Yuhao Yang] add ut 138bfed [Yuhao Yang] Merge pull request #1 from jkbradley/hhbyyh-ldaonline-update 9e910d9 [Joseph K. Bradley] small fix 61d60df [Joseph K. Bradley] Minor cleanups: * Update *Concentration parameter documentation * EM Optimizer: createVertices() does not need to be a function * OnlineLDAOptimizer: typos in doc * Clean up the core code for online LDA (Scala style) a996a82 [Yuhao Yang] respond to comments b1178cf [Yuhao Yang] fit into the optimizer framework dbe3cff [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline 15be071 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline b29193b [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline d19ef55 [Yuhao Yang] change OnlineLDA to class 97b9e1a [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline e7bf3b0 [Yuhao Yang] move to seperate file f367cc9 [Yuhao Yang] change to optimization 8cb16a6 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline 62405cc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline 02d0373 [Yuhao Yang] fix style in comment f6d47ca [Yuhao Yang] Merge branch 'ldaonline' of https://github.com/hhbyyh/spark into ldaonline d86cdec [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline a570c9a [Yuhao Yang] use sample to pick up batch 4a3f27e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline e271eb1 [Yuhao Yang] remove non ascii 581c623 [Yuhao Yang] seperate API and adjust batch split 37af91a [Yuhao Yang] iMerge remote-tracking branch 'upstream/master' into ldaonline 20328d1 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline i aa365d1 [Yuhao Yang] merge upstream master 3a06526 [Yuhao Yang] merge with new example 0dd3947 [Yuhao Yang] kMerge remote-tracking branch 'upstream/master' into ldaonline 0d0f3ee [Yuhao Yang] replace random split with sliding fa408a8 [Yuhao Yang] ssMerge remote-tracking branch 'upstream/master' into ldaonline 45884ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s f41c5ca [Yuhao Yang] style fix 26dca1b [Yuhao Yang] style fix and make class private 043e786 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s Conflicts: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala d640d9c [Yuhao Yang] online lda initial checkin
-
- May 03, 2015
-
-
Burak Yavuz authored
submitting this PR from a phone, excuse the brevity. adds Pearson correlation to Dataframes, reusing the covariance calculation code cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5858 from brkyvz/df-corr and squashes the following commits: 285b838 [Burak Yavuz] addressed comments v2.0 d10babb [Burak Yavuz] addressed comments v0.2 4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr 4fe693b [Burak Yavuz] addressed comments v0.1 a682d06 [Burak Yavuz] ready for PR
-
Xiangrui Meng authored
as suggested by justinuang on #5601. Author: Xiangrui Meng <meng@databricks.com> Closes #5873 from mengxr/SPARK-7329 and squashes the following commits: d08f9cf [Xiangrui Meng] simplify tests b7a7b9b [Xiangrui Meng] simplify grid build
-
Sean Owen authored
Remove references to Hadoop 0.23 CC tgravescs Is this what you had in mind? basically all refs to 0.23? We don't support YARN 0.23, but also don't support Hadoop 0.23 anymore AFAICT. There are no builds or releases for it. In fact, on a related note, refs to CDH3 (Hadoop 0.20.2) should be removed as this certainly isn't supported either. Author: Sean Owen <sowen@cloudera.com> Closes #5863 from srowen/SPARK-7302 and squashes the following commits: 42f5d1e [Sean Owen] Remove CDH3 (Hadoop 0.20.2) refs too dad02e3 [Sean Owen] Remove references to Hadoop 0.23
-
Michael Armbrust authored
This PR adds initial support for loading multiple versions of Hive in a single JVM and provides a common interface for extracting metadata from the `HiveMetastoreClient` for a given version. This is accomplished by creating an isolated `ClassLoader` that operates according to the following rules: - __Shared Classes__: Java, Scala, logging, and Spark classes are delegated to `baseClassLoader` allowing the results of calls to the `ClientInterface` to be visible externally. - __Hive Classes__: new instances are loaded from `execJars`. These classes are not accessible externally due to their custom loading. - __Barrier Classes__: Classes such as `ClientWrapper` are defined in Spark but must link to a specific version of Hive. As a result, the bytecode is acquired from the Spark `ClassLoader` but a new copy is created for each instance of `IsolatedClientLoader`. This new instance is able to see a specific version of hive without using reflection where ever hive is consistent across versions. Since this is a unique instance, it is not visible externally other than as a generic `ClientInterface`, unless `isolationOn` is set to `false`. In addition to the unit tests, I have also tested this locally against mysql instances of the Hive Metastore. I've also successfully ported Spark SQL to run with this client, but due to the size of the changes, that will come in a follow-up PR. By default, Hive jars are currently downloaded from Maven automatically for a given version to ease packaging and testing. However, there is also support for specifying their location manually for deployments without internet. Author: Michael Armbrust <michael@databricks.com> Closes #5851 from marmbrus/isolatedClient and squashes the following commits: c72f6ac [Michael Armbrust] rxins comments 1e271fa [Michael Armbrust] [SPARK-6907][SQL] Isolated client for HiveMetastore
-
Omede Firouz authored
Author: Omede Firouz <ofirouz@palantir.com> Author: Omede <omedefirouz@gmail.com> Closes #5601 from oefirouz/paramgrid and squashes the following commits: c9e2481 [Omede Firouz] Make test a doctest 9a8ce22 [Omede] Fix linter issues 8b8a6d2 [Omede Firouz] [SPARK-7022][PySpark][ML] Add ML.Tuning.ParamGridBuilder to PySpark
-
- May 02, 2015
-
-
WangTaoTheTonic authored
We should let Thrift Server take these two parameters as it is a daemon. And it is better to read driver-related configs as an app submited by spark-submit. https://issues.apache.org/jira/browse/SPARK-7031 Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #5609 from WangTaoTheTonic/SPARK-7031 and squashes the following commits: 8d3fc16 [WangTaoTheTonic] indent 035069b [WangTaoTheTonic] better code style d3ddfb6 [WangTaoTheTonic] revert the unnecessary changes in suite 624e652 [WangTaoTheTonic] fix break tests 0565831 [WangTaoTheTonic] fix failed tests 4fb25ed [WangTaoTheTonic] let thrift server take SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS
-
BenFradet authored
Added documentation for spark.streaming.kafka.maxRetries Author: BenFradet <benjamin.fradet@gmail.com> Closes #5808 from BenFradet/master and squashes the following commits: cc72e7a [BenFradet] updated doc for spark.streaming.kafka.maxRetries to explain the default value 18f823e [BenFradet] Added "consecutive" to the spark.streaming.kafka.maxRetries doc 597fdeb [BenFradet] Mention that spark.streaming.kafka.maxRetries only applies to the direct kafka api 0efad39 [BenFradet] Added documentation for spark.streaming.kafka.maxRetries
-
Cheng Hao authored
based on #4015, we should not delete `sqlParser` from sqlcontext, that leads to mima failed. Users implement dialect to give a fallback for `sqlParser` and we should construct `sqlParser` in sqlcontext according to the dialect `protected[sql] val sqlParser = new SparkSQLParser(getSQLDialect().parse(_))` Author: Cheng Hao <hao.cheng@intel.com> Author: scwf <wangfei1@huawei.com> Closes #5827 from scwf/sqlparser1 and squashes the following commits: 81b9737 [scwf] comment fix 0878bd1 [scwf] remove comments c19780b [scwf] fix mima tests c2895cf [scwf] Merge branch 'master' of https://github.com/apache/spark into sqlparser1 493775c [Cheng Hao] update the code as feedback 81a731f [Cheng Hao] remove the unecessary comment aab0b0b [Cheng Hao] polish the code a little bit 49b9d81 [Cheng Hao] shrink the comment for rebasing
-
Marcelo Vanzin authored
At least in the version of Hive I tested on, the test was deleting a temp directory generated by Hive instead of one containing partition data. So fix the filter to only consider partition directories when deciding what to delete. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5854 from vanzin/hive-test-fix and squashes the following commits: 7594ae9 [Marcelo Vanzin] Fix typo. 729fa80 [Marcelo Vanzin] [minor] [hive] Fix QueryPartitionSuite.
-
Ye Xianjin authored
SizeEstimator gives wrong result for Integer on 64bit JVM with UseCompressedOops on, this pr fixes that. For more details, please refer [SPARK-6030](https://issues.apache.org/jira/browse/SPARK-6030) sryza, I noticed there is a pr to expose SizeEstimator, maybe that should be waited by this pr get merged if we confirm this problem. And shivaram would you mind to review this pr since you contribute related code. Also cc to srowen and mateiz Author: Ye Xianjin <advancedxy@gmail.com> Closes #4783 from advancedxy/SPARK-6030 and squashes the following commits: c4dcb41 [Ye Xianjin] Add super.beforeEach in the beforeEach method to make the trait stackable.. Remove useless leading whitespace. 3f80640 [Ye Xianjin] The size of Integer class changes from 24 to 16 on a 64-bit JVM with -UseCompressedOops flag on after the fix. I don't how 100000 was originally calculated, It looks like 100000 is the magic number which makes sure spilling. Because of the size change, It fails because there is no spilling at all. Change the number to a slightly larger number fixes that. e849d2d [Ye Xianjin] Merge two shellSize assignments into one. Add some explanation to alignSizeUp method. 85a0b51 [Ye Xianjin] Fix typos and update wording in comments. Using alignSizeUp to compute alignSize. d27eb77 [Ye Xianjin] Add some detailed comments in the code. Add some test cases. It's very difficult to design test cases as the final object alignment will hide a lot of filed layout details if we just considering the whole size. 842aed1 [Ye Xianjin] primitiveSize(cls) can just return Int. Use a simplified class field layout method to calculate class instance size. Will add more documents and test cases. Add a new alignSizeUp function which uses bitwise operators to speedup. 62e8ab4 [Ye Xianjin] Don't alignSize for objects' shellSize, alignSize when added to state.size. Add some primitive wrapper objects size tests.
-
Mridul Muralidharan authored
Author: Mridul Muralidharan <mridulm@yahoo-inc.com> Closes #5862 from mridulm/optimize_aggregator and squashes the following commits: 61cf43a [Mridul Muralidharan] Use insertAll instead of insert - much more expensive to do it per tuple
-
Dean Chen authored
Author: Dean Chen <deanchen5@gmail.com> Closes #5866 from deanchen/patch-1 and squashes the following commits: 0028bc4 [Dean Chen] Fix typo in Dataframes.py introduced in [SPARK-3444]
-
Tathagata Das authored
`FileUtils.getTempDirectoryPath()` path may or may not exist. We want to make sure that it does not exist. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #5853 from tdas/SPARK-7315 and squashes the following commits: 141afd5 [Tathagata Das] Removed use of FileUtils b08d4f1 [Tathagata Das] Fix flaky WALBackedBlockRDDSuite
-
Andrew Or authored
Note: ~600 lines of this is test code, and ~100 lines documentation. **[SPARK-7121]** ClosureCleaner does not handle nested closures properly. For instance, in SparkContext, I tried to do the following: ``` def scope[T](body: => T): T = body // no-op def myCoolMethod(path: String): RDD[String] = scope { parallelize(1 to 10).map { _ => path } } ``` and I got an exception complaining that SparkContext is not serializable. The issue here is that the inner closure is getting its path from the outer closure (the scope), but the outer closure references the SparkContext object itself to get the `parallelize` method. Note, however, that the inner closure doesn't actually need the SparkContext; it just needs a field from the outer closure. If we modify ClosureCleaner to clean the outer closure recursively using only the fields accessed by the inner closure, then we can serialize the inner closure. **[SPARK-7120]** Also, the other thing is that this file is one of the least understood, partly because it is very low level and is written a long time ago. This patch attempts to change that by adding the missing documentation. This is blocking my effort on a separate task #5729. Author: Andrew Or <andrew@databricks.com> Closes #5685 from andrewor14/closure-cleaner and squashes the following commits: cd46230 [Andrew Or] Revert a small change that affected streaming 0bbe77f [Andrew Or] Fix style ea874bc [Andrew Or] Fix tests 26c5072 [Andrew Or] Address comments 16fbcfd [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner 26c7aba [Andrew Or] Revert "In sc.runJob, actually clean the inner closure" 6f75784 [Andrew Or] Revert "Guard against NPE if CC is used outside of an application" e909a42 [Andrew Or] Guard against NPE if CC is used outside of an application 3998168 [Andrew Or] In sc.runJob, actually clean the inner closure 9187066 [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner d889950 [Andrew Or] Revert "Bypass SerializationDebugger for now (SPARK-7180)" 9419efe [Andrew Or] Bypass SerializationDebugger for now (SPARK-7180) 6d4d3f1 [Andrew Or] Fix scala style? 4aab379 [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner e45e904 [Andrew Or] More minor updates (wording, renaming etc.) 8b71cdb [Andrew Or] Update a few comments eb127e5 [Andrew Or] Use private method tester for a few things a3aa465 [Andrew Or] Add more tests for individual closure cleaner operations e672170 [Andrew Or] Guard against potential infinite cycles in method visitor 6d36f38 [Andrew Or] Fix closure cleaner visibility 2106f12 [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner 263593d [Andrew Or] Finalize tests 06fd668 [Andrew Or] Make closure cleaning idempotent a4866e3 [Andrew Or] Add tests (still WIP) 438c68f [Andrew Or] Minor changes 2390a60 [Andrew Or] Feature flag this new behavior 86f7823 [Andrew Or] Implement transitive cleaning + add missing documentation
-
Burak Yavuz authored
The python api for DataFrame's plus addressed your comments from previous PR. rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5859 from brkyvz/df-freq-py2 and squashes the following commits: f9aa9ce [Burak Yavuz] addressed comments v0.1 4b25056 [Burak Yavuz] added python api for freqItems
-
- May 01, 2015
-
-
Mridul Muralidharan authored
Details in JIRA, in a nut-shell, all machinary for custom RDD's to leverage spark shuffle directly (without exposing impl details of shuffle) exists - except for this small piece. Exposing this will allow for custom dependencies to get a handle to ShuffleHandle - which they can then leverage on reduce side. Author: Mridul Muralidharan <mridulm@yahoo-inc.com> Closes #5857 from mridulm/expose_shuffle_handle and squashes the following commits: d8b6bd4 [Mridul Muralidharan] Expose ShuffleHandle
-
Marcelo Vanzin authored
There are two main parts of this change: - Extending the bootstrap mechanism in the network library to add a server-side bootstrap (which works a little bit differently than the client-side bootstrap), and to allow the bootstraps to modify the underlying channel. - Use SASL to encrypt data going through the RPC channel. The second item requires some non-optimal code to be able to work around the fact that the outbound path in netty is not thread-safe, and ordering is very important when encryption is in the picture. A lot of the changes outside the network/common library are just to adjust to the changed API for initializing the RPC server. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5377 from vanzin/SPARK-6229 and squashes the following commits: ff01966 [Marcelo Vanzin] Use fancy new size config style. be53f32 [Marcelo Vanzin] Merge branch 'master' into SPARK-6229 47d4aff [Marcelo Vanzin] Merge branch 'master' into SPARK-6229 7a2a805 [Marcelo Vanzin] Clean up some unneeded changes. 2f92237 [Marcelo Vanzin] Add comment. 67bb0c6 [Marcelo Vanzin] Revert "Avoid exposing ByteArrayWritableChannel outside of test code." 065f684 [Marcelo Vanzin] Add test to verify chunking. 3d1695d [Marcelo Vanzin] Minor cleanups. 73cff0e [Marcelo Vanzin] Skip bytes in decode path too. 318ad23 [Marcelo Vanzin] Avoid exposing ByteArrayWritableChannel outside of test code. 346f829 [Marcelo Vanzin] Avoid trip through channel selector by not reporting 0 bytes written. a4a5938 [Marcelo Vanzin] Review feedback. 4797519 [Marcelo Vanzin] Remove unused import. 9908ada [Marcelo Vanzin] Fix test, SASL backend disposal. 7fe1489 [Marcelo Vanzin] Add a test that makes sure encryption is actually enabled. adb6f9d [Marcelo Vanzin] Review feedback. cf2a605 [Marcelo Vanzin] Clean up some code. 8584323 [Marcelo Vanzin] Fix a comment. e98bc55 [Marcelo Vanzin] Add option to only allow encrypted connections to the server. dad42fc [Marcelo Vanzin] Make encryption thread-safe, less memory-intensive. b00999a [Marcelo Vanzin] Consolidate ByteArrayWritableChannel, fix SASL code to match master changes. b923cae [Marcelo Vanzin] Make SASL encryption handler thread-safe, handle FileRegion messages. 39539a7 [Marcelo Vanzin] Add config option to enable SASL encryption. 351a86f [Marcelo Vanzin] Add SASL encryption to network library. fbe6ccb [Marcelo Vanzin] Add TransportServerBootstrap, make SASL code use it.
-