Skip to content
Snippets Groups Projects
  1. Nov 25, 2014
    • arahuja's avatar
      [SPARK-4344][DOCS] adding documentation on spark.yarn.user.classpath.first · d2407601
      arahuja authored
      The documentation for the two parameters is the same with a pointer from the standalone parameter to the yarn parameter
      
      Author: arahuja <aahuja11@gmail.com>
      
      Closes #3209 from arahuja/yarn-classpath-first-param and squashes the following commits:
      
      51cb9b2 [arahuja] [SPARK-4344][DOCS] adding documentation for YARN on userClassPathFirst
      d2407601
    • jerryshao's avatar
      [SPARK-4381][Streaming]Add warning log when user set spark.master to local in... · fef27b29
      jerryshao authored
      [SPARK-4381][Streaming]Add warning log when user set spark.master to local in Spark Streaming and there's no job executed
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3244 from jerryshao/SPARK-4381 and squashes the following commits:
      
      d2486c7 [jerryshao] Improve the warning log
      d726e85 [jerryshao] Add local[1] to the filter condition
      eca428b [jerryshao] Add warning log
      fef27b29
    • q00251598's avatar
      [SPARK-4535][Streaming] Fix the error in comments · a51118a3
      q00251598 authored
      change `NetworkInputDStream` to `ReceiverInputDStream`
      change `ReceiverInputTracker` to `ReceiverTracker`
      
      Author: q00251598 <qiyadong@huawei.com>
      
      Closes #3400 from watermen/fix-comments and squashes the following commits:
      
      75d795c [q00251598] change 'NetworkInputDStream' to 'ReceiverInputDStream' && change 'ReceiverInputTracker' to 'ReceiverTracker'
      a51118a3
    • GuoQiang Li's avatar
      [SPARK-4526][MLLIB]GradientDescent get a wrong gradient value according to the gradient formula. · f515f943
      GuoQiang Li authored
      This is caused by the miniBatchSize parameter.The number of `RDD.sample` returns is not fixed.
      cc mengxr
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #3399 from witgo/GradientDescent and squashes the following commits:
      
      13cb228 [GuoQiang Li] review commit
      668ab66 [GuoQiang Li] Double to Long
      b6aa11a [GuoQiang Li] Check miniBatchSize is greater than 0
      0b5c3e3 [GuoQiang Li] Minor fix
      12e7424 [GuoQiang Li] GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.
      f515f943
    • DB Tsai's avatar
      [SPARK-4596][MLLib] Refactorize Normalizer to make code cleaner · 89f91226
      DB Tsai authored
      In this refactoring, the performance will be slightly increased due to removing
      the overhead from breeze vector. The bottleneck is still in breeze norm
      which is implemented by activeIterator.
      
      This inefficiency of breeze norm will be addressed in next PR. At least,
      this PR makes the code more consistent in the codebase.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #3446 from dbtsai/normalizer and squashes the following commits:
      
      e20a2b9 [DB Tsai] first commit
      89f91226
    • wangfei's avatar
      [DOC][Build] Wrong cmd for build spark with apache hadoop 2.4.X and hive 12 · 0fe54cff
      wangfei authored
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3335 from scwf/patch-10 and squashes the following commits:
      
      d343113 [wangfei] add '-Phive'
      60d595e [wangfei] [DOC] Wrong cmd for build spark with apache hadoop 2.4.X and Hive 12 support
      0fe54cff
  2. Nov 24, 2014
    • w00228970's avatar
      [SQL] Compute timeTaken correctly · 723be60e
      w00228970 authored
      ```timeTaken``` should not count the time of printing result.
      
      Author: w00228970 <wangfei1@huawei.com>
      
      Closes #3423 from scwf/time-taken-bug and squashes the following commits:
      
      da7e102 [w00228970] compute time taken correctly
      723be60e
    • tkaessmann's avatar
      [SPARK-4582][MLLIB] get raw vectors for further processing in Word2Vec · 9ce2bf38
      tkaessmann authored
      This is #3309 for the master branch.
      
      e.g. clustering
      
      Author: tkaessmann <tobias.kaessmanns24.com>
      
      Closes #3309 from tkaessmann/branch-1.2 and squashes the following commits:
      
      e3a3142 [tkaessmann] changes the comment for getVectors
      58d3d83 [tkaessmann] removes sign from comment
      a5be213 [tkaessmann] fixes getVectors to fit code guidelines
      3782fa9 [tkaessmann] get raw vectors for further processing
      
      Author: tkaessmann <tobias.kaessmann@s24.com>
      
      Closes #3437 from mengxr/SPARK-4582 and squashes the following commits:
      
      6c666b4 [tkaessmann] get raw vectors for further processing in Word2Vec
      9ce2bf38
    • Jongyoul Lee's avatar
      [SPARK-4525] Mesos should decline unused offers · f0afb623
      Jongyoul Lee authored
      Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly.
      
      I've also done some minor renaming/clean-up of variables in this class and tests.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Jongyoul Lee <jongyoul@gmail.com>
      
      Closes #3436 from pwendell/mesos-issue and squashes the following commits:
      
      58c35b5 [Patrick Wendell] Adding unit test for this situation
      c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix
      f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
      f0afb623
    • Patrick Wendell's avatar
      Revert "[SPARK-4525] Mesos should decline unused offers" · a68d4422
      Patrick Wendell authored
      This reverts commit b043c274.
      
      I accidentally committed this using my own authorship credential. However,
      I should have given authoriship to the original author: Jongyoul Lee.
      a68d4422
    • Patrick Wendell's avatar
      [SPARK-4525] Mesos should decline unused offers · b043c274
      Patrick Wendell authored
      Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly.
      
      I've also done some minor renaming/clean-up of variables in this class and tests.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Jongyoul Lee <jongyoul@gmail.com>
      
      Closes #3436 from pwendell/mesos-issue and squashes the following commits:
      
      58c35b5 [Patrick Wendell] Adding unit test for this situation
      c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix
      f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
      b043c274
    • Kay Ousterhout's avatar
      [SPARK-4266] [Web-UI] Reduce stage page load time. · d24d5bf0
      Kay Ousterhout authored
      The commit changes the java script used to show/hide additional
      metrics in order to reduce page load time. SPARK-4016 significantly
      increased page load time for the stage page when stages had a lot
      (thousands or tens of thousands) of tasks, due to the additional
      Javascript to hide some metrics by default and stripe the tables.
      This commit reduces page load time in two ways:
      
      (1) Now, all of the metrics that are hidden by default are
      hidden by setting "display: none;" using CSS for the page,
      rather than hiding them using javascript after the page loads.
      Without this change, for stages with thousands of tasks, there
      was a few second delay after page load, where first the additional
      metrics were shown, and then after a delay were hidden once the
      relevant JS finished running.
      
      (2) CSS is used to stripe all of the tables except for the summary
      table. The summary table needs javascript to do the striping because
      some rows are hidden, but the javascript striping is slower, which
      again resulted in a delay when it was used for the task table (where
      for a few seconds after page load, all of the rows in the task table
      would be white, while the browser finished running the JS to stripe
      the table).
      
      cc pwendell
      
      This change is intended to be backported to 1.2 to avoid a regression in
      UI performance when users run large jobs.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #3328 from kayousterhout/SPARK-4266 and squashes the following commits:
      
      f964091 [Kay Ousterhout] [SPARK-4266] [Web-UI] Reduce stage page load time.
      d24d5bf0
    • Davies Liu's avatar
      [SPARK-4548] []SPARK-4517] improve performance of python broadcast · 6cf50768
      Davies Liu authored
      Re-implement the Python broadcast using file:
      
      1) serialize the python object using cPickle, write into disks.
      2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
      3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
      4) During deserialization, writing the data into disk.
      5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.
      
      It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).
      
      Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):
      
               name |   1.1   | 1.2 with this patch |  improvement
      ---------|--------|---------|--------
            python-broadcast-w-bytes  |	25.20  |	9.33   |	170.13% |
              python-broadcast-w-set	  |     4.13	   |    4.50  |	-8.35%  |
      
      Testing with 100 tasks (16 CPUs):
      
               name |   1.1   | 1.2 with this patch |  improvement
      ---------|--------|---------|--------
           python-broadcast-w-bytes	| 38.16	| 8.40	 | 353.98%
              python-broadcast-w-set	| 23.29	| 9.59 |	142.80%
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3417 from davies/pybroadcast and squashes the following commits:
      
      50a58e0 [Davies Liu] address comments
      b98de1d [Davies Liu] disable gc while unpickle
      e5ee6b9 [Davies Liu] support large string
      09303b8 [Davies Liu] read all data into memory
      dde02dd [Davies Liu] improve performance of python broadcast
      6cf50768
    • Davies Liu's avatar
      [SPARK-4578] fix asDict() with nested Row() · 050616b4
      Davies Liu authored
      The Row object is created on the fly once the field is accessed, so we should access them by getattr() in asDict(0
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3434 from davies/fix_asDict and squashes the following commits:
      
      b20f1e7 [Davies Liu] fix asDict() with nested Row()
      050616b4
    • Davies Liu's avatar
      [SPARK-4562] [MLlib] speedup vector · b660de7a
      Davies Liu authored
      This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.
      
      It also improve the serialization of DenseVector.
      
      Before this change:
      
      trial	| trainingTime | 	testTime
      -------|--------|--------
      0	| 5.126 | 	1.786
      1	|2.698	|1.693
      
      After the change:
      
      trial	| trainingTime |	testTime
      -------|--------|--------
      0	|4.692	|0.554
      1	|2.307	|0.525
      
      This could partially fix the performance regression during test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3420 from davies/ser2 and squashes the following commits:
      
      0e1e6f3 [Davies Liu] fix tests
      426f5db [Davies Liu] impove toArray()
      44707ec [Davies Liu] add name for ISO-8859-1
      fa7d791 [Davies Liu] address comments
      1cfb137 [Davies Liu] handle zero sparse vector
      2548ee2 [Davies Liu] fix tests
      9e6389d [Davies Liu] bugfix
      470f702 [Davies Liu] speed up DenseMatrix
      f0d3c40 [Davies Liu] speedup SparseVector
      ef6ce70 [Davies Liu] speed up dense vector
      b660de7a
    • Tathagata Das's avatar
      [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files... · cb0e9b09
      Tathagata Das authored
      [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times
      
      Because of a corner case, a file already selected for batch t can get considered again for batch t+2. This refactoring fixes it by remembering all the files selected in the last 1 minute, so that this corner case does not arise. Also uses spark context's hadoop configuration to access the file system API for listing directories.
      
      pwendell Please take look. I still have not run long-running integration tests, so I cannot say for sure whether this has indeed solved the issue. You could do a first pass on this in the meantime.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3419 from tdas/filestream-fix2 and squashes the following commits:
      
      c19dd8a [Tathagata Das] Addressed PR comments.
      513b608 [Tathagata Das] Updated docs.
      d364faf [Tathagata Das] Added the current time condition back
      5526222 [Tathagata Das] Removed unnecessary imports.
      38bb736 [Tathagata Das] Fix long line.
      203bbc7 [Tathagata Das] Un-ignore tests.
      eaef4e1 [Tathagata Das] Fixed SPARK-4519
      9dbd40a [Tathagata Das] Refactored FileInputDStream to remember last few batches.
      cb0e9b09
    • Josh Rosen's avatar
      [SPARK-4145] Web UI job pages · 4a90276a
      Josh Rosen authored
      This PR adds two new pages to the Spark Web UI:
      
      - A jobs overview page, which shows details on running / completed / failed jobs.
      - A job details page, which displays information on an individual job's stages.
      
      The jobs overview page is now the default UI homepage; the old homepage is still accessible at `/stages`.
      
      ### Screenshots
      
      #### New UI homepage
      
      ![image](https://cloud.githubusercontent.com/assets/50748/5119035/fd0a69e6-701f-11e4-89cb-db7e9705714f.png)
      
      #### Job details page
      
      (This is effectively a per-job version of the stages page that can be extended later with other things, such as DAG visualizations)
      
      ![image](https://cloud.githubusercontent.com/assets/50748/5134910/50b340d4-70c7-11e4-88e1-6b73237ea7c8.png)
      
      ### Key changes in this PR
      
      - Rename `JobProgressPage` to `AllStagesPage`
      - Expose `StageInfo` objects in the ``SparkListenerJobStart` event; add backwards-compatibility tests to JsonProtocol.
      - Add additional data structures to `JobProgressListener` to map from stages to jobs.
      - Add several fields to `JobUIData`.
      
      I also added ~150 lines of Selenium tests as I uncovered UI issues while developing this patch.
      
      ### Limitations
      
      If a job contains stages that aren't run, then its overall job progress bar may be an underestimate of the total job progress; in other words, a completed job may appear to have a progress bar that's not at 100%.
      
      If stages or tasks fail, then the progress bar will not go backwards to reflect the true amount of remaining work.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3009 from JoshRosen/job-page and squashes the following commits:
      
      eb05e90 [Josh Rosen] Disable kill button in completed stages tables.
      f00c851 [Josh Rosen] Fix JsonProtocol compatibility
      b89c258 [Josh Rosen] More JSON protocol backwards-compatibility fixes.
      ff804cd [Josh Rosen] Don't write "Stage Ids" field in JobStartEvent JSON.
      6f17f3f [Josh Rosen] Only store StageInfos in SparkListenerJobStart event.
      2bbf41a [Josh Rosen] Update job progress bar to reflect skipped tasks/stages.
      61c265a [Josh Rosen] Add “skipped stages” table; only display non-empty tables.
      1f45d44 [Josh Rosen] Incorporate a bunch of minor review feedback.
      0b77e3e [Josh Rosen] More bug fixes for phantom stages.
      034aa8d [Josh Rosen] Use `.max()` to find result stage for job.
      eebdc2c [Josh Rosen] Don’t display pending stages for completed jobs.
      67080ba [Josh Rosen] Ensure that "phantom stages" don't cause memory leaks.
      7d10b97 [Josh Rosen] Merge remote-tracking branch 'apache/master' into job-page
      d69c775 [Josh Rosen] Fix table sorting on all jobs page.
      5eb39dc [Josh Rosen] Add pending stages table to job page.
      f2a15da [Josh Rosen] Add status field to job details page.
      171b53c [Josh Rosen] Move `startTime` to the start of SparkContext.
      e2f2c43 [Josh Rosen] Fix sorting of stages in job details page.
      8955f4c [Josh Rosen] Display information for pending stages on jobs page.
      8ab6c28 [Josh Rosen] Compute numTasks from job start stage infos.
      5884f91 [Josh Rosen] Add StageInfos to SparkListenerJobStart event.
      79793cd [Josh Rosen] Track indices of completed stage to avoid overcounting when failures occur.
      d62ea7b [Josh Rosen] Add failing Selenium test for stage overcounting issue.
      1145c60 [Josh Rosen] Display text instead of progress bar for stages.
      3d0a007 [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page
      8a2351b [Josh Rosen] Add help tooltip to Spark Jobs page.
      b7bf30e [Josh Rosen] Add stages progress bar; fix bug where active stages show as completed.
      4846ce4 [Josh Rosen] Hide "(Job Group") if no jobs were submitted in job groups.
      4d58e55 [Josh Rosen] Change label to "Tasks (for all stages)"
      85e9c85 [Josh Rosen] Extract startTime into separate variable.
      1cf4987 [Josh Rosen] Fix broken kill links; add Selenium test to avoid future regressions.
      56701fa [Josh Rosen] Move last stage name / description logic out of markup.
      a475ea1 [Josh Rosen] Add progress bars to jobs page.
      45343b8 [Josh Rosen] More comments
      4b206fb [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page
      bfce2b9 [Josh Rosen] Address review comments, except for progress bar.
      4487dcb [Josh Rosen] [SPARK-4145] Web UI job pages
      2568a6c [Josh Rosen] Rename JobProgressPage to AllStagesPage:
      4a90276a
    • Kousuke Saruta's avatar
      [SPARK-4487][SQL] Fix attribute reference resolution error when using ORDER BY. · dd1c9cb3
      Kousuke Saruta authored
      When we use ORDER BY clause, at first, attributes referenced by projection are resolved (1).
      And then, attributes referenced at ORDER BY clause are resolved (2).
       But when resolving attributes referenced at ORDER BY clause, the resolution result generated in (1) is discarded so for example, following query fails.
      
          SELECT c1 + c2 FROM mytable ORDER BY c1;
      
      The query above fails because when resolving the attribute reference 'c1', the resolution result of 'c2' is discarded.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3363 from sarutak/SPARK-4487 and squashes the following commits:
      
      fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer
      6e60c20 [Kousuke Saruta] Fixed conflicts
      cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487
      282d529 [Kousuke Saruta] Fixed attributes reference resolution error
      b6123e6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into concat-feature
      317b7fb [Kousuke Saruta] WIP
      dd1c9cb3
    • scwf's avatar
      [SQL] Fix path in HiveFromSpark · b3841193
      scwf authored
      It require us to run ```HiveFromSpark``` in specified dir because ```HiveFromSpark``` use relative path, this leads to ```run-example``` error(http://apache-spark-developers-list.1001551.n3.nabble.com/src-main-resources-kv1-txt-not-found-in-example-of-HiveFromSpark-td9100.html).
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3415 from scwf/HiveFromSpark and squashes the following commits:
      
      ed3d6c9 [scwf] revert no need change
      b00e20c [scwf] fix path usring spark_home
      dbd321b [scwf] fix path in hivefromspark
      b3841193
    • Daniel Darabos's avatar
      [SQL] Fix comment in HiveShim · d5834f07
      Daniel Darabos authored
      This file is for Hive 0.13.1 I think.
      
      Author: Daniel Darabos <darabos.daniel@gmail.com>
      
      Closes #3432 from darabos/patch-2 and squashes the following commits:
      
      4fd22ed [Daniel Darabos] Fix comment. This file is for Hive 0.13.1.
      d5834f07
    • Cheng Lian's avatar
      [SPARK-4479][SQL] Avoids unnecessary defensive copies when sort based shuffle is on · a6d7b61f
      Cheng Lian authored
      This PR is a workaround for SPARK-4479. Two changes are introduced: when merge sort is bypassed in `ExternalSorter`,
      
      1. also bypass RDD elements buffering as buffering is the reason that `MutableRow` backed row objects must be copied, and
      2. avoids defensive copies in `Exchange` operator
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3422)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3422 from liancheng/avoids-defensive-copies and squashes the following commits:
      
      591f2e9 [Cheng Lian] Passes all shuffle suites
      0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed
      ed5df3c [Cheng Lian] Fixes styling changes
      f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based shuffle is on
      a6d7b61f
    • Sandy Ryza's avatar
      SPARK-4457. Document how to build for Hadoop versions greater than 2.4 · 29372b63
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3322 from sryza/sandy-spark-4457 and squashes the following commits:
      
      5e72b77 [Sandy Ryza] Feedback
      0cf05c1 [Sandy Ryza] Caveat
      be8084b [Sandy Ryza] SPARK-4457. Document how to build for Hadoop versions greater than 2.4
      29372b63
  3. Nov 22, 2014
    • Prashant Sharma's avatar
      [SPARK-4377] Fixed serialization issue by switching to akka provided serializer. · 9b2a3c61
      Prashant Sharma authored
      ... - there is no way around this for deserializing actorRef(s).
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #3402 from ScrapCodes/SPARK-4377/troubleDeserializing and squashes the following commits:
      
      77233fd [Prashant Sharma] Style fixes
      9b35c6e [Prashant Sharma] Scalastyle fixes
      29880da [Prashant Sharma] [SPARK-4377] Fixed serialization issue by switching to akka provided serializer - there is no way around this for deserializing actorRef(s).
      9b2a3c61
  4. Nov 21, 2014
    • DB Tsai's avatar
      [SPARK-4431][MLlib] Implement efficient foreachActive for dense and sparse vector · b5d17ef1
      DB Tsai authored
      Previously, we were using Breeze's activeIterator to access the non-zero elements
      in dense/sparse vector. Due to the overhead, we switched back to native `while loop`
      in #SPARK-4129.
      
      However, #SPARK-4129 requires de-reference the dv.values/sv.values in
      each access to the value, which is very expensive. Also, in MultivariateOnlineSummarizer,
      we're using Breeze's dense vector to store the partial stats, and this is very expensive compared
      with using primitive scala array.
      
      In this PR, efficient foreachActive is implemented to unify the code path for dense and sparse
      vector operation which makes codebase easier to maintain. Breeze dense vector is replaced
      by primitive array to reduce the overhead further.
      
      Benchmarking with mnist8m dataset on single JVM
      with first 200 samples loaded in memory, and repeating 5000 times.
      
      Before change:
      Sparse Vector - 30.02
      Dense Vector - 38.27
      
      With this PR:
      Sparse Vector - 6.29
      Dense Vector - 11.72
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #3288 from dbtsai/activeIterator and squashes the following commits:
      
      844b0e6 [DB Tsai] formating
      03dd693 [DB Tsai] futher performance tunning.
      1907ae1 [DB Tsai] address feedback
      98448bb [DB Tsai] Made the override final, and had a local copy of variables which made the accessing a single step operation.
      c0cbd5a [DB Tsai] fix a bug
      6441f92 [DB Tsai] Finished SPARK-4431
      b5d17ef1
    • Davies Liu's avatar
      [SPARK-4531] [MLlib] cache serialized java object · ce95bd8e
      Davies Liu authored
      The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.
      
      This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3397 from davies/cache and squashes the following commits:
      
      7f6e6ce [Davies Liu] Update -> Updater
      4b52edd [Davies Liu] using named argument
      63b984e [Davies Liu] fix
      7da0332 [Davies Liu] add unpersist()
      dff33e1 [Davies Liu] address comments
      c2bdfc2 [Davies Liu] refactor
      d572f00 [Davies Liu] Merge branch 'master' into cache
      f1063e1 [Davies Liu] cache serialized java object
      ce95bd8e
    • Patrick Wendell's avatar
      SPARK-4532: Fix bug in detection of Hive in Spark 1.2 · a81918c5
      Patrick Wendell authored
      Because the Hive profile is no longer defined in the root pom,
      we need to check specifically in the sql/hive pom when we
      perform the check in make-distribtion.sh.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3398 from pwendell/make-distribution and squashes the following commits:
      
      8a58279 [Patrick Wendell] Fix bug in detection of Hive in Spark 1.2
      a81918c5
    • zsxwing's avatar
      [SPARK-4397][Core] Reorganize 'implicit's to improve the API convenience · 65b987c3
      zsxwing authored
      This PR moved `implicit`s to `package object` and `companion object` to enable the Scala compiler search them automatically without explicit importing.
      
      It should not break any API. A test project for backforward compatibility is [here](https://github.com/zsxwing/SPARK-4397-Backforward-Compatibility). It proves the codes compiled with Spark 1.1.0 can run with this PR.
      
      To summarize, the changes are:
      
      * Deprecated the old implicit conversion functions: this preserves binary compatibility for code compiled against earlier versions of Spark.
      * Removed "implicit" from them so they are just normal functions: this made sure the compiler doesn't get confused and warn about multiple implicits in scope.
      * Created new implicit functions in package rdd object, which is part of the scope that scalac will search when looking for implicit conversions on various RDD objects.
      
      The disadvantage is there are duplicated codes in SparkContext for backforward compatibility.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3262 from zsxwing/SPARK-4397 and squashes the following commits:
      
      fc30314 [zsxwing] Update the comments
      9c27aff [zsxwing] Move implicit functions to object RDD and forward old functions to new implicit ones directly
      2b5f5a4 [zsxwing] Comments for the deprecated functions
      52353de [zsxwing] Remove private[spark] from object WritableConverter
      34641d4 [zsxwing] Move ImplicitSuite to org.apache.sparktest
      7266218 [zsxwing] Add comments to warn the duplicate codes in SparkContext
      185c12f [zsxwing] Remove simpleWritableConverter from SparkContext
      3bdcae2 [zsxwing] Move WritableConverter implicits to object WritableConverter
      9b73188 [zsxwing] Fix the code style issue
      3ac4f07 [zsxwing] Add license header
      1eda9e4 [zsxwing] Reorganize 'implicit's to improve the API convenience
      65b987c3
    • zsxwing's avatar
      [SPARK-4472][Shell] Print "Spark context available as sc." only when SparkContext is created... · f1069b84
      zsxwing authored
      ... successfully
      
      It's weird that printing "Spark context available as sc" when creating SparkContext unsuccessfully.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3341 from zsxwing/SPARK-4472 and squashes the following commits:
      
      4850093 [zsxwing] Print "Spark context available as sc." only when SparkContext is created successfully
      f1069b84
    • Reynold Xin's avatar
      [Doc][GraphX] Remove unused png files. · 28fdc6f6
      Reynold Xin authored
      28fdc6f6
    • Reynold Xin's avatar
  5. Nov 20, 2014
    • Michael Armbrust's avatar
      [SPARK-4522][SQL] Parse schema with missing metadata. · 90a6a46b
      Michael Armbrust authored
      This is just a quick fix for 1.2.  SPARK-4523 describes a more complete solution.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3392 from marmbrus/parquetMetadata and squashes the following commits:
      
      bcc6626 [Michael Armbrust] Parse schema with missing metadata.
      90a6a46b
    • Davies Liu's avatar
      add Sphinx as a dependency of building docs · 8cd6eea6
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3388 from davies/doc_readme and squashes the following commits:
      
      daa1482 [Davies Liu] add Sphinx dependency
      8cd6eea6
    • Michael Armbrust's avatar
      [SPARK-4413][SQL] Parquet support through datasource API · 02ec058e
      Michael Armbrust authored
      Goals:
       - Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns)
       - Support for folder based partitioning with automatic discovery of available partitions
       - Caching of file metadata
      
      See scaladoc of `ParquetRelation2` for more details.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3269 from marmbrus/newParquet and squashes the following commits:
      
      1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once.
      645768b [Michael Armbrust] Review comments.
      abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API.
      938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions.
      e9d2641 [Michael Armbrust] logging / formatting improvements.
      02ec058e
    • Cheng Hao's avatar
      [SPARK-4244] [SQL] Support Hive Generic UDFs with constant object inspector parameters · 84d79ee9
      Cheng Hao authored
      Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the input object inspector is `ConstantObjectInspector`, however, we won't get that before the expression optimization executed. (Constant Folding).
      
      This PR is a work around to fix this. (As ideally, the `output` of LogicalPlan should be identical before and after Optimization).
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3109 from chenghao-intel/optimized and squashes the following commits:
      
      487ff79 [Cheng Hao] rebase to the latest master & update the unittest
      84d79ee9
    • Davies Liu's avatar
      [SPARK-4477] [PySpark] remove numpy from RDDSampler · d39f2e9c
      Davies Liu authored
      In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.
      
      numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.
      
      It also complicate the code a lot, so we may should remove numpy from RDDSampler.
      
      I also did some benchmark to verify that:
      ```
      >>> from pyspark.mllib.random import RandomRDDs
      >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
      >>> rdd.count()  # cache it
      >>> rdd.sample(True, 0.9).count()    # measure this line
      ```
      the results:
      
      |withReplacement      |  random  | numpy.random |
       ------- | ------------ |  -------
      |True | 1.5 s|  1.4 s|
      |False|  0.6 s | 0.8 s|
      
      closes #2313
      
      Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3351 from davies/numpy and squashes the following commits:
      
      5c438d7 [Davies Liu] fix comment
      c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
      98eb31b [Xiangrui Meng] make poisson sampling slightly faster
      ee17d78 [Davies Liu] remove = for float
      13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
      f583023 [Davies Liu] fix tests
      51649f5 [Davies Liu] remove numpy in RDDSampler
      78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
      f5fdf63 [Davies Liu] fix bug with int in weights
      4dfa2cd [Davies Liu] refactor
      f866bcf [Davies Liu] remove unneeded change
      c7a2007 [Davies Liu] switch to python implementation
      95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
      0d9b256 [Davies Liu] refactor
      1715ee3 [Davies Liu] address comments
      41fce54 [Davies Liu] randomSplit()
      d39f2e9c
    • Jacky Li's avatar
      [SQL] fix function description mistake · ad5f1f3c
      Jacky Li authored
      Sample code in the description of SchemaRDD.where is not correct
      
      Author: Jacky Li <jacky.likun@gmail.com>
      
      Closes #3344 from jackylk/patch-6 and squashes the following commits:
      
      62cd126 [Jacky Li] [SQL] fix function description mistake
      ad5f1f3c
    • Cheng Hao's avatar
      [SPARK-2918] [SQL] Support the CTAS in EXPLAIN command · 6aa0fc9f
      Cheng Hao authored
      Hive supports the `explain` the CTAS, which was supported by Spark SQL previously, however, seems it was reverted after the code refactoring in HiveQL.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3357 from chenghao-intel/explain and squashes the following commits:
      
      7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command
      6aa0fc9f
    • Takuya UESHIN's avatar
      [SPARK-4318][SQL] Fix empty sum distinct. · 2c2e7a44
      Takuya UESHIN authored
      Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following commits:
      
      8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318
      66fdb0a [Takuya UESHIN] Re-refine aggregate functions.
      6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate.
      d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate.
      1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions.
      917e533 [Takuya UESHIN] Use aggregate instead of groupBy().
      1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation.
      a5a57d2 [Takuya UESHIN] Fix empty Average.
      22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct.
      65b7dd2 [Takuya UESHIN] Fix empty sum distinct.
      2c2e7a44
    • ravipesala's avatar
      [SPARK-4513][SQL] Support relational operator '<=>' in Spark SQL · 98e94197
      ravipesala authored
      The relational operator '<=>' is not working in Spark SQL. Same works in Spark HiveQL
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #3387 from ravipesala/<=> and squashes the following commits:
      
      7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL
      98e94197
    • Davies Liu's avatar
      [SPARK-4439] [MLlib] add python api for random forest · 1c53a5db
      Davies Liu authored
      ```
          class RandomForestModel
           |  A model trained by RandomForest
           |
           |  numTrees(self)
           |      Get number of trees in forest.
           |
           |  predict(self, x)
           |      Predict values for a single data point or an RDD of points using the model trained.
           |
           |  toDebugString(self)
           |      Full model
           |
           |  totalNumNodes(self)
           |      Get total number of nodes, summed over all trees in the forest.
           |
      
          class RandomForest
           |  trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None):
           |      Method to train a decision tree model for binary or multiclass classification.
           |
           |      :param data: Training dataset: RDD of LabeledPoint.
           |                   Labels should take values {0, 1, ..., numClasses-1}.
           |      :param numClassesForClassification: number of classes for classification.
           |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
           |                                  E.g., an entry (n -> k) indicates that feature n is categorical
           |                                  with k categories indexed from 0: {0, 1, ..., k-1}.
           |      :param numTrees: Number of trees in the random forest.
           |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
           |                                Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
           |                                If "auto" is set, this parameter is set based on numTrees:
           |                                  if numTrees == 1, set to "all";
           |                                  if numTrees > 1 (forest) set to "sqrt".
           |      :param impurity: Criterion used for information gain calculation.
           |                   Supported values: "gini" (recommended) or "entropy".
           |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
           |                       1 internal node + 2 leaf nodes. (default: 4)
           |      :param maxBins: maximum number of bins used for splitting features (default: 100)
           |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
           |      :return: RandomForestModel that can be used for prediction
           |
           |   trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None):
           |      Method to train a decision tree model for regression.
           |
           |      :param data: Training dataset: RDD of LabeledPoint.
           |                   Labels are real numbers.
           |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
           |                                   E.g., an entry (n -> k) indicates that feature n is categorical
           |                                   with k categories indexed from 0: {0, 1, ..., k-1}.
           |      :param numTrees: Number of trees in the random forest.
           |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
           |                                 Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
           |                                 If "auto" is set, this parameter is set based on numTrees:
           |                                 if numTrees == 1, set to "all";
           |                                 if numTrees > 1 (forest) set to "onethird".
           |      :param impurity: Criterion used for information gain calculation.
           |                       Supported values: "variance".
           |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
           |                       1 internal node + 2 leaf nodes.(default: 4)
           |      :param maxBins: maximum number of bins used for splitting features (default: 100)
           |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
           |      :return: RandomForestModel that can be used for prediction
           |
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3320 from davies/forest and squashes the following commits:
      
      8003dfc [Davies Liu] reorder
      53cf510 [Davies Liu] fix docs
      4ca593d [Davies Liu] fix docs
      e0df852 [Davies Liu] fix docs
      0431746 [Davies Liu] rebased
      2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest
      885abee [Davies Liu] address comments
      dae7fc0 [Davies Liu] address comments
      89a000f [Davies Liu] fix docs
      565d476 [Davies Liu] add python api for random forest
      1c53a5db
Loading