Skip to content
Snippets Groups Projects
  1. Feb 24, 2015
    • Reynold Xin's avatar
      [SPARK-5904][SQL] DataFrame Java API test suites. · 53a1ebf3
      Reynold Xin authored
      Added a new test suite to make sure Java DF programs can use varargs properly.
      Also moved all suites into test.org.apache.spark package to make sure the suites also test for method visibility.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4751 from rxin/df-tests and squashes the following commits:
      
      1e8b8e4 [Reynold Xin] Fixed imports and renamed JavaAPISuite.
      a6ca53b [Reynold Xin] [SPARK-5904][SQL] DataFrame Java API test suites.
      53a1ebf3
    • Cheng Lian's avatar
      [SPARK-5751] [SQL] [WIP] Revamped HiveThriftServer2Suite for robustness · f816e739
      Cheng Lian authored
      **NOTICE** Do NOT merge this, as we're waiting for #3881 to be merged.
      
      `HiveThriftServer2Suite` has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness:
      
      1. Fixes a racing condition occurred while using `tail -f` to check log file
      
         It's possible that the line we are looking for has already been printed into the log file before we start the `tail -f` process. This PR uses `tail -n +0 -f` to ensure all lines are checked.
      
      2. Retries up to 3 times if the server fails to start
      
         In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start.
      
      3. A server instance is reused among all test cases within a single suite
      
         The original `HiveThriftServer2Suite` is splitted into two test suites, `HiveThriftBinaryServerSuite` and `HiveThriftHttpServerSuite`. Each suite starts a `HiveThriftServer2` instance and reuses it for all of its test cases.
      
      **TODO**
      
      - [ ] Starts the Thrift server in foreground once #3881 is merged (adding `--foreground` flag to `spark-daemon.sh`)
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4720)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4720 from liancheng/revamp-thrift-server-tests and squashes the following commits:
      
      d6c80eb [Cheng Lian] Relaxes server startup timeout
      6f14eb1 [Cheng Lian] Revamped HiveThriftServer2Suite for robustness
      f816e739
    • MechCoder's avatar
      [SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation · 2a0fe348
      MechCoder authored
      One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.
      
      This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #4677 from MechCoder/spark-5436 and squashes the following commits:
      
      1bb21d4 [MechCoder] Combine regression and classification tests into a single one
      e4d799b [MechCoder] Addresses indentation and doc comments
      b48a70f [MechCoder] COSMIT
      b928a19 [MechCoder] Move validation while training section under usage tips
      fad9b6e [MechCoder] Made the following changes 1. Add section to documentation 2. Return corresponding to bestValidationError 3. Allow negative tolerance.
      55e5c3b [MechCoder] One liner for prevValidateError
      3e74372 [MechCoder] TST: Add test for classification
      77549a9 [MechCoder] [SPARK-5436] Validate GradientBoostedTrees using runWithValidation
      2a0fe348
    • Davies Liu's avatar
      [SPARK-5973] [PySpark] fix zip with two RDDs with AutoBatchedSerializer · da505e59
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4745 from davies/fix_zip and squashes the following commits:
      
      2124b2c [Davies Liu] Update tests.py
      b5c828f [Davies Liu] increase the number of records
      c1e40fd [Davies Liu] fix zip with two RDDs with AutoBatchedSerializer
      da505e59
    • Michael Armbrust's avatar
      [SPARK-5952][SQL] Lock when using hive metastore client · a2b91379
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4746 from marmbrus/hiveLock and squashes the following commits:
      
      8b871cf [Michael Armbrust] [SPARK-5952][SQL] Lock when using hive metastore client
      a2b91379
    • Judy's avatar
      [Spark-5708] Add Slf4jSink to Spark Metrics · c5ba975e
      Judy authored
      Add Slf4jSink to Spark Metrics using Coda Hale's SlfjReporter.
      This sends metrics to log4j, allowing spark users to reuse log4j pipeline for metrics collection.
      
      Reviewed existing unit tests and didn't see any sink-related tests. Please advise on if tests should be added.
      
      Author: Judy <judynash@microsoft.com>
      Author: judynash <judynash@microsoft.com>
      
      Closes #4644 from judynash/master and squashes the following commits:
      
      57ef214 [judynash] doc clarification and indent fixes
      a751a66 [Judy] Spark-5708: Add Slf4jSink to Spark Metrics
      c5ba975e
    • Xiangrui Meng's avatar
      [MLLIB] Change x_i to y_i in Variance's user guide · 105791e3
      Xiangrui Meng authored
      Variance is calculated on labels/responses.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4740 from mengxr/patch-1 and squashes the following commits:
      
      673317b [Xiangrui Meng] [MLLIB] Change x_i to y_i in Variance's user guide
      105791e3
    • Andrew Or's avatar
      [SPARK-5965] Standalone Worker UI displays {{USER_JAR}} · 6d2caa57
      Andrew Or authored
      For screenshot see: https://issues.apache.org/jira/browse/SPARK-5965
      This was caused by 20a60131.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #4739 from andrewor14/user-jar-blocker and squashes the following commits:
      
      23c4a9e [Andrew Or] Use right argument
      6d2caa57
    • Tathagata Das's avatar
      [Spark-5967] [UI] Correctly clean JobProgressListener.stageIdToActiveJobIds · 64d2c01f
      Tathagata Das authored
      Patch should be self-explanatory
      pwendell JoshRosen
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #4741 from tdas/SPARK-5967 and squashes the following commits:
      
      653b5bb [Tathagata Das] Fixed the fix and added test
      e2de972 [Tathagata Das] Clear stages which have no corresponding active jobs.
      64d2c01f
    • Michael Armbrust's avatar
      [SPARK-5532][SQL] Repartition should not use external rdd representation · 20123662
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4738 from marmbrus/udtRepart and squashes the following commits:
      
      c06d7b5 [Michael Armbrust] fix compilation
      91c8829 [Michael Armbrust] [SQL][SPARK-5532] Repartition should not use external rdd representation
      20123662
    • Michael Armbrust's avatar
      [SPARK-5910][SQL] Support for as in selectExpr · 0a59e45e
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4736 from marmbrus/asExprs and squashes the following commits:
      
      5ba97e4 [Michael Armbrust] [SPARK-5910][SQL] Support for as in selectExpr
      0a59e45e
    • Cheng Lian's avatar
      [SPARK-5968] [SQL] Suppresses ParquetOutputCommitter WARN logs · 84033313
      Cheng Lian authored
      Please refer to the [JIRA ticket] [1] for the motivation.
      
      [1]: https://issues.apache.org/jira/browse/SPARK-5968
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4744)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4744 from liancheng/spark-5968 and squashes the following commits:
      
      caac6a8 [Cheng Lian] Suppresses ParquetOutputCommitter WARN logs
      84033313
    • Xiangrui Meng's avatar
      [SPARK-5958][MLLIB][DOC] update block matrix user guide · cf2e4165
      Xiangrui Meng authored
      * Removed SVD code from examples.
      * Corrected Java API doc link.
      * Updated variable names: `AtransposeA` -> `ata`.
      * Minor changes.
      
      brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4737 from mengxr/update-block-matrix-user-guide and squashes the following commits:
      
      70f53ac [Xiangrui Meng] update block matrix user guide
      cf2e4165
  2. Feb 23, 2015
    • Michael Armbrust's avatar
      [SPARK-5873][SQL] Allow viewing of partially analyzed plans in queryExecution · 1ed57086
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4684 from marmbrus/explainAnalysis and squashes the following commits:
      
      afbaa19 [Michael Armbrust] fix python
      d93278c [Michael Armbrust] fix hive
      e5fa0a4 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explainAnalysis
      52119f2 [Michael Armbrust] more tests
      82a5431 [Michael Armbrust] fix tests
      25753d2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explainAnalysis
      aee1e6a [Michael Armbrust] fix hive
      b23a844 [Michael Armbrust] newline
      de8dc51 [Michael Armbrust] more comments
      acf620a [Michael Armbrust] [SPARK-5873][SQL] Show partially analyzed plans in query execution
      1ed57086
    • Yin Huai's avatar
      [SPARK-5935][SQL] Accept MapType in the schema provided to a JSON dataset. · 48376bfe
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-5935
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #4710 from yhuai/jsonMapType and squashes the following commits:
      
      3e40390 [Yin Huai] Remove unnecessary changes.
      f8e6267 [Yin Huai] Fix test.
      baa36e3 [Yin Huai] Accept MapType in the schema provided to jsonFile/jsonRDD.
      48376bfe
    • Joseph K. Bradley's avatar
      [SPARK-5912] [docs] [mllib] Small fixes to ChiSqSelector docs · 59536cc8
      Joseph K. Bradley authored
      Fixes:
      * typo in Scala example
      * Removed comment "usually applied on sparse data" since that is debatable
      * small edits to text for clarity
      
      CC: avulanov  I noticed a typo post-hoc and ended up making a few small edits.  Do the changes look OK?
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4732 from jkbradley/chisqselector-docs and squashes the following commits:
      
      9656a3b [Joseph K. Bradley] added Java example for ChiSqSelector to guide
      3f3f9f4 [Joseph K. Bradley] small fixes to ChiSqSelector docs
      59536cc8
    • Alexander Ulanov's avatar
      [MLLIB] SPARK-5912 Programming guide for feature selection · 28ccf5ee
      Alexander Ulanov authored
      Added description of ChiSqSelector and few words about feature selection in general. I could add a code example, however it would not look reasonable in the absence of feature discretizer or a dataset in the `data` folder that has redundant features.
      
      Author: Alexander Ulanov <nashb@yandex.ru>
      
      Closes #4709 from avulanov/SPARK-5912 and squashes the following commits:
      
      19a8a4e [Alexander Ulanov] Addressing reviewers comments @jkbradley
      58d9e4d [Alexander Ulanov] Addressing reviewers comments @jkbradley
      eb6b9fe [Alexander Ulanov] Typo
      2921a1d [Alexander Ulanov] ChiSqSelector example of use
      c845350 [Alexander Ulanov] ChiSqSelector docs
      28ccf5ee
    • Jacky Li's avatar
      [SPARK-5939][MLLib] make FPGrowth example app take parameters · 651a1c01
      Jacky Li authored
      Add parameter parsing in FPGrowth example app in Scala and Java
      And a sample data file is added in data/mllib folder
      
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #4714 from jackylk/parameter and squashes the following commits:
      
      8c478b3 [Jacky Li] fix according to comments
      3bb74f6 [Jacky Li] make FPGrowth exampl app take parameters
      f0e4d10 [Jacky Li] make FPGrowth exampl app take parameters
      651a1c01
    • CodingCat's avatar
      [SPARK-5724] fix the misconfiguration in AkkaUtils · 242d4958
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-5724
      
      In AkkaUtil, we set several failure detector related the parameters as following
      
      ```
      al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
            .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
            s"""
            |akka.daemonic = on
            |akka.loggers = [""akka.event.slf4j.Slf4jLogger""]
            |akka.stdout-loglevel = "ERROR"
            |akka.jvm-exit-on-fatal-error = off
            |akka.remote.require-cookie = "$requireCookie"
            |akka.remote.secure-cookie = "$secureCookie"
            |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s
            |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s
            |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
            |akka.actor.provider = "akka.remote.RemoteActorRefProvider"
            |akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport"
            |akka.remote.netty.tcp.hostname = "$host"
            |akka.remote.netty.tcp.port = $port
            |akka.remote.netty.tcp.tcp-nodelay = on
            |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
            |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
            |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
            |akka.actor.default-dispatcher.throughput = $akkaBatchSize
            |akka.log-config-on-start = $logAkkaConfig
            |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
            |akka.log-dead-letters = $lifecycleEvents
            |akka.log-dead-letters-during-shutdown = $lifecycleEvents
            """.stripMargin))
      
      ```
      
      Actually, we do not have any parameter naming "akka.remote.transport-failure-detector.threshold"
      see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html
      what we have is "akka.remote.watch-failure-detector.threshold"
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #4512 from CodingCat/SPARK-5724 and squashes the following commits:
      
      bafe56e [CodingCat] fix the grammar in configuration doc
      338296e [CodingCat] remove failure-detector related info
      8bfcfd4 [CodingCat] fix the misconfiguration in AkkaUtils
      242d4958
    • Saisai Shao's avatar
      [SPARK-5943][Streaming] Update the test to use new API to reduce the warning · 757b14b8
      Saisai Shao authored
      Author: Saisai Shao <saisai.shao@intel.com>
      
      Closes #4722 from jerryshao/SPARK-5943 and squashes the following commits:
      
      1b01233 [Saisai Shao] Update the test to use new API to reduce the warning
      757b14b8
    • Makoto Fukuhara's avatar
      [EXAMPLES] fix typo. · 93487674
      Makoto Fukuhara authored
      Author: Makoto Fukuhara <fukuo33@gmail.com>
      
      Closes #4724 from fukuo33/fix-typo and squashes the following commits:
      
      8c806b9 [Makoto Fukuhara] fix typo.
      93487674
    • Ilya Ganelin's avatar
      [SPARK-3885] Provide mechanism to remove accumulators once they are no longer used · 95cd643a
      Ilya Ganelin authored
      Instead of storing a strong reference to accumulators, I've replaced this with a weak reference and updated any code that uses these accumulators to check whether the reference resolves before using the accumulator. A weak reference will be cleared when there is no longer an existing copy of the variable versus using a soft reference in which case accumulators would only be cleared when the GC explicitly ran out of memory.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #4021 from ilganeli/SPARK-3885 and squashes the following commits:
      
      4ba9575 [Ilya Ganelin]  Fixed error in test suite
      8510943 [Ilya Ganelin] Extra code
      bb76ef0 [Ilya Ganelin] File deleted somehow
      283a333 [Ilya Ganelin] Added cleanup method for accumulators to remove stale references within Accumulators.original to accumulators that are now out of scope
      345fd4f [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-3885
      7485a82 [Ilya Ganelin] Fixed build error
      c8e0f2b [Ilya Ganelin] Added working test for accumulator garbage collection
      94ce754 [Ilya Ganelin] Still not being properly garbage collected
      8722b63 [Ilya Ganelin] Fixing gc test
      7414a9c [Ilya Ganelin] Added test for accumulator garbage collection
      18d62ec [Ilya Ganelin] Updated to throw Exception when accessing a GCd accumulator
      9a81928 [Ilya Ganelin] Reverting permissions changes
      28f705c [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-3885
      b820ab4b [Ilya Ganelin] reset
      d78f4bf [Ilya Ganelin] Removed obsolete comment
      0746e61 [Ilya Ganelin] Updated DAGSchedulerSUite to fix bug
      3350852 [Ilya Ganelin] Updated DAGScheduler and Suite to correctly use new implementation of WeakRef Accumulator storage
      c49066a [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-3885
      cbb9023 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-3885
      a77d11b [Ilya Ganelin] Updated Accumulators class to store weak references instead of strong references to allow garbage collection of old accumulators
      95cd643a
    • Aaron Josephs's avatar
      [SPARK-911] allow efficient queries for a range if RDD is partitioned wi... · e4f9d03d
      Aaron Josephs authored
      ...th RangePartitioner
      
      Author: Aaron Josephs <ajoseph4@binghamton.edu>
      
      Closes #1381 from aaronjosephs/PLAT-911 and squashes the following commits:
      
      e30ade5 [Aaron Josephs] [SPARK-911] allow efficient queries for a range if RDD is partitioned with RangePartitioner
      e4f9d03d
  3. Feb 22, 2015
  4. Feb 21, 2015
    • Evan Yu's avatar
      [SPARK-5860][CORE] JdbcRDD: overflow on large range with high number of partitions · 7683982f
      Evan Yu authored
      Fix a overflow bug in JdbcRDD when calculating partitions for large BIGINT ids
      
      Author: Evan Yu <ehotou@gmail.com>
      
      Closes #4701 from hotou/SPARK-5860 and squashes the following commits:
      
      9e038d1 [Evan Yu] [SPARK-5860][CORE] Prevent overflowing at the length level
      7883ad9 [Evan Yu] [SPARK-5860][CORE] Prevent overflowing at the length level
      c88755a [Evan Yu] [SPARK-5860][CORE] switch to BigInt instead of BigDecimal
      4e9ff4f [Evan Yu] [SPARK-5860][CORE] JdbcRDD overflow on large range with high number of partitions
      7683982f
    • Hari Shreedharan's avatar
      [SPARK-5937][YARN] Fix ClientSuite to set YARN mode, so that the correct class is used in t... · 7138816a
      Hari Shreedharan authored
      ...ests.
      
      Without this SparkHadoopUtil is used by the Client instead of YarnSparkHadoopUtil.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #4711 from harishreedharan/SPARK-5937 and squashes the following commits:
      
      d154de6 [Hari Shreedharan] Use System.clearProperty() instead of setting the value of SPARK_YARN_MODE to empty string.
      f729f70 [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the correct class is used in tests.
      7138816a
    • Nishkam Ravi's avatar
      SPARK-5841 [CORE] [HOTFIX 2] Memory leak in DiskBlockManager · d3cbd38c
      Nishkam Ravi authored
      Continue to see IllegalStateException in YARN cluster mode. Adding a simple workaround for now.
      
      Author: Nishkam Ravi <nravi@cloudera.com>
      Author: nishkamravi2 <nishkamravi@gmail.com>
      Author: nravi <nravi@c1704.halxg.cloudera.com>
      
      Closes #4690 from nishkamravi2/master_nravi and squashes the following commits:
      
      d453197 [nishkamravi2] Update NewHadoopRDD.scala
      6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
      0ce2c32 [nishkamravi2] Update HadoopRDD.scala
      f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
      ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
      71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      494d8c0 [nishkamravi2] Update DiskBlockManager.scala
      3c5ddba [nishkamravi2] Update DiskBlockManager.scala
      f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
      79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
      535295a [nishkamravi2] Update TaskSetManager.scala
      3e1b616 [Nishkam Ravi] Modify test for maxResultSize
      9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
      5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      636a9ff [nishkamravi2] Update YarnAllocator.scala
      8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
      35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
      5ac2ec1 [Nishkam Ravi] Remove out
      dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
      42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
      362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
      c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
      1cf2d1e [nishkamravi2] Update YarnAllocator.scala
      ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
      2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
      2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
      3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
      5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
      eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
      df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
      6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
      5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
      681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
      d3cbd38c
    • Jacky Li's avatar
      [MLlib] fix typo · e1553247
      Jacky Li authored
      fix typo: it should be "default:" instead of "default;"
      
      Author: Jacky Li <jackylk@users.noreply.github.com>
      
      Closes #4713 from jackylk/patch-10 and squashes the following commits:
      
      15daf2e [Jacky Li] [MLlib] fix typo
      e1553247
  5. Feb 20, 2015
    • Davies Liu's avatar
      [SPARK-5898] [SPARK-5896] [SQL] [PySpark] create DataFrame from pandas and tuple/list · 5b0a42cb
      Davies Liu authored
      Fix createDataFrame() from pandas DataFrame (not tested by jenkins, depends on SPARK-5693).
      
      It also support to create DataFrame from plain tuple/list without column names, `_1`, `_2` will be used as column names.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4679 from davies/pandas and squashes the following commits:
      
      c0cbe0b [Davies Liu] fix tests
      8466d1d [Davies Liu] fix create DataFrame from pandas
      5b0a42cb
    • Joseph K. Bradley's avatar
      [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 release · 4a17eedb
      Joseph K. Bradley authored
      For SPARK-5867:
      * The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
      * It should also include Python examples now.
      
      For SPARK-5892:
      * Fix Python docs
      * Various other cleanups
      
      BTW, I accidentally merged this with master.  If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
      
      CC: mengxr  (ML),  davies  (Python docs)
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:
      
      f191bb0 [Joseph K. Bradley] small cleanups
      e786efa [Joseph K. Bradley] small doc corrections
      6b1ab4a [Joseph K. Bradley] fixed python lint test
      946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example.  Changed spark.ml Java examples to use DataFrames API instead of sql()
      da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
      629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
      b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
      34b067f [Joseph K. Bradley] small doc correction
      da16aef [Joseph K. Bradley] Fixed python mllib docs
      8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
      695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
      a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
      b05a80d [Joseph K. Bradley] organize imports. doc cleanups
      e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
      4a17eedb
    • Sean Owen's avatar
      SPARK-5744 [CORE] Take 2. RDD.isEmpty / take fails for (empty) RDD of Nothing · d3dfebeb
      Sean Owen authored
      Follow-on to https://github.com/apache/spark/pull/4591
      
      Document isEmpty / take / parallelize and their interaction with (an empty) RDD[Nothing] and RDD[Null]. Also, fix a marginally related minor issue with histogram() and EmptyRDD.
      
      CC rxin since you reviewed the last one although I imagine this is an uncontroversial resolution.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4698 from srowen/SPARK-5744.2 and squashes the following commits:
      
      9b2a811 [Sean Owen] 2 extra javadoc fixes
      d1b9fba [Sean Owen] Document isEmpty / take / parallelize and their interaction with (an empty) RDD[Nothing] and RDD[Null]. Also, fix a marginally related minor issue with histogram() and EmptyRDD.
      d3dfebeb
    • Yin Huai's avatar
      [SPARK-5909][SQL] Add a clearCache command to Spark SQL's cache manager · 70bfb5c7
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-5909
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4694 from yhuai/clearCache and squashes the following commits:
      
      397ecc4 [Yin Huai] Address comments.
      a2702fc [Yin Huai] Update parser.
      3a54506 [Yin Huai] add isEmpty to CacheManager.
      6d14460 [Yin Huai] Python clearCache.
      f7b8dbd [Yin Huai] Add clear cache command.
      70bfb5c7
  6. Feb 19, 2015
    • mcheah's avatar
      [SPARK-4808] Removing minimum number of elements read before spill check · 3be92cda
      mcheah authored
      In the general case, Spillable's heuristic of checking for memory stress
      on every 32nd item after 1000 items are read is good enough. In general,
      we do not want to be enacting the spilling checks until later on in the
      job; checking for disk-spilling too early can produce unacceptable
      performance impact in trivial cases.
      
      However, there are non-trivial cases, particularly if each serialized
      object is large, where checking for the necessity to spill too late
      would allow the memory to overflow. Consider if every item is 1.5 MB in
      size, and the heap size is 1000 MB. Then clearly if we only try to spill
      the in-memory contents to disk after 1000 items are read, we would have
      already accumulated 1500 MB of RAM and overflowed the heap.
      
      Patch #3656 attempted to circumvent this by checking the need to spill
      on every single item read, but that would cause unacceptable performance
      in the general case. However, the convoluted cases above should not be
      forced to be refactored to shrink the data items. Therefore it makes
      sense that the memory spilling thresholds be configurable.
      
      Author: mcheah <mcheah@palantir.com>
      
      Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits:
      
      6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
      3be92cda
    • Xiangrui Meng's avatar
      [SPARK-5900][MLLIB] make PIC and FPGrowth Java-friendly · 0cfd2ceb
      Xiangrui Meng authored
      In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`.
      
      Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java.
      
      I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions.
      
      CC: jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4695 from mengxr/SPARK-5900 and squashes the following commits:
      
      865b5ca [Xiangrui Meng] make Assignment serializable
      cffa96e [Xiangrui Meng] fix test
      9c0e590 [Xiangrui Meng] remove unused Tuple2
      1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly
      0cfd2ceb
    • Ilya Ganelin's avatar
      SPARK-5570: No docs stating that `new SparkConf().set("spark.driver.memory", ...) will not work · 6bddc403
      Ilya Ganelin authored
      I've updated documentation to reflect true behavior of this setting in client vs. cluster mode.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #4665 from ilganeli/SPARK-5570 and squashes the following commits:
      
      5d1c8dd [Ilya Ganelin] Added example configuration code
      a51700a [Ilya Ganelin] Getting rid of extra spaces
      85f7a08 [Ilya Ganelin] Reworded note
      5889d43 [Ilya Ganelin] Formatting adjustment
      f149ba1 [Ilya Ganelin] Minor updates
      1fec7a5 [Ilya Ganelin] Updated to add clarification for other driver properties
      db47595 [Ilya Ganelin] Slight formatting update
      c899564 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5570
      17b751d [Ilya Ganelin] Updated documentation for driver-memory to reflect its true behavior in client vs cluster mode
      6bddc403
    • Sean Owen's avatar
      SPARK-4682 [CORE] Consolidate various 'Clock' classes · 34b7c353
      Sean Owen authored
      Another one from JoshRosen 's wish list. The first commit is much smaller and removes 2 of the 4 Clock classes. The second is much larger, necessary for consolidating the streaming one. I put together implementations in the way that seemed simplest. Almost all the change is standardizing class and method names.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4514 from srowen/SPARK-4682 and squashes the following commits:
      
      5ed3a03 [Sean Owen] Javadoc Clock classes; make ManualClock private[spark]
      169dd13 [Sean Owen] Add support for legacy org.apache.spark.streaming clock class names
      277785a [Sean Owen] Reduce the net change in this patch by reversing some unnecessary syntax changes along the way
      b5e53df [Sean Owen] FakeClock -> ManualClock; getTime() -> getTimeMillis()
      160863a [Sean Owen] Consolidate Streaming Clock class into common util Clock
      7c956b2 [Sean Owen] Consolidate Clocks except for Streaming Clock
      34b7c353
    • Zhan Zhang's avatar
      [Spark-5889] Remove pid file after stopping service. · ad6b169d
      Zhan Zhang authored
      Currently the pid file is not deleted, and potentially may cause some problem after service is stopped. The fix remove the pid file after service stopped.
      
      Author: Zhan Zhang <zhazhan@gmail.com>
      
      Closes #4676 from zhzhan/spark-5889 and squashes the following commits:
      
      eb01be1 [Zhan Zhang] solve review comments
      b4c009e [Zhan Zhang] solve review comments
      018110a [Zhan Zhang] spark-5889: remove pid file after stopping service
      088d2a2 [Zhan Zhang] squash all commits
      c1f1fa5 [Zhan Zhang] test
      ad6b169d
    • Joseph K. Bradley's avatar
      [SPARK-5902] [ml] Made PipelineStage.transformSchema public instead of private to ml · a5fed343
      Joseph K. Bradley authored
      For users to implement their own PipelineStages, we need to make PipelineStage.transformSchema be public instead of private to ml.  This would be nice to include in Spark 1.3
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4682 from jkbradley/SPARK-5902 and squashes the following commits:
      
      6f02357 [Joseph K. Bradley] Made transformSchema public
      0e6d0a0 [Joseph K. Bradley] made implementations of transformSchema protected as well
      fdaf26a [Joseph K. Bradley] Made PipelineStage.transformSchema protected instead of private[ml]
      a5fed343
Loading