Skip to content
Snippets Groups Projects
  1. Aug 14, 2014
    • Aaron Davidson's avatar
      [SPARK-3029] Disable local execution of Spark jobs by default · d069c5d9
      Aaron Davidson authored
      Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead.
      
      Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring.
      
      This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1321 from aarondav/allowlocal and squashes the following commits:
      
      136b253 [Aaron Davidson] Fix DAGSchedulerSuite
      5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default
      d069c5d9
    • Andrew Or's avatar
      [Docs] Add missing <code> tags (minor) · e4245656
      Andrew Or authored
      These configs looked inconsistent from the rest.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1936 from andrewor14/docs-code and squashes the following commits:
      
      15f578a [Andrew Or] Add <code> tag
      e4245656
  2. Aug 13, 2014
    • Kousuke Saruta's avatar
      [SPARK-2963] [SQL] There no documentation about building to use HiveServer and CLI for SparkSQL · 869f06c7
      Kousuke Saruta authored
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #1885 from sarutak/SPARK-2963 and squashes the following commits:
      
      ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
      07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md
      6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963
      c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
      869f06c7
    • Reynold Xin's avatar
      [SPARK-2953] Allow using short names for io compression codecs · 676f9828
      Reynold Xin authored
      Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy".
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1873 from rxin/compressionCodecShortForm and squashes the following commits:
      
      9f50962 [Reynold Xin] Specify short-form compression codec names first.
      63f78ee [Reynold Xin] Updated configuration documentation.
      47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs
      676f9828
  3. Aug 12, 2014
    • Ameet Talwalkar's avatar
      SPARK-2830 [MLlib]: re-organize mllib documentation · c235b83e
      Ameet Talwalkar authored
      As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.
      
      Author: Ameet Talwalkar <atalwalkar@gmail.com>
      
      Closes #1908 from atalwalkar/master and squashes the following commits:
      
      fe6938a [Ameet Talwalkar] made xiangruis suggested changes
      840028b [Ameet Talwalkar] made xiangruis suggested changes
      7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
      c235b83e
  4. Aug 09, 2014
    • li-zhihui's avatar
      [SPARK-2635] Fix race condition at SchedulerBackend.isReady in standalone mode · 28dbae85
      li-zhihui authored
      In SPARK-1946(PR #900), configuration <code>spark.scheduler.minRegisteredExecutorsRatio</code> was introduced. However, in standalone mode, there is a race condition where isReady() can return true because totalExpectedExecutors has not been correctly set.
      
      Because expected executors is uncertain in standalone mode, the PR try to use CPU cores(<code>--total-executor-cores</code>) as expected resources to judge whether SchedulerBackend is ready.
      
      Author: li-zhihui <zhihui.li@intel.com>
      Author: Li Zhihui <zhihui.li@intel.com>
      
      Closes #1525 from li-zhihui/fixre4s and squashes the following commits:
      
      e9a630b [Li Zhihui] Rename variable totalExecutors and clean codes
      abf4860 [Li Zhihui] Push down variable totalExpectedResources to children classes
      ca54bd9 [li-zhihui] Format log with String interpolation
      88c7dc6 [li-zhihui] Few codes and docs refactor
      41cf47e [li-zhihui] Fix race condition at SchedulerBackend.isReady in standalone mode
      28dbae85
  5. Aug 07, 2014
    • Matei Zaharia's avatar
      SPARK-2787: Make sort-based shuffle write files directly when there's no... · 6906b69c
      Matei Zaharia authored
      SPARK-2787: Make sort-based shuffle write files directly when there's no sorting/aggregation and # partitions is small
      
      As described in https://issues.apache.org/jira/browse/SPARK-2787, right now sort-based shuffle is more expensive than hash-based for map operations that do no partial aggregation or sorting, such as groupByKey. This is because it has to serialize each data item twice (once when spilling to intermediate files, and then again when merging these files object-by-object). This patch adds a code path to just write separate files directly if the # of output partitions is small, and concatenate them at the end to produce a sorted file.
      
      On the unit test side, I added some tests that force or don't force this bypass path to be used, and checked that our tests for other features (e.g. all the operations) cover both cases.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1799 from mateiz/SPARK-2787 and squashes the following commits:
      
      88cf26a [Matei Zaharia] Fix rebase
      10233af [Matei Zaharia] Review comments
      398cb95 [Matei Zaharia] Fix looking up shuffle manager in conf
      ca3efd9 [Matei Zaharia] Add docs for shuffle manager properties, and allow short names for them
      d0ae3c5 [Matei Zaharia] Fix some comments
      90d084f [Matei Zaharia] Add code path to bypass merge-sort in ExternalSorter, and tests
      31e5d7c [Matei Zaharia] Move existing logic for writing partitioned files into ExternalSorter
      6906b69c
  6. Aug 06, 2014
    • Andrew Or's avatar
      [SPARK-2157] Enable tight firewall rules for Spark · 09f7e458
      Andrew Or authored
      The goal of this PR is to allow users of Spark to write tight firewall rules for their clusters. This is currently not possible because Spark uses random ports in many places, notably the communication between executors and drivers. The changes in this PR are based on top of ash211's changes in #1107.
      
      The list covered here may or may not be the complete set of port needed for Spark to operate perfectly. However, as of the latest commit there are no known sources of random ports (except in tests). I have not documented a few of the more obscure configs.
      
      My spark-env.sh looks like this:
      ```
      export SPARK_MASTER_PORT=6060
      export SPARK_WORKER_PORT=7070
      export SPARK_MASTER_WEBUI_PORT=9090
      export SPARK_WORKER_WEBUI_PORT=9091
      ```
      and my spark-defaults.conf looks like this:
      ```
      spark.master spark://andrews-mbp:6060
      spark.driver.port 5001
      spark.fileserver.port 5011
      spark.broadcast.port 5021
      spark.replClassServer.port 5031
      spark.blockManager.port 5041
      spark.executor.port 5051
      ```
      
      Author: Andrew Or <andrewor14@gmail.com>
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #1777 from andrewor14/configure-ports and squashes the following commits:
      
      621267b [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      8a6b820 [Andrew Or] Use a random UI port during tests
      7da0493 [Andrew Or] Fix tests
      523c30e [Andrew Or] Add test for isBindCollision
      b97b02a [Andrew Or] Minor fixes
      c22ad00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      93d359f [Andrew Or] Executors connect to wrong port when collision occurs
      d502e5f [Andrew Or] Handle port collisions when creating Akka systems
      a2dd05c [Andrew Or] Patrick's comment nit
      86461e2 [Andrew Or] Remove spark.executor.env.port and spark.standalone.client.port
      1d2d5c6 [Andrew Or] Fix ports for standalone cluster mode
      cb3be88 [Andrew Or] Various doc fixes (broken link, format etc.)
      e837cde [Andrew Or] Remove outdated TODOs
      bfbab28 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      de1b207 [Andrew Or] Update docs to reflect new ports
      b565079 [Andrew Or] Add spark.ports.maxRetries
      2551eb2 [Andrew Or] Remove spark.worker.watcher.port
      151327a [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      9868358 [Andrew Or] Add a few miscellaneous ports
      6016e77 [Andrew Or] Add spark.executor.port
      8d836e6 [Andrew Or] Also document SPARK_{MASTER/WORKER}_WEBUI_PORT
      4d9e6f3 [Andrew Or] Fix super subtle bug
      3f8e51b [Andrew Or] Correct erroneous docs...
      e111d08 [Andrew Or] Add names for UI services
      470f38c [Andrew Or] Special case non-"Address already in use" exceptions
      1d7e408 [Andrew Or] Treat 0 ports specially + return correct ConnectionManager port
      ba32280 [Andrew Or] Minor fixes
      6b550b0 [Andrew Or] Assorted fixes
      73fbe89 [Andrew Or] Move start service logic to Utils
      ec676f4 [Andrew Or] Merge branch 'SPARK-2157' of github.com:ash211/spark into configure-ports
      038a579 [Andrew Ash] Trust the server start function to report the port the service started on
      7c5bdc4 [Andrew Ash] Fix style issue
      0347aef [Andrew Ash] Unify port fallback logic to a single place
      24a4c32 [Andrew Ash] Remove type on val to match surrounding style
      9e4ad96 [Andrew Ash] Reformat for style checker
      5d84e0e [Andrew Ash] Document new port configuration options
      066dc7a [Andrew Ash] Fix up HttpServer port increments
      cad16da [Andrew Ash] Add fallover increment logic for HttpServer
      c5a0568 [Andrew Ash] Fix ConnectionManager to retry with increment
      b80d2fd [Andrew Ash] Make Spark's block manager port configurable
      17c79bb [Andrew Ash] Add a configuration option for spark-shell's class server
      f34115d [Andrew Ash] SPARK-1176 Add port configuration for HttpBroadcast
      49ee29b [Andrew Ash] SPARK-1174 Add port configuration for HttpFileServer
      1c0981a [Andrew Ash] Make port in HttpServer configurable
      09f7e458
  7. Aug 05, 2014
    • Reynold Xin's avatar
      [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB. · acff9a7f
      Reynold Xin authored
      This can substantially reduce memory usage during shuffle.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1781 from rxin/SPARK-2503-spark.shuffle.file.buffer.kb and squashes the following commits:
      
      104b8d8 [Reynold Xin] [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB.
      acff9a7f
    • Thomas Graves's avatar
      SPARK-1680: use configs for specifying environment variables on YARN · 41e0a21b
      Thomas Graves authored
      Note that this also documents spark.executorEnv.*  which to me means its public.  If we don't want that please speak up.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1512 from tgravescs/SPARK-1680 and squashes the following commits:
      
      11525df [Thomas Graves] more doc changes
      553bad0 [Thomas Graves] fix documentation
      152bf7c [Thomas Graves] fix docs
      5382326 [Thomas Graves] try fix docs
      32f86a4 [Thomas Graves] use configs for specifying environment variables on YARN
      41e0a21b
    • Patrick Wendell's avatar
      SPARK-2380: Support displaying accumulator values in the web UI · 74f82c71
      Patrick Wendell authored
      This patch adds support for giving accumulators user-visible names and displaying accumulator values in the web UI. This allows users to create custom counters that can display in the UI. The current approach displays both the accumulator deltas caused by each task and a "current" value of the accumulator totals for each stage, which gets update as tasks finish.
      
      Currently in Spark developers have been extending the `TaskMetrics` functionality to provide custom instrumentation for RDD's. This provides a potentially nicer alternative of going through the existing accumulator framework (actually `TaskMetrics` and accumulators are on an awkward collision course as we add more features to the former). The current patch demo's how we can use the feature to provide instrumentation for RDD input sizes. The nice thing about going through accumulators is that users can read the current value of the data being tracked in their programs. This could be useful to e.g. decide to short-circuit a Spark stage depending on how things are going.
      
      ![counters](https://cloud.githubusercontent.com/assets/320616/3488815/6ee7bc34-0505-11e4-84ce-e36d9886e2cf.png)
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1309 from pwendell/metrics and squashes the following commits:
      
      8815308 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into HEAD
      93fbe0f [Patrick Wendell] Other minor fixes
      cc43f68 [Patrick Wendell] Updating unit tests
      c991b1b [Patrick Wendell] Moving some code into the Accumulators class
      9a9ba3c [Patrick Wendell] More merge fixes
      c5ace9e [Patrick Wendell] More merge conflicts
      1da15e3 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into metrics
      9860c55 [Patrick Wendell] Potential solution to posting listener events
      0bb0e33 [Patrick Wendell] Remove "display" variable and assume display = name.isDefined
      0ec4ac7 [Patrick Wendell] Java API's
      e95bf69 [Patrick Wendell] Stash
      be97261 [Patrick Wendell] Style fix
      8407308 [Patrick Wendell] Removing examples in Hadoop and RDD class
      64d405f [Patrick Wendell] Adding missing file
      5d8b156 [Patrick Wendell] Changes based on Kay's review.
      9f18bad [Patrick Wendell] Minor style changes and tests
      7a63abc [Patrick Wendell] Adding Json serialization and responding to Reynold's feedback
      ad85076 [Patrick Wendell] Example of using named accumulators for custom RDD metrics.
      0b72660 [Patrick Wendell] Initial WIP example of supporing globally named accumulators.
      74f82c71
    • Guancheng (G.C.) Chen's avatar
      [SPARK-2859] Update url of Kryo project in related docs · ac3440f4
      Guancheng (G.C.) Chen authored
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-2859
      
      Kryo project has been migrated from googlecode to github, hence we need to update its URL in related docs such as tuning.md.
      
      Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com>
      
      Closes #1782 from gchen/kryo-docs and squashes the following commits:
      
      b62543c [Guancheng (G.C.) Chen] update url of Kryo project
      ac3440f4
    • Thomas Graves's avatar
      SPARK-1890 and SPARK-1891- add admin and modify acls · 1c5555a2
      Thomas Graves authored
      It was easier to combine these 2 jira since they touch many of the same places.  This pr adds the following:
      
      - adds modify acls
      - adds admin acls (list of admins/users that get added to both view and modify acls)
      - modify Kill button on UI to take modify acls into account
      - changes config name of spark.ui.acls.enable to spark.acls.enable since I choose poorly in original name. We keep backwards compatibility so people can still use spark.ui.acls.enable. The acls should apply to any web ui as well as any CLI interfaces.
      - send view and modify acls information on to YARN so that YARN interfaces can use (yarn cli for killing applications for example).
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1196 from tgravescs/SPARK-1890 and squashes the following commits:
      
      8292eb1 [Thomas Graves] review comments
      b92ec89 [Thomas Graves] remove unneeded variable from applistener
      4c765f4 [Thomas Graves] Add in admin acls
      72eb0ac [Thomas Graves] Add modify acls
      1c5555a2
    • Thomas Graves's avatar
      SPARK-1528 - spark on yarn, add support for accessing remote HDFS · 2c0f705e
      Thomas Graves authored
      Add a config (spark.yarn.access.namenodes) to allow applications running on yarn to access other secure HDFS cluster.  User just specifies the namenodes of the other clusters and we get Tokens for those and ship them with the spark application.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1159 from tgravescs/spark-1528 and squashes the following commits:
      
      ddbcd16 [Thomas Graves] review comments
      0ac8501 [Thomas Graves] SPARK-1528 - add support for accessing remote HDFS
      2c0f705e
    • Reynold Xin's avatar
      [SPARK-2856] Decrease initial buffer size for Kryo to 64KB. · 184048f8
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1780 from rxin/kryo-init-size and squashes the following commits:
      
      551b935 [Reynold Xin] [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.
      184048f8
    • Andrew Or's avatar
      [SPARK-2857] Correct properties to set Master / Worker ports · a646a365
      Andrew Or authored
      `master.ui.port` and `worker.ui.port` were never picked up by SparkConf, simply because they are not prefixed with "spark." Unfortunately, this is also currently the documented way of setting these values.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1779 from andrewor14/master-worker-port and squashes the following commits:
      
      8475e95 [Andrew Or] Update docs to reflect changes in configs
      4db3d5d [Andrew Or] Stop using configs that don't actually work
      a646a365
  8. Aug 04, 2014
    • Matei Zaharia's avatar
      SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter · 8e7d5ba1
      Matei Zaharia authored
      All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.
      
      In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1722 from mateiz/spark-2792 and squashes the following commits:
      
      5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
      18fe865 [Matei Zaharia] Update docs on objectStreamReset
      576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
      0374217 [Matei Zaharia] Remove super paranoid code to close file handles
      bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
      0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
      9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
      8e7d5ba1
  9. Aug 03, 2014
    • Michael Armbrust's avatar
      [SPARK-2784][SQL] Deprecate hql() method in favor of a config option, 'spark.sql.dialect' · 236dfac6
      Michael Armbrust authored
      Many users have reported being confused by the distinction between the `sql` and `hql` methods.  Specifically, many users think that `sql(...)` cannot be used to read hive tables.  In this PR I introduce a new configuration option `spark.sql.dialect` that picks which dialect with be used for parsing.  For SQLContext this must be set to `sql`.  In `HiveContext` it defaults to `hiveql` but can also be set to `sql`.
      
      The `hql` and `hiveql` methods continue to act the same but are now marked as deprecated.
      
      **This is a possibly breaking change for some users unless they set the dialect manually, though this is unlikely.**
      
      For example: `hiveContex.sql("SELECT 1")` will now throw a parsing exception by default.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1746 from marmbrus/sqlLanguageConf and squashes the following commits:
      
      ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' into sqlLanguageConf
      20c43f8 [Michael Armbrust] override function instead of just setting the value
      7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
      236dfac6
    • Stephen Boesch's avatar
      SPARK-2712 - Add a small note to maven doc that mvn package must happen ... · f8cd143b
      Stephen Boesch authored
      Per request by Reynold adding small note about proper sequencing of build then test.
      
      Author: Stephen Boesch <javadba@gmail.com>
      
      Closes #1615 from javadba/docs and squashes the following commits:
      
      6c3183e [Stephen Boesch] Moved updated testing blurb per PWendell
      5764757 [Stephen Boesch] SPARK-2712 - Add a small note to maven doc that mvn package must happen before test
      f8cd143b
  10. Aug 02, 2014
    • Michael Armbrust's avatar
      [SPARK-2739][SQL] Rename registerAsTable to registerTempTable · 1a804373
      Michael Armbrust authored
      There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle.  This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening.  `registerAsTable` remains, but will cause a deprecation warning.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1743 from marmbrus/registerTempTable and squashes the following commits:
      
      d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
      4dff086 [Michael Armbrust] Fix .java files too
      89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
      0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
      1a804373
    • Chris Fregly's avatar
      [SPARK-1981] Add AWS Kinesis streaming support · 91f9504e
      Chris Fregly authored
      Author: Chris Fregly <chris@fregly.com>
      
      Closes #1434 from cfregly/master and squashes the following commits:
      
      4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
      0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
      691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
      0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
      e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
      d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
      912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
      db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
      338997e [Chris Fregly] improve build docs for kinesis
      828f8ae [Chris Fregly] more cleanup
      e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      cd68c0d [Chris Fregly] fixed typos and backward compatibility
      d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
      b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
      91f9504e
  11. Aug 01, 2014
    • CrazyJvm's avatar
      [SQL] Documentation: Explain cacheTable command · c82fe478
      CrazyJvm authored
      add the `cacheTable` specification
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #1681 from CrazyJvm/sql-programming-guide-cache and squashes the following commits:
      
      0a231e0 [CrazyJvm] grammar fixes
      a04020e [CrazyJvm] modify title to Cached tables
      18b6594 [CrazyJvm] fix format
      2cbbf58 [CrazyJvm] add cacheTable guide
      c82fe478
    • Sandy Ryza's avatar
      SPARK-2099. Report progress while task is running. · 8d338f64
      Sandy Ryza authored
      This is a sketch of a patch that allows the UI to show metrics for tasks that have not yet completed.  It adds a heartbeat every 2 seconds from the executors to the driver, reporting metrics for all of the executor's tasks.
      
      It still needs unit tests, polish, and cluster testing, but I wanted to put it up to get feedback on the approach.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1056 from sryza/sandy-spark-2099 and squashes the following commits:
      
      93b9fdb [Sandy Ryza] Up heartbeat interval to 10 seconds and other tidying
      132aec7 [Sandy Ryza] Heartbeat and HeartbeatResponse are already Serializable as case classes
      38dffde [Sandy Ryza] Additional review feedback and restore test that was removed in BlockManagerSuite
      51fa396 [Sandy Ryza] Remove hostname race, add better comments about threading, and some stylistic improvements
      3084f10 [Sandy Ryza] Make TaskUIData a case class again
      3bda974 [Sandy Ryza] Stylistic fixes
      0dae734 [Sandy Ryza] SPARK-2099. Report progress while task is running.
      8d338f64
  12. Jul 31, 2014
    • kballou's avatar
      Docs: monitoring, streaming programming guide · cc820502
      kballou authored
      Fix several awkward wordings and grammatical issues in the following
      documents:
      
      *   docs/monitoring.md
      
      *   docs/streaming-programming-guide.md
      
      Author: kballou <kballou@devnulllabs.io>
      
      Closes #1662 from kennyballou/grammar_fixes and squashes the following commits:
      
      e1b8ad6 [kballou] Docs: monitoring, streaming programming guide
      cc820502
    • Michael Armbrust's avatar
      [SPARK-2397][SQL] Deprecate LocalHiveContext · 72cfb139
      Michael Armbrust authored
      LocalHiveContext is redundant with HiveContext.  The only difference is it creates `./metastore` instead of `./metastore_db`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1641 from marmbrus/localHiveContext and squashes the following commits:
      
      e5ec497 [Michael Armbrust] Add deprecation version
      626e056 [Michael Armbrust] Don't remove from imports yet
      905cc5f [Michael Armbrust] Merge remote-tracking branch 'apache/master' into localHiveContext
      1c2727e [Michael Armbrust] Deprecate LocalHiveContext
      72cfb139
    • CrazyJvm's avatar
      automatically set master according to `spark.master` in `spark-defaults.... · 669e3f05
      CrazyJvm authored
      automatically set master according to `spark.master` in `spark-defaults.conf`
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #1644 from CrazyJvm/standalone-guide and squashes the following commits:
      
      bb12b95 [CrazyJvm] automatically set master according to `spark.master` in `spark-defaults.conf`
      669e3f05
  13. Jul 30, 2014
    • Kan Zhang's avatar
      [SPARK-2024] Add saveAsSequenceFile to PySpark · 94d1f46f
      Kan Zhang authored
      JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024
      
      This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.
      
      * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.
      
      * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
      
      * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.
      
      * Added HBase and Cassandra output examples to show how custom output formats and converters can be used.
      
      cc MLnick mateiz ahirreddy pwendell
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:
      
      c01e3ef [Kan Zhang] [SPARK-2024] code formatting
      6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
      d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
      57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
      75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
      0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
      9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
      0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
      7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
      94d1f46f
    • Koert Kuipers's avatar
      SPARK-2543: Allow user to set maximum Kryo buffer size · 7c5fc28a
      Koert Kuipers authored
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #735 from koertkuipers/feat-kryo-max-buffersize and squashes the following commits:
      
      15f6d81 [Koert Kuipers] change default for spark.kryoserializer.buffer.max.mb to 64mb and add some documentation
      1bcc22c [Koert Kuipers] Merge branch 'master' into feat-kryo-max-buffersize
      0c9f8eb [Koert Kuipers] make default for kryo max buffer size 16MB
      143ec4d [Koert Kuipers] test resizable buffer in kryo Output
      0732445 [Koert Kuipers] support setting maxCapacity to something different than capacity in kryo Output
      7c5fc28a
  14. Jul 28, 2014
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix) · a7a9d144
      Cheng Lian authored
      JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      
      Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.
      
      In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:
      
      629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
      ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
      a7a9d144
  15. Jul 27, 2014
    • Patrick Wendell's avatar
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · e5bbce9a
      Patrick Wendell authored
      This reverts commit f6ff2a61.
      e5bbce9a
    • Andrew Or's avatar
      [SPARK-1777] Prevent OOMs from single partitions · ecf30ee7
      Andrew Or authored
      **Problem.** When caching, we currently unroll the entire RDD partition before making sure we have enough free memory. This is a common cause for OOMs especially when (1) the BlockManager has little free space left in memory, and (2) the partition is large.
      
      **Solution.** We maintain a global memory pool of `M` bytes shared across all threads, similar to the way we currently manage memory for shuffle aggregation. Then, while we unroll each partition, periodically check if there is enough space to continue. If not, drop enough RDD blocks to ensure we have at least `M` bytes to work with, then try again. If we still don't have enough space to unroll the partition, give up and drop the block to disk directly if applicable.
      
      **New configurations.**
      - `spark.storage.bufferFraction` - the value of `M` as a fraction of the storage memory. (default: 0.2)
      - `spark.storage.safetyFraction` - a margin of safety in case size estimation is slightly off. This is the equivalent of the existing `spark.shuffle.safetyFraction`. (default 0.9)
      
      For more detail, see the [design document](https://issues.apache.org/jira/secure/attachment/12651793/spark-1777-design-doc.pdf). Tests pending for performance and memory usage patterns.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1165 from andrewor14/them-rdd-memories and squashes the following commits:
      
      e77f451 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      c7c8832 [Andrew Or] Simplify logic + update a few comments
      269d07b [Andrew Or] Very minor changes to tests
      6645a8a [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      b7e165c [Andrew Or] Add new tests for unrolling blocks
      f12916d [Andrew Or] Slightly clean up tests
      71672a7 [Andrew Or] Update unrollSafely tests
      369ad07 [Andrew Or] Correct ensureFreeSpace and requestMemory behavior
      f4d035c [Andrew Or] Allow one thread to unroll multiple blocks
      a66fbd2 [Andrew Or] Rename a few things + update comments
      68730b3 [Andrew Or] Fix weird scalatest behavior
      e40c60d [Andrew Or] Fix MIMA excludes
      ff77aa1 [Andrew Or] Fix tests
      1a43c06 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      b9a6eee [Andrew Or] Simplify locking behavior on unrollMemoryMap
      ed6cda4 [Andrew Or] Formatting fix (super minor)
      f9ff82e [Andrew Or] putValues -> putIterator + putArray
      beb368f [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      8448c9b [Andrew Or] Fix tests
      a49ba4d [Andrew Or] Do not expose unroll memory check period
      69bc0a5 [Andrew Or] Always synchronize on putLock before unrollMemoryMap
      3f5a083 [Andrew Or] Simplify signature of ensureFreeSpace
      dce55c8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      8288228 [Andrew Or] Synchronize put and unroll properly
      4f18a3d [Andrew Or] bufferFraction -> unrollFraction
      28edfa3 [Andrew Or] Update a few comments / log messages
      728323b [Andrew Or] Do not synchronize every 1000 elements
      5ab2329 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      129c441 [Andrew Or] Fix bug: Use toArray rather than array
      9a65245 [Andrew Or] Update a few comments + minor control flow changes
      57f8d85 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      abeae4f [Andrew Or] Add comment clarifying the MEMORY_AND_DISK case
      3dd96aa [Andrew Or] AppendOnlyBuffer -> Vector (+ a few small changes)
      f920531 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      0871835 [Andrew Or] Add an effective storage level interface to BlockManager
      64e7d4c [Andrew Or] Add/modify a few comments (minor)
      8af2f35 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      4f4834e [Andrew Or] Use original storage level for blocks dropped to disk
      ecc8c2d [Andrew Or] Fix binary incompatibility
      24185ea [Andrew Or] Avoid dropping a block back to disk if reading from disk
      2b7ee66 [Andrew Or] Fix bug in SizeTracking*
      9b9a273 [Andrew Or] Fix tests
      20eb3e5 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      649bdb3 [Andrew Or] Document spark.storage.bufferFraction
      a10b0e7 [Andrew Or] Add initial memory request threshold + rename a few things
      e9c3cb0 [Andrew Or] cacheMemoryMap -> unrollMemoryMap
      198e374 [Andrew Or] Unfold -> unroll
      0d50155 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      d9d02a8 [Andrew Or] Remove unused param in unfoldSafely
      ec728d8 [Andrew Or] Add tests for safe unfolding of blocks
      22b2209 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      078eb83 [Andrew Or] Add check for hasNext in PrimitiveVector.iterator
      0871535 [Andrew Or] Fix tests in BlockManagerSuite
      d68f31e [Andrew Or] Safely unfold blocks for all memory puts
      5961f50 [Andrew Or] Fix tests
      195abd7 [Andrew Or] Refactor: move unfold logic to MemoryStore
      1e82d00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      3ce413e [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      d5dd3b4 [Andrew Or] Free buffer memory in finally
      ea02eec [Andrew Or] Fix tests
      b8e1d9c [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      a8704c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      e1b8b25 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      87aa75c [Andrew Or] Fix mima excludes again (typo)
      11eb921 [Andrew Or] Clarify comment (minor)
      50cae44 [Andrew Or] Remove now duplicate mima exclude
      7de5ef9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      df47265 [Andrew Or] Fix binary incompatibility
      6d05a81 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
      f94f5af [Andrew Or] Update a few comments (minor)
      776aec9 [Andrew Or] Prevent OOM if a single RDD partition is too large
      bbd3eea [Andrew Or] Fix CacheManagerSuite to use Array
      97ea499 [Andrew Or] Change BlockManager interface to use Arrays
      c12f093 [Andrew Or] Add SizeTrackingAppendOnlyBuffer and tests
      ecf30ee7
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · f6ff2a61
      Cheng Lian authored
      (This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)
      
      JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      
      Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1600 from liancheng/jdbc and squashes the following commits:
      
      ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      f6ff2a61
    • Matei Zaharia's avatar
      SPARK-2680: Lower spark.shuffle.memoryFraction to 0.2 by default · b547f69b
      Matei Zaharia authored
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1593 from mateiz/spark-2680 and squashes the following commits:
      
      3c949c4 [Matei Zaharia] Lower spark.shuffle.memoryFraction to 0.2 by default
      b547f69b
  16. Jul 26, 2014
    • Hossein's avatar
      [SPARK-2696] Reduce default value of spark.serializer.objectStreamReset · 66f26a46
      Hossein authored
      The current default value of spark.serializer.objectStreamReset is 10,000.
      When trying to re-partition (e.g., to 64 partitions) a large file (e.g., 500MB), containing 1MB records, the serializer will cache 10000 x 1MB x 64 ~= 640 GB which will cause out of memory errors.
      
      This patch sets the default value to a more reasonable default value (100).
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #1595 from falaki/objectStreamReset and squashes the following commits:
      
      650a935 [Hossein] Updated documentation
      1aa0df8 [Hossein] Reduce default value of spark.serializer.objectStreamReset
      66f26a46
  17. Jul 25, 2014
    • Michael Armbrust's avatar
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · afd757a2
      Michael Armbrust authored
      This reverts commit 06dc0d2c.
      
      #1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1594 from marmbrus/revertJDBC and squashes the following commits:
      
      59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
      afd757a2
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · 06dc0d2c
      Cheng Lian authored
      JIRA issue:
      
      - Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      - Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
      
      Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      (Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)
      
      TODO
      
      - [x] Use `spark-submit` to launch the server, the CLI and beeline
      - [x] Migration guideline draft for Shark users
      
      ----
      
      Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:
      
      ```bash
      $ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
      ```
      
      This actually shows usage information of `SparkSubmit` rather than `BeeLine`.
      
      ~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~
      
      **UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1399 from liancheng/thriftserver and squashes the following commits:
      
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      06dc0d2c
    • Davies Liu's avatar
      [SPARK-2538] [PySpark] Hash based disk spilling aggregation · 14174abd
      Davies Liu authored
      During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
      
      It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1460 from davies/spill and squashes the following commits:
      
      cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
      37d71f7 [Davies Liu] balance the partitions
      902f036 [Davies Liu] add shuffle.py into run-tests
      dcf03a9 [Davies Liu] fix memory_info() of psutil
      67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
      f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
      e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
      400be01 [Davies Liu] address all the comments
      6178844 [Davies Liu] refactor and improve docs
      fdd0a49 [Davies Liu] add long doc string for ExternalMerger
      1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
      e6cc7f9 [Davies Liu] Merge branch 'master' into spill
      3652583 [Davies Liu] address comments
      e78a0a0 [Davies Liu] fix style
      24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
      57ee7ef [Davies Liu] update docs
      286aaff [Davies Liu] let spilled aggregation in Python configurable
      e9a40f6 [Davies Liu] recursive merger
      6edbd1f [Davies Liu] Hash based disk spilling aggregation
      14174abd
  18. Jul 24, 2014
    • Sandy Ryza's avatar
      SPARK-2310. Support arbitrary Spark properties on the command line with ... · e34922a2
      Sandy Ryza authored
      ...spark-submit
      
      The PR allows invocations like
        spark-submit --class org.MyClass --spark.shuffle.spill false myjar.jar
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1253 from sryza/sandy-spark-2310 and squashes the following commits:
      
      1dc9855 [Sandy Ryza] More doc and cleanup
      00edfb9 [Sandy Ryza] Review comments
      91b244a [Sandy Ryza] Change format to --conf PROP=VALUE
      8fabe77 [Sandy Ryza] SPARK-2310. Support arbitrary Spark properties on the command line with spark-submit
      e34922a2
  19. Jul 23, 2014
    • Ian O Connell's avatar
      [SPARK-2102][SQL][CORE] Add option for kryo registration required and use a... · efdaeb11
      Ian O Connell authored
      [SPARK-2102][SQL][CORE] Add option for kryo registration required and use a resource pool in Spark SQL for Kryo instances.
      
      Author: Ian O Connell <ioconnell@twitter.com>
      
      Closes #1377 from ianoc/feature/SPARK-2102 and squashes the following commits:
      
      5498566 [Ian O Connell] Docs update suggested by Patrick
      20e8555 [Ian O Connell] Slight style change
      f92c294 [Ian O Connell] Add docs for new KryoSerializer option
      f3735c8 [Ian O Connell] Add using a kryo resource pool for the SqlSerializer
      4e5c342 [Ian O Connell] Register the SparkConf for kryo, it gets swept into serialization
      665805a [Ian O Connell] Add a spark.kryo.registrationRequired option for configuring the Kryo Serializer
      efdaeb11
  20. Jul 20, 2014
    • Michael Giannakopoulos's avatar
      [SPARK-1945][MLLIB] Documentation Improvements for Spark 1.0 · db56f2df
      Michael Giannakopoulos authored
      Standalone application examples are added to 'mllib-linear-methods.md' file written in Java.
      This commit is related to the issue [Add full Java Examples in MLlib docs](https://issues.apache.org/jira/browse/SPARK-1945).
      Also I changed the name of the sigmoid function from 'logit' to 'f'. This is because the logit function
      is the inverse of sigmoid.
      
      Thanks,
      Michael
      
      Author: Michael Giannakopoulos <miccagiann@gmail.com>
      
      Closes #1311 from miccagiann/master and squashes the following commits:
      
      8ffe5ab [Michael Giannakopoulos] Update code so as to comply with code standards.
      f7ad5cc [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master'
      38d92c7 [Michael Giannakopoulos] Adding PCA, SVD and LBFGS examples in Java. Performing minor updates in the already committed examples so as to eradicate the call of 'productElement' function whenever is possible.
      cc0a089 [Michael Giannakopoulos] Modyfied Java examples so as to comply with coding standards.
      b1141b2 [Michael Giannakopoulos] Added Java examples for Clustering and Collaborative Filtering [mllib-clustering.md & mllib-collaborative-filtering.md].
      837f7a8 [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master'
      15f0eb4 [Michael Giannakopoulos] Java examples included in 'mllib-linear-methods.md' file.
      db56f2df
Loading