Skip to content
Snippets Groups Projects
  1. Nov 08, 2014
    • Michelangelo D'Agostino's avatar
      [MLLIB] [PYTHON] SPARK-4221: Expose nonnegative ALS in the python API · 7e9d9756
      Michelangelo D'Agostino authored
      SPARK-1553 added alternating nonnegative least squares to MLLib, however it's not possible to access it via the python API.  This pull request resolves that.
      
      Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>
      
      Closes #3095 from mdagost/python_nmf and squashes the following commits:
      
      a6743ad [Michelangelo D'Agostino] Use setters instead of static methods in PythonMLLibAPI.  Remove the new static methods I added.  Set seed in tests.  Change ratings to ratingsRDD in both train and trainImplicit for consistency.
      7cffd39 [Michelangelo D'Agostino] Swapped nonnegative and seed in a few more places.
      3fdc851 [Michelangelo D'Agostino] Moved seed to the end of the python parameter list.
      bdcc154 [Michelangelo D'Agostino] Change seed type to java.lang.Long so that it can handle null.
      cedf043 [Michelangelo D'Agostino] Added in ability to set the seed from python and made that play nice with the nonnegative changes.  Also made the python ALS tests more exact.
      a72fdc9 [Michelangelo D'Agostino] Expose nonnegative ALS in the python API.
      7e9d9756
  2. Nov 07, 2014
    • Davies Liu's avatar
      [SPARK-4304] [PySpark] Fix sort on empty RDD · 77791097
      Davies Liu authored
      This PR fix sortBy()/sortByKey() on empty RDD.
      
      This should be back ported into 1.1/1.2
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3162 from davies/fix_sort and squashes the following commits:
      
      84f64b7 [Davies Liu] add tests
      52995b5 [Davies Liu] fix sortByKey() on empty RDD
      77791097
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 5923dd98
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #3016 (close requested by 'andrewor14')
      Closes #2798 (close requested by 'andrewor14')
      Closes #2864 (close requested by 'andrewor14')
      Closes #3154 (close requested by 'JoshRosen')
      Closes #3156 (close requested by 'JoshRosen')
      Closes #214 (close requested by 'kayousterhout')
      Closes #2584 (close requested by 'andrewor14')
      5923dd98
    • xiao321's avatar
      Update JavaCustomReceiver.java · 7c9ec529
      xiao321 authored
      数组下标越界
      
      Author: xiao321 <1042460381@qq.com>
      
      Closes #3153 from xiao321/patch-1 and squashes the following commits:
      
      0ed17b5 [xiao321] Update JavaCustomReceiver.java
      7c9ec529
    • wangfei's avatar
      [SPARK-4292][SQL] Result set iterator bug in JDBC/ODBC · d6e55524
      wangfei authored
      select * from src, get the wrong result set as follows:
      ```
      ...
      | 309  | val_309  |
      | 309  | val_309  |
      | 309  | val_309  |
      | 309  | val_309  |
      | 309  | val_309  |
      | 309  | val_309  |
      | 309  | val_309  |
      | 309  | val_309  |
      | 309  | val_309  |
      | 309  | val_309  |
      | 97   | val_97   |
      | 97   | val_97   |
      | 97   | val_97   |
      | 97   | val_97   |
      | 97   | val_97   |
      | 97   | val_97   |
      | 97   | val_97   |
      | 97   | val_97   |
      | 97   | val_97   |
      | 97   | val_97   |
      | 97   | val_97   |
      ...
      
      ```
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3149 from scwf/SPARK-4292 and squashes the following commits:
      
      1574a43 [wangfei] using result.collect
      8b2d845 [wangfei] adding test
      f64eddf [wangfei] result set iter bug
      d6e55524
    • Matthew Taylor's avatar
      [SPARK-4203][SQL] Partition directories in random order when inserting into hive table · ac70c972
      Matthew Taylor authored
      When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a Map<String,String> parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Working on a test but having intillij problems
      
      Author: Matthew Taylor <matthew.t@tbfe.net>
      
      Closes #3076 from tbfenet/partition_dir_order_problem and squashes the following commits:
      
      f1b9a52 [Matthew Taylor] Comment format fix
      bca709f [Matthew Taylor] review changes
      0e50f6b [Matthew Taylor] test fix
      99f1a31 [Matthew Taylor] partition ordering fix
      369e618 [Matthew Taylor] partition ordering fix
      ac70c972
    • Takuya UESHIN's avatar
      [SPARK-4270][SQL] Fix Cast from DateType to DecimalType. · a6405c5d
      Takuya UESHIN authored
      `Cast` from `DateType` to `DecimalType` throws `NullPointerException`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3134 from ueshin/issues/SPARK-4270 and squashes the following commits:
      
      7394e4b [Takuya UESHIN] Fix Cast from DateType to DecimalType.
      a6405c5d
    • Cheng Hao's avatar
      [SPARK-4272] [SQL] Add more unwrapper functions for primitive type in TableReader · 60ab80f5
      Cheng Hao authored
      Currently, the data "unwrap" only support couple of primitive types, not all, it will not cause exception, but may get some performance in table scanning for the type like binary, date, timestamp, decimal etc.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3136 from chenghao-intel/table_reader and squashes the following commits:
      
      fffb729 [Cheng Hao] fix bug for retrieving the timestamp object
      e9c97a4 [Cheng Hao] Add more unwrapper functions for primitive type in TableReader
      60ab80f5
    • Kousuke Saruta's avatar
      [SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators · 14c54f18
      Kousuke Saruta authored
      Following description is quoted from JIRA:
      
      When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error:
      scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$)
      Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters.
      To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB):
      
          create table sparkbug (
          id int,
          event string
          ) stored as parquet;
      
      Insert some sample data:
      
          insert into table sparkbug select 1, '2011-06-18' from <some table> limit 1;
          insert into table sparkbug select 2, '2012-01-01' from <some table> limit 1;
      
      Launch a spark shell and create a HiveContext to the metastore where the table above is located.
      
          import org.apache.spark.sql._
          import org.apache.spark.sql.SQLContext
          import org.apache.spark.sql.hive.HiveContext
          val hc = new HiveContext(sc)
          hc.setConf("spark.sql.shuffle.partitions", "10")
          hc.setConf("spark.sql.hive.convertMetastoreParquet", "true")
          hc.setConf("spark.sql.parquet.compression.codec", "snappy")
          import hc._
          hc.hql("select * from <db>.sparkbug where event >= '2011-12-01'")
      
      A scala.MatchError will appear in the output.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3083 from sarutak/SPARK-4213 and squashes the following commits:
      
      4ab6e56 [Kousuke Saruta] WIP
      b6890c6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4213
      9a1fae7 [Kousuke Saruta] Fixed ParquetFilters so that compare Strings
      14c54f18
    • Jacky Li's avatar
      [SQL] Modify keyword val location according to ordering · 68609c51
      Jacky Li authored
      'DOUBLE' should be moved before 'ELSE' according to the ordering convension
      
      Author: Jacky Li <jacky.likun@gmail.com>
      
      Closes #3080 from jackylk/patch-5 and squashes the following commits:
      
      3c11df7 [Jacky Li] [SQL] Modify keyword val location according to ordering
      68609c51
    • Michael Armbrust's avatar
      [SQL] Support ScalaReflection of schema in different universes · 8154ed7d
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3096 from marmbrus/reflectionContext and squashes the following commits:
      
      adc221f [Michael Armbrust] Support ScalaReflection of schema in different universes
      8154ed7d
    • Cheng Lian's avatar
      [SPARK-4225][SQL] Resorts to SparkContext.version to inspect Spark version · 86e9eaa3
      Cheng Lian authored
      This PR resorts to `SparkContext.version` rather than META-INF/MANIFEST.MF in the assembly jar to inspect Spark version. Currently, when built with Maven, the MANIFEST.MF file in the assembly jar is incorrectly replaced by Guava 15.0 MANIFEST.MF, probably because of the assembly/shading tricks.
      
      Another related PR is #3103, which tries to fix the MANIFEST issue.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3105 from liancheng/spark-4225 and squashes the following commits:
      
      d9585e1 [Cheng Lian] Resorts to SparkContext.version to inspect Spark version
      86e9eaa3
    • wangfei's avatar
      [SQL][DOC][Minor] Spark SQL Hive now support dynamic partitioning · 636d7bcc
      wangfei authored
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3127 from scwf/patch-9 and squashes the following commits:
      
      e39a560 [wangfei] now support dynamic partitioning
      636d7bcc
    • Aaron Davidson's avatar
      [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages · d4fa04e5
      Aaron Davidson authored
      This PR elimiantes the network package's usage of the Java serializer and replaces it with Encodable, which is a lightweight binary protocol. Each message is preceded by a type id, which will allow us to change messages (by only adding new ones), or to change the format entirely by switching to a special id (such as -1).
      
      This protocol has the advantage over Java that we can guarantee that messages will remain compatible across compiled versions and JVMs, though it does not provide a clean way to do schema migration. In the future, it may be good to use a more heavy-weight serialization format like protobuf, thrift, or avro, but these all add several dependencies which are unnecessary at the present time.
      
      Additionally this unifies the RPC messages of NettyBlockTransferService and ExternalShuffleClient.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3146 from aarondav/free and squashes the following commits:
      
      ed1102a [Aaron Davidson] Remove some unused imports
      b8e2a49 [Aaron Davidson] Add appId to test
      538f2a3 [Aaron Davidson] [SPARK-4187] [Core] Switch to binary protocol for external shuffle service messages
      d4fa04e5
  3. Nov 06, 2014
    • zsxwing's avatar
      [SPARK-4204][Core][WebUI] Change Utils.exceptionString to contain the inner... · 3abdb1b2
      zsxwing authored
      [SPARK-4204][Core][WebUI] Change Utils.exceptionString to contain the inner exceptions and make the error information in Web UI more friendly
      
      This PR fixed `Utils.exceptionString` to output the full exception information. However, the stack trace may become very huge, so I also updated the Web UI to collapse the error information by default (display the first line and clicking `+detail` will display the full info).
      
      Here are the screenshots:
      
      Stages:
      ![stages](https://cloud.githubusercontent.com/assets/1000778/4882441/66d8cc68-6356-11e4-8346-6318677d9470.png)
      
      Details for one stage:
      ![stage](https://cloud.githubusercontent.com/assets/1000778/4882513/1311043c-6357-11e4-8804-ca14240a9145.png)
      
      The full information in the gray text field is:
      ```Java
      org.apache.spark.shuffle.FetchFailedException: Connection reset by peer
      	at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
      	at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
      	at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
      	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
      	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
      	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      	at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
      	at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
      	at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
      	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
      	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
      	at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
      	at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
      	at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
      	at org.apache.spark.scheduler.Task.run(Task.scala:56)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	at java.lang.Thread.run(Thread.java:662)
      Caused by: java.io.IOException: Connection reset by peer
      	at sun.nio.ch.FileDispatcher.read0(Native Method)
      	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      	at sun.nio.ch.IOUtil.read(IOUtil.java:166)
      	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
      	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
      	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
      	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
      	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
      	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
      	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
      	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
      	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
      	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
      	... 1 more
      ```
      
      /cc aarondav
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3073 from zsxwing/SPARK-4204 and squashes the following commits:
      
      176d1e3 [zsxwing] Add comments to explain the stack trace difference
      ca509d3 [zsxwing] Add fullStackTrace to the constructor of ExceptionFailure
      a07057b [zsxwing] Core style fix
      dfb0032 [zsxwing] Backward compatibility for old history server
      1e50f71 [zsxwing] Update as per review and increase the max height of the stack trace details
      94f2566 [zsxwing] Change Utils.exceptionString to contain the inner exceptions and make the error information in Web UI more friendly
      3abdb1b2
    • Aaron Davidson's avatar
      [SPARK-4236] Cleanup removed applications' files in shuffle service · 48a19a6d
      Aaron Davidson authored
      This relies on a hook from whoever is hosting the shuffle service to invoke removeApplication() when the application is completed. Once invoked, we will clean up all the executors' shuffle directories we know about.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3126 from aarondav/cleanup and squashes the following commits:
      
      33a64a9 [Aaron Davidson] Missing brace
      e6e428f [Aaron Davidson] Address comments
      16a0d27 [Aaron Davidson] Cleanup
      e4df3e7 [Aaron Davidson] [SPARK-4236] Cleanup removed applications' files in shuffle service
      48a19a6d
    • Aaron Davidson's avatar
      [SPARK-4188] [Core] Perform network-level retry of shuffle file fetches · f165b2bb
      Aaron Davidson authored
      This adds a RetryingBlockFetcher to the NettyBlockTransferService which is wrapped around our typical OneForOneBlockFetcher, adding retry logic in the event of an IOException.
      
      This sort of retry allows us to avoid marking an entire executor as failed due to garbage collection or high network load.
      
      TODO:
      - [x] unit tests
      - [x] put in ExternalShuffleClient too
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3101 from aarondav/retry and squashes the following commits:
      
      72a2a32 [Aaron Davidson] Add that we should remove the condition around the retry thingy
      c7fd107 [Aaron Davidson] Fix unit tests
      e80e4c2 [Aaron Davidson] Address initial comments
      6f594cd [Aaron Davidson] Fix unit test
      05ff43c [Aaron Davidson] Add to external shuffle client and add unit test
      66e5a24 [Aaron Davidson] [SPARK-4238] [Core] Perform network-level retry of shuffle file fetches
      f165b2bb
    • Aaron Davidson's avatar
      [SPARK-4277] Support external shuffle service on Standalone Worker · 6e9ef10f
      Aaron Davidson authored
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3142 from aarondav/worker and squashes the following commits:
      
      3780bd7 [Aaron Davidson] Address comments
      2dcdfc1 [Aaron Davidson] Add private[worker]
      47f49d3 [Aaron Davidson] NettyBlockTransferService shouldn't care about app ids (it's only b/t executors)
      258417c [Aaron Davidson] [SPARK-4277] Support external shuffle service on executor
      6e9ef10f
    • Andrew Or's avatar
      [SPARK-3797] Minor addendum to Yarn shuffle service · 96136f22
      Andrew Or authored
      I did not realize there was a `network.util.JavaUtils` when I wrote this code. This PR moves the `ByteBuffer` string conversion to the appropriate place. I tested the changes on a stable yarn cluster.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3144 from andrewor14/yarn-shuffle-util and squashes the following commits:
      
      b6c08bf [Andrew Or] Remove unused import
      94e205c [Andrew Or] Use netty Unpooled
      85202a5 [Andrew Or] Use guava Charsets
      057135b [Andrew Or] Reword comment
      adf186d [Andrew Or] Move byte buffer String conversion logic to JavaUtils
      96136f22
    • Andrew Or's avatar
      [HOT FIX] Make distribution fails · 470881b2
      Andrew Or authored
      This was added by me in https://github.com/apache/spark/commit/61a5cced049a8056292ba94f23fa7bd040f50685. The real fix will be added in [SPARK-4281](https://issues.apache.org/jira/browse/SPARK-4281).
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3145 from andrewor14/fix-make-distribution and squashes the following commits:
      
      c78be61 [Andrew Or] Hot fix make distribution
      470881b2
    • lianhuiwang's avatar
      [SPARK-4249][GraphX]fix a problem of EdgePartitionBuilder in Graphx · d15c6e9d
      lianhuiwang authored
      at first srcIds is not initialized and are all 0. so we use edgeArray(0).srcId to currSrcId
      
      Author: lianhuiwang <lianhuiwang09@gmail.com>
      
      Closes #3138 from lianhuiwang/SPARK-4249 and squashes the following commits:
      
      3f4e503 [lianhuiwang] fix a problem of EdgePartitionBuilder in Graphx
      d15c6e9d
    • Aaron Davidson's avatar
      [SPARK-4264] Completion iterator should only invoke callback once · 23eaf0e1
      Aaron Davidson authored
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3128 from aarondav/compiter and squashes the following commits:
      
      698e4be [Aaron Davidson] [SPARK-4264] Completion iterator should only invoke callback once
      23eaf0e1
    • Davies Liu's avatar
      [SPARK-4186] add binaryFiles and binaryRecords in Python · b41a39e2
      Davies Liu authored
      add binaryFiles() and binaryRecords() in Python
      ```
      binaryFiles(self, path, minPartitions=None):
          :: Developer API ::
      
          Read a directory of binary files from HDFS, a local file system
          (available on all nodes), or any Hadoop-supported file system URI
          as a byte array. Each file is read as a single record and returned
          in a key-value pair, where the key is the path of each file, the
          value is the content of each file.
      
          Note: Small files are preferred, large file is also allowable, but
          may cause bad performance.
      
      binaryRecords(self, path, recordLength):
          Load data from a flat binary file, assuming each record is a set of numbers
          with the specified numerical format (see ByteBuffer), and the number of
          bytes per record is constant.
      
          :param path: Directory to the input data files
          :param recordLength: The length at which to split the records
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3078 from davies/binary and squashes the following commits:
      
      cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
      3aa349b [Davies Liu] add experimental notes
      24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
      5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
      1900085 [Davies Liu] bugfix
      bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python
      b41a39e2
    • Kay Ousterhout's avatar
      [SPARK-4255] Fix incorrect table striping · 5f27ae16
      Kay Ousterhout authored
      This commit stripes table rows after hiding some rows, to
      ensure that rows are correct striped to alternate white
      and grey even when rows are hidden by default.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #3117 from kayousterhout/striping and squashes the following commits:
      
      be6e10a [Kay Ousterhout] [SPARK-4255] Fix incorrect table striping
      5f27ae16
  4. Nov 05, 2014
    • Nicholas Chammas's avatar
      [SPARK-4137] [EC2] Don't change working dir on user · db45f5ad
      Nicholas Chammas authored
      This issue was uncovered after [this discussion](https://issues.apache.org/jira/browse/SPARK-3398?focusedCommentId=14187471&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14187471).
      
      Don't change the working directory on the user. This breaks relative paths the user may pass in, e.g., for the SSH identity file.
      
      ```
      ./ec2/spark-ec2 -i ../my.pem
      ```
      
      This patch will preserve the user's current working directory and allow calls like the one above to work.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2988 from nchammas/spark-ec2-cwd and squashes the following commits:
      
      f3850b5 [Nicholas Chammas] pep8 fix
      fbc20c7 [Nicholas Chammas] revert to old commenting style
      752f958 [Nicholas Chammas] specify deploy.generic path absolutely
      bcdf6a5 [Nicholas Chammas] fix typo
      77871a2 [Nicholas Chammas] add clarifying comment
      ce071fc [Nicholas Chammas] don't change working dir
      db45f5ad
    • Xiangrui Meng's avatar
      [SPARK-4262][SQL] add .schemaRDD to JavaSchemaRDD · 3d2b5bc5
      Xiangrui Meng authored
       marmbrus
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3125 from mengxr/SPARK-4262 and squashes the following commits:
      
      307695e [Xiangrui Meng] add .schemaRDD to JavaSchemaRDD
      3d2b5bc5
    • Joseph K. Bradley's avatar
      [SPARK-4254] [mllib] MovieLensALS bug fix · c315d131
      Joseph K. Bradley authored
      Changed code so it does not try to serialize Params.
      CC: mengxr 	debasish83 srowen
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3116 from jkbradley/als-bugfix and squashes the following commits:
      
      e575bd8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into als-bugfix
      9401b16 [Joseph K. Bradley] changed implicitPrefs so it is not serialized to fix MovieLensALS example bug
      c315d131
    • Brenden Matthews's avatar
      [SPARK-4158] Fix for missing resources. · cb0eae3b
      Brenden Matthews authored
      Mesos offers may not contain all resources, and Spark needs to check to
      ensure they are present and sufficient.  Spark may throw an erroneous
      exception when resources aren't present.
      
      Author: Brenden Matthews <brenden@diddyinc.com>
      
      Closes #3024 from brndnmtthws/fix-mesos-resource-misuse and squashes the following commits:
      
      e5f9580 [Brenden Matthews] [SPARK-4158] Fix for missing resources.
      cb0eae3b
    • Jongyoul Lee's avatar
      SPARK-3223 runAsSparkUser cannot change HDFS write permission properly i... · f7ac8c2b
      Jongyoul Lee authored
      ...n mesos cluster mode
      
      - change master newer
      
      Author: Jongyoul Lee <jongyoul@gmail.com>
      
      Closes #3034 from jongyoul/SPARK-3223 and squashes the following commits:
      
      42b2ed3 [Jongyoul Lee] SPARK-3223 runAsSparkUser cannot change HDFS write permission properly in mesos cluster mode - change master newer
      f7ac8c2b
    • jay@apache.org's avatar
      SPARK-4040. Update documentation to exemplify use of local (n) value, fo... · 868cd4c3
      jay@apache.org authored
      This is a minor docs update which helps to clarify the way local[n] is used for streaming apps.
      
      Author: jay@apache.org <jayunit100>
      
      Closes #2964 from jayunit100/SPARK-4040 and squashes the following commits:
      
      35b5a5e [jay@apache.org] SPARK-4040: Update documentation to exemplify use of local (n) value.
      868cd4c3
    • Andrew Or's avatar
      [SPARK-3797] Run external shuffle service in Yarn NM · 61a5cced
      Andrew Or authored
      This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.
      
      This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits:
      
      ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      0ee67a2 [Andrew Or] Minor wording suggestions
      1c66046 [Andrew Or] Remove unused provided dependencies
      0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      6489db5 [Andrew Or] Try catch at the right places
      7b71d8f [Andrew Or] Add detailed java docs + reword a few comments
      d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE)
      5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      9b6e058 [Andrew Or] Address various feedback
      f48b20c [Andrew Or] Fix tests again
      f39daa6 [Andrew Or] Do not make network-yarn an assembly module
      761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      15a5b37 [Andrew Or] Fix build for Hadoop 1.x
      baff916 [Andrew Or] Fix tests
      5bf9b7e [Andrew Or] Address a few minor comments
      5b419b8 [Andrew Or] Add missing license header
      804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution
      cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation
      ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled
      1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config
      b4b1f0c [Andrew Or] 4 tabs -> 2 tabs
      43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service
      b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service
      61a5cced
    • industrial-sloth's avatar
      SPARK-4222 [CORE] use readFully in FixedLengthBinaryRecordReader · f37817b1
      industrial-sloth authored
      
      replaces the existing read() call with readFully().
      
      Author: industrial-sloth <industrial-sloth@users.noreply.github.com>
      
      Closes #3093 from industrial-sloth/branch-1.2-fixedLenRecRdr and squashes the following commits:
      
      a245c8a [industrial-sloth] use readFully in FixedLengthBinaryRecordReader
      
      (cherry picked from commit 6844e7a8)
      Signed-off-by: default avatarMatei Zaharia <matei@databricks.com>
      f37817b1
    • Kay Ousterhout's avatar
      [SPARK-3984] [SPARK-3983] Fix incorrect scheduler delay and display task deserialization time in UI · a46497ee
      Kay Ousterhout authored
      This commit fixes the scheduler delay in the UI (which previously
      included things that are not scheduler delay, like time to
      deserialize the task and serialize the result), and also
      adds information about time to deserialize tasks to the optional
      additional metrics.  Time to deserialize the task can be large relative
      to task time for short jobs, and understanding when it is high can help
      developers realize that they should try to reduce closure size (e.g, by including
      less data in the task description).
      
      cc shivaram etrain
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #2832 from kayousterhout/SPARK-3983 and squashes the following commits:
      
      0c1398e [Kay Ousterhout] Fixed ordering
      531575d [Kay Ousterhout] Removed executor launch time
      1f13afe [Kay Ousterhout] Minor spacing fixes
      335be4b [Kay Ousterhout] Made metrics hideable
      5bc3cba [Kay Ousterhout] [SPARK-3984] [SPARK-3983] Improve UI task metrics.
      a46497ee
    • Aaron Davidson's avatar
      [SPARK-4242] [Core] Add SASL to external shuffle service · 4c42986c
      Aaron Davidson authored
      Does three things: (1) Adds SASL to ExternalShuffleClient, (2) puts SecurityManager in BlockManager's constructor, and (3) adds unit test.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3108 from aarondav/sasl-client and squashes the following commits:
      
      48b622d [Aaron Davidson] Screw it, let's just get LimitedInputStream
      3543b70 [Aaron Davidson] Back out of pom change due to unknown test issue?
      b58518a [Aaron Davidson] ByteStreams.limit() not available :(
      cbe451a [Aaron Davidson] Address comments
      2bf2908 [Aaron Davidson] [SPARK-4242] [Core] Add SASL to external shuffle service
      4c42986c
    • Joseph K. Bradley's avatar
      [SPARK-4197] [mllib] GradientBoosting API cleanup and examples in Scala, Java · 5b3b6f6f
      Joseph K. Bradley authored
      ### Summary
      
      * Made it easier to construct default Strategy and BoostingStrategy and to set parameters using simple types.
      * Added Scala and Java examples for GradientBoostedTrees
      * small cleanups and fixes
      
      ### Details
      
      GradientBoosting bug fixes (“bug” = bad default options)
      * Force boostingStrategy.weakLearnerParams.algo = Regression
      * Force boostingStrategy.weakLearnerParams.impurity = impurity.Variance
      * Only persist data if not yet persisted (since it causes an error if persisted twice)
      
      BoostingStrategy
      * numEstimators: renamed to numIterations
      * removed subsamplingRate (duplicated by Strategy)
      * removed categoricalFeaturesInfo since it belongs with the weak learner params (since boosting can be oblivious to feature type)
      * Changed algo to var (not val) and added BeanProperty, with overload taking String argument
      * Added assertValid() method
      * Updated defaultParams() method and eliminated defaultWeakLearnerParams() since that belongs in Strategy
      
      Strategy (for DecisionTree)
      * Changed algo to var (not val) and added BeanProperty, with overload taking String argument
      * Added setCategoricalFeaturesInfo method taking Java Map.
      * Cleaned up assertValid
      * Changed val’s to def’s since parameters can now be changed.
      
      CC: manishamde mengxr codedeft
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3094 from jkbradley/gbt-api and squashes the following commits:
      
      7a27e22 [Joseph K. Bradley] scalastyle fix
      52013d5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into gbt-api
      e9b8410 [Joseph K. Bradley] Summary of changes
      5b3b6f6f
    • Tathagata Das's avatar
      [SPARK-4029][Streaming] Update streaming driver to reliably save and recover... · 5f13759d
      Tathagata Das authored
      [SPARK-4029][Streaming] Update streaming driver to reliably save and recover received block metadata on driver failures
      
      As part of the initiative of preventing data loss on driver failure, this JIRA tracks the sub task of modifying the streaming driver to reliably save received block metadata, and recover them on driver restart.
      
      This was solved by introducing a `ReceivedBlockTracker` that takes all the responsibility of managing the metadata of received blocks (i.e. `ReceivedBlockInfo`, and any actions on them (e.g, allocating blocks to batches, etc.). All actions to block info get written out to a write ahead log (using `WriteAheadLogManager`). On recovery, all the actions are replaying to recreate the pre-failure state of the `ReceivedBlockTracker`, which include the batch-to-block allocations and the unallocated blocks.
      
      Furthermore, the `ReceiverInputDStream` was modified to create `WriteAheadLogBackedBlockRDD`s when file segment info is present in the `ReceivedBlockInfo`. After recovery of all the block info (through recovery `ReceivedBlockTracker`), the `WriteAheadLogBackedBlockRDD`s gets recreated with the recovered info, and jobs submitted. The data of the blocks gets pulled from the write ahead logs, thanks to the segment info present in the `ReceivedBlockInfo`.
      
      This is still a WIP. Things that are missing here are.
      
      - *End-to-end integration tests:* Unit tests that tests the driver recovery, by killing and restarting the streaming context, and verifying all the input data gets processed. This has been implemented but not included in this PR yet. A sneak peek of that DriverFailureSuite can be found in this PR (on my personal repo): https://github.com/tdas/spark/pull/25 I can either include it in this PR, or submit that as a separate PR after this gets in.
      
      - *WAL cleanup:* Cleaning up the received data write ahead log, by calling `ReceivedBlockHandler.cleanupOldBlocks`. This is being worked on.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3026 from tdas/driver-ha-rbt and squashes the following commits:
      
      a8009ed [Tathagata Das] Added comment
      1d704bb [Tathagata Das] Enabled storing recovered WAL-backed blocks to BM
      2ee2484 [Tathagata Das] More minor changes based on PR
      47fc1e3 [Tathagata Das] Addressed PR comments.
      9a7e3e4 [Tathagata Das] Refactored ReceivedBlockTracker API a bit to make things a little cleaner for users of the tracker.
      af63655 [Tathagata Das] Minor changes.
      fce2b21 [Tathagata Das] Removed commented lines
      59496d3 [Tathagata Das] Changed class names, made allocation more explicit and added cleanup
      19aec7d [Tathagata Das] Fixed casting bug.
      f66d277 [Tathagata Das] Fix line lengths.
      cda62ee [Tathagata Das] Added license
      25611d6 [Tathagata Das] Minor changes before submitting PR
      7ae0a7fb [Tathagata Das] Transferred changes from driver-ha-working branch
      5f13759d
  5. Nov 04, 2014
    • Davies Liu's avatar
      [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API · c8abddc5
      Davies Liu authored
      ```
      pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
          :: Experimental ::
      
          If `observed` is Vector, conduct Pearson's chi-squared goodness
          of fit test of the observed data against the expected distribution,
          or againt the uniform distribution (by default), with each category
          having an expected frequency of `1 / len(observed)`.
          (Note: `observed` cannot contain negative values)
      
          If `observed` is matrix, conduct Pearson's independence test on the
          input contingency matrix, which cannot contain negative entries or
          columns or rows that sum up to 0.
      
          If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
          test for every feature against the label across the input RDD.
          For each feature, the (feature, label) pairs are converted into a
          contingency matrix for which the chi-squared statistic is computed.
          All label and feature values must be categorical.
      
          :param observed: it could be a vector containing the observed categorical
                           counts/relative frequencies, or the contingency matrix
                           (containing either counts or relative frequencies),
                           or an RDD of LabeledPoint containing the labeled dataset
                           with categorical features. Real-valued features will be
                           treated as categorical for each distinct value.
          :param expected: Vector containing the expected categorical counts/relative
                           frequencies. `expected` is rescaled if the `expected` sum
                           differs from the `observed` sum.
          :return: ChiSquaredTest object containing the test statistic, degrees
                   of freedom, p-value, the method used, and the null hypothesis.
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3091 from davies/his and squashes the following commits:
      
      145d16c [Davies Liu] address comments
      0ab0764 [Davies Liu] fix float
      5097d54 [Davies Liu] add Hypothesis test Python API
      c8abddc5
    • Michael Armbrust's avatar
      [SQL] Add String option for DSL AS · 515abb9a
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3097 from marmbrus/asString and squashes the following commits:
      
      6430520 [Michael Armbrust] Add String option for DSL AS
      515abb9a
    • Aaron Davidson's avatar
      [SPARK-2938] Support SASL authentication in NettyBlockTransferService · 5e73138a
      Aaron Davidson authored
      Also lays the groundwork for supporting it inside the external shuffle service.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3087 from aarondav/sasl and squashes the following commits:
      
      3481718 [Aaron Davidson] Delete rogue println
      44f8410 [Aaron Davidson] Delete documentation - muahaha!
      eb9f065 [Aaron Davidson] Improve documentation and add end-to-end test at Spark-level
      a6b95f1 [Aaron Davidson] Address comments
      785bbde [Aaron Davidson] Cleanup
      79973cb [Aaron Davidson] Remove unused file
      151b3c5 [Aaron Davidson] Add docs, timeout config, better failure handling
      f6177d7 [Aaron Davidson] Cleanup SASL state upon connection termination
      7b42adb [Aaron Davidson] Add unit tests
      8191bcb [Aaron Davidson] [SPARK-2938] Support SASL authentication in NettyBlockTransferService
      5e73138a
    • Niklas Wilcke's avatar
      [Spark-4060] [MLlib] exposing special rdd functions to the public · f90ad5d4
      Niklas Wilcke authored
      Author: Niklas Wilcke <1wilcke@informatik.uni-hamburg.de>
      
      Closes #2907 from numbnut/master and squashes the following commits:
      
      7f7c767 [Niklas Wilcke] [Spark-4060] [MLlib] exposing special rdd functions to the public, #2907
      f90ad5d4
Loading