Skip to content
Snippets Groups Projects
  1. Nov 03, 2014
    • Josh Rosen's avatar
      [SPARK-611] Display executor thread dumps in web UI · 4f035dd2
      Josh Rosen authored
      This patch allows executor thread dumps to be collected on-demand and viewed in the Spark web UI.
      
      The thread dumps are collected using Thread.getAllStackTraces().  To allow remote thread dumps to be triggered from the web UI, I added a new `ExecutorActor` that runs inside of the Executor actor system and responds to RPCs from the driver.  The driver's mechanism for obtaining a reference to this actor is a little bit hacky: it uses the block manager master actor to determine the host/port of the executor actor systems in order to construct ActorRefs to ExecutorActor.  Unfortunately, I couldn't find a much cleaner way to do this without a big refactoring of the executor -> driver communication.
      
      Screenshots:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/4781793/7e7a0776-5cbf-11e4-874d-a91cd04620bd.png)
      
      ![image](https://cloud.githubusercontent.com/assets/50748/4781794/8bce76aa-5cbf-11e4-8d13-8477748c9f7e.png)
      
      ![image](https://cloud.githubusercontent.com/assets/50748/4781797/bd11a8b8-5cbf-11e4-9ad7-a7459467ec8e.png)
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2944 from JoshRosen/jstack-in-web-ui and squashes the following commits:
      
      3c21a5d [Josh Rosen] Address review comments:
      880f7f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
      f719266 [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
      19707b0 [Josh Rosen] Add one comment.
      127a130 [Josh Rosen] Update to use SparkContext.DRIVER_IDENTIFIER
      b8e69aa [Josh Rosen] Merge remote-tracking branch 'origin/master' into jstack-in-web-ui
      3dfc2d4 [Josh Rosen] Add missing file.
      bc1e675 [Josh Rosen] Undo some leftover changes from the earlier approach.
      f4ac1c1 [Josh Rosen] Switch to on-demand collection of thread dumps
      dfec08b [Josh Rosen] Add option to disable thread dumps in UI.
      4c87d7f [Josh Rosen] Use separate RPC for sending thread dumps.
      2b8bdf3 [Josh Rosen] Enable thread dumps from the driver when running in non-local mode.
      cc3e6b3 [Josh Rosen] Fix test code in DAGSchedulerSuite.
      87b8b65 [Josh Rosen] Add new listener event for thread dumps.
      8c10216 [Josh Rosen] Add missing file.
      0f198ac [Josh Rosen] [SPARK-611] Display executor thread dumps in web UI
      4f035dd2
    • Zhang, Liye's avatar
      [SPARK-4168][WebUI] web statges number should show correctly when stages are more than 1000 · 97a466ec
      Zhang, Liye authored
      The number of completed stages and failed stages showed on webUI will always be less than 1000. This is really misleading when there are already thousands of stages completed or failed. The number should be correct even when only partial stages listed on the webUI (stage info will be removed if the number is too large).
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #3035 from liyezhang556520/webStageNum and squashes the following commits:
      
      d9e29fb [Zhang, Liye] add detailed comments for variables
      4ea8fd1 [Zhang, Liye] change variable name accroding to comments
      f4c404d [Zhang, Liye] [SPARK-4168][WebUI] web statges number should show correctly when stages are more than 1000
      97a466ec
    • Michael Armbrust's avatar
      [SQL] Convert arguments to Scala UDFs · 15b58a22
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3077 from marmbrus/udfsWithUdts and squashes the following commits:
      
      34b5f27 [Michael Armbrust] style
      504adef [Michael Armbrust] Convert arguments to Scala UDFs
      15b58a22
    • Sandy Ryza's avatar
      SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader insta... · 28128150
      Sandy Ryza authored
      ...ntiation
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3045 from sryza/sandy-spark-4178 and squashes the following commits:
      
      8d2e70e [Sandy Ryza] Kostas's review feedback
      e5b27c0 [Sandy Ryza] SPARK-4178. Hadoop input metrics ignore bytes read in RecordReader instantiation
      28128150
    • Michael Armbrust's avatar
      [SQL] More aggressive defaults · 25bef7e6
      Michael Armbrust authored
       - Turns on compression for in-memory cached data by default
       - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
       - Ups the batch size to 10,000 rows
       - Increases the broadcast threshold to 10mb.
       - Uses our parquet implementation instead of the hive one by default.
       - Cache parquet metadata by default.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3064 from marmbrus/fasterDefaults and squashes the following commits:
      
      97ee9f8 [Michael Armbrust] parquet codec docs
      e641694 [Michael Armbrust] Remote also
      a12866a [Michael Armbrust] Cache metadata.
      2d73acc [Michael Armbrust] Update docs defaults.
      d63d2d5 [Michael Armbrust] document parquet option
      da373f9 [Michael Armbrust] More aggressive defaults
      25bef7e6
    • Cheng Hao's avatar
      [SPARK-4152] [SQL] Avoid data change in CTAS while table already existed · e83f13e8
      Cheng Hao authored
      CREATE TABLE t1 (a String);
      CREATE TABLE t1 AS SELECT key FROM src; – throw exception
      CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3013 from chenghao-intel/ctas_unittest and squashes the following commits:
      
      194113e [Cheng Hao] fix bug in CTAS when table already existed
      e83f13e8
    • Cheng Lian's avatar
      [SPARK-4202][SQL] Simple DSL support for Scala UDF · c238fb42
      Cheng Lian authored
      This feature is based on an offline discussion with mengxr, hopefully can be useful for the new MLlib pipeline API.
      
      For the following test snippet
      
      ```scala
      case class KeyValue(key: Int, value: String)
      val testData = sc.parallelize(1 to 10).map(i => KeyValue(i, i.toString)).toSchemaRDD
      def foo(a: Int, b: String) => a.toString + b
      ```
      
      the newly introduced DSL enables the following syntax
      
      ```scala
      import org.apache.spark.sql.catalyst.dsl._
      testData.select(Star(None), foo.call('key, 'value) as 'result)
      ```
      
      which is equivalent to
      
      ```scala
      testData.registerTempTable("testData")
      sqlContext.registerFunction("foo", foo)
      sql("SELECT *, foo(key, value) AS result FROM testData")
      ```
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3067 from liancheng/udf-dsl and squashes the following commits:
      
      f132818 [Cheng Lian] Adds DSL support for Scala UDF
      c238fb42
    • Davies Liu's avatar
      [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling · 24544fbc
      Davies Liu authored
      This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
      
      If sampling is presented, it will infer schema from all the rows after sampling.
      
      Also, add samplingRatio for jsonFile() and jsonRDD()
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2716 from davies/infer and squashes the following commits:
      
      e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      567dc60 [Davies Liu] update docs
      9767b27 [Davies Liu] Merge branch 'master' into infer
      e48d7fb [Davies Liu] fix tests
      29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
      ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      540d1d5 [Davies Liu] merge fields for StructType
      f93fd84 [Davies Liu] add more tests
      3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
      24544fbc
    • ravipesala's avatar
      [SPARK-4207][SQL] Query which has syntax like 'not like' is not working in Spark SQL · 2b6e1ce6
      ravipesala authored
      Queries which has 'not like' is not working spark sql.
      
      sql("SELECT * FROM records where value not like 'val%'")
       same query works in Spark HiveQL
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #3075 from ravipesala/SPARK-4207 and squashes the following commits:
      
      35c11e7 [ravipesala] Supported 'not like' syntax in sql
      2b6e1ce6
    • fi's avatar
      [SPARK-4211][Build] Fixes hive.version in Maven profile hive-0.13.1 · df607da0
      fi authored
      instead of `hive.version=0.13.1`.
      e.g. mvn -Phive -Phive=0.13.1
      
      Note: `hive.version=0.13.1a` is the default property value. However, when explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be selected.
      References:  PR #2685, which resolved a package incompatibility issue with Hive-0.13.1 by introducing a special version Hive-0.13.1a
      
      Author: fi <coderfi@gmail.com>
      
      Closes #3072 from coderfi/master and squashes the following commits:
      
      7ca4b1e [fi] Fixes the `hive-0.13.1` maven profile referencing `hive.version=0.13.1` instead of the Spark compatible `hive.version=0.13.1a` Note: `hive.version=0.13.1a` is the default version. However, when explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be selected. e.g. mvn -Phive -Phive=0.13.1 See PR #2685
      df607da0
    • Xiangrui Meng's avatar
      [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample · 3cca1962
      Xiangrui Meng authored
      The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.
      
      ~~~
      In [14]: import random
      
      In [15]: r1 = random.Random(10)
      
      In [16]: r1.randint(0, 1)
      Out[16]: 1
      
      In [17]: r1.random()
      Out[17]: 0.4288890546751146
      
      In [18]: r1.random()
      Out[18]: 0.5780913011344704
      
      In [19]: r2 = random.Random(10)
      
      In [20]: r2.randint(0, 1)
      Out[20]: 1
      
      In [21]: r2.randint(0, 1)
      Out[21]: 0
      
      In [22]: r2.random()
      Out[22]: 0.5780913011344704
      ~~~
      
      Note: The new tests are not for this bug fix.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:
      
      869ae4b [Xiangrui Meng] move tests tests.py
      c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample
      3cca1962
    • Nicholas Chammas's avatar
      [EC2] Factor out Mesos spark-ec2 branch · 2aca97c7
      Nicholas Chammas authored
      We reference a specific branch in two places. This patch makes it one place.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #3008 from nchammas/mesos-spark-ec2-branch and squashes the following commits:
      
      10a6089 [Nicholas Chammas] factor out mess spark-ec2 branch
      2aca97c7
    • zsxwing's avatar
      [SPARK-4163][Core][WebUI] Send the fetch failure message back to Web UI · 76386e1a
      zsxwing authored
      This is a PR to send the fetch failure message back to Web UI.
      Before:
      ![f1](https://cloud.githubusercontent.com/assets/1000778/4856595/1f036c80-60be-11e4-956f-335147fbccb7.png)
      ![f2](https://cloud.githubusercontent.com/assets/1000778/4856596/1f11cbea-60be-11e4-8fe9-9f9b2b35c884.png)
      
      After (Please ignore the meaning of exception, I threw it in the code directly because it's hard to simulate a fetch failure):
      ![e1](https://cloud.githubusercontent.com/assets/1000778/4856600/2657ea38-60be-11e4-9f2d-d56c5f900f10.png)
      ![e2](https://cloud.githubusercontent.com/assets/1000778/4856601/26595008-60be-11e4-912b-2744af786991.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3032 from zsxwing/SPARK-4163 and squashes the following commits:
      
      f7e1faf [zsxwing] Discard changes for FetchFailedException and minor modification
      4e946f7 [zsxwing] Add e as the cause of SparkException
      316767d [zsxwing] Add private[storage] to FetchResult
      d51b0b6 [zsxwing] Set e as the cause of FetchFailedException
      b88c919 [zsxwing] Use 'private[storage]' for case classes instead of 'sealed'
      62103fd [zsxwing] Update as per review
      0c07d1f [zsxwing] Backward-compatible support
      a3bca65 [zsxwing] Send the fetch failure message back to Web UI
      76386e1a
    • wangfei's avatar
      [SPARK-4177][Doc]update build doc since JDBC/CLI support hive 13 now · 001acc44
      wangfei authored
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3042 from scwf/patch-9 and squashes the following commits:
      
      3784ed1 [wangfei] remove 'TODO'
      1891553 [wangfei] update build doc since JDBC/CLI support hive 13
      001acc44
  2. Nov 02, 2014
    • Reynold Xin's avatar
      Close #2971. · d6e4c591
      Reynold Xin authored
      d6e4c591
    • Aaron Davidson's avatar
      [SPARK-4183] Enable NettyBlockTransferService by default · 1ae51f6d
      Aaron Davidson authored
      Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3049 from aarondav/enable-netty and squashes the following commits:
      
      bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
      1ae51f6d
    • Joseph K. Bradley's avatar
      [SPARK-3572] [SQL] Internal API for User-Defined Types · ebd64805
      Joseph K. Bradley authored
      This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet).
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3063 from marmbrus/udts and squashes the following commits:
      
      7ccfc0d [Michael Armbrust] remove println
      46a3aee [Michael Armbrust] Slightly easier to read test output.
      6cc434d [Michael Armbrust] Recursively convert rows.
      e369b91 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udts
      15c10a6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into sql-udt2
      f3c72fe [Joseph K. Bradley] Fixing merge
      e13cd8a [Joseph K. Bradley] Removed Vector UDTs
      5817b2b [Joseph K. Bradley] style edits
      30ce5b2 [Joseph K. Bradley] updates based on code review
      d063380 [Joseph K. Bradley] Cleaned up Java UDT Suite, and added warning about element ordering when creating schema from Java Bean
      a571bb6 [Joseph K. Bradley] Removed old UDT code (registry and Java UDTs).  Cleaned up other code.  Extended JavaUserDefinedTypeSuite
      6fddc1c [Joseph K. Bradley] Made MyLabeledPoint into a Java Bean
      20630bc [Joseph K. Bradley] fixed scalastyle
      fa86b20 [Joseph K. Bradley] Removed Java UserDefinedType, and made UDTs private[spark] for now
      8de957c [Joseph K. Bradley] Modified UserDefinedType to store Java class of user type so that registerUDT takes only the udt argument.
      8b242ea [Joseph K. Bradley] Fixed merge error after last merge.  Note: Last merge commit also removed SQL UDT examples from mllib.
      7f29656 [Joseph K. Bradley] Moved udt case to top of all matches.  Small cleanups
      b028675 [Xiangrui Meng] allow any type in UDT
      4500d8a [Xiangrui Meng] update example code
      87264a5 [Xiangrui Meng] remove debug code
      3143ac3 [Xiangrui Meng] remove unnecessary changes
      cfbc321 [Xiangrui Meng] support UDT in parquet
      db16139 [Joseph K. Bradley] Added more doc for UserDefinedType.  Removed unused code in Suite
      759af7a [Joseph K. Bradley] Added more doc to UserDefineType
      63626a4 [Joseph K. Bradley] Updated ScalaReflectionsSuite per @marmbrus suggestions
      51e5282 [Joseph K. Bradley] fixed 1 test
      f025035 [Joseph K. Bradley] Cleanups before PR.  Added new tests
      85872f6 [Michael Armbrust] Allow schema calculation to be lazy, but ensure its available on executors.
      dff99d6 [Joseph K. Bradley] Added UDTs for Vectors in MLlib, plus DatasetExample using the UDTs
      cd60cb4 [Joseph K. Bradley] Trying to get other SQL tests to run
      34a5831 [Joseph K. Bradley] Added MLlib dependency on SQL.
      e1f7b9c [Joseph K. Bradley] blah
      2f40c02 [Joseph K. Bradley] renamed UDT types
      3579035 [Joseph K. Bradley] udt annotation now working
      b226b9e [Joseph K. Bradley] Changing UDT to annotation
      fea04af [Joseph K. Bradley] more cleanups
      964b32e [Joseph K. Bradley] some cleanups
      893ee4c [Joseph K. Bradley] udt finallly working
      50f9726 [Joseph K. Bradley] udts
      04303c9 [Joseph K. Bradley] udts
      39f8707 [Joseph K. Bradley] removed old udt suite
      273ac96 [Joseph K. Bradley] basic UDT is working, but deserialization has yet to be done
      8bebf24 [Joseph K. Bradley] commented out convertRowToScala for debugging
      53de70f [Joseph K. Bradley] more udts...
      982c035 [Joseph K. Bradley] still working on UDTs
      19b2f60 [Joseph K. Bradley] still working on UDTs
      0eaeb81 [Joseph K. Bradley] Still working on UDTs
      105c5a3 [Joseph K. Bradley] Adding UserDefinedType to SQL, not done yet.
      ebd64805
    • Aaron Davidson's avatar
      [SPARK-4183] Close transport-related resources between SparkContexts · 2ebd1df3
      Aaron Davidson authored
      A leak of event loops may be causing test failures.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3053 from aarondav/leak and squashes the following commits:
      
      e676d18 [Aaron Davidson] Typo!
      8f96475 [Aaron Davidson] Keep original ssc semantics
      7e49f10 [Aaron Davidson] A leak of event loops may be causing test failures.
      2ebd1df3
    • Cheng Lian's avatar
      [SPARK-2189][SQL] Adds dropTempTable API · 9081b9f9
      Cheng Lian authored
      This PR adds an API for unregistering temporary tables. If a temporary table has been cached before, it's unpersisted as well.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #3039 from liancheng/unregister-temp-table and squashes the following commits:
      
      54ae99f [Cheng Lian] Fixes Scala styling issue
      1948c14 [Cheng Lian] Removes the unpersist argument
      aca41d3 [Cheng Lian] Ensures thread safety
      7d4fb2b [Cheng Lian] Adds unregisterTempTable API
      9081b9f9
    • Yin Huai's avatar
      [SPARK-4185][SQL] JSON schema inference failed when dealing with type conflicts in arrays · 06232d23
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-4185.
      
      This PR also has the fix of #3052.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #3056 from yhuai/SPARK-4185 and squashes the following commits:
      
      ed3a5a8 [Yin Huai] Correctly handle type conflicts between structs and primitive types in an array.
      06232d23
    • wangfei's avatar
      [SPARK-4191][SQL]move wrapperFor to HiveInspectors to reuse it · e749f5de
      wangfei authored
      Move wrapperFor in InsertIntoHiveTable to HiveInspectors to reuse them, this method can be reused when writing date with ObjectInspector(such as orc support)
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3057 from scwf/reuse-wraperfor and squashes the following commits:
      
      7ccf932 [scwf] fix conflicts
      d44f4da [wangfei] fix imports
      9bf1b50 [wangfei] revert no related change
      9a5276a [wangfei] move wrapfor to hiveinspector to reuse them
      e749f5de
    • Cheng Lian's avatar
      [SPARK-3791][SQL] Provides Spark version and Hive version in HiveThriftServer2 · c9f84004
      Cheng Lian authored
      This PR overrides the `GetInfo` Hive Thrift API to provide correct version information. Another property `spark.sql.hive.version` is added to reveal the underlying Hive version. These are generally useful for Spark SQL ODBC driver providers. The Spark version information is extracted from the jar manifest. Also took the chance to remove the `SET -v` hack, which was a workaround for Simba ODBC driver connectivity.
      
      TODO
      
      - [x] Find a general way to figure out Hive (or even any dependency) version.
      
        This [blog post](http://blog.soebes.de/blog/2014/01/02/version-information-into-your-appas-with-maven/) suggests several methods to inspect application version. In the case of Spark, this can be tricky because the chosen method:
      
        1. must applies to both Maven build and SBT build
      
          For Maven builds, we can retrieve the version information from the META-INF/maven directory within the assembly jar. But this doesn't work for SBT builds.
      
        2. must not rely on the original jars of dependencies to extract specific dependency version, because Spark uses assembly jar.
      
          This implies we can't read Hive version from Hive jar files since standard Spark distribution doesn't include them.
      
        3. should play well with `SPARK_PREPEND_CLASSES` to ease local testing during development.
      
           `SPARK_PREPEND_CLASSES` prevents classes to be loaded from the assembly jar, thus we can't locate the jar file and read its manifest.
      
        Given these, maybe the only reliable method is to generate a source file containing version information at build time. pwendell Do you have any suggestions from the perspective of the build process?
      
      **Update** Hive version is now retrieved from the newly introduced `HiveShim` object.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #2843 from liancheng/get-info and squashes the following commits:
      
      a873d0f [Cheng Lian] Updates test case
      53f43cd [Cheng Lian] Retrieves underlying Hive verson via HiveShim
      1d282b8 [Cheng Lian] Removes the Simba ODBC "SET -v" hack
      f857fce [Cheng Lian] Overrides Hive GetInfo Thrift API and adds Hive version property
      c9f84004
    • Cheng Lian's avatar
      [SQL] Fixes race condition in CliSuite · 495a1320
      Cheng Lian authored
      `CliSuite` has been flaky for a while, this PR tries to improve this situation by fixing a race condition in `CliSuite`. The `captureOutput` function is used to capture both stdout and stderr output of the forked external process in two background threads and search for expected strings, but wasn't been properly synchronized before.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3060 from liancheng/fix-cli-suite and squashes the following commits:
      
      a70569c [Cheng Lian] Fixes race condition in CliSuite
      495a1320
    • Cheng Lian's avatar
      [SPARK-4182][SQL] Fixes ColumnStats classes for boolean, binary and complex data types · e4b80894
      Cheng Lian authored
      `NoopColumnStats` was once used for binary, boolean and complex data types. This `ColumnStats` doesn't return properly shaped column statistics and causes caching failure if a table contains columns of the aforementioned types.
      
      This PR adds `BooleanColumnStats`, `BinaryColumnStats` and `GenericColumnStats`, used for boolean, binary and all complex data types respectively. In addition, `NoopColumnStats` returns properly shaped column statistics containing null count and row count, but this class is now used for testing purpose only.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3059 from liancheng/spark-4182 and squashes the following commits:
      
      b398cfd [Cheng Lian] Fixes failed test case
      fb3ee85 [Cheng Lian] Fixes SPARK-4182
      e4b80894
    • Michael Armbrust's avatar
      [SPARK-3247][SQL] An API for adding data sources to Spark SQL · 9c0eb57c
      Michael Armbrust authored
      This PR introduces a new set of APIs to Spark SQL to allow other developers to add support for reading data from new sources in `org.apache.spark.sql.sources`.
      
      New sources must implement the interface `BaseRelation`, which is responsible for describing the schema of the data.  BaseRelations have three `Scan` subclasses, which are responsible for producing an RDD containing row objects.  The [various Scan interfaces](https://github.com/marmbrus/spark/blob/foreign/sql/core/src/main/scala/org/apache/spark/sql/sources/package.scala#L50) allow for optimizations such as column pruning and filter push down, when the underlying data source can handle these operations.
      
      By implementing a class that inherits from RelationProvider these data sources can be accessed using using pure SQL.  I've used the functionality to update the JSON support so it can now be used in this way as follows:
      
      ```sql
      CREATE TEMPORARY TABLE jsonTableSQL
      USING org.apache.spark.sql.json
      OPTIONS (
        path '/home/michael/data.json'
      )
      ```
      
      Further example usage can be found in the test cases: https://github.com/marmbrus/spark/tree/foreign/sql/core/src/test/scala/org/apache/spark/sql/sources
      
      There is also a library that uses this new API to read avro data available here:
      https://github.com/marmbrus/sql-avro
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2475 from marmbrus/foreign and squashes the following commits:
      
      1ed6010 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      ab2c31f [Michael Armbrust] fix test
      1d41bb5 [Michael Armbrust] unify argument names
      5b47901 [Michael Armbrust] Remove sealed, more filter types
      fab154a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      e3e690e [Michael Armbrust] Add hook for extraStrategies
      a70d602 [Michael Armbrust] Fix style, more tests, FilteredSuite => PrunedFilteredSuite
      70da6d9 [Michael Armbrust] Modify API to ease binary compatibility and interop with Java
      7d948ae [Michael Armbrust] Fix equality of AttributeReference.
      5545491 [Michael Armbrust] Address comments
      5031ac3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign
      22963ef [Michael Armbrust] package objects compile wierdly...
      b069146 [Michael Armbrust] traits => abstract classes
      34f836a [Michael Armbrust] Make @DeveloperApi
      0d74bcf [Michael Armbrust] Add documention on object life cycle
      3e06776 [Michael Armbrust] remove line wraps
      de3b68c [Michael Armbrust] Remove empty file
      360cb30 [Michael Armbrust] style and java api
      2957875 [Michael Armbrust] add override
      0fd3a07 [Michael Armbrust] Draft of data sources API
      9c0eb57c
    • wangfei's avatar
      [HOTFIX][SQL] hive test missing some golden files · f0a4b630
      wangfei authored
      cc marmbrus
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3055 from scwf/hotfix and squashes the following commits:
      
      d881bd7 [wangfei] miss golden files
      f0a4b630
    • zsxwing's avatar
      [SPARK-4166][Core][WebUI] Display the executor ID in the Web UI when ExecutorLostFailure happens · 4e6a7a0b
      zsxwing authored
      Now when ExecutorLostFailure happens, it only displays `ExecutorLostFailure (executor lost)`. Adding the executor id will help locate the faulted executor.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3033 from zsxwing/SPARK-4166 and squashes the following commits:
      
      ff4664c [zsxwing] Backward-compatible support
      c5c4cf2 [zsxwing] Display the executor ID in the Web UI when ExecutorLostFailure happens
      4e6a7a0b
    • Davies Liu's avatar
      [SPARK-3466] Limit size of results that a driver collects for each action · 6181577e
      Davies Liu authored
      Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data.
      
      This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it.
      
      By default, it's 1g (for backward compatibility for most the cases).
      
      In local mode, the driver and executor share the same JVM, the default setting can not protect JVM from OOM.
      
      cc mateiz
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3003 from davies/collect and squashes the following commits:
      
      248ed5e [Davies Liu] fix compile
      272522e [Davies Liu] address comments
      2c35773 [Davies Liu] add sizes in message of abort()
      5d62303 [Davies Liu] address comments
      bc3c077 [Davies Liu] Merge branch 'master' of github.com:apache/spark into collect
      11f97c5 [Davies Liu] address comments
      47b144f [Davies Liu] check the size of result before send and fetch
      3d81af2 [Davies Liu] address comments
      ca8267d [Davies Liu] limit the size of data by collect
      6181577e
  3. Nov 01, 2014
    • Matei Zaharia's avatar
      [SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some optimizations · 23f966f4
      Matei Zaharia authored
      - Adds optional precision and scale to Spark SQL's decimal type, which behave similarly to those in Hive 13 (https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf)
      - Replaces our internal representation of decimals with a Decimal class that can store small values in a mutable Long, saving memory in this situation and letting some operations happen directly on Longs
      
      This is still marked WIP because there are a few TODOs, but I'll remove that tag when done.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #2983 from mateiz/decimal-1 and squashes the following commits:
      
      35e6b02 [Matei Zaharia] Fix issues after merge
      227f24a [Matei Zaharia] Review comments
      31f915e [Matei Zaharia] Implement Davies's suggestions in Python
      eb84820 [Matei Zaharia] Support reading/writing decimals as fixed-length binary in Parquet
      4dc6bae [Matei Zaharia] Fix decimal support in PySpark
      d1d9d68 [Matei Zaharia] Fix compile error and test issues after rebase
      b28933d [Matei Zaharia] Support decimal precision/scale in Hive metastore
      2118c0d [Matei Zaharia] Some test and bug fixes
      81db9cb [Matei Zaharia] Added mutable Decimal that will be more efficient for small precisions
      7af0c3b [Matei Zaharia] Add optional precision and scale to DecimalType, but use Unlimited for now
      ec0a947 [Matei Zaharia] Make the result of AVG on Decimals be Decimal, not Double
      23f966f4
    • Sung Chung's avatar
      [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training deci... · 56f2c61c
      Sung Chung authored
      ...sion trees. jkbradley mengxr chouqin Please review this.
      
      Author: Sung Chung <schung@alpinenow.com>
      
      Closes #2868 from codedeft/SPARK-3161 and squashes the following commits:
      
      5f5a156 [Sung Chung] [SPARK-3161][MLLIB] Adding a node Id caching mechanism for training decision trees.
      56f2c61c
    • Xiangrui Meng's avatar
      [SPARK-4121] Set commons-math3 version based on hadoop profiles, instead of shading · d8176b1c
      Xiangrui Meng authored
      In #2928 , we shade commons-math3 to prevent future conflicts with hadoop. It caused problems with our Jenkins master build with maven. Some tests used local-cluster mode, where the assembly jar contains relocated math3 classes, while mllib test code still compiles with core and the untouched math3 classes.
      
      This PR sets commons-math3 version based on hadoop profiles.
      
      pwendell JoshRosen srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3023 from mengxr/SPARK-4121-alt and squashes the following commits:
      
      580f6d9 [Xiangrui Meng] replace tab by spaces
      7f71f08 [Xiangrui Meng] revert changes to PoissonSampler to avoid conflicts
      d3353d9 [Xiangrui Meng] do not shade commons-math3
      b4180dc [Xiangrui Meng] temp work
      d8176b1c
    • Patrick Wendell's avatar
      Revert "[SPARK-4183] Enable NettyBlockTransferService by default" · 7894de27
      Patrick Wendell authored
      This reverts commit 59e626c7.
      7894de27
    • Cheng Lian's avatar
      [SPARK-4037][SQL] Removes the SessionState instance created in HiveThriftServer2 · ad0fde10
      Cheng Lian authored
      `HiveThriftServer2` creates a global singleton `SessionState` instance and overrides `HiveContext` to inject the `SessionState` object. This messes up `SessionState` initialization and causes problems.
      
      This PR replaces the global `SessionState` with `HiveContext.sessionState` to avoid the initialization conflict. Also `HiveContext` reuses existing started `SessionState` if any (this is required by `SparkSQLCLIDriver`, which uses specialized `CliSessionState`).
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #2887 from liancheng/spark-4037 and squashes the following commits:
      
      8446675 [Cheng Lian] Removes redundant Driver initialization
      a28fef5 [Cheng Lian] Avoid starting HiveContext.sessionState multiple times
      49b1c5b [Cheng Lian] Reuses existing started SessionState if any
      3cd6fab [Cheng Lian] Fixes SPARK-4037
      ad0fde10
    • Aaron Davidson's avatar
      [SPARK-3796] Create external service which can serve shuffle files · f55218ae
      Aaron Davidson authored
      This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).
      
      This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).
      
      There are several outstanding tasks which must be complete before this PR can be merged:
      - [x] Complete unit testing of network/shuffle package.
      - [x] Performance and correctness testing on a real cluster.
      - [x] Remove example service instantiation from Worker.scala.
      
      There are even more shortcomings of this PR which should be addressed in followup patches:
      - Don't use Java serializer for RPC layer! It is not cross-version compatible.
      - Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
      - Documentation of the feature in the Spark docs.
      - Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
      - SSL and SASL integration
      - Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3001 from aarondav/shuffle-service and squashes the following commits:
      
      4d1f8c1 [Aaron Davidson] Remove changes to Worker
      705748f [Aaron Davidson] Rename Standalone* to External*
      fd3928b [Aaron Davidson] Do not unregister executor outputs unduly
      9883918 [Aaron Davidson] Make suggested build changes
      3d62679 [Aaron Davidson] Add Spark integration test
      7fe51d5 [Aaron Davidson] Fix SBT integration
      56caa50 [Aaron Davidson] Address comments
      c8d1ac3 [Aaron Davidson] Add unit tests
      2f70c0c [Aaron Davidson] Fix unit tests
      5483e96 [Aaron Davidson] Fix unit tests
      46a70bf [Aaron Davidson] Whoops, bracket
      5ea4df6 [Aaron Davidson] [SPARK-3796] Create external service which can serve shuffle files
      f55218ae
    • Xiangrui Meng's avatar
      [SPARK-3569][SQL] Add metadata field to StructField · 1d4f3552
      Xiangrui Meng authored
      Add `metadata: Metadata` to `StructField` to store extra information of columns. `Metadata` is a simple wrapper over `Map[String, Any]` with value types restricted to Boolean, Long, Double, String, Metadata, and arrays of those types. SerDe is via JSON.
      
      Metadata is preserved through simple operations like `SELECT`.
      
      marmbrus liancheng
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2701 from mengxr/structfield-metadata and squashes the following commits:
      
      dedda56 [Xiangrui Meng] merge remote
      5ef930a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      c35203f [Xiangrui Meng] Merge pull request #1 from marmbrus/pr/2701
      886b85c [Michael Armbrust] Expose Metadata and MetadataBuilder through the public scala and java packages.
      589f314 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      1e2abcf [Xiangrui Meng] change default value of metadata to None in python
      611d3c2 [Xiangrui Meng] move metadata from Expr to NamedExpr
      ddfcfad [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      a438440 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      4266f4d [Xiangrui Meng] add StructField.toString back for backward compatibility
      3f49aab [Xiangrui Meng] remove StructField.toString
      24a9f80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      473a7c5 [Xiangrui Meng] merge master
      c9d7301 [Xiangrui Meng] organize imports
      1fcbf13 [Xiangrui Meng] change metadata type in StructField for Scala/Java
      60cc131 [Xiangrui Meng] add doc and header
      60614c7 [Xiangrui Meng] add metadata
      e42c452 [Xiangrui Meng] merge master
      93518fb [Xiangrui Meng] support metadata in python
      905bb89 [Xiangrui Meng] java conversions
      618e349 [Xiangrui Meng] make tests work in scala
      61b8e0f [Xiangrui Meng] merge master
      7e5a322 [Xiangrui Meng] do not output metadata in StructField.toString
      c41a664 [Xiangrui Meng] merge master
      d8af0ed [Xiangrui Meng] move tests to SQLQuerySuite
      67fdebb [Xiangrui Meng] add test on join
      d65072e [Xiangrui Meng] remove Map.empty
      367d237 [Xiangrui Meng] add test
      c194d5e [Xiangrui Meng] add metadata field to StructField and Attribute
      1d4f3552
    • Aaron Davidson's avatar
      [SPARK-4183] Enable NettyBlockTransferService by default · 59e626c7
      Aaron Davidson authored
      Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3049 from aarondav/enable-netty and squashes the following commits:
      
      bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
      59e626c7
    • Kevin Mader's avatar
      [SPARK-2759][CORE] Generic Binary File Support in Spark · 7136719b
      Kevin Mader authored
      The additions add the abstract BinaryFileInputFormat and BinaryRecordReader classes for reading in data as a byte stream and converting it to another format using the ```def parseByteArray(inArray: Array[Byte]): T``` function.
      As a trivial example ```ByteInputFormat``` and ```ByteRecordReader``` are included which just return the Array[Byte] from a given file.
      Finally a RDD for ```BinaryFileInputFormat``` (to allow for easier partitioning changes as was done for WholeFileInput) was added and the appropriate byteFiles to the ```SparkContext``` so the functions can be easily used by others.
      A common use case might be to read in a folder
      ```
      sc.byteFiles("s3://mydrive/tif/*.tif").map(rawData => ReadTiffFromByteArray(rawData))
      ```
      
      Author: Kevin Mader <kevinmader@gmail.com>
      Author: Kevin Mader <kmader@users.noreply.github.com>
      
      Closes #1658 from kmader/master and squashes the following commits:
      
      3c49a30 [Kevin Mader] fixing wholetextfileinput to it has the same setMinPartitions function as in BinaryData files
      359a096 [Kevin Mader] making the final corrections suggested by @mateiz and renaming a few functions to make their usage clearer
      6379be4 [Kevin Mader] reorganizing code
      7b9d181 [Kevin Mader] removing developer API, cleaning up imports
      8ac288b [Kevin Mader] fixed a single slightly over 100 character line
      92bda0d [Kevin Mader] added new tests, renamed files, fixed several of the javaapi functions, formatted code more nicely
      a32fef7 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
      49174d9 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
      c27a8f1 [Kevin Mader] jenkins crashed before running anything last time, so making minor change
      b348ce1 [Kevin Mader] fixed order in check (prefix only appears on jenkins not when I run unit tests locally)
      0588737 [Kevin Mader] filename check in "binary file input as byte array" test now ignores prefixes and suffixes which might get added by Hadoop
      4163e38 [Kevin Mader] fixing line length and output from FSDataInputStream to DataInputStream to minimize sensitivity to Hadoop API changes
      19812a8 [Kevin Mader] Fixed the serialization issue with PortableDataStream since neither CombineFileSplit nor TaskAttemptContext implement the Serializable interface, by using ByteArrays for storing both and then recreating the objects from these bytearrays as needed.
      238c83c [Kevin Mader] fixed several scala-style issues, changed structure of binaryFiles, removed excessive classes added new tests. The caching tests still have a serialization issue, but that should be easily fixed as well.
      932a206 [Kevin Mader] Update RawFileInput.scala
      a01c9cf [Kevin Mader] Update RawFileInput.scala
      441f79a [Kevin Mader] fixed a few small comments and dependency
      12e7be1 [Kevin Mader] removing imglib from maven (definitely not ready yet)
      5deb79e [Kevin Mader] added new portabledatastream to code so that it can be serialized correctly
      f032bc0 [Kevin Mader] fixed bug in path name, renamed tests
      bc5c0b9 [Kevin Mader] made minor stylistic adjustments from mateiz
      df8e528 [Kevin Mader] fixed line lengths and changed java test
      9a313d5 [Kevin Mader] making classes that needn't be public private, adding automatic file closure, adding new tests
      edf5829 [Kevin Mader] fixing line lengths, adding new lines
      f4841dc [Kevin Mader] un-optimizing imports, silly intellij
      eacfaa6 [Kevin Mader] Added FixedLengthBinaryInputFormat and RecordReader from freeman-lab and added them to both the JavaSparkContext and the SparkContext as fixedLengthBinaryFile
      1622935 [Kevin Mader] changing the line lengths to make jenkins happy
      1cfa38a [Kevin Mader] added apache headers, added datainputstream directly as an output option for more complicated readers (HDF5 perhaps), and renamed several of the functions and files to be more consistent. Also added parallel functions to the java api
      84035f1 [Kevin Mader] adding binary and byte file support spark
      81c5f12 [Kevin Mader] Merge pull request #1 from apache/master
      7136719b
    • luluorta's avatar
      [SPARK-4115][GraphX] Add overrided count for edge counting of EdgeRDD. · ee29ef38
      luluorta authored
      Accumulate sizes of all the EdgePartitions just like the VertexRDD.
      
      Author: luluorta <luluorta@gmail.com>
      
      Closes #2975 from luluorta/graph-edge-count and squashes the following commits:
      
      86ef0e5 [luluorta] Add overrided count for edge counting of EdgeRDD.
      ee29ef38
    • Joseph E. Gonzalez's avatar
      [SPARK-4142][GraphX] Default numEdgePartitions · f4e0b28c
      Joseph E. Gonzalez authored
      Changing the default number of edge partitions to match spark parallelism.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #3006 from jegonzal/default_partitions and squashes the following commits:
      
      a9a5c4f [Joseph E. Gonzalez] Changing the default number of edge partitions to match spark parallelism
      f4e0b28c
    • Daniel Lemire's avatar
      Upgrading to roaring 0.4.5 (bug fix release) · 680fd87c
      Daniel Lemire authored
      I recommend upgrading roaring to 0.4.5 as it fixes a rarely occurring bug in iterators (that would otherwise throw an unwarranted exception). The upgrade should have no other consequence.
      
      Author: Daniel Lemire <lemire@gmail.com>
      
      Closes #3044 from lemire/master and squashes the following commits:
      
      54018c5 [Daniel Lemire] Recommended update to roaring 0.4.5 (bug fix release)
      048933e [Daniel Lemire] Merge remote-tracking branch 'upstream/master'
      431f3a0 [Daniel Lemire] Recommended bug fix release
      680fd87c
Loading