Skip to content
Snippets Groups Projects
  1. Apr 29, 2015
    • zsxwing's avatar
      [SPARK-6862] [STREAMING] [WEBUI] Add BatchPage to display details of a batch · 1b7106b8
      zsxwing authored
      This is an initial commit for SPARK-6862. Once SPARK-6796 is merged, I will add the links to StreamingPage so that the user can jump to BatchPage.
      
      Screenshots:
      ![success](https://cloud.githubusercontent.com/assets/1000778/7102439/bbe75406-e0b3-11e4-84fe-3e6de629a49a.png)
      ![failure](https://cloud.githubusercontent.com/assets/1000778/7102440/bc124454-e0b3-11e4-921a-c8b39d6b61bc.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5473 from zsxwing/SPARK-6862 and squashes the following commits:
      
      0727d35 [zsxwing] Change BatchUIData to a case class
      b380cfb [zsxwing] Add createJobStart to eliminate duplicate codes
      9a3083d [zsxwing] Rename XxxDatas -> XxxData
      087ba98 [zsxwing] Refactor BatchInfo to store only necessary fields
      cb62e4f [zsxwing] Use Seq[(OutputOpId, SparkJobId)] to store the id relations
      72f8e7e [zsxwing] Add unit tests for BatchPage
      1282b10 [zsxwing] Handle some corner cases and add tests for StreamingJobProgressListener
      77a69ae [zsxwing] Refactor codes as per TD's comments
      35ffd80 [zsxwing] Merge branch 'master' into SPARK-6862
      15bdf9b [zsxwing] Add batch links and unit tests
      4bf66b6 [zsxwing] Merge branch 'master' into SPARK-6862
      7168807 [zsxwing] Limit the max width of the error message and fix nits in the UI
      0b226f9 [zsxwing] Change 'Last Error' to 'Error'
      fc98a43 [zsxwing] Put clearing local properties to finally and remove redundant private[streaming]
      0c7b2eb [zsxwing] Add BatchPage to display details of a batch
      1b7106b8
    • Joseph K. Bradley's avatar
      [SPARK-7176] [ML] Add validation functionality to Param · 114bad60
      Joseph K. Bradley authored
      Main change: Added isValid field to Param.  Modified all usages to use isValid when relevant.  Added helper methods in ParamValidate.
      
      Also overrode Params.validate() in:
      * CrossValidator + model
      * Pipeline + model
      
      I made a few updates for the elastic net patch:
      * I changed "tol" to "convergenceTol"
      * I added some documentation
      
      This PR is Scala + Java only.  Python will be in a follow-up PR.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5740 from jkbradley/enforce-validate and squashes the following commits:
      
      ad9c6c1 [Joseph K. Bradley] re-generated sharedParams after merging with current master
      76415e8 [Joseph K. Bradley] reverted convergenceTol to tol
      af62f4b [Joseph K. Bradley] Removed changes to SparkBuild, python linalg.  Fixed test failures.  Renamed ParamValidate to ParamValidators.  Removed explicit type from ParamValidators calls where possible.
      bb2665a [Joseph K. Bradley] merged with elastic net pr
      ecda302 [Joseph K. Bradley] fix rat tests, plus add a little doc
      6895dfc [Joseph K. Bradley] small cleanups
      069ac6d [Joseph K. Bradley] many cleanups
      928fb84 [Joseph K. Bradley] Maybe done
      a910ac7 [Joseph K. Bradley] still workin
      6d60e2e [Joseph K. Bradley] Still workin
      b987319 [Joseph K. Bradley] Partly done with adding checks, but blocking on adding checking functionality to Param
      dbc9fb2 [Joseph K. Bradley] merged with master.  enforcing Params.validate
      114bad60
    • wangfei's avatar
      [SQL] [Minor] Print detail query execution info when spark answer is not right · 1fdfdb47
      wangfei authored
      Print detail query execution info including parsed/analyzed/optimized/Physical plan for query when spak answer is not rignt.
      
      ```
      Results do not match for query:
      == Parsed Logical Plan ==
      'Aggregate ['x.str], ['x.str,SUM('x.strCount) AS c1#46]
       'Join Inner, Some(('x.str = 'y.str))
        'UnresolvedRelation [df], Some(x)
        'UnresolvedRelation [df], Some(y)
      
      == Analyzed Logical Plan ==
      Aggregate [str#44], [str#44,SUM(strCount#45L) AS c1#46L]
       Join Inner, Some((str#44 = str#51))
        Subquery x
         Subquery df
          Aggregate [str#44], [str#44,COUNT(str#44) AS strCount#45L]
           Project [_1#41 AS int#43,_2#42 AS str#44]
            LocalRelation [_1#41,_2#42], [[1,1],[2,2],[3,3]]
        Subquery y
         Subquery df
          Aggregate [str#51], [str#51,COUNT(str#51) AS strCount#47L]
           Project [_1#41 AS int#50,_2#42 AS str#51]
            LocalRelation [_1#41,_2#42], [[1,1],[2,2],[3,3]]
      
      == Optimized Logical Plan ==
      Aggregate [str#44], [str#44,SUM(strCount#45L) AS c1#46L]
       Project [str#44,strCount#45L]
        Join Inner, Some((str#44 = str#51))
         Aggregate [str#44], [str#44,COUNT(str#44) AS strCount#45L]
          LocalRelation [str#44], [[1],[2],[3]]
         Aggregate [str#51], [str#51]
          LocalRelation [str#51], [[1],[2],[3]]
      
      == Physical Plan ==
      Aggregate false, [str#44], [str#44,CombineSum(PartialSum#53L) AS c1#46L]
       Aggregate true, [str#44], [str#44,SUM(strCount#45L) AS PartialSum#53L]
        Project [str#44,strCount#45L]
         BroadcastHashJoin [str#44], [str#51], BuildRight
          Aggregate false, [str#44], [str#44,Coalesce(SUM(PartialCount#55L),0) AS strCount#45L]
           Exchange (HashPartitioning [str#44], 5), []
            Aggregate true, [str#44], [str#44,COUNT(str#44) AS PartialCount#55L]
             LocalTableScan [str#44], [[1],[2],[3]]
          Aggregate false, [str#51], [str#51]
           Exchange (HashPartitioning [str#51], 5), []
            Aggregate true, [str#51], [str#51]
             LocalTableScan [str#51], [[1],[2],[3]]
      
      Code Generation: false
      == RDD ==
      == Results ==
      !== Correct Answer - 3 ==   == Spark Answer - 3 ==
       [1,1]                      [1,1]
      ![2,3]                      [2,1]
       [3,1]                      [3,1]
      ```
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #5774 from scwf/checkanswer and squashes the following commits:
      
      5be6f78 [wangfei] print detail query execution info when Spark Answer is not right
      1fdfdb47
    • Joseph K. Bradley's avatar
      [SPARK-7259] [ML] VectorIndexer: do not copy non-ML metadata to output column · b1ef6a60
      Joseph K. Bradley authored
      Changed VectorIndexer so it does not carry non-ML metadata from the input to the output column.  Removed ml.util.TestingUtils since VectorIndexer was the only use.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5789 from jkbradley/vector-indexer-metadata and squashes the following commits:
      
      b28e159 [Joseph K. Bradley] Changed VectorIndexer so it does not carry non-ML metadata from the input to the output column.  Removed ml.util.TestingUtils since VectorIndexer was the only use.
      b1ef6a60
    • Cheng Hao's avatar
      [SPARK-7229] [SQL] SpecificMutableRow should take integer type as internal representation for Date · f8cbb0a4
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #5772 from chenghao-intel/specific_row and squashes the following commits:
      
      2cd064d [Cheng Hao] scala style issue
      60347a2 [Cheng Hao] SpecificMutableRow should take integer type as internal representation for DateType
      f8cbb0a4
    • yongtang's avatar
      [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input · 3fc6cfd0
      yongtang authored
      See JIRA: https://issues.apache.org/jira/browse/SPARK-7155
      
      SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following:
      ```scala
      sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
      ```
      will throw
      ```
      org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt
      ```
      However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly.
      
      That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned.
      
      This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input.
      
      A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile().
      
      Note: The contribution is my original work and that I license the work to the project under the project's open source license.
      
      Author: yongtang <yongtang@users.noreply.github.com>
      
      Closes #5708 from yongtang/SPARK-7155 and squashes the following commits:
      
      654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary.
      26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.
      73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.
      3fc6cfd0
    • Qiping Li's avatar
      [SPARK-7181] [CORE] fix inifite loop in Externalsorter's mergeWithAggregation · 7f4b5837
      Qiping Li authored
      see [SPARK-7181](https://issues.apache.org/jira/browse/SPARK-7181).
      
      Author: Qiping Li <liqiping1991@gmail.com>
      
      Closes #5737 from chouqin/externalsorter and squashes the following commits:
      
      2924b93 [Qiping Li] fix inifite loop in Externalsorter's mergeWithAggregation
      7f4b5837
    • Burak Yavuz's avatar
      [SPARK-7156][SQL] support RandomSplit in DataFrames · d7dbce8f
      Burak Yavuz authored
      This is built on top of kaka1992 's PR #5711 using Logical plans.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5761 from brkyvz/random-sample and squashes the following commits:
      
      a1fb0aa [Burak Yavuz] remove unrelated file
      69669c3 [Burak Yavuz] fix broken test
      1ddb3da [Burak Yavuz] copy base
      6000328 [Burak Yavuz] added python api and fixed test
      3c11d1b [Burak Yavuz] fixed broken test
      f400ade [Burak Yavuz] fix build errors
      2384266 [Burak Yavuz] addressed comments v0.1
      e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
      d7dbce8f
    • Xusen Yin's avatar
      [SPARK-6529] [ML] Add Word2Vec transformer · c9d530e2
      Xusen Yin authored
      See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529).
      
      There are some notes:
      
      1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms.
      2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues.
      3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #5596 from yinxusen/SPARK-6529 and squashes the following commits:
      
      ee2b37a [Xusen Yin] merge with former HEAD
      4945462 [Xusen Yin] merge with #5626
      3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas
      5dd4ee7 [Xusen Yin] fix scala style
      743e0d5 [Xusen Yin] fix comments and code style
      04c48e9 [Xusen Yin] ensure the functionality
      a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec
      02848fa [Xusen Yin] refine comments
      34a55c0 [Xusen Yin] fix errors
      109d124 [Xusen Yin] add test suite and pass it
      04dde06 [Xusen Yin] add shared params
      c594095 [Xusen Yin] add word2vec transformer
      23d77fa [Xusen Yin] merge with #5626
      e8cfaf7 [Xusen Yin] fix conflict with master
      66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas
      566ec20 [Xusen Yin] fix scala style
      b54399f [Xusen Yin] fix comments and code style
      1211e86 [Xusen Yin] ensure the functionality
      6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec
      7cde18f [Xusen Yin] rm sharedParams
      618abd0 [Xusen Yin] refine comments
      e29680a [Xusen Yin] fix errors
      fe3afe9 [Xusen Yin] add test suite and pass it
      02767fb [Xusen Yin] add shared params
      6a514f1 [Xusen Yin] add word2vec transformer
      c9d530e2
    • DB Tsai's avatar
      [SPARK-7222] [ML] Added mathematical derivation in comment and compressed the... · 15995c88
      DB Tsai authored
      [SPARK-7222] [ML] Added mathematical derivation in comment and compressed the model, removed the correction terms in LinearRegression with ElasticNet
      
      Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Refactored the code so the model is compressed based on the storage. We may try compression based on the prediction time.
      
      Also, I found that diffSum will be always zero mathematically, so no corrections are required.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #5767 from dbtsai/lir-doc and squashes the following commits:
      
      5e346c9 [DB Tsai] refactoring
      fc9f582 [DB Tsai] doc
      58456d8 [DB Tsai] address feedback
      69757b8 [DB Tsai] actually diffSum is mathematically zero! No correction is needed.
      5929e49 [DB Tsai] typo
      63f7d1e [DB Tsai] Added compression to the model based on storage
      203a295 [DB Tsai] Add more documentation to LinearRegression in new ML framework.
      15995c88
    • Josh Rosen's avatar
      [SPARK-6629] cancelJobGroup() may not work for jobs whose job groups are... · 3a180c19
      Josh Rosen authored
      [SPARK-6629] cancelJobGroup() may not work for jobs whose job groups are inherited from parent threads
      
      When a job is submitted with a job group and that job group is inherited from a parent thread, there are multiple bugs that may prevent this job from being cancelable via `SparkContext.cancelJobGroup()`:
      
      - When filtering jobs based on their job group properties, DAGScheduler calls `get()` instead of `getProperty()`, which does not respect inheritance, so it will skip over jobs whose job group properties were inherited.
      - `Properties` objects are mutable, but we do not make defensive copies / snapshots, so modifications of the parent thread's job group will cause running jobs' groups to change; this also breaks cancelation.
      
      Both of these issues are easy to fix: use `getProperty()` and perform defensive copying.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5288 from JoshRosen/localProperties-mutability-race and squashes the following commits:
      
      9e29654 [Josh Rosen] Fix style issue
      5d90750 [Josh Rosen] Merge remote-tracking branch 'origin/master' into localProperties-mutability-race
      3f7b9e8 [Josh Rosen] Add JIRA reference; move clone into DAGScheduler
      707e417 [Josh Rosen] Clone local properties to prevent mutations from breaking job cancellation.
      b376114 [Josh Rosen] Fix bug that prevented jobs with inherited job group properties from being cancelled.
      3a180c19
    • Tathagata Das's avatar
      [SPARK-6752] [STREAMING] [REOPENED] Allow StreamingContext to be recreated... · a9c4e299
      Tathagata Das authored
      [SPARK-6752] [STREAMING] [REOPENED] Allow StreamingContext to be recreated from checkpoint and existing SparkContext
      
      Original PR #5428 got reverted due to issues between MutableBoolean and Hadoop 1.0.4 (see JIRA). This replaces MutableBoolean with AtomicBoolean.
      
      srowen pwendell
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5773 from tdas/SPARK-6752 and squashes the following commits:
      
      a0c0ead [Tathagata Das] Fix for hadoop 1.0.4
      70ae85b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-6752
      94db63c [Tathagata Das] Fix long line.
      524f519 [Tathagata Das] Many changes based on PR comments.
      eabd092 [Tathagata Das] Added Function0, Java API and unit tests for StreamingContext.getOrCreate
      36a7823 [Tathagata Das] Minor changes.
      204814e [Tathagata Das] Added StreamingContext.getOrCreate with existing SparkContext
      a9c4e299
    • Tathagata Das's avatar
      [SPARK-7056] [STREAMING] Make the Write Ahead Log pluggable · 1868bd40
      Tathagata Das authored
      Users may want the WAL data to be written to non-HDFS data storage systems. To allow that, we have to make the WAL pluggable. The following design doc outlines the plan.
      
      https://docs.google.com/a/databricks.com/document/d/1A2XaOLRFzvIZSi18i_luNw5Rmm9j2j4AigktXxIYxmY/edit?usp=sharing
      
      Things to add.
      * Unit tests for WriteAheadLogUtils
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5645 from tdas/wal-pluggable and squashes the following commits:
      
      2c431fd [Tathagata Das] Minor fixes.
      c2bc7384 [Tathagata Das] More changes based on PR comments.
      569a416 [Tathagata Das] fixed long line
      bde26b1 [Tathagata Das] Renamed segment to record handle everywhere
      b65e155 [Tathagata Das] More changes based on PR comments.
      d7cd15b [Tathagata Das] Fixed test
      1a32a4b [Tathagata Das] Fixed test
      e0d19fb [Tathagata Das] Fixed defaults
      9310cbf [Tathagata Das] style fix.
      86abcb1 [Tathagata Das] Refactored WriteAheadLogUtils, and consolidated all WAL related configuration into it.
      84ce469 [Tathagata Das] Added unit test and fixed compilation error.
      bce5e75 [Tathagata Das] Fixed long lines.
      837c4f5 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into wal-pluggable
      754fbf8 [Tathagata Das] Added license and docs.
      09bc6fe [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into wal-pluggable
      7dd2d4b [Tathagata Das] Added pluggable WriteAheadLog interface, and refactored all code along with it
      1868bd40
    • Xusen Yin's avatar
      Fix a typo of "threshold" · c0c0ba6d
      Xusen Yin authored
      mengxr
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #5769 from yinxusen/patch-1 and squashes the following commits:
      
      43235f4 [Xusen Yin] Update PearsonCorrelation.scala
      f7287ee [Xusen Yin] Fix a typo of "threshold"
      c0c0ba6d
    • Wenchen Fan's avatar
      [SQL][Minor] fix java doc for DataFrame.agg · 81ea42bf
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #5712 from cloud-fan/minor and squashes the following commits:
      
      be23064 [Wenchen Fan] fix java doc for DataFrame.agg
      81ea42bf
    • ksonj's avatar
      Better error message on access to non-existing attribute · 3df9c5dd
      ksonj authored
      I believe column access via `__getattr__` is bad and shouldn't be implicitly encouraged by the error message when accessing a non-existing attribute on DataFrame. This patch changes the error message from 'no such column'  to the more generic 'no such attribute', which is also what Pandas DFs will throw.
      
      Author: ksonj <kson@siberie.de>
      
      Closes #5771 from ksonj/master and squashes the following commits:
      
      bcc2220 [ksonj] Better error message on access to non-existing attribute
      3df9c5dd
    • Reynold Xin's avatar
      [SPARK-7223] Rename RPC askWithReply -> askWithReply, sendWithReply -> ask. · 687273d9
      Reynold Xin authored
      The old naming scheme was very confusing between askWithReply and sendWithReply. I also divided RpcEnv.scala into multiple files.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5768 from rxin/rpc-rename and squashes the following commits:
      
      a84058e [Reynold Xin] [SPARK-7223] Rename RPC askWithReply -> askWithReply, sendWithReply -> ask.
      687273d9
    • Dean Chen's avatar
      [SPARK-6918] [YARN] Secure HBase support. · baed3f2c
      Dean Chen authored
      Obtain HBase security token with Kerberos credentials locally to be sent to executors. Tested on eBay's secure HBase cluster.
      
      Similar to obtainTokenForNamenodes and fails gracefully if HBase classes are not included in path.
      
      Requires hbase-site.xml to be in the classpath(typically via conf dir) for the zookeeper configuration. Should that go in the docs somewhere? Did not see an HBase section.
      
      Author: Dean Chen <deanchen5@gmail.com>
      
      Closes #5586 from deanchen/master and squashes the following commits:
      
      0c190ef [Dean Chen] [SPARK-6918][YARN] Secure HBase support.
      baed3f2c
    • Josh Rosen's avatar
      [SPARK-7076][SPARK-7077][SPARK-7080][SQL] Use managed memory for aggregations · f49284b5
      Josh Rosen authored
      This patch adds managed-memory-based aggregation to Spark SQL / DataFrames. Instead of working with Java objects, this new aggregation path uses `sun.misc.Unsafe` to manipulate raw memory.  This reduces the memory footprint for aggregations, resulting in fewer spills, OutOfMemoryErrors, and garbage collection pauses.  As a result, this allows for higher memory utilization.  It can also result in better cache locality since objects will be stored closer together in memory.
      
      This feature can be eanbled by setting `spark.sql.unsafe.enabled=true`.  For now, this feature is only supported when codegen is enabled and only supports aggregations for which the grouping columns are primitive numeric types or strings and aggregated values are numeric.
      
      ### Managing memory with sun.misc.Unsafe
      
      This patch supports both on- and off-heap managed memory.
      
      - In on-heap mode, memory addresses are identified by the combination of a base Object and an offset within that object.
      - In off-heap mode, memory is addressed directly with 64-bit long addresses.
      
      To support both modes, functions that manipulate memory accept both `baseObject` and `baseOffset` fields.  In off-heap mode, we simply pass `null` as `baseObject`.
      
      We allocate memory in large chunks, so memory fragmentation and allocation speed are not significant bottlenecks.
      
      By default, we use on-heap mode.  To enable off-heap mode, set `spark.unsafe.offHeap=true`.
      
      To track allocated memory, this patch extends `SparkEnv` with an `ExecutorMemoryManager` and supplies each `TaskContext` with a `TaskMemoryManager`.  These classes work together to track allocations and detect memory leaks.
      
      ### Compact tuple format
      
      This patch introduces `UnsafeRow`, a compact row layout.  In this format, each tuple has three parts: a null bit set, fixed length values, and variable-length values:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/7328538/2fdb65ce-ea8b-11e4-9743-6c0f02bb7d1f.png)
      
      - Rows are always 8-byte word aligned (so their sizes will always be a multiple of 8 bytes)
      - The bit set is used for null tracking:
      	- Position _i_ is set if and only if field _i_ is null
      	- The bit set is aligned to an 8-byte word boundary.
      - Every field appears as an 8-byte word in the fixed-length values part:
      	- If a field is null, we zero out the values.
      	- If a field is variable-length, the word stores a relative offset (w.r.t. the base of the tuple) that points to the beginning of the field's data in the variable-length part.
      - Each variable-length data type can have its own encoding:
      	- For strings, the first word stores the length of the string and is followed by UTF-8 encoded bytes.  If necessary, the end of the string is padded with empty bytes in order to ensure word-alignment.
      
      For example, a tuple that consists 3 fields of type (int, string, string), with value (null, “data”, “bricks”) would look like this:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/7328526/1e21959c-ea8b-11e4-9a28-a4350fe4a7b5.png)
      
      This format allows us to compare tuples for equality by directly comparing their raw bytes.  This also enables fast hashing of tuples.
      
      ### Hash map for performing aggregations
      
      This patch introduces `UnsafeFixedWidthAggregationMap`, a hash map for performing aggregations where the aggregation result columns are fixed-with.  This map's keys and values are `Row` objects. `UnsafeFixedWidthAggregationMap` is implemented on top of `BytesToBytesMap`, an append-only map which supports byte-array keys and values.
      
      `BytesToBytesMap` stores pointers to key and value tuples.  For each record with a new key, we copy the key and create the aggregation value buffer for that key and put them in a buffer. The hash table then simply stores pointers to the key and value. For each record with an existing key, we simply run the aggregation function to update the values in place.
      
      This map is implemented using open hashing with triangular sequence probing.  Each entry stores two words in a long array: the first word stores the address of the key and the second word stores the relative offset from the key tuple to the value tuple, as well as the key's 32-bit hashcode.  By storing the full hashcode, we reduce the number of equality checks that need to be performed to handle position collisions ()since the chance of hashcode collision is much lower than position collision).
      
      `UnsafeFixedWidthAggregationMap` allows regular Spark SQL `Row` objects to be used when probing the map.  Internally, it encodes these rows into `UnsafeRow` format using `UnsafeRowConverter`.  This conversion has a small overhead that can be eliminated in the future once we use UnsafeRows in other operators.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5725)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5725 from JoshRosen/unsafe and squashes the following commits:
      
      eeee512 [Josh Rosen] Add converters for Null, Boolean, Byte, and Short columns.
      81f34f8 [Josh Rosen] Follow 'place children last' convention for GeneratedAggregate
      1bc36cc [Josh Rosen] Refactor UnsafeRowConverter to avoid unnecessary boxing.
      017b2dc [Josh Rosen] Remove BytesToBytesMap.finalize()
      50e9671 [Josh Rosen] Throw memory leak warning even in case of error; add warning about code duplication
      70a39e4 [Josh Rosen] Split MemoryManager into ExecutorMemoryManager and TaskMemoryManager:
      6e4b192 [Josh Rosen] Remove an unused method from ByteArrayMethods.
      de5e001 [Josh Rosen] Fix debug vs. trace in logging message.
      a19e066 [Josh Rosen] Rename unsafe Java test suites to match Scala test naming convention.
      78a5b84 [Josh Rosen] Add logging to MemoryManager
      ce3c565 [Josh Rosen] More comments, formatting, and code cleanup.
      529e571 [Josh Rosen] Measure timeSpentResizing in nanoseconds instead of milliseconds.
      3ca84b2 [Josh Rosen] Only zero the used portion of groupingKeyConversionScratchSpace
      162caf7 [Josh Rosen] Fix test compilation
      b45f070 [Josh Rosen] Don't redundantly store the offset from key to value, since we can compute this from the key size.
      a8e4a3f [Josh Rosen] Introduce MemoryManager interface; add to SparkEnv.
      0925847 [Josh Rosen] Disable MiMa checks for new unsafe module
      cde4132 [Josh Rosen] Add missing pom.xml
      9c19fc0 [Josh Rosen] Add configuration options for heap vs. offheap
      6ffdaa1 [Josh Rosen] Null handling improvements in UnsafeRow.
      31eaabc [Josh Rosen] Lots of TODO and doc cleanup.
      a95291e [Josh Rosen] Cleanups to string handling code
      afe8dca [Josh Rosen] Some Javadoc cleanup
      f3dcbfe [Josh Rosen] More mod replacement
      854201a [Josh Rosen] Import and comment cleanup
      06e929d [Josh Rosen] More warning cleanup
      ef6b3d3 [Josh Rosen] Fix a bunch of FindBugs and IntelliJ inspections
      29a7575 [Josh Rosen] Remove debug logging
      49aed30 [Josh Rosen] More long -> int conversion.
      b26f1d3 [Josh Rosen] Fix bug in murmur hash implementation.
      765243d [Josh Rosen] Enable optional performance metrics for hash map.
      23a440a [Josh Rosen] Bump up default hash map size
      628f936 [Josh Rosen] Use ints intead of longs for indexing.
      92d5a06 [Josh Rosen] Address a number of minor code review comments.
      1f4b716 [Josh Rosen] Merge Unsafe code into the regular GeneratedAggregate, guarded by a configuration flag; integrate planner support and re-enable all tests.
      d85eeff [Josh Rosen] Add basic sanity test for UnsafeFixedWidthAggregationMap
      bade966 [Josh Rosen] Comment update (bumping to refresh GitHub cache...)
      b3eaccd [Josh Rosen] Extract aggregation map into its own class.
      d2bb986 [Josh Rosen] Update to implement new Row methods added upstream
      58ac393 [Josh Rosen] Use UNSAFE allocator in GeneratedAggregate (TODO: make this configurable)
      7df6008 [Josh Rosen] Optimizations related to zeroing out memory:
      c1b3813 [Josh Rosen] Fix bug in UnsafeMemoryAllocator.free():
      738fa33 [Josh Rosen] Add feature flag to guard UnsafeGeneratedAggregate
      c55bf66 [Josh Rosen] Free buffer once iterator has been fully consumed.
      62ab054 [Josh Rosen] Optimize for fact that get() is only called on String columns.
      c7f0b56 [Josh Rosen] Reuse UnsafeRow pointer in UnsafeRowConverter
      ae39694 [Josh Rosen] Add finalizer as "cleanup method of last resort"
      c754ae1 [Josh Rosen] Now that the store*() contract has been stregthened, we can remove an extra lookup
      f764d13 [Josh Rosen] Simplify address + length calculation in Location.
      079f1bf [Josh Rosen] Some clarification of the BytesToBytesMap.lookup() / set() contract.
      1a483c5 [Josh Rosen] First version that passes some aggregation tests:
      fc4c3a8 [Josh Rosen] Sketch how the converters will be used in UnsafeGeneratedAggregate
      53ba9b7 [Josh Rosen] Start prototyping Java Row -> UnsafeRow converters
      1ff814d [Josh Rosen] Add reminder to free memory on iterator completion
      8a8f9df [Josh Rosen] Add skeleton for GeneratedAggregate integration.
      5d55cef [Josh Rosen] Add skeleton for Row implementation.
      f03e9c1 [Josh Rosen] Play around with Unsafe implementations of more string methods.
      ab68e08 [Josh Rosen] Begin merging the UTF8String implementations.
      480a74a [Josh Rosen] Initial import of code from Databricks unsafe utils repo.
      f49284b5
    • Patrick Wendell's avatar
      [SPARK-7204] [SQL] Fix callSite for Dataframe and SQL operations · 1fd6ed9a
      Patrick Wendell authored
      This patch adds SQL to the set of excluded libraries when
      generating a callSite. This makes the callSite mechanism work
      properly for the data frame API. I also added a small improvement for
      JDBC queries where we just use the string "Spark JDBC Server Query"
      instead of trying to give a callsite that doesn't make any sense
      to the user.
      
      Before (DF):
      ![screen shot 2015-04-28 at 1 29 26 pm](https://cloud.githubusercontent.com/assets/320616/7380170/ef63bfb0-edae-11e4-989c-f88a5ba6bbee.png)
      
      After (DF):
      ![screen shot 2015-04-28 at 1 34 58 pm](https://cloud.githubusercontent.com/assets/320616/7380181/fa7f6d90-edae-11e4-9559-26f163ed63b8.png)
      
      After (JDBC):
      ![screen shot 2015-04-28 at 2 00 10 pm](https://cloud.githubusercontent.com/assets/320616/7380185/02f5b2a4-edaf-11e4-8e5b-99bdc3df66dd.png)
      
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #5757 from pwendell/dataframes and squashes the following commits:
      
      0d931a4 [Patrick Wendell] Attempting to fix PySpark tests
      85bf740 [Patrick Wendell] [SPARK-7204] Fix callsite for dataframe operations.
      1fd6ed9a
    • Burak Yavuz's avatar
      [SPARK-7188] added python support for math DataFrame functions · fe917f5e
      Burak Yavuz authored
      Adds support for the math functions for DataFrames in PySpark.
      
      rxin I love Davies.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5750 from brkyvz/python-math-udfs and squashes the following commits:
      
      7c4f563 [Burak Yavuz] removed is_math
      3c4adde [Burak Yavuz] cleanup imports
      d5dca3f [Burak Yavuz] moved math functions to mathfunctions
      25e6534 [Burak Yavuz] addressed comments v2.0
      d3f7e0f [Burak Yavuz] addressed comments and added tests
      7b7d7c4 [Burak Yavuz] remove tests for removed methods
      33c2c15 [Burak Yavuz] fixed python style
      3ee0c05 [Burak Yavuz] added python functions
      fe917f5e
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 8dee2746
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #3205 (close requested by 'srowen')
      Closes #5478 (close requested by 'andrewor14')
      Closes #4910 (close requested by 'srowen')
      Closes #5080 (close requested by 'marmbrus')
      Closes #537 (close requested by 'srowen')
      Closes #5691 (close requested by 'srowen')
      Closes #5469 (close requested by 'marmbrus')
      8dee2746
    • Burak Yavuz's avatar
      [SPARK-7205] Support `.ivy2/local` and `.m2/repositories/` in --packages · f98773a9
      Burak Yavuz authored
      In addition, I made a small change that will allow users to import 2 different artifacts with the same name. That change is made in `[organization]_[artifact]-[revision].[ext]`. This used to be only `[artifact].[ext]` which might have caused collisions between artifacts with the same artifactId, but different groupId's.
      
      cc pwendell
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5755 from brkyvz/local-caches and squashes the following commits:
      
      c47c9c5 [Burak Yavuz] Small fixes to --packages
      f98773a9
    • Burak Yavuz's avatar
      [SPARK-7215] made coalesce and repartition a part of the query plan · 271c4c62
      Burak Yavuz authored
      Coalesce and repartition now show up as part of the query plan, rather than resulting in a new `DataFrame`.
      
      cc rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5762 from brkyvz/df-repartition and squashes the following commits:
      
      b1e76dd [Burak Yavuz] added documentation on repartitions
      5807e35 [Burak Yavuz] renamed coalescepartitions
      fa4509f [Burak Yavuz] rename coalesce
      2c349b5 [Burak Yavuz] address comments
      f2e6af1 [Burak Yavuz] add ticks
      686c90b [Burak Yavuz] made coalesce and repartition a part of the query plan
      271c4c62
  2. Apr 28, 2015
    • Xiangrui Meng's avatar
      [SPARK-6756] [MLLIB] add toSparse, toDense, numActives, numNonzeros, and compressed to Vector · 5ef006fc
      Xiangrui Meng authored
      Add `compressed` to `Vector` with some other methods: `numActives`, `numNonzeros`, `toSparse`, and `toDense`. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5756 from mengxr/SPARK-6756 and squashes the following commits:
      
      8d4ecbd [Xiangrui Meng] address comment and add mima excludes
      da54179 [Xiangrui Meng] add toSparse, toDense, numActives, numNonzeros, and compressed to Vector
      5ef006fc
    • Joseph K. Bradley's avatar
      [SPARK-7208] [ML] [PYTHON] Added Matrix, SparseMatrix to __all__ list in linalg.py · a8aeadb7
      Joseph K. Bradley authored
      Added Matrix, SparseMatrix to __all__ list in linalg.py
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5759 from jkbradley/SPARK-7208 and squashes the following commits:
      
      deb51a2 [Joseph K. Bradley] Added Matrix, SparseMatrix to __all__ list in linalg.py
      a8aeadb7
    • Tathagata Das's avatar
      [SPARK-7138] [STREAMING] Add method to BlockGenerator to add multiple records... · 5c8f4bd5
      Tathagata Das authored
      [SPARK-7138] [STREAMING] Add method to BlockGenerator to add multiple records to BlockGenerator with single callback
      
      This is to ensure that receivers that receive data in small batches (like Kinesis) and want to add them but want the callback function to be called only once. This is for internal use only for improvement to Kinesis Receiver that we are planning to do.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5695 from tdas/SPARK-7138 and squashes the following commits:
      
      a35cf7d [Tathagata Das] Fixed style.
      a7a4cb9 [Tathagata Das] Added extra method to BlockGenerator.
      5c8f4bd5
    • Xiangrui Meng's avatar
      [SPARK-6965] [MLLIB] StringIndexer handles numeric input. · d36e6735
      Xiangrui Meng authored
      Cast numeric types to String for indexing. Boolean type is not handled in this PR. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5753 from mengxr/SPARK-6965 and squashes the following commits:
      
      2e34f3c [Xiangrui Meng] add actual type in the error message
      ad938bf [Xiangrui Meng] StringIndexer handles numeric input.
      d36e6735
    • Xiangrui Meng's avatar
      Closes #4807 · 555213eb
      Xiangrui Meng authored
      Closes #5055
      Closes #3583
      555213eb
    • Xiangrui Meng's avatar
      [SPARK-7201] [MLLIB] move Identifiable to ml.util · f0a1f90f
      Xiangrui Meng authored
      It shouldn't live directly under `spark.ml`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5749 from mengxr/SPARK-7201 and squashes the following commits:
      
      53847f9 [Xiangrui Meng] move Identifiable to ml.util
      f0a1f90f
    • Marcelo Vanzin's avatar
      [MINOR] [CORE] Warn users who try to cache RDDs with dynamic allocation on. · 28b1af74
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5751 from vanzin/cached-rdd-warning and squashes the following commits:
      
      554cc07 [Marcelo Vanzin] Change message.
      9efb9da [Marcelo Vanzin] [minor] [core] Warn users who try to cache RDDs with dynamic allocation on.
      28b1af74
    • Timothy Chen's avatar
      [SPARK-5338] [MESOS] Add cluster mode support for Mesos · 53befacc
      Timothy Chen authored
      This patch adds the support for cluster mode to run on Mesos.
      It introduces a new Mesos framework dedicated to launch new apps/drivers, and can be called with the spark-submit script and specifying --master flag to the cluster mode REST interface instead of Mesos master.
      
      Example:
      ./bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi --master mesos://10.0.0.206:8077 --executor-memory 1G --total-executor-cores 100 examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar 30
      
      Part of this patch is also to abstract the StandaloneRestServer so it can have different implementations of the REST endpoints.
      
      Features of the cluster mode in this PR:
      - Supports supervise mode where scheduler will keep trying to reschedule exited job.
      - Adds a new UI for the cluster mode scheduler to see all the running jobs, finished jobs, and supervise jobs waiting to be retried
      - Supports state persistence to ZK, so when the cluster scheduler fails over it can pick up all the queued and running jobs
      
      Author: Timothy Chen <tnachen@gmail.com>
      Author: Luc Bourlier <luc.bourlier@typesafe.com>
      
      Closes #5144 from tnachen/mesos_cluster_mode and squashes the following commits:
      
      069e946 [Timothy Chen] Fix rebase.
      e24b512 [Timothy Chen] Persist submitted driver.
      390c491 [Timothy Chen] Fix zk conf key for mesos zk engine.
      e324ac1 [Timothy Chen] Fix merge.
      fd5259d [Timothy Chen] Address review comments.
      1553230 [Timothy Chen] Address review comments.
      c6c6b73 [Timothy Chen] Pass spark properties to mesos cluster tasks.
      f7d8046 [Timothy Chen] Change app name to spark cluster.
      17f93a2 [Timothy Chen] Fix head of line blocking in scheduling drivers.
      6ff8e5c [Timothy Chen] Address comments and add logging.
      df355cd [Timothy Chen] Add metrics to mesos cluster scheduler.
      20f7284 [Timothy Chen] Address review comments
      7252612 [Timothy Chen] Fix tests.
      a46ad66 [Timothy Chen] Allow zk cli param override.
      920fc4b [Timothy Chen] Fix scala style issues.
      862b5b5 [Timothy Chen] Support asking driver status when it's retrying.
      7f214c2 [Timothy Chen] Fix RetryState visibility
      e0f33f7 [Timothy Chen] Add supervise support and persist retries.
      371ce65 [Timothy Chen] Handle cluster mode recovery and state persistence.
      3d4dfa1 [Luc Bourlier] Adds support to kill submissions
      febfaba [Timothy Chen] Bound the finished drivers in memory
      543a98d [Timothy Chen] Schedule multiple jobs
      6887e5e [Timothy Chen] Support looking at SPARK_EXECUTOR_URI env variable in schedulers
      8ec76bc [Timothy Chen] Fix Mesos dispatcher UI.
      d57d77d [Timothy Chen] Add documentation
      825afa0 [Luc Bourlier] Supports more spark-submit parameters
      b8e7181 [Luc Bourlier] Adds a shutdown latch to keep the deamon running
      0fa7780 [Luc Bourlier] Launch task through the mesos scheduler
      5b7a12b [Timothy Chen] WIP: Making a cluster mode a mesos framework.
      4b2f5ef [Timothy Chen] Specify user jar in command to be replaced with local.
      e775001 [Timothy Chen] Support fetching remote uris in driver runner.
      7179495 [Timothy Chen] Change Driver page output and add logging
      880bc27 [Timothy Chen] Add Mesos Cluster UI to display driver results
      9986731 [Timothy Chen] Kill drivers when shutdown
      67cbc18 [Timothy Chen] Rename StandaloneRestClient to RestClient and add sbin scripts
      e3facdd [Timothy Chen] Add Mesos Cluster dispatcher
      53befacc
    • Zhang, Liye's avatar
      [SPARK-6314] [CORE] handle JsonParseException for history server · 80098109
      Zhang, Liye authored
      This is handled in the same way with [SPARK-6197](https://issues.apache.org/jira/browse/SPARK-6197). The result of this PR is that exception showed in history server log will be replaced by a warning, and the application that with un-complete history log file will be listed on history server webUI
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #5736 from liyezhang556520/SPARK-6314 and squashes the following commits:
      
      b8d2d88 [Zhang, Liye] handle JsonParseException for history server
      80098109
    • Ilya Ganelin's avatar
      [SPARK-5932] [CORE] Use consistent naming for size properties · 2d222fb3
      Ilya Ganelin authored
      I've added an interface to JavaUtils to do byte conversion and added hooks within Utils.scala to handle conversion within Spark code (like for time strings). I've added matching tests for size conversion, and then updated all deprecated configs and documentation as per SPARK-5933.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #5574 from ilganeli/SPARK-5932 and squashes the following commits:
      
      11f6999 [Ilya Ganelin] Nit fixes
      49a8720 [Ilya Ganelin] Whitespace fix
      2ab886b [Ilya Ganelin] Scala style
      fc85733 [Ilya Ganelin] Got rid of floating point math
      852a407 [Ilya Ganelin] [SPARK-5932] Added much improved overflow handling. Can now handle sizes up to Long.MAX_VALUE Petabytes instead of being capped at Long.MAX_VALUE Bytes
      9ee779c [Ilya Ganelin] Simplified fraction matches
      22413b1 [Ilya Ganelin] Made MAX private
      3dfae96 [Ilya Ganelin] Fixed some nits. Added automatic conversion of old paramter for kryoserializer.mb to new values.
      e428049 [Ilya Ganelin] resolving merge conflict
      8b43748 [Ilya Ganelin] Fixed error in pattern matching for doubles
      84a2581 [Ilya Ganelin] Added smoother handling of fractional values for size parameters. This now throws an exception and added a warning for old spark.kryoserializer.buffer
      d3d09b6 [Ilya Ganelin] [SPARK-5932] Fixing error in KryoSerializer
      fe286b4 [Ilya Ganelin] Resolved merge conflict
      c7803cd [Ilya Ganelin] Empty lines
      54b78b4 [Ilya Ganelin] Simplified byteUnit class
      69e2f20 [Ilya Ganelin] Updates to code
      f32bc01 [Ilya Ganelin] [SPARK-5932] Fixed error in API in SparkConf.scala where Kb conversion wasn't being done properly (was Mb). Added test cases for both timeUnit and ByteUnit conversion
      f15f209 [Ilya Ganelin] Fixed conversion of kryo buffer size
      0f4443e [Ilya Ganelin]     Merge remote-tracking branch 'upstream/master' into SPARK-5932
      35a7fa7 [Ilya Ganelin] Minor formatting
      928469e [Ilya Ganelin] [SPARK-5932] Converted some longs to ints
      5d29f90 [Ilya Ganelin] [SPARK-5932] Finished documentation updates
      7a6c847 [Ilya Ganelin] [SPARK-5932] Updated spark.shuffle.file.buffer
      afc9a38 [Ilya Ganelin] [SPARK-5932] Updated spark.broadcast.blockSize and spark.storage.memoryMapThreshold
      ae7e9f6 [Ilya Ganelin] [SPARK-5932] Updated spark.io.compression.snappy.block.size
      2d15681 [Ilya Ganelin] [SPARK-5932] Updated spark.executor.logs.rolling.size.maxBytes
      1fbd435 [Ilya Ganelin] [SPARK-5932] Updated spark.broadcast.blockSize
      eba4de6 [Ilya Ganelin] [SPARK-5932] Updated spark.shuffle.file.buffer.kb
      b809a78 [Ilya Ganelin] [SPARK-5932] Updated spark.kryoserializer.buffer.max
      0cdff35 [Ilya Ganelin] [SPARK-5932] Updated to use bibibytes in method names. Updated spark.kryoserializer.buffer.mb and spark.reducer.maxMbInFlight
      475370a [Ilya Ganelin] [SPARK-5932] Simplified ByteUnit code, switched to using longs. Updated docs to clarify that we use kibi, mebi etc instead of kilo, mega
      851d691 [Ilya Ganelin] [SPARK-5932] Updated memoryStringToMb to use new interfaces
      a9f4fcf [Ilya Ganelin] [SPARK-5932] Added unit tests for unit conversion
      747393a [Ilya Ganelin] [SPARK-5932] Added unit tests for ByteString conversion
      09ea450 [Ilya Ganelin] [SPARK-5932] Added byte string conversion to Jav utils
      5390fd9 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5932
      db9a963 [Ilya Ganelin] Closing second spark context
      1dc0444 [Ilya Ganelin] Added ref equality check
      8c884fa [Ilya Ganelin] Made getOrCreate synchronized
      cb0c6b7 [Ilya Ganelin] Doc updates and code cleanup
      270cfe3 [Ilya Ganelin] [SPARK-6703] Documentation fixes
      15e8dea [Ilya Ganelin] Updated comments and added MiMa Exclude
      0e1567c [Ilya Ganelin] Got rid of unecessary option for AtomicReference
      dfec4da [Ilya Ganelin] Changed activeContext to AtomicReference
      733ec9f [Ilya Ganelin] Fixed some bugs in test code
      8be2f83 [Ilya Ganelin] Replaced match with if
      e92caf7 [Ilya Ganelin] [SPARK-6703] Added test to ensure that getOrCreate both allows creation, retrieval, and a second context if desired
      a99032f [Ilya Ganelin] Spacing fix
      d7a06b8 [Ilya Ganelin] Updated SparkConf class to add getOrCreate method. Started test suite implementation
      2d222fb3
    • Iulian Dragos's avatar
      [SPARK-4286] Add an external shuffle service that can be run as a daemon. · 8aab94d8
      Iulian Dragos authored
      This allows Mesos deployments to use the shuffle service (and implicitly dynamic allocation). It does so by adding a new "main" class and two corresponding scripts in `sbin`:
      
      - `sbin/start-shuffle-service.sh`
      - `sbin/stop-shuffle-service.sh`
      
      Specific options can be passed in `SPARK_SHUFFLE_OPTS`.
      
      This is picking up work from #3861 /cc tnachen
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #4990 from dragos/feature/external-shuffle-service and squashes the following commits:
      
      6c2b148 [Iulian Dragos] Import order and wrong name fixup.
      07804ad [Iulian Dragos] Moved ExternalShuffleService to the `deploy` package + other minor tweaks.
      4dc1f91 [Iulian Dragos] Reviewer’s comments:
      8145429 [Iulian Dragos] Add an external shuffle service that can be run as a daemon.
      8aab94d8
    • Zhang, Liye's avatar
      [Core][test][minor] replace try finally block with tryWithSafeFinally · 52ccf1d3
      Zhang, Liye authored
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #5739 from liyezhang556520/trySafeFinally and squashes the following commits:
      
      55683e5 [Zhang, Liye] replace try finally block with tryWithSafeFinally
      52ccf1d3
    • Xiangrui Meng's avatar
      [SPARK-7140] [MLLIB] only scan the first 16 entries in Vector.hashCode · b14cd236
      Xiangrui Meng authored
      The Python SerDe calls `Object.hashCode`, which is very expensive for Vectors. It is not necessary to scan the whole vector, especially for large ones. In this PR, we only scan the first 16 nonzeros. srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5697 from mengxr/SPARK-7140 and squashes the following commits:
      
      2abc86d [Xiangrui Meng] typo
      8fb7d74 [Xiangrui Meng] update impl
      1ebad60 [Xiangrui Meng] only scan the first 16 nonzeros in Vector.hashCode
      b14cd236
    • DB Tsai's avatar
      [SPARK-5253] [ML] LinearRegression with L1/L2 (ElasticNet) using OWLQN · 6a827d5d
      DB Tsai authored
      Author: DB Tsai <dbt@netflix.com>
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #4259 from dbtsai/lir and squashes the following commits:
      
      a81c201 [DB Tsai] add import org.apache.spark.util.Utils back
      9fc48ed [DB Tsai] rebase
      2178b63 [DB Tsai] add comments
      9988ca8 [DB Tsai] addressed feedback and fixed a bug. TODO: documentation and build another synthetic dataset which can catch the bug fixed in this commit.
      fcbaefe [DB Tsai] Refactoring
      4eb078d [DB Tsai] first commit
      6a827d5d
    • Masayoshi TSUZUKI's avatar
      [SPARK-6435] spark-shell --jars option does not add all jars to classpath · 268c419f
      Masayoshi TSUZUKI authored
      Modified to accept double-quotated args properly in spark-shell.cmd.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #5227 from tsudukim/feature/SPARK-6435-2 and squashes the following commits:
      
      ac55787 [Masayoshi TSUZUKI] removed unnecessary argument.
      60789a7 [Masayoshi TSUZUKI] Merge branch 'master' of https://github.com/apache/spark into feature/SPARK-6435-2
      1fee420 [Masayoshi TSUZUKI] fixed test code for escaping '='.
      0d4dc41 [Masayoshi TSUZUKI] - escaped comman and semicolon in CommandBuilderUtils.java - added random string to the temporary filename - double-quotation followed by `cmd /c` did not worked properly - no need to escape `=` by `^` - if double-quoted string ended with `\` like classpath, the last `\` is parsed as the escape charactor and the closing `"` didn't work properly
      2a332e5 [Masayoshi TSUZUKI] Merge branch 'master' into feature/SPARK-6435-2
      04f4291 [Masayoshi TSUZUKI] [SPARK-6435] spark-shell --jars option does not add all jars to classpath
      268c419f
    • Jim Carroll's avatar
      [SPARK-7100] [MLLIB] Fix persisted RDD leak in GradientBoostTrees · 75905c57
      Jim Carroll authored
      This fixes a leak of a persisted RDD where GradientBoostTrees can call persist but never unpersists.
      
      Jira: https://issues.apache.org/jira/browse/SPARK-7100
      
      Discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-td11750.html
      
      Author: Jim Carroll <jim@dontcallme.com>
      
      Closes #5669 from jimfcarroll/gb-unpersist-fix and squashes the following commits:
      
      45f4b03 [Jim Carroll] [SPARK-7100][MLLib] Fix persisted RDD leak in GradientBoostTrees
      75905c57
Loading