Skip to content
Snippets Groups Projects
  1. Aug 02, 2015
    • KaiXinXiaoLei's avatar
      [SPARK-9535][SQL][DOCS] Modify document for codegen. · 536d2adc
      KaiXinXiaoLei authored
      #7142 made codegen enabled by default so let's modify the corresponding documents.
      
      Closes #7142
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7863 from sarutak/SPARK-9535 and squashes the following commits:
      
      0884424 [Kousuke Saruta] Removed a line which mentioned about the effect of codegen enabled
      3c11af0 [Kousuke Saruta] Merge branch 'sqlconfig' of https://github.com/KaiXinXiaoLei/spark into SPARK-9535
      4ee531d [KaiXinXiaoLei] delete space
      4cfd11d [KaiXinXiaoLei] change spark.sql.planner.externalSort
      d624cf8 [KaiXinXiaoLei] sql config is wrong
      536d2adc
    • Reynold Xin's avatar
      [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter. · 9d03ad91
      Reynold Xin authored
      The detailed approach is documented in UnsafeKVExternalSorterSuite.testKVSorter(), working as follows:
      
      1. Create input by generating data randomly based on the given key/value schema (which is also randomly drawn from a list of candidate types)
      2. Run UnsafeKVExternalSorter on the generated data
      3. Collect the output from the sorter, and make sure the keys are sorted in ascending order
      4. Sort the input by both key and value, and sort the sorter output also by both key and value. Compare the sorted input and sorted output together to make sure all the key/values match.
      5. Check memory allocation to make sure there is no memory leak.
      
      There is also a spill flag. When set to true, the sorter will spill probabilistically roughly every 100 records.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7873 from rxin/kvsorter-randomized-test and squashes the following commits:
      
      a08c251 [Reynold Xin] Resource cleanup.
      0488b5c [Reynold Xin] [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter.
      9d03ad91
    • Liang-Chi Hsieh's avatar
      [SPARK-7937][SQL] Support comparison on StructType · 0722f433
      Liang-Chi Hsieh authored
      This brings #6519 up-to-date with master branch.
      
      Closes #6519.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7877 from rxin/sort-struct and squashes the following commits:
      
      4968231 [Reynold Xin] Minor fixes.
      2537813 [Reynold Xin] Merge branch 'compare_named_struct' of github.com:viirya/spark-1 into sort-struct
      d2ba8ad [Liang-Chi Hsieh] Remove unused import.
      3a3f40e [Liang-Chi Hsieh] Don't need to add compare to InternalRow because we can use RowOrdering.
      dae6aad [Liang-Chi Hsieh] Fix nested struct.
      d5349c7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
      43d4354 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
      1f66196 [Liang-Chi Hsieh] Reuse RowOrdering and GenerateOrdering.
      f8b2e9c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
      1187a65 [Liang-Chi Hsieh] Fix scala style.
      9d67f68 [Liang-Chi Hsieh] Fix wrongly merging.
      8f4d775 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
      94b27d5 [Liang-Chi Hsieh] Remove test for error on complex type comparison.
      2071693 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct
      3c142e4 [Liang-Chi Hsieh] Fix scala style.
      cf58dc3 [Liang-Chi Hsieh] Use checkAnswer.
      f651b8d [Liang-Chi Hsieh] Remove Either and move orderings to BinaryComparison to reuse it.
      b6e1009 [Liang-Chi Hsieh] Fix scala style.
      3922b54 [Liang-Chi Hsieh] Support ordering on named_struct.
      0722f433
    • Reynold Xin's avatar
      [SPARK-9531] [SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter · 2e981b7b
      Reynold Xin authored
      This pull request adds a destructAndCreateExternalSorter method to UnsafeFixedWidthAggregationMap. The new method does the following:
      
      1. Creates a new external sorter UnsafeKVExternalSorter
      2. Adds all the data into an in-memory sorter, sorts them
      3. Spills the sorted in-memory data to disk
      
      This method can be used to fallback to sort-based aggregation when under memory pressure.
      
      The pull request also includes accounting fixes from JoshRosen.
      
      TODOs (that can be done in follow-up PRs)
      - [x] Address Josh's feedbacks from #7849
      - [x] More documentation and test cases
      - [x] Make sure we are doing memory accounting correctly with test cases (e.g. did we release the memory in BytesToBytesMap twice?)
      - [ ] Look harder at possible memory leaks and exception handling
      - [ ] Randomized tester for the KV sorter as well as the aggregation map
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7860 from rxin/kvsorter and squashes the following commits:
      
      986a58c [Reynold Xin] Bug fix.
      599317c [Reynold Xin] Style fix and slightly more compact code.
      fe7bd4e [Reynold Xin] Bug fixes.
      fd71bef [Reynold Xin] Merge remote-tracking branch 'josh/large-records-in-sql-sorter' into kvsorter-with-josh-fix
      3efae38 [Reynold Xin] More fixes and documentation.
      45f1b09 [Josh Rosen] Ensure that spill files are cleaned up
      f6a9bd3 [Reynold Xin] Josh feedback.
      9be8139 [Reynold Xin] Remove testSpillFrequency.
      7cbe759 [Reynold Xin] [SPARK-9531][SQL] UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter.
      ae4a8af [Josh Rosen] Detect leaked unsafe memory in UnsafeExternalSorterSuite.
      52f9b06 [Josh Rosen] Detect ShuffleMemoryManager leaks in UnsafeExternalSorter.
      2e981b7b
    • Xiangrui Meng's avatar
      [SPARK-9527] [MLLIB] add PrefixSpanModel and make PrefixSpan Java friendly · 66924ffa
      Xiangrui Meng authored
      1. Use `PrefixSpanModel` to wrap the frequent sequences.
      2. Define `FreqSequence` to wrap each frequent sequence, which contains a Java-friendly method `javaSequence`
      3. Overload `run` for Java users.
      4. Added a unit test in Java to check Java compatibility.
      
      zhangjiajin feynmanliang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7869 from mengxr/SPARK-9527 and squashes the following commits:
      
      4345594 [Xiangrui Meng] add PrefixSpanModel and make PrefixSpan Java friendly
      66924ffa
    • Reynold Xin's avatar
      [SPARK-9208][SQL] Sort DataFrame functions alphabetically. · 8eafa2ae
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7861 from rxin/api-audit and squashes the following commits:
      
      7200256 [Reynold Xin] [SPARK-9208][SQL] Sort DataFrame functions alphabetically.
      8eafa2ae
    • Yu ISHIKAWA's avatar
      [SPARK-9149] [ML] [EXAMPLES] Add an example of spark.ml KMeans · 244016a9
      Yu ISHIKAWA authored
      [SPARK-9149] Add an example of spark.ml KMeans - ASF JIRA https://issues.apache.org/jira/browse/SPARK-9149
      
      jkbradley Should we support other data formats, such as TSV or CSV. I have implemented these examples which support only space separated file which is same as the example for `spark.mllib`'s `KMeans`.
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #7697 from yu-iskw/SPARK-9149 and squashes the following commits:
      
      7137bad [Yu ISHIKAWA] Fix the typo
      56b9da2 [Yu ISHIKAWA] Fix the place of the wrong import statment
      554e574 [Yu ISHIKAWA] Change the way to format input data in KMeansExample
      e7a948a [Yu ISHIKAWA] Import spark.ml.clustering.KMeans
      1901e0c [Yu ISHIKAWA] Change how to initialize an array for a DataFrame schema
      d8043f5 [Yu ISHIKAWA] Return a value directly
      d81bf55 [Yu ISHIKAWA] Fix a typo and its access specifiers
      3e0862d [Yu ISHIKAWA] Make KMeansExample more simple
      51ce9c1 [Yu ISHIKAWA] Make JavaKMeansExample more simple
      a5a01e0 [Yu ISHIKAWA] Fix a Javadoc about the command to execute the example
      b09ec13 [Yu ISHIKAWA] [SPARK-9149][ML][Examples] Add an example of spark.ml KMeans
      244016a9
    • Sean Owen's avatar
      [SPARK-9521] [BUILD] Require Maven 3.3.3+ in the build · 9d1c0252
      Sean Owen authored
      Enforce Maven 3.3.3+ in the build. (Also update the scala compiler plugin while we're at it.)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7852 from srowen/SPARK-9521 and squashes the following commits:
      
      3093039 [Sean Owen] Enforce Maven 3.3.3+ in the build. (Also update the scala compiler plugin while we're at it.)
      9d1c0252
    • Davies Liu's avatar
      [SPARK-9529] [SQL] improve TungstenSort on DecimalType · 16b928c5
      Davies Liu authored
      Generate prefix for DecimalType, fix the random generator of decimal
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7857 from davies/sort_decimal and squashes the following commits:
      
      2433959 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_decimal
      de24253 [Davies Liu] fix style
      0a54c1a [Davies Liu] sort decimal
      16b928c5
    • Feynman Liang's avatar
      [SPARK-9000] [MLLIB] Support generic item types in PrefixSpan · 28d944e8
      Feynman Liang authored
      mengxr Please review after #7818 merges and master is rebased.
      
      Continues work by rikima
      
      Closes #7400
      
      Author: Feynman Liang <fliang@databricks.com>
      Author: masaki rikitoku <rikima3132@gmail.com>
      
      Closes #7837 from feynmanliang/SPARK-7400-genericItems and squashes the following commits:
      
      8b2c756 [Feynman Liang] Remove orig
      92443c8 [Feynman Liang] Style fixes
      42c6349 [Feynman Liang] Style fix
      14e67fc [Feynman Liang] Generic prefixSpan itemtypes
      b3b21e0 [Feynman Liang] Initial support for generic itemtype in public api
      b86e0d5 [masaki rikitoku] modify to support generic item type
      28d944e8
  2. Aug 01, 2015
    • Davies Liu's avatar
      [SPARK-9459] [SQL] use generated FromUnsafeProjection to do deep copy for UTF8String and struct · 57084e0c
      Davies Liu authored
      When accessing a column in UnsafeRow, it's good to avoid the copy, then we should do deep copy when turn the UnsafeRow into generic Row, this PR brings generated FromUnsafeProjection to do that.
      
      This PR also fix the expressions that cache the UTF8String, which should also copy it.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7840 from davies/avoid_copy and squashes the following commits:
      
      230c8a1 [Davies Liu] address comment
      fd797c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into avoid_copy
      e095dd0 [Davies Liu] rollback rename
      8ef5b0b [Davies Liu] copy String in Columnar
      81360b8 [Davies Liu] fix class name
      9aecb88 [Davies Liu] use FromUnsafeProjection to do deep copy for UTF8String and struct
      57084e0c
    • Davies Liu's avatar
      [SPARK-8185] [SPARK-8188] [SPARK-8191] [SQL] function datediff,... · c1b0cbd7
      Davies Liu authored
      [SPARK-8185] [SPARK-8188] [SPARK-8191] [SQL] function datediff, to_utc_timestamp, from_utc_timestamp
      
      This PR is based on #7643 , thanks to adrian-wang
      
      Author: Davies Liu <davies@databricks.com>
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #7847 from davies/datediff and squashes the following commits:
      
      74333d7 [Davies Liu] fix bug
      22d8a8c [Davies Liu] optimize
      85cdd21 [Davies Liu] remove unnecessary tests
      241d90c [Davies Liu] Merge branch 'master' of github.com:apache/spark into datediff
      e9dc0f5 [Davies Liu] fix datediff/to_utc_timestamp/from_utc_timestamp
      c360447 [Daoyuan Wang] function datediff, to_utc_timestamp, from_utc_timestamp (commits merged)
      c1b0cbd7
    • HuJiayin's avatar
      [SPARK-8269] [SQL] string function: initcap · 00cd92f3
      HuJiayin authored
      This PR is based on #7208 , thanks to HuJiayin
      
      Closes #7208
      
      Author: HuJiayin <jiayin.hu@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7850 from davies/initcap and squashes the following commits:
      
      54472e9 [Davies Liu] fix python test
      17ffe51 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
      ca46390 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
      3a906e4 [Davies Liu] implement title case in UTF8String
      8b2506a [HuJiayin] Update functions.py
      2cd43e5 [HuJiayin] fix python style check
      b616c0e [HuJiayin] add python api
      1f5a0ef [HuJiayin] add codegen
      7e0c604 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark into initcap
      6a0b958 [HuJiayin] add column
      c79482d [HuJiayin] support soundex
      7ce416b [HuJiayin] support initcap rebase code
      00cd92f3
    • Davies Liu's avatar
      [SPARK-9495] prefix of DateType/TimestampType · 5d9e33d9
      Davies Liu authored
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7856 from davies/sort_improve and squashes the following commits:
      
      5fc81bd [Davies Liu] support DateType/TimestampType
      5d9e33d9
    • Meihua Wu's avatar
      [SPARK-9530] [MLLIB] ScalaDoc should not indicate LDAModel.describeTopics and... · 84a6982b
      Meihua Wu authored
      [SPARK-9530] [MLLIB] ScalaDoc should not indicate LDAModel.describeTopics and DistributedLDAModel.topDocumentsPerTopic as approximate
      
      Remove ScalaDoc that suggests describeTopics and topDocumentsPerTopic are approximate.
      
      cc jkbradley
      
      Author: Meihua Wu <meihuawu@umich.edu>
      
      Closes #7858 from rotationsymmetry/SPARK-9530 and squashes the following commits:
      
      b574923 [Meihua Wu] Remove ScalaDoc that suggests describeTopics and topDocumentsPerTopic are approximate.
      84a6982b
    • Reynold Xin's avatar
      [SPARK-9520] [SQL] Support in-place sort in UnsafeFixedWidthAggregationMap · 3d1535d4
      Reynold Xin authored
      This pull request adds a sortedIterator method to UnsafeFixedWidthAggregationMap that sorts its data in-place by the grouping key.
      
      This is needed so we can fallback to external sorting for aggregation.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7849 from rxin/bytes2bytes-sorting and squashes the following commits:
      
      75018c6 [Reynold Xin] Updated documentation.
      81a8694 [Reynold Xin] [SPARK-9520][SQL] Support in-place sort in UnsafeFixedWidthAggregationMap.
      3d1535d4
    • Marcelo Vanzin's avatar
      [SPARK-9491] Avoid fetching HBase tokens when not needed. · df733cbe
      Marcelo Vanzin authored
      Look at HBase's configuration to make sure it's configured for
      Kerberos. If the HBase configuration is missing, or if HBase is
      configured for non-kerberos authentication, then skip getting
      tokens.
      
      Reference: http://hbase.apache.org/book.html#security.prerequisites
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7810 from vanzin/SPARK-9491 and squashes the following commits:
      
      a57c776 [Marcelo Vanzin] [SPARK-9491] Avoid fetching HBase tokens when not needed.
      df733cbe
    • Andrew Or's avatar
      [SPARK-4751] Dynamic allocation in standalone mode · 6688ba6e
      Andrew Or authored
      Dynamic allocation is a feature that allows a Spark application to scale the number of executors up and down dynamically based on the workload. Support was first introduced in YARN since 1.2, and then extended to Mesos coarse-grained mode recently. Today, it is finally supported in standalone mode as well!
      
      I tested this locally and it works as expected. This is WIP because unit tests are coming.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7532 from andrewor14/standalone-da and squashes the following commits:
      
      b3c1736 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      879e928 [Andrew Or] Add end-to-end tests for standalone dynamic allocation
      accc8f6 [Andrew Or] Address comments
      ee686a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      c0a2c02 [Andrew Or] Fix build after merge conflict
      24149eb [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      2e762d6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      6832bd7 [Andrew Or] Add tests for scheduling with executor limit
      a82e907 [Andrew Or] Fix comments
      0a8be79 [Andrew Or] Simplify logic by removing the worker blacklist
      b7742af [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      2eb5f3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      1334e9a [Andrew Or] Fix MiMa
      32abe44 [Andrew Or] Fix style
      58cb06f [Andrew Or] Privatize worker blacklist for cleanliness
      42ac215 [Andrew Or] Clean up comments and rewrite code for readability
      49702d1 [Andrew Or] Clean up shuffle files after application exits
      80047aa [Andrew Or] First working implementation
      6688ba6e
    • zhichao.li's avatar
      [SPARK-8263] [SQL] substr/substring should also support binary type · c5166f7a
      zhichao.li authored
      This is based on #7641, thanks to zhichao-li
      
      Closes #7641
      
      Author: zhichao.li <zhichao.li@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7848 from davies/substr and squashes the following commits:
      
      461b709 [Davies Liu] remove bytearry from tests
      b45377a [Davies Liu] Merge branch 'master' of github.com:apache/spark into substr
      01d795e [zhichao.li] scala style
      99aa130 [zhichao.li] add substring to dataframe
      4f68bfe [zhichao.li] add binary type support for substring
      c5166f7a
    • Cheng Hao's avatar
      [SPARK-8232] [SQL] Add sort_array support · cf6c9ca3
      Cheng Hao authored
      This PR is based on #7581 , just fix the conflict.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7851 from davies/sort_array and squashes the following commits:
      
      a80ef66 [Davies Liu] fix conflict
      7cfda65 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_array
      664c960 [Cheng Hao] update the sort_array by using the ArrayData
      276d2d5 [Cheng Hao] add empty line
      0edab9c [Cheng Hao] Add asending/descending support for sort_array
      80fc0f8 [Cheng Hao] Add type checking
      a42b678 [Cheng Hao] Add sort_array support
      cf6c9ca3
    • Yuhao Yang's avatar
      [SPARK-8169] [ML] Add StopWordsRemover as a transformer · 87656650
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-8169
      
      stop words: http://en.wikipedia.org/wiki/Stop_words
      
      StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default.
      
      Currently I used a minimum stop words set since on some [case](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html), small set of stop words is preferred.
      ASCII char has been tested, Yet I cannot check it in due to style check.
      
      Further thought,
      1. Maybe I should use OpenHashSet. Is it recommended?
      2. Currently I leave the null in input array untouched, i.e. Array(null, null) => Array(null, null).
      3. If the current stop words set looks too limited, any suggestion for replacement? We can have something similar to the one in [SKlearn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py).
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #6742 from hhbyyh/stopwords and squashes the following commits:
      
      fa959d8 [Yuhao Yang] separating udf
      f190217 [Yuhao Yang] replace default list and other small fix
      04403ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into stopwords
      b3aa957 [Yuhao Yang] add stopWordsRemover
      87656650
    • zhangjiajin's avatar
      [SPARK-8999] [MLLIB] PrefixSpan non-temporal sequences · d2a9b66f
      zhangjiajin authored
      mengxr Extends PrefixSpan to non-temporal itemsets. Continues work by zhangjiajin
      
       * Internal API uses List[Set[Int]] which is likely not efficient; will need to refactor during QA
      
      Closes #7646
      
      Author: zhangjiajin <zhangjiajin@huawei.com>
      Author: Feynman Liang <fliang@databricks.com>
      Author: zhang jiajin <zhangjiajin@huawei.com>
      
      Closes #7818 from feynmanliang/SPARK-8999-nonTemporal and squashes the following commits:
      
      4ded81d [Feynman Liang] Replace all filters to filter nonempty
      350e67e [Feynman Liang] Code review feedback
      03156ca [Feynman Liang] Fix tests, drop delimiters at boundaries of sequences
      d1fe0ed [Feynman Liang] Remove comments
      86ca4e5 [Feynman Liang] Fix style
      7c7bf39 [Feynman Liang] Fixed itemSet sequences
      6073b10 [Feynman Liang] Basic itemset functionality, failing test
      1a7fb48 [Feynman Liang] Add delimiter to results
      5db00aa [Feynman Liang] Working for items, not itemsets
      6787716 [Feynman Liang] Working on temporal sequences
      f1114b9 [Feynman Liang] Add -1 delimiter
      00fe756 [Feynman Liang] Reset base files for rebase
      f486dcd [zhangjiajin] change maxLocalProjDBSize and fix a bug (remove -3 from frequent items).
      60a0b76 [zhangjiajin] fixed a scala style error.
      740c203 [zhangjiajin] fixed a scala style error.
      5785cb8 [zhangjiajin] support non-temporal sequence
      a5d649d [zhangjiajin] restore original version
      09dc409 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into multiItems_2
      ae8c02d [zhangjiajin] Fixed some Scala style errors.
      216ab0c [zhangjiajin] Support non-temporal sequence in PrefixSpan
      b572f54 [zhangjiajin] initialize file before rebase.
      f06772f [zhangjiajin] fix a scala style error.
      a7e50d4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan.
      c1d13d0 [zhang jiajin] Delete PrefixspanSuite.scala
      d9d8137 [zhang jiajin] Delete Prefixspan.scala
      c6ceb63 [zhangjiajin] Add new algorithm PrefixSpan and test file.
      d2a9b66f
    • Holden Karau's avatar
      [SPARK-7446] [MLLIB] Add inverse transform for string indexer · 65038973
      Holden Karau authored
      It is useful to convert the encoded indices back to their string representation for result inspection. We can add a function which creates an inverse transformation.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6339 from holdenk/SPARK-7446-inverse-transform-for-string-indexer and squashes the following commits:
      
      7cdf915 [Holden Karau] scala style comment fix
      b9cffb6 [Holden Karau] Update the labels param to have the metadata note
      6a38edb [Holden Karau] Setting the default needs to come after the value gets defined
      9e241d8 [Holden Karau] use Array.empty
      21c8cfa [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
      64dd3a3 [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
      4f06c59 [Holden Karau] Fix comment styles, use empty array as the default, etc.
      a60c0e3 [Holden Karau] CR feedback (remove old constructor, add a note about use of setLabels)
      1987b95 [Holden Karau] Use default copy
      71e8d66 [Holden Karau] Make labels a local param for StringIndexerInverse
      8450d0b [Holden Karau] Use the labels param in StringIndexerInverse
      7464019 [Holden Karau] Add a labels param
      868b1a9 [Holden Karau] Update scaladoc since we don't have labelsCol anymore
      5aa38bf [Holden Karau] Add an inverse test using only meta data, pass labels when calling inverse method
      f3e0c64 [Holden Karau] CR feedback
      ebed932 [Holden Karau] Add Experimental tag and some scaladocs. Also don't require that the inputCol has the metadata on it, instead have the labelsCol specified when creating the inverse.
      03ebf95 [Holden Karau] Add explicit type for invert function
      ecc65e0 [Holden Karau] Read the metadata correctly, use the array, pass the test
      a42d773 [Holden Karau] Fix test to supply cols as per new invert method
      16cc3c3 [Holden Karau] Add an invert method
      d4bcb20 [Holden Karau] Make the inverse string indexer into a transformer (still needs test updates but compiles)
      e8bf3ad [Holden Karau] Merge branch 'master' into SPARK-7446-inverse-transform-for-string-indexer
      c3fdee1 [Holden Karau] Some WIP refactoring based on jkbradley's CR feedback. Definite work-in-progress
      557bef8 [Holden Karau] Instead of using a private inverse transform, add an invert function so we can use it in a pipeline
      88779c1 [Holden Karau] fix long line
      78b28c1 [Holden Karau] Finish reverse part and add a test :)
      bb16a6a [Holden Karau] Some progress
      65038973
    • Davies Liu's avatar
      Revert "[SPARK-8232] [SQL] Add sort_array support" · 60ea7ab4
      Davies Liu authored
      This reverts commit 67ad4e21.
      60ea7ab4
    • Wenchen Fan's avatar
      [SPARK-9480][SQL] add MapData and cleanup internal row stuff · 1d59a416
      Wenchen Fan authored
      This PR adds a `MapData` as internal representation of map type in Spark SQL, and provides a default implementation with just 2 `ArrayData`.
      
      After that, we have specialized getters for all internal type, so I removed generic getter in `ArrayData` and added specialized `toArray` for it.
      Also did some refactor and cleanup for `InternalRow` and its subclasses.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7799 from cloud-fan/map-data and squashes the following commits:
      
      77d482f [Wenchen Fan] fix python
      e8f6682 [Wenchen Fan] skip MapData equality check in HiveInspectorSuite
      40cc9db [Wenchen Fan] add toString
      6e06ec9 [Wenchen Fan] some more cleanup
      a90aca1 [Wenchen Fan] add MapData
      1d59a416
    • Reynold Xin's avatar
      [SPARK-9517][SQL] BytesToBytesMap should encode data the same way as UnsafeExternalSorter · d90f2cf7
      Reynold Xin authored
      BytesToBytesMap current encodes key/value data in the following format:
      ```
      8B key length, key data, 8B value length, value data
      ```
      
      UnsafeExternalSorter, on the other hand, encodes data this way:
      ```
      4B record length, data
      ```
      
      As a result, we cannot pass records encoded by BytesToBytesMap directly into UnsafeExternalSorter for sorting. However, if we rearrange data slightly, we can then pass the key/value records directly into UnsafeExternalSorter:
      ```
      4B key+value length, 4B key length, key data, value data
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7845 from rxin/kvsort-rebase and squashes the following commits:
      
      5716b59 [Reynold Xin] Fixed test.
      2e62ccb [Reynold Xin] Updated BytesToBytesMap's data encoding to put the key first.
      a51b641 [Reynold Xin] Added a KV sorter interface.
      d90f2cf7
    • Cheng Hao's avatar
      [SPARK-8232] [SQL] Add sort_array support · 67ad4e21
      Cheng Hao authored
      Add expression `sort_array` support.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Davies Liu <davies.liu@gmail.com>
      
      Closes #7581 from chenghao-intel/sort_array and squashes the following commits:
      
      664c960 [Cheng Hao] update the sort_array by using the ArrayData
      276d2d5 [Cheng Hao] add empty line
      0edab9c [Cheng Hao] Add asending/descending support for sort_array
      80fc0f8 [Cheng Hao] Add type checking
      a42b678 [Cheng Hao] Add sort_array support
      67ad4e21
    • Liang-Chi Hsieh's avatar
      [SPARK-9415][SQL] Throw AnalysisException when using MapType on Join and Aggregate · 3320b0ba
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9415
      
      Following up #7787. We shouldn't use MapType as grouping keys and join keys too.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7819 from viirya/map_join_groupby and squashes the following commits:
      
      005ee0c [Liang-Chi Hsieh] For comments.
      7463398 [Liang-Chi Hsieh] MapType can't be used as join keys, grouping keys.
      3320b0ba
  3. Jul 31, 2015
    • Josh Rosen's avatar
      [SPARK-9464][SQL] Property checks for UTF8String · 14f26344
      Josh Rosen authored
      This PR is based on the original work by JoshRosen in #7780, which adds ScalaCheck property-based tests for UTF8String.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7830 from yjshen/utf8-property-checks and squashes the following commits:
      
      593da3a [Yijie Shen] resolve comments
      c0800e6 [Yijie Shen] Finish all todos in suite
      52f51a0 [Josh Rosen] Add some more failing tests
      49ed0697 [Josh Rosen] Rename suite
      9209c64 [Josh Rosen] UTF8String Property Checks.
      14f26344
    • zhichao.li's avatar
      [SPARK-8264][SQL]add substring_index function · 6996bd2e
      zhichao.li authored
      This PR is based on #7533 , thanks to zhichao-li
      
      Closes #7533
      
      Author: zhichao.li <zhichao.li@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7843 from davies/str_index and squashes the following commits:
      
      391347b [Davies Liu] add python api
      3ce7802 [Davies Liu] fix substringIndex
      f2d29a1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into str_index
      515519b [zhichao.li] add foldable and remove null checking
      9546991 [zhichao.li] scala style
      67c253a [zhichao.li] hide some apis and clean code
      b19b013 [zhichao.li] add codegen and clean code
      ac863e9 [zhichao.li] reduce the calling of numChars
      12e108f [zhichao.li] refine unittest
      d92951b [zhichao.li] add lastIndexOf
      52d7b03 [zhichao.li] add substring_index function
      6996bd2e
    • Reynold Xin's avatar
      [SPARK-9358][SQL] Code generation for UnsafeRow joiner. · 03377d25
      Reynold Xin authored
      This patch creates a code generated unsafe row concatenator that can be used to concatenate/join two UnsafeRows into a single UnsafeRow.
      
      Since it is inherently hard to test these low level stuff, the test suites employ randomized testing heavily in order to guarantee correctness.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7821 from rxin/rowconcat and squashes the following commits:
      
      8717f35 [Reynold Xin] Rebase and code review.
      72c5d8e [Reynold Xin] Fixed a bug.
      a84ed2e [Reynold Xin] Fixed offset.
      40c3fb2 [Reynold Xin] Reset random data generator.
      f0913aa [Reynold Xin] Test fixes.
      6687b6f [Reynold Xin] Updated documentation.
      00354b9 [Reynold Xin] Support concat data as well.
      e9a4347 [Reynold Xin] Updated.
      6269f96 [Reynold Xin] Fixed a bug .
      0f89716 [Reynold Xin] [SPARK-9358][SQL][WIP] Code generation for UnsafeRow concat.
      03377d25
    • Hossein's avatar
      [SPARK-9318] [SPARK-9320] [SPARKR] Aliases for merge and summary functions on DataFrames · 712f5b7a
      Hossein authored
      This PR adds synonyms for ```merge``` and ```summary``` in SparkR DataFrame API.
      
      cc shivaram
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #7806 from falaki/SPARK-9320 and squashes the following commits:
      
      72600f7 [Hossein] Updated docs
      92a6e75 [Hossein] Fixed merge generic signature issue
      4c2b051 [Hossein] Fixing naming with mllib summary
      0f3a64c [Hossein] Added ... to generic for merge
      30fbaf8 [Hossein] Merged master
      ae1a4cf [Hossein] Merge branch 'master' into SPARK-9320
      e8eb86f [Hossein] Add a generic for merge
      fc01f2d [Hossein] Added unit test
      8d92012 [Hossein] Added merge as an alias for join
      5b8bedc [Hossein] Added unit test
      632693d [Hossein] Added summary as an alias for describe for DataFrame
      712f5b7a
    • Josh Rosen's avatar
      [SPARK-9451] [SQL] Support entries larger than default page size in... · 8cb415a4
      Josh Rosen authored
      [SPARK-9451] [SQL] Support entries larger than default page size in BytesToBytesMap & integrate with ShuffleMemoryManager
      
      This patch adds support for entries larger than the default page size in BytesToBytesMap.  These large rows are handled by allocating special overflow pages to hold individual entries.
      
      In addition, this patch integrates BytesToBytesMap with the ShuffleMemoryManager:
      
      - Move BytesToBytesMap from `unsafe` to `core` so that it can import `ShuffleMemoryManager`.
      - Before allocating new data pages, ask the ShuffleMemoryManager to reserve the memory:
        - `putNewKey()` now returns a boolean to indicate whether the insert succeeded or failed due to a lack of memory.  The caller can use this value to respond to the memory pressure (e.g. by spilling).
      - `UnsafeFixedWidthAggregationMap. getAggregationBuffer()` now returns `null` to signal failure due to a lack of memory.
      - Updated all uses of these classes to handle these error conditions.
      - Added new tests for allocating large records and for allocations which fail due to memory pressure.
      - Extended the `afterAll()` test teardown methods to detect ShuffleMemoryManager leaks.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7762 from JoshRosen/large-rows and squashes the following commits:
      
      ae7bc56 [Josh Rosen] Fix compilation
      82fc657 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-rows
      34ab943 [Josh Rosen] Remove semi
      31a525a [Josh Rosen] Integrate BytesToBytesMap with ShuffleMemoryManager.
      626b33c [Josh Rosen] Move code to sql/core and spark/core packages so that ShuffleMemoryManager can be integrated
      ec4484c [Josh Rosen] Move BytesToBytesMap from unsafe package to core.
      642ed69 [Josh Rosen] Rename size to numElements
      bea1152 [Josh Rosen] Add basic test.
      2cd3570 [Josh Rosen] Remove accidental duplicated code
      07ff9ef [Josh Rosen] Basic support for large rows in BytesToBytesMap.
      8cb415a4
    • Feynman Liang's avatar
      [SPARK-8936] [MLLIB] OnlineLDA document-topic Dirichlet hyperparameter optimization · f51fd6fb
      Feynman Liang authored
      Adds `alpha` (document-topic Dirichlet parameter) hyperparameter optimization to `OnlineLDAOptimizer` following Huang: Maximum Likelihood Estimation of Dirichlet Distribution Parameters. Also introduces a private `setSampleWithReplacement` to `OnlineLDAOptimizer` for unit testing purposes.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7836 from feynmanliang/SPARK-8936-alpha-optimize and squashes the following commits:
      
      4bef484 [Feynman Liang] Documentation improvements
      c3c6c1d [Feynman Liang] Fix docs
      151e859 [Feynman Liang] Fix style
      fa77518 [Feynman Liang] Hyperparameter optimization
      f51fd6fb
    • HuJiayin's avatar
      [SPARK-8271][SQL]string function: soundex · 4d5a6e7b
      HuJiayin authored
      This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738
      
      It's based on #7115 , thanks to HuJiayin
      
      Author: HuJiayin <jiayin.hu@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7812 from davies/soundex and squashes the following commits:
      
      fa75941 [Davies Liu] Merge branch 'master' of github.com:apache/spark into soundex
      a4bd6d8 [Davies Liu] fix soundex
      2538908 [HuJiayin] add codegen soundex
      d15d329 [HuJiayin] add back ut
      ded1a14 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark
      e2dec2c [HuJiayin] support soundex rebase code
      4d5a6e7b
    • Yin Huai's avatar
      [SPARK-9233] [SQL] Enable code-gen in window function unit tests · 3fc0cb92
      Yin Huai authored
      Since code-gen is enabled by default, it is better to run window function tests with code-gen.
      
      https://issues.apache.org/jira/browse/SPARK-9233
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #7832 from yhuai/SPARK-9233 and squashes the following commits:
      
      4e4e4cc [Yin Huai] style
      ca80e07 [Yin Huai] Test window function with codegen.
      3fc0cb92
    • Hossein's avatar
      [SPARK-9324] [SPARK-9322] [SPARK-9321] [SPARKR] Some aliases for R-like functions in DataFrames · 710c2b5d
      Hossein authored
      Adds following aliases:
      * unique (distinct)
      * rbind (unionAll): accepts many DataFrames
      * nrow (count)
      * ncol
      * dim
      * names (columns): along with the replacement function to change names
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #7764 from falaki/sparkR-alias and squashes the following commits:
      
      56016f5 [Hossein] Updated R documentation
      5e4a4d0 [Hossein] Removed extra code
      f51cbef [Hossein] Merge branch 'master' into sparkR-alias
      c1b88bd [Hossein] Moved setGeneric and other comments applied
      d9307f8 [Hossein] Added tests
      b5aa988 [Hossein] Added dim, ncol, nrow, names, rbind, and unique functions to DataFrames
      710c2b5d
    • Shivaram Venkataraman's avatar
      [SPARK-9510] [SPARKR] Remaining SparkR style fixes · 82f47b81
      Shivaram Venkataraman authored
      With the change in this patch, I get no more warnings from `./dev/lint-r` in my machine
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7834 from shivaram/sparkr-style-fixes and squashes the following commits:
      
      716cd8e [Shivaram Venkataraman] Remaining SparkR style fixes
      82f47b81
    • Sean Owen's avatar
      [SPARK-9507] [BUILD] Remove dependency reduced POM hack now that shade plugin is updated · 6e5fd613
      Sean Owen authored
      Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here
      
      See https://issues.apache.org/jira/browse/SPARK-8819
      
      I verified that `mvn clean package -DskipTests` works with Maven 3.3.3.
      
      pwendell are you up for trying this for the 1.5.0 release?
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7826 from srowen/SPARK-9507 and squashes the following commits:
      
      e0b0fd2 [Sean Owen] Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here
      6e5fd613
    • Sean Owen's avatar
      [SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code... · 873ab0f9
      Sean Owen authored
      [SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code uses deprecated print statement
      
      Use print(x) not print x for Python 3 in eval examples
      CC sethah mengxr -- just wanted to close this out before 1.5
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7822 from srowen/SPARK-9490 and squashes the following commits:
      
      01abeba [Sean Owen] Change "print x" to "print(x)" in the rest of the docs too
      bd7f7fb [Sean Owen] Use print(x) not print x for Python 3 in eval examples
      873ab0f9
Loading