Skip to content
Snippets Groups Projects
  1. Dec 01, 2015
    • gatorsmile's avatar
      [SPARK-11905][SQL] Support Persist/Cache and Unpersist in Dataset APIs · 0a7bca2d
      gatorsmile authored
      Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding is correct? If so, could you help me check if the implementation is acceptable?
      
      Please provide your opinions. marmbrus rxin cloud-fan
      
      Thank you very much!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #9889 from gatorsmile/persistDS.
      0a7bca2d
    • Wenchen Fan's avatar
      [SPARK-11954][SQL] Encoder for JavaBeans · fd95eeaf
      Wenchen Fan authored
      create java version of `constructorFor` and `extractorFor` in `JavaTypeInference`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #9937 from cloud-fan/pojo.
      fd95eeaf
    • Wenchen Fan's avatar
      [SPARK-11856][SQL] add type cast if the real type is different but compatible with encoder schema · 9df24624
      Wenchen Fan authored
      When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema.
      For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9840 from cloud-fan/err-msg.
      9df24624
    • Wenchen Fan's avatar
      [SPARK-12068][SQL] use a single column in Dataset.groupBy and count will fail · 8ddc55f1
      Wenchen Fan authored
      The reason is that, for a single culumn `RowEncoder`(or a single field product encoder), when we use it as the encoder for grouping key, we should also combine the grouping attributes, although there is only one grouping attribute.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10059 from cloud-fan/bug.
      8ddc55f1
    • Cheng Lian's avatar
      [SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues · 69dbe6b4
      Cheng Lian authored
      This PR backports PR #10039 to master
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10063 from liancheng/spark-12046.doc-fix.master.
      69dbe6b4
    • Shixiong Zhu's avatar
      [SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize · 14011665
      Shixiong Zhu authored
      `JavaSerializerInstance.serialize` uses `ByteArrayOutputStream.toByteArray` to get the serialized data. `ByteArrayOutputStream.toByteArray` needs to copy the content in the internal array to a new array. However, since the array will be converted to `ByteBuffer` at once, we can avoid the memory copy.
      
      This PR added `ByteBufferOutputStream` to access the protected `buf` and convert it to a `ByteBuffer` directly.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10051 from zsxwing/SPARK-12060.
      14011665
    • Liang-Chi Hsieh's avatar
      [SPARK-11949][SQL] Set field nullable property for GroupingSets to get correct... · c87531b7
      Liang-Chi Hsieh authored
      [SPARK-11949][SQL] Set field nullable property for GroupingSets to get correct results for null values
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-11949
      
      The result of cube plan uses incorrect schema. The schema of cube result should set nullable property to true because the grouping expressions will have null values.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #10038 from viirya/fix-cube.
      c87531b7
    • Yuhao Yang's avatar
      [SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec · a0af0e35
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-11898
      syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization.
      
      Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help,
      
      1. decrease the worker memory consumption by 45%.
      2. decrease running time by 40%.
      
      This will also help extend the upper limit for Word2Vec.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #9878 from hhbyyh/w2vBC.
      a0af0e35
  2. Nov 30, 2015
  3. Nov 29, 2015
  4. Nov 28, 2015
    • felixcheung's avatar
      [SPARK-9319][SPARKR] Add support for setting column names, types · c793d2d9
      felixcheung authored
      Add support for for colnames, colnames<-, coltypes<-
      Also added tests for names, names<- which have no test previously.
      
      I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218
      
      shivaram sun-rui
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9654 from felixcheung/colnamescoltypes.
      c793d2d9
    • felixcheung's avatar
      [SPARK-12029][SPARKR] Improve column functions signature, param check, tests,... · 28e46ab4
      felixcheung authored
      [SPARK-12029][SPARKR] Improve column functions signature, param check, tests, fix doc and add examples
      
      shivaram sun-rui
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #10019 from felixcheung/rfunctionsdoc.
      28e46ab4
    • gatorsmile's avatar
      [SPARK-12028] [SQL] get_json_object returns an incorrect result when the value is null literals · 149cd692
      gatorsmile authored
      When calling `get_json_object` for the following two cases, both results are `"null"`:
      
      ```scala
          val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil
          val df: DataFrame = tuple.toDF("key", "jstring")
          val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect()
      ```
      ```scala
          val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil
          val df2: DataFrame = tuple2.toDF("key", "jstring")
          val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect()
      ```
      
      Fixed the problem and also added a test case.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10018 from gatorsmile/get_json_object.
      149cd692
  5. Nov 27, 2015
  6. Nov 26, 2015
Loading